What's Logical Optimizing in TiDB
First of all
Actually, every logical rule struct has a method named optimize()
rules of logical
From here, we can see logical optimizing items will be applied on by one.
var optRuleList = []logicalOptRule{
&gcSubstituter{},
&columnPruner{},
&resultReorder{},
&buildKeySolver{},
&decorrelateSolver{},
&semiJoinRewriter{},
&aggregationEliminator{},
&skewDistinctAggRewriter{},
&projectionEliminator{},
&maxMinEliminator{},
&ppdSolver{},
&outerJoinEliminator{},
&partitionProcessor{},
&collectPredicateColumnsPoint{},
&aggregationPushDownSolver{},
&pushDownTopNOptimizer{},
&syncWaitStatsLoadPoint{},
&joinReOrderSolver{},
&columnPruner{}, // column pruning again at last, note it will mess up the results of buildKeySolver
}
gcSubstituter
By mysql reference of Generated Column(in short: gc)
, we can see what the function of gc is. And from here, you can figure out how tidb implment gc
in period of logical plan. There are four parts of operands, (Selection, Projection, Sort, Aggregation), will subtitute gc as index.
LogicalSelection:
LogicalSelection represents a where or having predicate. firstly, in generated column function, index wouldn't none. Because of every calculating is based on index expression. secondly, when predicate operands are
=
,>
,>=
,<
,<=
,like
,or
,and
,not
. TiDB'll deal with it using generated column. However, when predicate isin
, there is a limition that all of scalars are the same type.LogicalProjection:
LogicalProjection represents a select fields plan and TiDB'll loop every field(selected colums from SQL) and try subsitiuting generated column.
LogicalSort:
LogicalSort stands for the order by plan. if there are any
sort
item in a query.LogicalAggregation:
LogicalAggregation represents an aggregate plan. And all agg functions are here. if there any field could be subtitiuted into one genereated colum, it'd be done.
go// AggFuncCount is the name of Count function. AggFuncCount = "count" // AggFuncSum is the name of Sum function. AggFuncSum = "sum" // AggFuncAvg is the name of Avg function. AggFuncAvg = "avg" // AggFuncFirstRow is the name of FirstRowColumn function. AggFuncFirstRow = "firstrow" // AggFuncMax is the name of max function. AggFuncMax = "max" // AggFuncMin is the name of min function. AggFuncMin = "min" // AggFuncGroupConcat is the name of group_concat function. AggFuncGroupConcat = "group_concat" // AggFuncBitOr is the name of bit_or function. AggFuncBitOr = "bit_or" // AggFuncBitXor is the name of bit_xor function. AggFuncBitXor = "bit_xor" // AggFuncBitAnd is the name of bit_and function. AggFuncBitAnd = "bit_and" // AggFuncVarPop is the name of var_pop function AggFuncVarPop = "var_pop" // AggFuncVarSamp is the name of var_samp function AggFuncVarSamp = "var_samp" // AggFuncStddevPop is the name of stddev_pop/std/stddev function AggFuncStddevPop = "stddev_pop" // AggFuncStddevSamp is the name of stddev_samp function AggFuncStddevSamp = "stddev_samp" // AggFuncJsonArrayagg is the name of json_arrayagg function AggFuncJsonArrayagg = "json_arrayagg" // AggFuncJsonObjectAgg is the name of json_objectagg function AggFuncJsonObjectAgg = "json_objectagg" // AggFuncApproxCountDistinct is the name of approx_count_distinct function. AggFuncApproxCountDistinct = "approx_count_distinct" // AggFuncApproxPercentile is the name of approx_percentile function. AggFuncApproxPercentile = "approx_percentile"
columnPruner
Actually, there are too many operands implmented
pruneCoumns
, you can PTAL at it's interface. After parsing AST, TiDB has already had a basic logic plan and column info without any tuning, when we step in here. I'll chooseLogicalSelection
which is one of the easiest to take a look deeply.That just deals with every column selected in sql or filterd in predicate and prunes part of columns which don't need. In
LogicalSelection
, as result offunc filter
is nil, so result is equal to columns selected plus predicate for filtering.gofunc extractColumns(result []*Column, expr Expression, filter func(*Column) bool) []*Column { switch v := expr.(type) { case *Column: if filter == nil || filter(v) { result = append(result, v) } case *ScalarFunction: for _, arg := range v.GetArgs() { result = extractColumns(result, arg, filter) } } return result }
resultReorder
ResultReorder reorder query results, which is not a common rule for all queries, it's specially implemented for a few customers. And here is a short comment in code place.
go/* Results of some queries are not ordered, for example: create table t (a int); insert into t values (1), (2); select a from t; In the case above, the result can be `1 2` or `2 1`, which is not ordered. This rule reorders results by modifying or injecting a Sort operator: 1. iterate the plan from the root, and ignore all input-order operators (Sel/Proj/Limit); 2. when meeting the first non-input-order operator, 2.1. if it's a Sort, update it by appending all output columns into its order-by list, 2.2. otherwise, inject a new Sort upon this operator. */
At here, based on the such clear comment, TiDB'll do
1
,2.1
and2.2
byfunc completeSort
andfunc injectSort
. PTAL at #pr for details.
buildKeyInfo
BuildKeyInfo as their name will collect the information of unique keys into schema(selfSchema
and childSchema
) and check keys of schema from query input coule be used. in addition, that to check if a query returns no more than one row is also an important action inside of this step. more details
decorrelateSolver
what's apply? from comments,
LogicalApply
gets one row from outer executor and gets one row from inner executor according to outer row.which means, there wouldn't have optimization without
Apply
operand logic.if
innerPlan
isLogicalSelection
, action is set new apply-plan children and reoptimize it using the samedecorrelateSolver
rule, you can see here.if
innerPlan
isLogicalMaxOneRow
, action is the same asLogicalSelection
.if
innerPlan
isLogicalProjection
andJoinType
isLeftOuterJoin
, action is it doesn't have optimization, because of below comments, every comparation will return 1 since the projection is evaluated after the join. And when anApply
isapply.JoinType != SemiJoin && apply.JoinType != LeftOuterSemiJoin && apply.JoinType != AntiSemiJoin && apply.JoinType != AntiLeftOuterSemiJoin
, it'll step to next step and try to substitute the all the schema with new expressions which maybe have been optimized bycolumnPruner
, which mainly it does is substitued all expressions or columns likegcSubstitute
.if
innerPlan
isLogicalAggregation
,sqlselect (select 1 from t1 where t1.a=t2.a) from t2; -- when t1.a=t2.a is false, the result should be null -- after decorrelateSolver, it'll be a format below select t1.* from t2 left outer join (select distinct(t1.a) as a from t1) as sub on t2.a = sub.a; -- | t2.a | | t1.a | | t2.a | t1.a | join result | | join result | -- | 1 | join | 0 | --> | 1 | 0 | nll | --> | 1 | 1 | 1 | -- | 2 | | 1 | | 1 | 1 | 1 | -- | 2 | 1 | null | -- | 2 | 1 | null |
if
innerPlan
isLogicalLimit
andapply.JoinType != SemiJoin && apply.JoinType != LeftOuterSemiJoin && apply.JoinType != AntiSemiJoin && apply.JoinType != AntiLeftOuterSemiJoin
, that means ifLogicalLimit
is not SemiJoin, the output of it might be expanded even though we arelimit 1
. because of function ofsemi join
below, if we did expanding, we'd getsemi join
the result of limit number of rows.sqlselect * from sale_detail; --返回结果。 +------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+ select * from sale_detail_sj; --返回结果。 +------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s5 | c2 | 100.2 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+ select * from sale_detail a left semi join sale_detail_sj b on a.total_price=b.total_price; +------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s2 | c2 | 100.2 | 2013 | china | | s1 | c1 | 100.1 | 2013 | china | +------------+-------------+-------------+------------+------------+ [right result] select count(*) from test t1 where exists (select value from test t2 where t1.id = t2.id limit 1); [wrong result] select count(*) from test t1 a semi join test t2 where t1.id = t2.id limit 1;
if
innerPlan
isLogicalAggregation
, It'll check ifApply
can be pulled up and an aggregation can be pulled up, then, do related actions like addsum
flag.if
innerPlan
isLogicalSort
, Since TiDB only pull up Selection, Projection, Aggregation, MaxOneRow, so there just setp in childen optimization.
The text may help you understand it easily.