Distributed query on batch mode #320

jingchen2222 · 2021-08-30T20:09:49Z

jingchen2222
Aug 30, 2021

Motivation

Users are getting frustrated when they can't do distributed batch queries on OpenMLDB. So we are considering support distributed batch queried gradually.
Firstly, we might support some batch queried on some specific restrictions. For example:

the simple query on the single table can be supported because the queried rows are independent.
complex aggregation query might be supported if the aggregation data are organized by partitions.
- aggregation query on data group by partition key
- aggregation query on data filter by partition key
- aggregation query on window filter by partition key

ISSUE related

jingchen2222 · 2021-09-01T02:51:49Z

jingchen2222
Sep 1, 2021
Author

SQL plan optimization for where and group op

Related issue #317

Previous work

The OpenMLDB planner has apply some optimization passes on physical plan so that the group/join/filter operation can be optimized when their keys are matches with table indexes.

For instance, given the table create statement:

CREATE TABLE t1 (
col0 string, 
col1 int32, 
col2 int16, 
col3 float, 
col4 double, 
col5 int64, 
col6 string,
INDEX index0(col0) OPTIONS (ts = col5));

and the table query statement:

SELECT col0, 
sum(col1) as col1_sum, sum(col3) as col3_sum,
sum(col4) as col4_sum, sum(col2) as col2_sum,
sum(col5) as col5_sum 
FROM t1 WHERE col0 = "1" and col5 < 2 Group By col0;

Before optimized, the physical plan will be:

PROJECT(type=GroupAggregation, group_keys=(col0))
  FILTER_BY(condition=col5 < 1000, left_keys=(col0), right_keys=("1"), index_keys=)
    GROUP_BY(group_keys=(col0))
    DATA_PROVIDER(type=Table, table=t1)

After apply optimization passes, the group op will be eliminated since the group key col0 matches with the table index0 key. So the physical plan will be like:

PROJECT(type=GroupAggregation, group_keys=(col0))
  FILTER_BY(condition=col5 < 1000, left_keys=(col0), right_keys=("1"), index_keys=)
    DATA_PROVIDER(type=Partition, table=t1, index=index0)

Issue description

In this issue, we are going to enhance the existing passes in a way to further optimize GROUP op and FILTER op when both of them can match to the same table index.

SELECT col0, 
sum(col1) as col1_sum, sum(col3) as col3_sum,
sum(col4) as col4_sum, sum(col2) as col2_sum,
sum(col5) as col5_sum 
FROM t1 WHERE col0 = "1" and col5 < 2 Group By col0;

Then in query statement above, since both GROUP and WHERE ops are based on column col0 which can exactly match with index0. We are expecting to optimized physical plan like:

PROJECT(type=GroupAggregation, group_keys=(col2))
  FILTER_BY(condition=col5 < 1000, left_keys=, right_keys=, index_keys="1")
    DATA_PROVIDER(type=Partition, table=t1, index=index0)

Implementation

From the above discussion, we can see that our essential work is to further optimized OP when its keys can match the PartitionDataProvider's index keys.

bool GroupAndSortOptimized::KeysOptimized(
    const SchemasContext* root_schemas_ctx, PhysicalOpNode* in, Key* left_key,
    Key* index_key, Key* right_key, Sort* sort, PhysicalOpNode** new_in) {
// TODO: 
// handler keys optimized when the data provider type is `kProviderTypePartition `
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed query on batch mode #320

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Distributed query on batch mode #320

jingchen2222 Aug 30, 2021

Motivation

ISSUE related

Replies: 1 comment

jingchen2222 Sep 1, 2021 Author

SQL plan optimization for where and group op

Previous work

Issue description

Implementation

jingchen2222
Aug 30, 2021

jingchen2222
Sep 1, 2021
Author