[Perf][Bench] Poor performance of join #567

Devjiu · 2023-07-04T22:27:46Z

In several reference requests join queries have similar issues.
In case of big tables main performance drop is issued by fetchChunks stage.
In current case AFAIU the slowest stage is compileWorkUnit,

198 ms total duration for executeRelAlgQuery
        197 ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:1389
          119 ms start(0ms) compileWorkUnit NativeCodegen.cpp:1403
          0 ms start(120ms) fetchChunks Execute.cpp:3090
          8 ms start(121ms) collectAllDeviceResults Execute.cpp:2550
          0 ms start(195ms) fetchChunks Execute.cpp:3090
          1 ms start(196ms) executePlan Execute.cpp:3462

60% of total execution time was spent in compileWorkUnit.

Pandas in this case executes query for ~5 ms vs ~240 ms HDK for single query.

Looks like the most significant part of compileWorkUnit is

  New thread(1)
    15 ms start(0ms) initHashTableForDevice BaselineJoinHashTable.cpp:562
      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
      15 ms start(0ms) initHashTableOnCpu BaselineHashTableBuilder.h:193
        1 ms start(0ms) CPU Baseline-Hash: init_baseline_hash_join_buff_32 BaselineHashTableBuilder.h:227
  End thread(1)
  New thread(386)
    62 ms start(0ms) initHashTableForDevice BaselineJoinHashTable.cpp:562
      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
      62 ms start(0ms) initHashTableOnCpu BaselineHashTableBuilder.h:193
        0 ms start(0ms) CPU Baseline-Hash: init_baseline_hash_join_buff_32 BaselineHashTableBuilder.h:227
        0 ms start(11ms) CPU Baseline-Hash: Additional Buffers init_hash_join_buff BaselineHashTableBuilder.h:309
  End thread(386)

There are 2 subqueries, so launchCPU, fetchChunks and compileWU are faced twice. The first subquery is to collect metadata - select count(*) ....
But there are still several blind spots in the total time of compileWorkUnit execution. (119 ms total, 90 ms reify, 15 ms initHashTable)

Also convertToArrowTable took pretty long time(16ms), but maybe it's incorrect test.

Total debug timers log:

198 ms total duration for executeRelAlgQuery
  198 ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:216
    0 ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:217
    197 ms start(0ms) execute RelAlgExecutor.cpp:411
      197 ms start(0ms) executeStep RelAlgExecutor.cpp:895
        197 ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:1389
          119 ms start(0ms) compileWorkUnit NativeCodegen.cpp:1403
            90 ms start(2ms) getInstance HashJoin.cpp:241
              90 ms start(2ms) reify BaselineJoinHashTable.cpp:235
                0 ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(2ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(2ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(2ms) ChunkNS getChunk Chunk.cpp:40
                  New thread(1)
                    15 ms start(0ms) initHashTableForDevice BaselineJoinHashTable.cpp:562
                      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
                      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
                      15 ms start(0ms) initHashTableOnCpu BaselineHashTableBuilder.h:193
                        1 ms start(0ms) CPU Baseline-Hash: init_baseline_hash_join_buff_32 BaselineHashTableBuilder.h:227
                  End thread(1)
                0 ms start(18ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(18ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(18ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(18ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(18ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(18ms) ChunkNS getChunk Chunk.cpp:40
                  New thread(386)
                    62 ms start(0ms) initHashTableForDevice BaselineJoinHashTable.cpp:562
                      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
                      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
                      62 ms start(0ms) initHashTableOnCpu BaselineHashTableBuilder.h:193
                        0 ms start(0ms) CPU Baseline-Hash: init_baseline_hash_join_buff_32 BaselineHashTableBuilder.h:227
                        0 ms start(11ms) CPU Baseline-Hash: Additional Buffers init_hash_join_buff BaselineHashTableBuilder.h:309
                  End thread(386)
            0 ms start(104ms) markDeadRuntimeFuncs NativeCodegen.cpp:801
            4 ms start(105ms) generateNativeCPUCode Backend.cpp:97
              4 ms start(105ms) optimize_ir HelperFunctions.cpp:110
          0 ms start(120ms) ExecutionKernel::run ExecutionKernel.cpp:126
          0 ms start(120ms) fetchChunks Execute.cpp:3090
            0 ms start(120ms) buildSelectedFragsMapping Execute.cpp:3359
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(120ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(120ms) getRowCountAndOffsetForAllFrags Execute.cpp:2994
          0 ms start(120ms) create QueryExecutionContext.cpp:94
          1 ms start(120ms) executePlan Execute.cpp:3462
            1 ms start(120ms) launchCpuCode QueryExecutionContext.cpp:564
          8 ms start(121ms) collectAllDeviceResults Execute.cpp:2550
            8 ms start(121ms) reduceMultiDeviceResults Execute.cpp:1178
              8 ms start(121ms) reduceMultiDeviceResultSets Execute.cpp:1245
                2 ms start(126ms) generateNativeCPUCode Backend.cpp:97
                  2 ms start(126ms) optimize_ir HelperFunctions.cpp:110
          65 ms start(130ms) compileWorkUnit NativeCodegen.cpp:1403
            0 ms start(131ms) getInstance HashJoin.cpp:241
              0 ms start(131ms) reify BaselineJoinHashTable.cpp:235
                0 ms start(131ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(131ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(131ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(131ms) ChunkNS getChunk Chunk.cpp:40
                0 ms start(131ms) getOneColumnFragment ColumnFetcher.cpp:67
                  0 ms start(131ms) ChunkNS getChunk Chunk.cpp:40
                  New thread(771)
                    0 ms start(0ms) initHashTableForDevice BaselineJoinHashTable.cpp:562
                      0 ms start(0ms) initHashTableOnCpuFromCache BaselineJoinHashTable.cpp:980
                  End thread(771)
            0 ms start(176ms) markDeadRuntimeFuncs NativeCodegen.cpp:801
            4 ms start(177ms) generateNativeCPUCode Backend.cpp:97
              4 ms start(177ms) optimize_ir HelperFunctions.cpp:110
          0 ms start(195ms) ExecutionKernel::run ExecutionKernel.cpp:126
          0 ms start(195ms) fetchChunks Execute.cpp:3090
            0 ms start(195ms) buildSelectedFragsMapping Execute.cpp:3359
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getOneTableColumnFragment ColumnFetcher.cpp:176
              0 ms start(195ms) ChunkNS getChunk Chunk.cpp:40
            0 ms start(195ms) getRowCountAndOffsetForAllFrags Execute.cpp:2994
          1 ms start(195ms) create QueryExecutionContext.cpp:94
          1 ms start(196ms) executePlan Execute.cpp:3462
            1 ms start(196ms) launchCpuCode QueryExecutionContext.cpp:564
            0 ms start(197ms) getRowSet QueryExecutionContext.cpp:192
          0 ms start(197ms) resultsUnion Execute.cpp:1133
[2023-07-04 09:06:45.134493] [0x00007f55b364c740] [info]    0 0 ArrowResultSetConverter.cpp:826 DEBUG_TIMER thread_id(0)
16 ms total duration for convertToArrowTable
  0 ms start(0ms) columnar conversion ArrowResultSetConverter.cpp:1426
  16 ms start(0ms) row-wise conversion ArrowResultSetConverter.cpp:1446
    14 ms start(0ms) fetch data in parallel_for ArrowResultSetConverter.cpp:1495
    1 ms start(14ms) append rows to arrow, finish builders ArrowResultSetConverter.cpp:1516

The text was updated successfully, but these errors were encountered:

This is an example that compares HDK with Pandas performance. Can be modified by increasing N (originally was 10). See also: #567 Signed-off-by: Dmitrii Makarenko <[email protected]>

Devjiu · 2023-07-04T22:31:20Z

Refence test code shared at #568 branch name is join_perf.

Devjiu · 2023-07-17T14:28:54Z

It was decided to test the behavior of the HDK as the data set size increases. The start number is 15_000, which may be too low for the hdk use case.

I checked benchmark with the following number of rows: [15_000, 300_000, 500_000, 700_000, 800_000, 1_000_000, 3_000_000, 6_000_000, 9_000_000, 15_000_000]

Get the following results:

Num Rows:      15_000, N:  2,    Hdk Time: 0.25210
Num Rows:      15_000, N:  2, Pandas Time: 0.00493
[hdk] shape:  (24837, 11)
[ pd] shape:  (24837, 11)
Num Rows:     300_000, N:  2,    Hdk Time: 3.13445
Num Rows:     300_000, N:  2, Pandas Time: 0.77139
[hdk] shape:  (9999326, 11)
[ pd] shape:  (9999326, 11)
Num Rows:     500_000, N:  2,    Hdk Time: 8.12875
Num Rows:     500_000, N:  2, Pandas Time: 1.70332
[hdk] shape:  (27779610, 11)
[ pd] shape:  (27779610, 11)
Num Rows:     700_000, N:  2,    Hdk Time: 16.94468
Num Rows:     700_000, N:  2, Pandas Time: 3.97328
[hdk] shape:  (54439501, 11)
[ pd] shape:  (54439501, 11)
Num Rows:     800_000, N:  2,    Hdk Time: 22.91816
Num Rows:     800_000, N:  2, Pandas Time: 6.94236
[hdk] shape:  (71111180, 11)
[ pd] shape:  (71111180, 11)
Num Rows:    1_000_000, N:  2,    Hdk Time: 35.59929
Num Rows:    1_000_000, N:  2, Pandas Time: 10.80405
[hdk] shape:  (111105866, 11)
[ pd] shape:  (111105866, 11)
Num Rows:    3_000_000, N:  2,    Hdk Time: 332.11756
Num Rows:    3_000_000, N:  2, Pandas Time: 131.54231
[hdk] shape:  (999980688, 11)
[ pd] shape:  (999980688, 11)
Num Rows:    6_000_000, N:  2,    Hdk Time: 8.26204
Num Rows:    6_000_000, N:  2, Pandas Time: 536.55373
[hdk] shape:  (705, 11)
[ pd] shape:  (3999922590, 11)
Traceback (most recent call last):
  File "/localdisk/dmitriim/benchmarks/db-benchmark/./samsung.py", line 107, in <module>
    df_ans = ht_base.join(
  File "_builder.pyx", line 687, in pyhdk._builder.QueryNode.run
  File "_sql.pyx", line 207, in pyhdk._sql.RelAlgExecutor.execute
RuntimeError: Ran out of slots in the query output buffer

Last error generated on 9_000_000 num_rows case.
Also notice, that data shape on 6_000_000 looks wrong.

Devjiu added bug Something isn't working customer request labels Jul 4, 2023

Devjiu assigned alexbaden Jul 4, 2023

Devjiu added a commit that referenced this issue Jul 4, 2023

[Bench] Reference for perfomance analysis

6a0c785

This is an example that compares HDK with Pandas performance. Can be modified by increasing N (originally was 10). See also: #567 Signed-off-by: Dmitrii Makarenko <[email protected]>

Devjiu mentioned this issue Jul 4, 2023

[Bench] Reference for perfomance analysis #568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][Bench] Poor performance of join #567

[Perf][Bench] Poor performance of join #567

Devjiu commented Jul 4, 2023

Devjiu commented Jul 4, 2023

Devjiu commented Jul 17, 2023

[Perf][Bench] Poor performance of join #567

[Perf][Bench] Poor performance of join #567

Comments

Devjiu commented Jul 4, 2023

Devjiu commented Jul 4, 2023

Devjiu commented Jul 17, 2023