Vectorized aggregation with grouping by one fixed-size column #7341

akuzm · 2024-10-14T12:07:46Z

This is a simplified implementation that uses the Postgres simplehash hash table with a generic Datum key for by-value fixed-size compressed columns.

The biggest improvement on a "sensible" query is about 90%, and a couple of queries show bigger improvements but these are very synthetic cases that don't make much sense:
https://grafana.ops.savannah-dev.timescale.com/d/fasYic_4z/compare-akuzm?orgId=1&var-branch=All&var-run1=3815&var-run2=3816&var-threshold=0.02&var-use_historical_thresholds=true&var-threshold_expression=2%20%2A%20percentile_cont%280.90%29&var-exact_suite_version=false&from=now-2d&to=now

some experiments

This reverts commit 795ef6b.

This reverts commit 166d0e8.

codecov · 2024-10-14T12:16:59Z

Codecov Report

Attention: Patch coverage is 88.25137% with 43 lines in your changes missing coverage. Please review.

Project coverage is 82.59%. Comparing base (59f50f2) to head (9ebd61f).
Report is 563 commits behind head on main.

Files with missing lines	Patch %	Lines
tsl/src/nodes/vector_agg/grouping_policy_hash.c	81.94%	25 Missing and 14 partials ⚠️
tsl/src/nodes/vector_agg/plan.c	75.00%	0 Missing and 3 partials ⚠️
tsl/src/nodes/vector_agg/exec.c	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7341      +/-   ##
==========================================
+ Coverage   80.06%   82.59%   +2.52%     
==========================================
  Files         190      231      +41     
  Lines       37181    43037    +5856     
  Branches     9450    10802    +1352     
==========================================
+ Hits        29770    35547    +5777     
- Misses       2997     3190     +193     
+ Partials     4414     4300     -114

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

erimatnor

This is really good functionality but there's a lot of code to review and it is a bit hard to understand and follow so it takes time.

I think there are some things that can be done to improve the code and the readability so that it is easier to understand for others.

I've made some suggestions, and probably not all of them are valid or good ideas. But at least the places I've made suggestions for so far are areas that are particularly hard to follow.

I will submit what I have so far, and will try to do the rest later.

erimatnor · 2024-10-18T11:16:31Z

tsl/src/nodes/vector_agg/exec.c

-		create_grouping_policy_batch(vector_agg_state->agg_defs,
-									 vector_agg_state->output_grouping_columns,
-									 /* partial_per_batch = */ grouping_column_offsets != NIL);
+	if (list_length(vector_agg_state->output_grouping_columns) == 1)


Would be nice with a comment before this check letting the reader know that that we want to try optimizing the 1-column case in a special way and that we later fall back to the "regular" per-batch grouping if the optimization wasn't possible.

erimatnor · 2024-10-18T11:20:33Z

tsl/src/nodes/vector_agg/function/functions.c

+					   MemoryContext agg_extra_mctx)
+{
+	CountState *states = (CountState *) agg_states;
+	for (int row = start_row; row < end_row; row++)


Just double-checking that it is really correct to be non-inclusive with the end_row here...

end_row sounds like it is a "valid" row index as opposed to using something like num_rows in a zero-indexed series. If end_row is not a valid index it should probably be called something else.

I guess that's the C++ habit of mine where end is idiomatically a past-the-end invalid iterator. In general I think the ranges with exclusive right end are very common, like [begin, end). Do you have a better name for this? Sometimes I write past_the_end_row to make it absolutely clear, but this feels a little too long for common usage...

erimatnor · 2024-10-18T11:23:04Z

tsl/src/nodes/vector_agg/function/functions.c

+			   const ArrowArray *vector, MemoryContext agg_extra_mctx)
+{
+	const uint64 *valid = vector->buffers[0];
+	for (int row = start_row; row < end_row; row++)


Same here as above w.r.t. end_row

erimatnor · 2024-10-18T11:45:01Z