-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join and semi-join with result cardinality hint #901
Comments
Hi, I will work on this issue. |
Thanks, please go ahead! |
saminbassiri
added a commit
to saminbassiri/daphne
that referenced
this issue
Nov 18, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result. - In `InnerJoin`: - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - In `SemiJoin`: - If `numRowRes` is -1, the result size defaults to `numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - Updated DaphneDSL: - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions. - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations. - Modified DaphneIR: - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`. - Implementation Updates: - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values. - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility. - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`. - Added script-level test cases to validate the new functionality. - Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
saminbassiri
added a commit
to saminbassiri/daphne
that referenced
this issue
Nov 18, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result. - In `InnerJoin`: - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - In `SemiJoin`: - If `numRowRes` is -1, the result size defaults to `numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - Updated DaphneDSL: - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions. - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations. - Modified DaphneIR: - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`. - Implementation Updates: - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values. - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility. - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`. - Added script-level test cases to validate the new functionality. - Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
saminbassiri
added a commit
to saminbassiri/daphne
that referenced
this issue
Nov 25, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result. - In `InnerJoin`: - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - In `SemiJoin`: - If `numRowRes` is -1, the result size defaults to `numRowLhs`. - Otherwise, the result size is determined by `numRowRes`. - Updated DaphneDSL: - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions. - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations. - Modified DaphneIR: - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`. - Implementation Updates: - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values. - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility. - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`. - Added script-level test cases to validate the new functionality. - Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The result of a relational join or semi-join can have the same size as the cartesian product of the two inputs, in the worst case. Thus, it is an upper bound for the result size. Very often, the actual result size will be much smaller, while the upper bound is too large to be allocated. In fact, many joins encountered in practice are N:1 join (e.g., primary-key foreign-key joins). In such cases, one input of the (semi-)join has unique keys. Then, the result size is upper-bounded by the other input.
Currently, DAPHNE's
innerJoin
-kernels always allocates the size of the cartesian product (src/local/runtime/kernels/innerJoin.h
, line 92:const size_t totalRows = numRowRhs * numRowLhs;
), and thesemiJoin
-kernel always allocates the size of the left-hand-side input (src/runtime/local/kernels/semiJoin.h
, line 75:res = DataObjectFactory::create<Frame>(numArgLhs, 1, schema, nullptr, false);
and 79). Both should be improved. We need a way to prevent DAPHNE from allocating the size of the cartesian product for join results for N:1 joins.As a quick fix, we want to add an optional parameter to both join variants that allows users to specify the number of result rows to allocate. In DaphneDSL, when
A
andB
are frames, it should be possible to write:Hints:
src/parser/daphnedsl/DaphneDSLBuiltins.cpp
. If the parameter is not provided in DaphneDSL, a-1
should be passed to the DaphneIR operations.InnerJoinOp
andSemiJoinOp
in DaphneIR.innerJoin
andsemiJoin
-kernels to use this additional argument; if it is-1
allocate the cartesian product, otherwise use the specified size for the result.We should go for a quick fix now since need this to work soon, but a better long-term solution could be:
The text was updated successfully, but these errors were encountered: