Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join and semi-join with result cardinality hint #901

Open
pdamme opened this issue Nov 12, 2024 · 2 comments
Open

Join and semi-join with result cardinality hint #901

pdamme opened this issue Nov 12, 2024 · 2 comments
Assignees

Comments

@pdamme
Copy link
Collaborator

pdamme commented Nov 12, 2024

The result of a relational join or semi-join can have the same size as the cartesian product of the two inputs, in the worst case. Thus, it is an upper bound for the result size. Very often, the actual result size will be much smaller, while the upper bound is too large to be allocated. In fact, many joins encountered in practice are N:1 join (e.g., primary-key foreign-key joins). In such cases, one input of the (semi-)join has unique keys. Then, the result size is upper-bounded by the other input.

Currently, DAPHNE's innerJoin-kernels always allocates the size of the cartesian product (src/local/runtime/kernels/innerJoin.h, line 92: const size_t totalRows = numRowRhs * numRowLhs;), and the semiJoin-kernel always allocates the size of the left-hand-side input (src/runtime/local/kernels/semiJoin.h, line 75: res = DataObjectFactory::create<Frame>(numArgLhs, 1, schema, nullptr, false); and 79). Both should be improved. We need a way to prevent DAPHNE from allocating the size of the cartesian product for join results for N:1 joins.

As a quick fix, we want to add an optional parameter to both join variants that allows users to specify the number of result rows to allocate. In DaphneDSL, when A and B are frames, it should be possible to write:

f1 = innerJoin(A, B, "a.fk", "b.pk");           # allocates the size of AxB for the result
f2 = innerJoin(A, B, "a.fk", "b.pk", nrows(A)); # allocates the size of A for the result
f3 = semiJoin(A, B, "a.fk", "b.pk");            # allocates the size of AxB for the result
f4 = semiJoin(A, B, "a.fk", "b.pk", nrows(A));  # allocates the size of A for the result

Hints:

  • Add an optional parameter to the DaphneDSL built-in functions in src/parser/daphnedsl/DaphneDSLBuiltins.cpp. If the parameter is not provided in DaphneDSL, a -1 should be passed to the DaphneIR operations.
  • Add an additional mandatory argument to the InnerJoinOp and SemiJoinOp in DaphneIR.
  • Adapt the innerJoin and semiJoin-kernels to use this additional argument; if it is -1 allocate the cartesian product, otherwise use the specified size for the result.
  • Add script-level test cases.

We should go for a quick fix now since need this to work soon, but a better long-term solution could be:

  • Don't allocate the result of joins (and other operations, e.g., filters) upfront, but create it chunk-by-chunk.
  • Automatically detect N:1 joins in the DAPHNE compiler and/or runtime kernels. This could be achieved through a "unique" data property.
@saminbassiri
Copy link
Contributor

Hi, I will work on this issue.

@pdamme
Copy link
Collaborator Author

pdamme commented Nov 13, 2024

Thanks, please go ahead!

saminbassiri added a commit to saminbassiri/daphne that referenced this issue Nov 18, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result.
- In `InnerJoin`:
  - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- In `SemiJoin`:
  - If `numRowRes` is -1, the result size defaults to `numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- Updated DaphneDSL:
  - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions.
  - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations.
- Modified DaphneIR:
  - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`.
- Implementation Updates:
  - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values.
  - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility.
  - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`.
- Added script-level test cases to validate the new functionality.
- Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
saminbassiri added a commit to saminbassiri/daphne that referenced this issue Nov 18, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result.
- In `InnerJoin`:
  - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- In `SemiJoin`:
  - If `numRowRes` is -1, the result size defaults to `numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- Updated DaphneDSL:
  - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions.
  - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations.
- Modified DaphneIR:
  - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`.
- Implementation Updates:
  - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values.
  - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility.
  - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`.
- Added script-level test cases to validate the new functionality.
- Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
saminbassiri added a commit to saminbassiri/daphne that referenced this issue Nov 25, 2024
- Introduced `numRowRes` as a parameter for `InnerJoin` and `SemiJoin` kernel functions, indicating the size of the result.
- In `InnerJoin`:
  - If `numRowRes` is -1, the result size is set to `numRowRhs * numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- In `SemiJoin`:
  - If `numRowRes` is -1, the result size defaults to `numRowLhs`.
  - Otherwise, the result size is determined by `numRowRes`.
- Updated DaphneDSL:
  - Added `numRowRes` as an optional parameter for `innerJoin` and `semiJoin` built-in functions.
  - If not provided, `numRowRes` defaults to -1, which is passed to DaphneIR operations.
- Modified DaphneIR:
  - Made `numRowRes` a mandatory argument for `InnerJoinOp` and `SemiJoinOp`.
- Implementation Updates:
  - Updated `DaphneDSLBuiltins.cpp` to handle default `numRowRes` values.
  - Set `numRowRes` to -1 in `SQLVisitor.cpp` for compatibility.
  - Adjusted `kernels.json` to reflect the new parameter in `innerJoin` and `semiJoin`.
- Added script-level test cases to validate the new functionality.
- Addresses issue `daphne-eu#901` by allowing users to specify result size to prevent over-allocation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants