Save streaming data in node #912

dpavlic · 2021-09-30T02:33:40Z

dpavlic
Sep 30, 2021

I have a bunch of data that is being processed on the fly, and I can't think of a good way to harmonize it with kedro nodes.

Due to memory limitations and plain efficiency, I load up X amount of records from a database, at which point I convert them into a pyarrow table and save them to parquet via pyarrow parquet Writer. I keep loading the records and appending the parquet file until the query is finished.

The problem is twofold: I can't access the data catalog within the node, so the file save name has to be an input parameter (far from ideal) AND the returned output has to be some sort of a dummy since the pyarrow object is never actually loaded into memory, only small chunks are.

Currently I'm thinking of decorating the node function to support this functionality but I'm wondering if there's something I've neglected to look at.

datajoely · 2021-09-30T09:32:06Z

datajoely
Sep 30, 2021
Collaborator

Hi @dpavlic Kedro today doesn't support streaming use cases, we've played around with some prototypes, but it firmly a batch based tool for now. In terms of your solution it sounds like maybe PySpark (or maybe Dask) would be good candidates for this solution as they will do some of this efficiency optimisation behind the scenes.

That being said - your decorator solution is intriguing, so would be interested to see what that would look like. As far as I can tell you've not neglected any Kedro functionality, I would reiterate if you're reaching the performance limit of a Pandas based workflow to explore PySpark.

1 reply

lasiadhi Feb 20, 2024

Hi @dpavlic, I also have a similar issue. I need to process a large dataset batch-wise inside a node and write them into the same output path defined in the data catalog iteratively. I am wondering about the decorating approach that you mentioned above. Would you mind sharing more info about it? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save streaming data in node #912

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Save streaming data in node #912

dpavlic Sep 30, 2021

Replies: 1 comment · 1 reply

datajoely Sep 30, 2021 Collaborator

lasiadhi Feb 20, 2024

dpavlic
Sep 30, 2021

Replies: 1 comment 1 reply

datajoely
Sep 30, 2021
Collaborator