Replies: 1 comment 1 reply
-
Hi @dpavlic Kedro today doesn't support streaming use cases, we've played around with some prototypes, but it firmly a batch based tool for now. In terms of your solution it sounds like maybe PySpark (or maybe Dask) would be good candidates for this solution as they will do some of this efficiency optimisation behind the scenes. That being said - your decorator solution is intriguing, so would be interested to see what that would look like. As far as I can tell you've not neglected any Kedro functionality, I would reiterate if you're reaching the performance limit of a Pandas based workflow to explore PySpark. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a bunch of data that is being processed on the fly, and I can't think of a good way to harmonize it with kedro nodes.
Due to memory limitations and plain efficiency, I load up X amount of records from a database, at which point I convert them into a pyarrow table and save them to parquet via pyarrow parquet Writer. I keep loading the records and appending the parquet file until the query is finished.
The problem is twofold: I can't access the data catalog within the node, so the file save name has to be an input parameter (far from ideal) AND the returned output has to be some sort of a dummy since the pyarrow object is never actually loaded into memory, only small chunks are.
Currently I'm thinking of decorating the node function to support this functionality but I'm wondering if there's something I've neglected to look at.
Beta Was this translation helpful? Give feedback.
All reactions