Potentially inefficient memory usage #4111
Replies: 1 comment 2 replies
-
@harm-matthias-harms Thanks for the analysis! See #819, when you are talking about GC, noted Kedro has a internal "garbage collection" inside a runner where intermediate dataset is released from memory.
This is expected behavior and it shouldn't cause any problem, this is a lifespan of a single node, and memory will be released as soon as you start the next node. See: kedro/kedro/runner/sequential_runner.py Line 88 in ba98135 I did similar analysis specificially for Do you have a lot of source node that read in a lot of data? If you can confirm this I'll try to look into this again. |
Beta Was this translation helpful? Give feedback.
-
TL;DR
We observed that running Kedro nodes results in unexpectedly high memory usage relative to the dataset size. Profiling revealed that many object references are retained throughout the entire lifespan of a node, leading to increased memory consumption. As a result, while developing pipelines, we have unnecessarily optimized the pipelines, like splitting datasets or resizing computational resources. Memory usage is influenced by the number of times datasets are converted to different formats.
Observed worst case scenario
Inputs
Loading multiple datasets, particularly those sourced from databases, can cause SQLAlchemy readers and pandas DataFrames to persist in memory for the entire duration of the node's execution.
Running the node
During execution, data may be combined, transformed, or converted to different DataFrame types (e.g., from GeoPandas to Polars). Intermediate references are eventually garbage collected, but inputs and outputs are retained in memory throughout the lifespan of the node.
Saving the new dataset
When saving data, additional transformations may be required, such as upserting a panda.DataFrame into a database table. Even after the output is no longer needed, it remains in memory.
After nodes and hooks ran
Inputs and outputs are eventually garbage collected, but during processing, memory usage can spike to 2x to 6x the anticipated amount due to Kedro's operations.
Code insights
It appears that the memory issues are largely tied to the hook system. While I am not fully familiar with this system, examining the _call_node_run function provides some insight into how long inputs and outputs are retained. For instance, inputs are passed to the after_node_run hook, meaning their references are kept for the node's entire lifespan. Outputs also seem to be retained for extended periods after the Dataset is stored. Modifying this behavior would break current functionality, but raising awareness about this issue might lead to improved approaches in the future.
Beta Was this translation helpful? Give feedback.
All reactions