Potentially inefficient memory usage #4111

harm-matthias-harms · 2024-08-22T12:17:38Z

harm-matthias-harms
Aug 22, 2024

TL;DR

We observed that running Kedro nodes results in unexpectedly high memory usage relative to the dataset size. Profiling revealed that many object references are retained throughout the entire lifespan of a node, leading to increased memory consumption. As a result, while developing pipelines, we have unnecessarily optimized the pipelines, like splitting datasets or resizing computational resources. Memory usage is influenced by the number of times datasets are converted to different formats.

Observed worst case scenario

Inputs

Loading multiple datasets, particularly those sourced from databases, can cause SQLAlchemy readers and pandas DataFrames to persist in memory for the entire duration of the node's execution.

Running the node

During execution, data may be combined, transformed, or converted to different DataFrame types (e.g., from GeoPandas to Polars). Intermediate references are eventually garbage collected, but inputs and outputs are retained in memory throughout the lifespan of the node.

Saving the new dataset

When saving data, additional transformations may be required, such as upserting a panda.DataFrame into a database table. Even after the output is no longer needed, it remains in memory.

After nodes and hooks ran

Inputs and outputs are eventually garbage collected, but during processing, memory usage can spike to 2x to 6x the anticipated amount due to Kedro's operations.

Code insights

It appears that the memory issues are largely tied to the hook system. While I am not fully familiar with this system, examining the _call_node_run function provides some insight into how long inputs and outputs are retained. For instance, inputs are passed to the after_node_run hook, meaning their references are kept for the node's entire lifespan. Outputs also seem to be retained for extended periods after the Dataset is stored. Modifying this behavior would break current functionality, but raising awareness about this issue might lead to improved approaches in the future.

noklam · 2024-08-22T14:03:50Z

noklam
Aug 22, 2024
Collaborator

@harm-matthias-harms Thanks for the analysis! See #819, when you are talking about GC, noted Kedro has a internal "garbage collection" inside a runner where intermediate dataset is released from memory.

For instance, inputs are passed to the after_node_run hook, meaning their references are kept for the node's entire lifespan.

This is expected behavior and it shouldn't cause any problem, this is a lifespan of a single node, and memory will be released as soon as you start the next node.

See:

kedro/kedro/runner/sequential_runner.py

Line 88 in ba98135

catalog.release(dataset)

I did similar analysis specificially for CacheDataset, but now you remind me this has nothing to do with CacheDataset itself, the problem seems to be in SequentialRunner itself, as long as the dataset is part of pipeline.inputs(), it will never be released.

https://noklam.github.io/blog/posts/2021-07-02-kedro-datacatalog.html#a-potential-bug-or-undesired-beahvior

Do you have a lot of source node that read in a lot of data? If you can confirm this I'll try to look into this again.

2 replies

noklam Aug 22, 2024
Collaborator

It would be very helpful if you can provide some memory profiling and do some experiment to see if it's associated with the source inputs. (my guess)

harm-matthias-harms Aug 22, 2024
Author

This is expected behavior and it shouldn't cause any problem, this is a lifespan of a single node, and memory will be released as soon as you start the next node.

Currently, I also see it as expected behavior. It was more a result of curiosity because I wondered how a node with for example 2 GB of data could take up to 10 GB of memory. Ultimately, I noticed the inputs and outputs were never released before the node finished. In our case with larger datasets, this means we have to divide nodes into multiple nodes, each running a part of the original dataset, and provide more computational power over Kubernetes, which could be saved when releasing the inputs earlier. The overall runtime, complexity, and resource consumption would be smaller if we didn't need these kinds of "optimizations". But it's also not hurting us too badly.

Earlier releases benefit a specific kind of node:

A lot of large inputs (and smaller outputs)
A lot of transformation between objects or packages

In a very naive world, which doesn't exist:

Inputs are passed to node.run and garbage-collected automatically, when not further referenced
Outputs are passed to dataset.save and garbage-collected automatically, when not further referenced

But this probably not even work, because a lot is handled over dicts. To be honest, I understand way too little about how Kedro works to judge any of it. So it was more like a notice in case somebody stumbles over unexpected memory usage using kedro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially inefficient memory usage #4111

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Potentially inefficient memory usage #4111

harm-matthias-harms Aug 22, 2024

TL;DR

Observed worst case scenario

Inputs

Running the node

Saving the new dataset

After nodes and hooks ran

Code insights

Replies: 1 comment · 2 replies

noklam Aug 22, 2024 Collaborator

noklam Aug 22, 2024 Collaborator

harm-matthias-harms Aug 22, 2024 Author

harm-matthias-harms
Aug 22, 2024

Replies: 1 comment 2 replies

noklam
Aug 22, 2024
Collaborator

noklam Aug 22, 2024
Collaborator

harm-matthias-harms Aug 22, 2024
Author