-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rust engine consume a lot of memory compared to pyarrow #2968
Comments
@djouallah 👋 in the attached notebook, which write_deltalake call is resulting in memory pressure? I'm less familiar with duckdb, but I assume the |
it is an Arrow RecordBatchReader I think |
Rust engine materalizes everything to memory prior to starting the whole process. Pyarrow probably writes batch by batch when you pass a reader |
just for my own understanding, is this something that can be fixed by datafusion ? |
@djouallah there is a PR to address this but the contributor didn't have time to finish it yet: #2289 |
I have been analyzing the performance around this issue, even with today's latest Rust engine Pyarrow engine The root cause of this behavior is, in my opinion, the collection of There is some trickiness around Nothing required from anybody, I just wanted to share the analysis and progress here. 🤔 🏃 |
Environment
Delta-rs version:
0.21.0
Binding:
Environment:
Linux
Bug
switching from pyarrow engine to rust increase memory usage by nearly 3X, the job used to works fine, but now, getting OOM errors.
I added a reproducible example with only 60 input files to demo the issue
https://colab.research.google.com/drive/1fahlV0FgKSAS8sQvRMu47s3bDP1ekLbb#scrollTo=333a177b-f075-412e-8ca1-32d44f8c07eb
The text was updated successfully, but these errors were encountered: