-
I created a Pandas data frame and write the data in overwrite or append mode using the write_deltalake method, but the performance is poor whether the same data is overwritten or a new row is appended. In particular, when reading with the DeltaTable method, it takes less than 3 seconds at first, but increases to 30 seconds after an overwrite or append has beenperformed about 900 times. Even with calls to optimize() and vacuum(). Can you tell me the solution for this? ---- Overwrite mode: [Read time] after repetitive writing of 10x10 table ---- Append Mode: [Read time] after adding a 10 column row each time. =====================
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
Best guess so far is that on each write you are creating a new version. on very read we have to parse the entire delta log. and this naturally grows with each write operation against the table. Usually a checkpoint would be created (databricks does that every 10 commits) delta-rs does not do that automatically. SO in this case every time you load the table, in the end up to 900 requests need to be made to get all log files... We are also actively working to improve the delta log handling / parsing performance ... |
Beta Was this translation helpful? Give feedback.
Instead of using a path every time, after the first write_deltalake, get a DeltaTable object and pass that in instead. That way it only has to load the incremental updates.
there is also a function to create a checkpoint which you can run every ten or so commits. This function is a method on DeltaTable called create_checkpoint..