Performance degradation when repeatedly writing the same data. #1476

llggdd · 2023-06-19T09:27:02Z

llggdd
Jun 19, 2023

I created a Pandas data frame and write the data in overwrite or append mode using the write_deltalake method, but the performance is poor whether the same data is overwritten or a new row is appended.

In particular, when reading with the DeltaTable method, it takes less than 3 seconds at first, but increases to 30 seconds after an overwrite or append has beenperformed about 900 times. Even with calls to optimize() and vacuum().
And after deleting this table and inputting the same amount of data, performing a read operation takes less than a second.

Can you tell me the solution for this?
Writing gets slow also as time goes by iteration.

---- Overwrite mode: [Read time] after repetitive writing of 10x10 table
Elapsed 10 times: 2.37 s
Elapsed 200 times: 10.45 s
Elapsed 400 times: 19.37 s
Elapsed 500 times: 24.45 s
The write speed is also slowing down while doing repetitive work, but I haven't measured the time yet.

---- Append Mode: [Read time] after adding a 10 column row each time.
Elapsed 10 times: 3.47 s
Elapsed 50 times: 5.43 s
Elapsed 100 times: 8.20 s
Elapsed 200 times: 13.33 s
Elapsed 300 times: 17.50 s
Elapsed 400 times: 22.04 s
Elapsed 500 times: 21.09 s
Elapsed 600 times: 25.29 s
Elapsed 700 times: 35.89 s

=====================
And here is my test code below(skipped secrets and deltalake path):

import os
from deltalake import DeltaTable
import pandas as pd
from collections import defaultdict
from deltalake import DeltaTable
from deltalake.writer import write_deltalake
import sys
import os
import random
import string
import time

def generate_random_text(length):
    letters = string.ascii_letters
    return ''.join(random.choice(letters) for _ in range(length))

select=0
while(select==0 or select>=3):
    print("1.Overwrite")
    print("2.append")
    select=int(input("Choose what you want to check : "))
iter=0
while(iter==0 or iter <=9):
    iter=int(input("Input number of iteration you want to try out(at least 10) : "))


if select==1:
    try:
        delta_table_path=delta_test_path + 'overwrite'

        execution_times = []
        # Create a dataframe 10 x 10
        data = {
            f"column_{i+1}": [generate_random_text(10) for _ in range(10)] for i in range(10)
        }
        df = pd.DataFrame(data)

        print("dataframe\n=========",df)

        cnt=1
        for i in range(iter):
            write_deltalake(delta_table_path, df, mode='overwrite', overwrite_schema=True, storage_options=storage_options)
            print(f"#{i+1} write done out of {iter}")

            if cnt%10==0:
                
                try:
                    print("Chekcing Read execution time")

                    start_time = time.time()
                    existing_data = DeltaTable(delta_table_path, storage_options=storage_options)
                    existing_data.optimize()
                    existing_data.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)
                    existing_data = existing_data.to_pandas()                    
                    end_time = time.time()
                    execution_time = end_time - start_time
                    print(f"Read operation execution time : ", execution_time)
                    execution_times.append(execution_time)
                except:
                    continue

            cnt+=1
        print("overwrite operation is done")

        for idx, i in enumerate(execution_times):
            print(f"#{(idx+1)*10} run took {i}")



    except:
        print("error")

elif select==2:
    try:
        delta_table_path=delta_test_path + 'append'

        execution_times = []

        cnt=1
        for i in range(iter):
            data = {
            f"column_{j+1}": [generate_random_text(10) for _ in range(1)] for j in range(10)
            }
            df = pd.DataFrame(data)
            print("adding row =====\n ", df)
            write_deltalake(delta_table_path, df, mode='append', overwrite_schema=True, storage_options=storage_options)
            print(f"#{i+1} write done out of {iter}")

            if cnt%10==0:
                try:
                    print("Chekcing Read execution time")

                    start_time = time.time()
                    existing_data = DeltaTable(delta_table_path, storage_options=storage_options)
                    existing_data.optimize()
                    existing_data.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)
                    existing_data = existing_data.to_pandas()                    
                    end_time = time.time()
                    execution_time = end_time - start_time
                    print(f"Read operation execution time : ", execution_time)
                    execution_times.append(execution_time)
                except:
                    continue

            cnt+=1
        print("append operation is done")
        print("\n ==============\n",existing_data)

        for idx, i in enumerate(execution_times):
            print(f"#{(idx+1)*10} run took {i}")

    except:
        print("error")

Answered by wjones127

Jun 19, 2023

Instead of using a path every time, after the first write_deltalake, get a DeltaTable object and pass that in instead. That way it only has to load the incremental updates.

there is also a function to create a checkpoint which you can run every ten or so commits. This function is a method on DeltaTable called create_checkpoint..

View full answer

roeap · 2023-06-19T13:08:52Z

roeap
Jun 19, 2023
Collaborator

Best guess so far is that on each write you are creating a new version. on very read we have to parse the entire delta log. and this naturally grows with each write operation against the table.

Usually a checkpoint would be created (databricks does that every 10 commits) delta-rs does not do that automatically. SO in this case every time you load the table, in the end up to 900 requests need to be made to get all log files...

We are also actively working to improve the delta log handling / parsing performance ...

7 replies

llggdd Jun 19, 2023
Author

Thanks. Any tips to improve this for now?

wjones127 Jun 19, 2023
Collaborator

Instead of using a path every time, after the first write_deltalake, get a DeltaTable object and pass that in instead. That way it only has to load the incremental updates.

there is also a function to create a checkpoint which you can run every ten or so commits. This function is a method on DeltaTable called create_checkpoint..

Answer selected by llggdd

llggdd Jun 19, 2023
Author

@wjones127 Oh. thanks. Can you tell me more about the “get a DeltaTable object and pass” thing?

wjones127 Jun 20, 2023
Collaborator

Are you looking for an example?

Instead of

write_deltalake("path/to/my/table", data)

Write:

from deltalake import DeltaTable

dt = DeltaTable("path/to/my/table")
write_deltalake(dt, data)

llggdd Jun 20, 2023
Author

@wjones127 Oh I got it. I'll try test with it. And would it help to adopt both of your suggestions?

wjones127 Jun 20, 2023
Collaborator

For your for-loop, you don't need to create checkpoints. But for any independent process that reads or writes, it will help to create a checkpoint. So depending on your use case, you might want to create a checkpoint at the end of a series of writes, or just periodically every 10 commits or so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation when repeatedly writing the same data. #1476

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance degradation when repeatedly writing the same data. #1476

llggdd Jun 19, 2023

Replies: 1 comment · 7 replies

roeap Jun 19, 2023 Collaborator

llggdd Jun 19, 2023 Author

wjones127 Jun 19, 2023 Collaborator

llggdd Jun 19, 2023 Author

wjones127 Jun 20, 2023 Collaborator

llggdd Jun 20, 2023 Author

wjones127 Jun 20, 2023 Collaborator

llggdd
Jun 19, 2023

Replies: 1 comment 7 replies

roeap
Jun 19, 2023
Collaborator

llggdd Jun 19, 2023
Author

wjones127 Jun 19, 2023
Collaborator

llggdd Jun 19, 2023
Author

wjones127 Jun 20, 2023
Collaborator

llggdd Jun 20, 2023
Author

wjones127 Jun 20, 2023
Collaborator