Missing CommitInfo in checkpoint file #2020
Replies: 8 comments
-
I would be happy to raise a PR, please let me know if my expectation is correct. |
Beta Was this translation helpful? Give feedback.
-
Hi @abhimanyusinghgaur - The current behavious is actually what is meant by the protocol. "replay" refers to the reconciled actions, so the checkpoint describes the current state of the table at that version, which includes tombstones for files within the retention period. At that time there is no more straight forward mapping to individual versions etc. That said, we will eventually support row tracking, which lets you correlate data with versions, however the commit into is indeed lost, once it has been vacuumed. |
Beta Was this translation helpful? Give feedback.
-
@roeap I see, Thanks. So, maybe it is more of a feature request than a bug. Let me tell you my use case. I have a data source from which I retrieve data using a cursor and save that data to a delta table. Now, at the very beginning, I don't have any cursor. I just query the data source, and it gives me a set of rows along with a cursor for me to make the next query. The usual pagination you can imagine. The cursor here is very important information for me. Every time I write the set of queried rows to the delta table, I also save the cursor in the commit metadata so that the next time when I load the delta table, I know from which cursor to resume querying data from the data source. That's what I was trying to achieve with #1584. Now, the moment I create a checkpoint, I lose that information. I can't figure out anymore what the cursor was. If somehow there were a way to preserve the commit metadata of the latest version even after a checkpoint operation, it would be the best way out for me. I know, I can save the cursor information in an external DB like PostgreSQL, but I didn't want to do that as I lose the atomicity that way. Being able to write both the data and metadata in a single delta transaction guaranteed that the data and cursor would always match. But, having a separate PostgreSQL to store that metadata won't give me that guarantee anymore as writing data and writing metadata would be two separate transactions in two separate systems altogether. Each one can fail irrespective of the other and then I can get inconsistencies between the two. Also, saving the cursor with each commit made me able to time-travel. So, let's say if I noticed some discrepancy in the delta table, then I could just revert the table to the last known good version, and I would still know the correct cursor from which to resume querying the data from the data source. To achieve this thing with PostgreSQL, it would require me to maintain the history of all the commit actions being performed on delta within PostgreSQL. Probably a time-series DB would be better suited for this use case. Overall, I don't want to maintain two separate systems for maintaining data and metadata. Could you please let me know if there could be a way to achieve what I'm trying to do just with delta? |
Beta Was this translation helpful? Give feedback.
-
The main question is, how far do you want to be able to travel back in time... the latest commit infos should be retained, even if a checkpoint gets created. so you should be able to read the commit infos that are stored on the table state once you laoded the table. The commit files are only ever deleted on a vacuum operation. so you can configure the retention period to you liking, and if you want to get the cursor for a version earlier then the latest checkpoint, you can load that table version explicitly (at thta point we cannot use the checpoint data to load thetable, and the state will therefore contain commit infos for earlier commits. Would that already help? |
Beta Was this translation helpful? Give feedback.
-
Not too far. Most of the time, I just want the latest commit info that contains metadata inserted by me. Note that it might not be in the latest version as operations like OPTIMIZE could have run after my commit. So, I just do a linear traversal over the commit inofs in reverse order and get the first one that contains the key I'm looking for.
90% of the time, yes! it would work. But, there are a few cases, where it won't work.
The 1st point is a real blocker for me. The 2nd point can be ignored for practical purposes, it's very unlikely to happen (though still a possibility). Overall, I would like both of the above points to be solved by including the commit info actions in the checkpoint file so that I get a 100% guarantee of data retention and not a chance of fate. In a sense, I'm looking to have ACID guarantees for metadata too, and not just the data. |
Beta Was this translation helpful? Give feedback.
-
Let's step back to your original use case:
Have you considered using transaction identifiers? You can associate an arbitrary string key "app id" with a long integer value "version." These app ids appear in snapshots, are preserved indefinitely in checkpoints, and can be updated by subsequent commits that advance the cursor. They are typically used to track high watermarks when consuming data from streaming sources, tho they can be used for anything the user wants. I don't know if delta-rs supports transaction identifiers yet, but if it does they seem like an ideal solution for your use case. Quite recently, domain metadata was also added, for more complex situations where an integer payload isn't enough -- but most Delta readers don't support that feature yet. A few additional thoughts: The Delta protocol specifically states:
The checkpoint includes everything from the table's entire history that is still relevant for the snapshot in question (it would be bad to "forget" about a file added to the table a year ago, if the user never removed that file). If checkpoints preserved commit info, a fast-moving table would quickly bloat its checkpoint as commit info actions accumulate. Even if we assumed that they somehow "expired" after 30 days, committing once per minute would mean 43k commits in the checkpoint. The only reliable way to get the commit info from each commit is to read it from the actual commit .json files. Spark Delta has an internal API for doing that, which returns all actions between two commit versions. I don't know if delta-rs has anything similar? Meanwhile, the erratic behavior you observe is actually a bug in delta-rs. A snapshot shouldn't ever contain commit info. It is not on the list of actions that participate in action reconciliation. |
Beta Was this translation helpful? Give feedback.
-
@abhimanyusinghgaur If I am understanding the challenge you're facing here, it's that when you read the table you will get a I am not sure what behavior Delta/Spark has, but it sounds like the "fix" would be to attempt to load the latest checkpoint and the latest version JSON file, even if the latest version is described in the latest checkpoint. That or provide an API for you to ensure the table state is hydrated from both those files, rather than what it sounds like delta-rs is doing currently. |
Beta Was this translation helpful? Give feedback.
-
I see. I was unaware of these two. Although, Tx IDs can only store an integer-like value, and what I need is to store a string, but the Domain Metadata seems promising. That will work best for my use case. Thanks!!
Nice idea! I guess, I can work with this for now until domain metadata lands an implementation. Thanks for both your suggestions @ryan-johnson-databricks and @rtyler. I think, this issue can be closed now, as it is not a bug in |
Beta Was this translation helpful? Give feedback.
-
Environment
Delta-rs version: Latest main b29766a
Binding: rust
Environment:
Bug
What happened:
When a checkpoint is created using
create_checkpoint
, it doesn't include theCommitInfo
action in the checkpoint file. It results in all the commit infos in the previous delta log JSON files being lost.What you expected to happen:
As per the delta protocol:
So, I expect that it should contain all the actions including
CommitInfo
actions.How to reproduce it:
Just create a delta table, do some writes, updates, etc, and then a checkpoint. You would see that the checkpoint parquet file doesn't include the
CommitInfo
actions from the previous delta log JSON files.More details:
This is the part that creates the checkpoint file:
delta-rs/rust/src/action/checkpoints.rs
Lines 318 to 356 in b29766a
You can see that it only includes the following actions:
and misses out on:
actions from this list of actions:
delta-rs/rust/src/action/mod.rs
Lines 546 to 563 in b29766a
I think, it should write all the actions to the checkpoint file.
Beta Was this translation helpful? Give feedback.
All reactions