Replies: 1 comment 3 replies
-
Hey dadepo, To fully answer this question we would need a little bot more info. First off, are you working in python or rust? A bit more puzzling though, what do you mean by "already loaded into deltalake"? Deltalake essentailly combines data files with a commit log. The log allows delta to tract versions, provide acid transactions etc. As such it requires specialised writers to write do delta tables, to keep the log up to date. While in principle agnostic of the file format for data files, none of the implementations I am aware of would write the data as csv. That said, if you have the data in memory already, there is no need to write it to disc again and re-load. Beyond figuring out which files to load (potentially already considering read predicates) delta is not actually involved evaluate the queries. And while delta has some fairly advanced techniques to optimize reads, there are all heuristic in nature - i.e. not an exact query result. Assuming you are in python, you can use any suitable query / dataframe library to evaluate a query against the in memory data - e.g. pandas, polars, duckdb, ... One thing to look out for though is that querying fields into nested json structures may not be supported by all of these, to to make it easier for users you may want to expand the json payload into columns. pyarrow has some great utilities for parsing json into tables. And while I don't think pyarrow has sql support (yet?), the record batches / tabes can be converted to pandas and polars or queried directly via DuckDB... |
Beta Was this translation helpful? Give feedback.
-
I have a bit of a strange question. Is it possible to convert a dataframe in memory into a delta-lake table?
Here is why I ask. I have a use case that goes somehow like this, allow arbitrary sql string to be queried against the data in the column of csv already loaded into delta_lake at a table path table_path. Let’s say the csv has 3 columns, id , content, timestamp .
The client just know there is data at table_path and they will want to write sql against that. The implementation for whatever reason, now has the data that needs to be queried with in a content column at table_path
The stuff in content is json
And there is the constraint that this setup cannot be changed. Akin to legacy.
Currently I can load the table_path then run select content from table_path to get the dataframe representing the content but now I need to run user defined queries against this content .
One way I thought about this is to first write out the dataframe representing content into a temporary file location, load it back into memory as a delta lake table, and then run the query against it.
I was wondering if there is a better way to go about this?
I guess the core question is, how best is it to run user defined queries against, the content of the column of a delta table?
Beta Was this translation helpful? Give feedback.
All reactions