Best way to run user defined sql queries against content of the column of a delta table? #1229

dadepo · 2023-03-19T03:25:35Z

dadepo
Mar 19, 2023

I have a bit of a strange question. Is it possible to convert a dataframe in memory into a delta-lake table?

Here is why I ask. I have a use case that goes somehow like this, allow arbitrary sql string to be queried against the data in the column of csv already loaded into delta_lake at a table path table_path. Let’s say the csv has 3 columns, id , content, timestamp .

The client just know there is data at table_path and they will want to write sql against that. The implementation for whatever reason, now has the data that needs to be queried with in a content column at table_path

The stuff in content is json

And there is the constraint that this setup cannot be changed. Akin to legacy.

Currently I can load the table_path then run select content from table_path to get the dataframe representing the content but now I need to run user defined queries against this content .

One way I thought about this is to first write out the dataframe representing content into a temporary file location, load it back into memory as a delta lake table, and then run the query against it.

I was wondering if there is a better way to go about this?

I guess the core question is, how best is it to run user defined queries against, the content of the column of a delta table?

roeap · 2023-03-19T08:17:50Z

roeap
Mar 19, 2023
Collaborator

Hey dadepo,

To fully answer this question we would need a little bot more info. First off, are you working in python or rust?

A bit more puzzling though, what do you mean by "already loaded into deltalake"? Deltalake essentailly combines data files with a commit log. The log allows delta to tract versions, provide acid transactions etc. As such it requires specialised writers to write do delta tables, to keep the log up to date. While in principle agnostic of the file format for data files, none of the implementations I am aware of would write the data as csv.

That said, if you have the data in memory already, there is no need to write it to disc again and re-load. Beyond figuring out which files to load (potentially already considering read predicates) delta is not actually involved evaluate the queries. And while delta has some fairly advanced techniques to optimize reads, there are all heuristic in nature - i.e. not an exact query result.

Assuming you are in python, you can use any suitable query / dataframe library to evaluate a query against the in memory data - e.g. pandas, polars, duckdb, ... One thing to look out for though is that querying fields into nested json structures may not be supported by all of these, to to make it easier for users you may want to expand the json payload into columns. pyarrow has some great utilities for parsing json into tables. And while I don't think pyarrow has sql support (yet?), the record batches / tabes can be converted to pandas and polars or queried directly via DuckDB...

3 replies

dadepo Mar 20, 2023
Author

Sorry I missed out some important information.

By "already loaded into deltalake" I meant csv data that has already being injected as delta table. As far as far as I know, on file, the data is stored using Parquet files and the commit logs you described.

So I have data ingested. Let say that data has two columns. id and content. The use case is to allow user supplied sql string, and apply that to the content - mind you the user is not aware of id and content existing, they just know the data they want to query against is at a path.

I got a suggestion in the chat and what I will end up doing is load the data into a dataframe, then turn the content column into a view which I can then query. See https://docs.rs/datafusion/20.0.0/datafusion/dataframe/struct.DataFrame.html#method.into_view

You mentioned that "pyarrow has some great utilities for parsing json into tables" mind sharing these tools? It could be an alternative I can also explore.

dadepo Mar 20, 2023
Author

I got a suggestion in the chat and what I will end up doing is load the data into a dataframe, then turn the content column into a view which I can then query. See https://docs.rs/datafusion/20.0.0/datafusion/dataframe/struct.DataFrame.html#method.into_view

Okay this did not work.

so to explain, let say if I load data from a delta lake path, I get the following:

Id	Data
1	{"name": "John Doe", "age": 35, "city": "New York"}

and let say I want to run a user provided sql of

select city from users

Against the value within Data column. If the value in the Data column was ingested directly into the delta table, it would have been possible. But now it is not.

First doing select Data from delta_path and then turning this into a view, which I then register back as a table, does not “parse the json into table” which is basically what I need to happen in other to be able to run the sql against it.

dadepo Mar 20, 2023
Author

so if I manually take the value from Data column, store it into a file, I was able to run the query as I wanted using code that looks like this

let ctx = SessionContext::new();
let df = ctx.read_json("./user.json", NdJsonReadOptions::default()).await?;
ctx.register_table("demo", df.into_view()).unwrap();
let df = ctx.sql("select city from demo").await?;
df.show().await?;

I can do this programmatically but question is, is there a way to avoid that step of writing the file and reading it back to make this work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to run user defined sql queries against content of the column of a delta table? #1229

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best way to run user defined sql queries against content of the column of a delta table? #1229

dadepo Mar 19, 2023

Replies: 1 comment · 3 replies

roeap Mar 19, 2023 Collaborator

dadepo Mar 20, 2023 Author

dadepo Mar 20, 2023 Author

dadepo Mar 20, 2023 Author

dadepo
Mar 19, 2023

Replies: 1 comment 3 replies

roeap
Mar 19, 2023
Collaborator

dadepo Mar 20, 2023
Author

dadepo Mar 20, 2023
Author

dadepo Mar 20, 2023
Author