Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8) #12041

Open
rumpuslabs opened this issue Aug 17, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@rumpuslabs
Copy link

Describe the bug

Related to #7797

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8)

To Reproduce

Create a simple input.csv file like this:

id,name
1,
2,bob

Run the following code:

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let format = CsvFormat::default();
    let listing_options = ListingOptions::new(Arc::new(format));
    ctx.register_listing_table(
        "input",
        "input.csv",
        listing_options.clone(),
        Some(Arc::new(Schema::new(vec![
            Field::new("id", DataType::Utf8, false),
            Field::new(
                "name",
                DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
                true,
            ),
        ]))),
        None,
    )
    .await?;

    let results = ctx
        .table("input")
        .await?
        .filter(col("name").is_not_null())?
        .collect()
        .await?;

    let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?.to_string();

    println!("{}", pretty_results);

    Ok(())
}

Expected behavior

I was expecting the output to look like this:

+----+------+
| id | name |
+----+------+
| 2  | bob  |
+----+------+

But the full dataset is returned instead:

+----+------+
| id | name |
+----+------+
| 1  |      |
| 2  | bob  |
+----+------+

Additional context

Tested on v41.0.0

Replace DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)) with DataType::Utf8 and it works.

@rumpuslabs rumpuslabs added the bug Something isn't working label Aug 17, 2024
@edmondop
Copy link
Contributor

take

@edmondop
Copy link
Contributor

@alamb shouldn't the csv reader also throw an error because "bob" is not a valid dictionary?

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

I agree the discrepancy between UTf8 and Dictionary looks like a bug

@alamb shouldn't the csv reader also throw an error because "bob" is not a valid dictionary?

I think "bob" is a valid value for a DictionaryArray (whose values are Strings)

@edmondop
Copy link
Contributor

@alamb before I file an issue to arrow-csv, why is "bob" a valid value for DictionaryArray? don't you need a key and a value for a dictionary?

@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

@alamb before I file an issue to arrow-csv, why is "bob" a valid value for DictionaryArray? don't you need a key and a value for a dictionary?

I think a more precise version of this would be "bob" is a valid value for DictionaryArray(Int32, Utf8) -- that is if the dictionary's value type is strings (Utf8) then the dictionary should be able to hold string values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants