Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support casting BinaryView --> Utf8 and LargeUtf8 #6162

Closed
Tracked by #6163 ...
alamb opened this issue Jul 31, 2024 · 3 comments · Fixed by #6180
Closed
Tracked by #6163 ...

Support casting BinaryView --> Utf8 and LargeUtf8 #6162

alamb opened this issue Jul 31, 2024 · 3 comments · Fixed by #6180
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

Part of #6163

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While working to enable StringView use more widely in DataFusion in apache/datafusion#11723 I found this cast function was not supported:

Specifically, create a BinaryViewArray and then call cast to cast it to Utf8:

cast(binary_view_array, &DataType::Utf8)
External error: query failed: DataFusion error: Error during planning: Cannot cast file schema field string_col of type BinaryView to table schema field of type Utf8

I think this came about if a column is marked as "binary" in a parqut file and DataFusion tries to read it in as a Utf8 column the reader will be unbappy

Describe the solution you'd like
Add the support to the cast kernel for BinaryView -> utf8

@RinChanNOWWW added most support in #5704 and I think we can simply use the cast_view_to_byte function to build the correct StringArray

Describe alternatives you've considered

Additional context
FYI @XiangpengHao

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 31, 2024
@xinlifoobar
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Aug 1, 2024

BTW I had a hacky version in datafusion apache/datafusion#11723: https://github.com/apache/datafusion/pull/11723/files#diff-07b427ee25e195566e30cca0e77e5eb4c63c54ea74f6ea15914fdd7a5a889186R169

In case that helps

// Workaround arrow-rs bug in can_cast_types
// External error: query failed: DataFusion error: Arrow error: Cast error: Casting from BinaryView to Utf8 not supported
fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
    arrow::compute::can_cast_types(from_type, to_type)
        || matches!(
            (from_type, to_type),
            (DataType::BinaryView, DataType::Utf8 | DataType::LargeUtf8)
                | (DataType::Utf8 | DataType::LargeUtf8, DataType::BinaryView)
        )
}

// Work around arrow-rs casting bug
// External error: query failed: DataFusion error: Arrow error: Cast error: Casting from BinaryView to Utf8 not supported
fn cast(array: &dyn Array, to_type: &DataType) -> Result<ArrayRef, ArrowError> {
    match (array.data_type(), to_type) {
        (DataType::BinaryView, DataType::Utf8) => {
            let array = array.as_binary_view();
            let mut builder = StringBuilder::with_capacity(array.len(), 8 * 1024);
            for value in array.iter() {
                // check if the value is valid utf8 (should do this once, not each value)
                let value = value.map(|value| std::str::from_utf8(value)).transpose()?;

                builder.append_option(value);
            }

            Ok(Arc::new(builder.finish()))
        }
        // fallback to arrow kernel
        (_, _) => arrow::compute::cast(array, to_type),
    }
}

@alamb
Copy link
Contributor Author

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'arrow'} from #6180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants