Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StructArray take doesn't make fields nullable #6727

Closed
gatesn opened this issue Nov 13, 2024 · 10 comments
Closed

StructArray take doesn't make fields nullable #6727

gatesn opened this issue Nov 13, 2024 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@gatesn
Copy link

gatesn commented Nov 13, 2024

Describe the bug
When calling arrow::compute::take on a StructArray with non-nullable fields and passing take indices that contain null values, the resulting StructArray still has non-nullable fields. This is an invalid state.

Expected behavior
The take function should convert all fields to nullable iff the take indices contain any nulls.

https://docs.rs/arrow-select/53.2.0/src/arrow_select/take.rs.html#238-239

@gatesn gatesn added the bug label Nov 13, 2024
@tustvold
Copy link
Contributor

Generally I wouldn't expect a selection kernel to alter the schema, so I think in this case it should raise an error

@gatesn
Copy link
Author

gatesn commented Nov 13, 2024

Yes that's also reasonable.


It's a bit annoying that Arrow DataTypes don't themselves have a nullable flat, since the selection kernels over non-nested arrays can also introduce nulls to previously non-null arrays.

@irenjj
Copy link

irenjj commented Nov 13, 2024

take

@tustvold
Copy link
Contributor

It's a bit annoying that Arrow DataTypes don't themselves have a nullable flat

One way to get this is to use StructArray in place of RecordBatch, this is actually what a lot of the IO logic in arrow-rs does, converting to RecordBatch at the edges.

IMO RecordBatch is confusing and arrow would be better off without it, but it's too late for that now 😅

@malikrohail
Copy link

can i fix

@irenjj irenjj removed their assignment Nov 18, 2024
@midnattsol
Copy link

midnattsol commented Nov 27, 2024

Hello, I'd be happy to work on this.

I'm already making some tests, adding the validation to the struct datatype:

        DataType::Struct(fields) => {
            let array: &StructArray = values.as_struct();
            if fields
            .iter()
            .zip(array.columns())
            .any(|(field, _)| !field.is_nullable())
            && indices.null_count() > 0 {
                return Err(ArrowError::ComputeError(
                    "Cannot use null indices on non-nullable fields".to_string(),
                ));
            }

Testing it, it seems to work when I try to use take over a StructArray:

use arrow::array::{Int32Array, StringArray, StructArray};
use arrow::compute::take;
use arrow::datatypes::{DataType, Field};
use std::sync::Arc;

fn main() {
    let field1 = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
    let field2 = StringArray::from(vec![Some("a"), Some("b"), Some("c")]);
    let struct_array = StructArray::from(vec![
        (
            Arc::new(Field::new("field1", DataType::Int32, false)),
            Arc::new(field1) as Arc<dyn arrow::array::Array>,
        ),
        (
            Arc::new(Field::new("field2", DataType::Utf8, false)),
            Arc::new(field2) as Arc<dyn arrow::array::Array>,
        ),
    ]);

    let indices = Int32Array::from(vec![Some(0), None, Some(2)]);

    let result = take(&struct_array, &indices, None).unwrap();
    let result_struct = result.as_any().downcast_ref::<StructArray>().unwrap();

    println!("Result StructArray: {:?}", result_struct);

Result

     Running `target/debug/arrow-playground`
thread 'main' panicked at src/main.rs:25:54:
called `Result::unwrap()` on an `Err` value: ComputeError("Cannot introduce nulls in non-nullable field 'field1'")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But when I try to use take over specific Fields, the issue stays the same. I can try to solve this too, but I just want to be sure that I'm not misanderstanding anything.

I would appreciate feedback, it's my first issue related with Arrow ecosystem.

Regards

@gatesn
Copy link
Author

gatesn commented Nov 28, 2024

The other possibly reasonable fix is that the null take indices should turn the StructArray's null buffer, rather than the null buffers of each of the fields.

But there doesn't seem to be good support for dealing with nullable struct arrays, e.g. DataFusion doesn't have a nullable struct scalar.

@midnattsol
Copy link

The other possibly reasonable fix is that the null take indices should turn the StructArray's null buffer, rather than the null buffers of each of the fields.

The issue I see with this approach is that managing nulls at the StructArray's null buffer, but not in the null buffers of each of the fields, can lead to uninitialized or default values in those fields. So, if someone tries to access those values using column_by_name without checking the StructArray's null bitmap, they could encounter unexpected behavior, right?

@tustvold
Copy link
Contributor

tustvold commented Nov 28, 2024

Re-reading the ticket again the premise is incorrect

the resulting StructArray still has non-nullable fields. This is an invalid state.

A non-nullable field can contain nulls provided these are masked by the null buffer of the parent StructArray, otherwise it would be impossible to have a structure where the StructArray is nullable but not its fields.

I don't think there is a bug here

@tustvold tustvold added question Further information is requested and removed good first issue Good for newcomers help wanted bug labels Nov 28, 2024
@gatesn
Copy link
Author

gatesn commented Nov 28, 2024

The other possibly reasonable fix is that the null take indices should turn the StructArray's null buffer, rather than the null buffers of each of the fields.

The issue I see with this approach is that managing nulls at the StructArray's null buffer, but not in the null buffers of each of the fields, can lead to uninitialized or default values in those fields. So, if someone tries to access those values using column_by_name without checking the StructArray's null bitmap, they could encounter unexpected behavior, right?

That's correct. But that's also true if I access a PrimitiveArray value buffer and don't check the null bitmap.

I think I've come round to agreeing with @tustvold, although it is a little odd, that a non-nullable field can contain nulls provided they are masked by the struct array's own validity bitmap. If a row is invalid in the struct array, then all bets are off when looking at that index in one of the struct's fields.

As such, I'm going to close this. Happy for someone else to re-open if they disagree with this conclusion

@gatesn gatesn closed this as completed Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants