-
Notifications
You must be signed in to change notification settings - Fork 806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StructArray take doesn't make fields nullable #6727
Comments
Generally I wouldn't expect a selection kernel to alter the schema, so I think in this case it should raise an error |
Yes that's also reasonable. It's a bit annoying that Arrow DataTypes don't themselves have a nullable flat, since the selection kernels over non-nested arrays can also introduce nulls to previously non-null arrays. |
take |
One way to get this is to use StructArray in place of RecordBatch, this is actually what a lot of the IO logic in arrow-rs does, converting to RecordBatch at the edges. IMO RecordBatch is confusing and arrow would be better off without it, but it's too late for that now 😅 |
can i fix |
Hello, I'd be happy to work on this. I'm already making some tests, adding the validation to the struct datatype: DataType::Struct(fields) => {
let array: &StructArray = values.as_struct();
if fields
.iter()
.zip(array.columns())
.any(|(field, _)| !field.is_nullable())
&& indices.null_count() > 0 {
return Err(ArrowError::ComputeError(
"Cannot use null indices on non-nullable fields".to_string(),
));
} Testing it, it seems to work when I try to use take over a StructArray: use arrow::array::{Int32Array, StringArray, StructArray};
use arrow::compute::take;
use arrow::datatypes::{DataType, Field};
use std::sync::Arc;
fn main() {
let field1 = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
let field2 = StringArray::from(vec![Some("a"), Some("b"), Some("c")]);
let struct_array = StructArray::from(vec![
(
Arc::new(Field::new("field1", DataType::Int32, false)),
Arc::new(field1) as Arc<dyn arrow::array::Array>,
),
(
Arc::new(Field::new("field2", DataType::Utf8, false)),
Arc::new(field2) as Arc<dyn arrow::array::Array>,
),
]);
let indices = Int32Array::from(vec![Some(0), None, Some(2)]);
let result = take(&struct_array, &indices, None).unwrap();
let result_struct = result.as_any().downcast_ref::<StructArray>().unwrap();
println!("Result StructArray: {:?}", result_struct); Result Running `target/debug/arrow-playground`
thread 'main' panicked at src/main.rs:25:54:
called `Result::unwrap()` on an `Err` value: ComputeError("Cannot introduce nulls in non-nullable field 'field1'")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace But when I try to use take over specific Fields, the issue stays the same. I can try to solve this too, but I just want to be sure that I'm not misanderstanding anything. I would appreciate feedback, it's my first issue related with Regards |
The other possibly reasonable fix is that the null take indices should turn the StructArray's null buffer, rather than the null buffers of each of the fields. But there doesn't seem to be good support for dealing with nullable struct arrays, e.g. DataFusion doesn't have a nullable struct scalar. |
The issue I see with this approach is that managing nulls at the StructArray's null buffer, but not in the null buffers of each of the fields, can lead to uninitialized or default values in those fields. So, if someone tries to access those values using |
Re-reading the ticket again the premise is incorrect
A non-nullable field can contain nulls provided these are masked by the null buffer of the parent StructArray, otherwise it would be impossible to have a structure where the StructArray is nullable but not its fields. I don't think there is a bug here |
That's correct. But that's also true if I access a PrimitiveArray value buffer and don't check the null bitmap. I think I've come round to agreeing with @tustvold, although it is a little odd, that a non-nullable field can contain nulls provided they are masked by the struct array's own validity bitmap. If a row is invalid in the struct array, then all bets are off when looking at that index in one of the struct's fields. As such, I'm going to close this. Happy for someone else to re-open if they disagree with this conclusion |
Describe the bug
When calling
arrow::compute::take
on a StructArray with non-nullable fields and passing take indices that contain null values, the resulting StructArray still has non-nullable fields. This is an invalid state.Expected behavior
The take function should convert all fields to nullable iff the take indices contain any nulls.
https://docs.rs/arrow-select/53.2.0/src/arrow_select/take.rs.html#238-239
The text was updated successfully, but these errors were encountered: