-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coerce Array inner types #13452
Coerce Array inner types #13452
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
you can add tests in datafusion/sqllogictest/test_files/union.slt
That was a very fast review 😅🙏 |
@@ -1138,27 +1138,44 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D | |||
} | |||
} | |||
|
|||
fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> { | |||
Some(Arc::new( | |||
Arc::unwrap_or_clone(Arc::clone(lhs_field)).with_data_type(comparison_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison_coercion
should we use type_union_resolution
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So both will work, but IMO the current version is a bit better as it makes code aligned with the dictionally behaviour (dictionary_comparison_coercion
uses comparison_coercion
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use type_union_resolution
here since we need to preserve dictionary when we union two array. But comparison coercion doesn't preserve dictionary type, only inner type matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have misunderstood your point, but I am pretty sure comparison_coercion
preserves dicts:
.or_else(|| dictionary_comparison_coercion(lhs_type, rhs_type, true)) |
For example, this query correctly casts Dict(Utf8)
to Dict(LargeUtf8)
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
Also, I think type_union_resolution
is a bit more limiting than comparison_coercion
, and so, for example, if I switch to it, the following two queries will stop working:
-- type_union_resolution can't cast nulls
select make_array(arrow_cast('a', 'Utf8')) x UNION ALL SELECT make_array(NULL) x;
-- type_union_resolution can't handle large lists (or fixed lists)
select make_array(make_array(1)) x UNION ALL SELECT make_array(arrow_cast(make_array(-1), 'LargeList(Int8)')) x;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
Shouldn't the result be like
query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think type_union_resolution is a bit more limiting than comparison_coercion, and so, for example, if I switch to it, the following two queries will stop working
It indicates we need to handle more coercion for type_union_resolution
.
dictionary_comparison_coercion
should not preserve dict if both are dict. There is a logic to optionally preserve dict if one of them is dict, one is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greatest(x, y, z)
should be equivalent to select max(a) from select x as a union all select y as a select z as a
and also should be equivalent to select array_max(array[x, y, z])
all these should converge x, y, z to the common super type, and i believe type_union_resolution is supposedly doing just that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's fair! I've switched to type_union_resolution
and will create a ticket to handle large arrays and nulls
@@ -1138,27 +1138,44 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D | |||
} | |||
} | |||
|
|||
fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> { | |||
Some(Arc::new( | |||
Arc::unwrap_or_clone(Arc::clone(lhs_field)).with_data_type(comparison_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#13468 seems related to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this PR doesn’t change names behaviour, let’s go as is and then fix it separately in that PR you highlighted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am fine assuming the answer to this question is 'nope':
can there be a problem if we coerce two lists and they have different field names?
i hope it is 'nope, no problem'
@alamb @jayzhan211 please confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any problem in theory. I am a little fuzzy about what the semantic meaning of the Field's name in a DataType::List
means -- sometimes the code is overly pedantic when comparing but I think semantically two lists are the same if their element's types are the same (and nullness and metadata). I don't think the field name should be compared
However, I remember there have been issues before on this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #13481 is a complain that list item name isn't preserved in some cases. I don't know when it would matter though. From SQL perspective, it shouldn't.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @findepi and @blaginin and @jayzhan211 -- I think this looks reasonable to me
Let's wait to see if @jayzhan211 wants to review again as well
@@ -1138,27 +1138,47 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D | |||
} | |||
} | |||
|
|||
fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> { | |||
Some(Arc::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here explaining why this function is needed -- specifically that it is setting the DataType / field name correctly?
…coerce-arrays-inner-types
@@ -198,7 +198,6 @@ pub(crate) fn make_array_inner(arrays: &[ArrayRef]) -> Result<ArrayRef> { | |||
let array = new_null_array(&DataType::Int64, length); | |||
Ok(Arc::new(array_into_list_array_nullable(array))) | |||
} | |||
LargeList(..) => array_array::<i64>(arrays, data_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this line was correct. According to the signature, the function output is always a List
. However, this one sometimes returns a LargeList
.
I added a test to illustrate this:
select make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query ?
select make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)'));
----
[[1], [-1]]
query T
select arrow_typeof(make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)')));
----
List(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
It seems correct 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You fix it :) Thanks
(FixedSizeList(_, _), List(_)) => Some(rhs_type.clone()), | ||
( | ||
LargeList(lhs_field), | ||
List(rhs_field) | LargeList(rhs_field) | FixedSizeList(rhs_field, _), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's maybe add listview to the mix. it's hard to test, but why not be future-proof here?
- let's maybe reorder clauses so that it's clear that any list types are handled
match (lhs_type, rhs_type) {
// Coerce to the left side FixedSizeList type if the list lengths are the same,
// otherwise coerce to list with the left type for dynamic length
(FixedSizeList(lhs_field, ls), FixedSizeList(rhs_field, rs)) if ls == rs => Some(
FixedSizeList(coerce_list_children(lhs_field, rhs_field)?, *rs),
),
// Left is a LargeList[View] or right is a LargeList[View]
(
LargeList(lhs_field) | LargeListView(lhs_field),
List(rhs_field)
| ListView(rhs_field)
| LargeList(rhs_field)
| LargeListView(rhs_field)
| FixedSizeList(rhs_field, _),
)
| (
List(lhs_field)
| ListView(lhs_field)
| FixedSizeList(lhs_field, _)
| LargeList(lhs_field)
| LargeListView(lhs_field),
LargeList(rhs_field) | LargeListView(rhs_field),
) => Some(LargeList(coerce_list_children(lhs_field, rhs_field)?)),
// Left and right are lists
(
List(lhs_field) | ListView(lhs_field) | FixedSizeList(lhs_field, _),
List(rhs_field) | ListView(rhs_field) | FixedSizeList(rhs_field, _),
) => Some(List(coerce_list_children(lhs_field, rhs_field)?)),
_ => None,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ListView already supported in arrow? I would prefer to handle list view if there is corresponding test as well to ensure the test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'd as well would handle this separately with a proper test. I'll put a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but will reorder the arms
Co-authored-by: Jay Zhan <[email protected]>
…coerce-arrays-inner-types
@jayzhan211 are you good to merge this PR? |
Which issue does this PR close?
Closes #12291
Rationale for this change
Currently, we don't attempt to coerce inner list types, and so this works (although it shouldn't)
What changes are included in this PR?
Now we’ll try to coerce inner types instead of just using the type of the first array.
Are these changes tested?
Yes, added a test
Are there any user-facing changes?
No