Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce Array inner types #13452

Merged
merged 19 commits into from
Nov 21, 2024
Merged

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Closes #12291

Rationale for this change

Currently, we don't attempt to coerce inner list types, and so this works (although it shouldn't)

SELECT make_array(2) x UNION ALL SELECT make_array(now()) x;

What changes are included in this PR?

Now we’ll try to coerce inner types instead of just using the type of the first array.

Are these changes tested?

Yes, added a test

Are there any user-facing changes?

No

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

you can add tests in datafusion/sqllogictest/test_files/union.slt

datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 16, 2024
@blaginin
Copy link
Contributor Author

That was a very fast review 😅🙏

@blaginin blaginin marked this pull request as ready for review November 16, 2024 18:22
datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
@@ -1138,27 +1138,44 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D
}
}

fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> {
Some(Arc::new(
Arc::unwrap_or_clone(Arc::clone(lhs_field)).with_data_type(comparison_coercion(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparison_coercion

should we use type_union_resolution here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So both will work, but IMO the current version is a bit better as it makes code aligned with the dictionally behaviour (dictionary_comparison_coercion uses comparison_coercion)

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use type_union_resolution here since we need to preserve dictionary when we union two array. But comparison coercion doesn't preserve dictionary type, only inner type matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood your point, but I am pretty sure comparison_coercion preserves dicts:

.or_else(|| dictionary_comparison_coercion(lhs_type, rhs_type, true))

For example, this query correctly casts Dict(Utf8) to Dict(LargeUtf8)

select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;

Also, I think type_union_resolution is a bit more limiting than comparison_coercion, and so, for example, if I switch to it, the following two queries will stop working:

-- type_union_resolution can't cast nulls
select make_array(arrow_cast('a', 'Utf8')) x UNION ALL SELECT make_array(NULL) x;

-- type_union_resolution can't handle large lists (or fixed lists)
select make_array(make_array(1)) x UNION ALL SELECT make_array(arrow_cast(make_array(-1), 'LargeList(Int8)')) x;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

Shouldn't the result be like

query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think type_union_resolution is a bit more limiting than comparison_coercion, and so, for example, if I switch to it, the following two queries will stop working

It indicates we need to handle more coercion for type_union_resolution.

dictionary_comparison_coercion should not preserve dict if both are dict. There is a logic to optionally preserve dict if one of them is dict, one is not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greatest(x, y, z) should be equivalent to select max(a) from select x as a union all select y as a select z as a and also should be equivalent to select array_max(array[x, y, z])

all these should converge x, y, z to the common super type, and i believe type_union_resolution is supposedly doing just that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's fair! I've switched to type_union_resolution and will create a ticket to handle large arrays and nulls

datafusion/expr-common/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
@@ -1138,27 +1138,44 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D
}
}

fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> {
Some(Arc::new(
Arc::unwrap_or_clone(Arc::clone(lhs_field)).with_data_type(comparison_coercion(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb can there be a problem if we coerce two lists and they have different field names?

@blaginin the resulting field nullability should be OR of nullability of sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13468 seems related to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR doesn’t change names behaviour, let’s go as is and then fix it separately in that PR you highlighted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am fine assuming the answer to this question is 'nope':

can there be a problem if we coerce two lists and they have different field names?

i hope it is 'nope, no problem'
@alamb @jayzhan211 please confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any problem in theory. I am a little fuzzy about what the semantic meaning of the Field's name in a DataType::List means -- sometimes the code is overly pedantic when comparing but I think semantically two lists are the same if their element's types are the same (and nullness and metadata). I don't think the field name should be compared

However, I remember there have been issues before on this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #13481 is a complain that list item name isn't preserved in some cases. I don't know when it would matter though. From SQL perspective, it shouldn't.

@jayzhan211

This comment was marked as outdated.

jayzhan211
jayzhan211 previously approved these changes Nov 18, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jayzhan211 jayzhan211 dismissed their stale review November 18, 2024 03:57

type_union_resolution

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @blaginin and @jayzhan211 -- I think this looks reasonable to me

Let's wait to see if @jayzhan211 wants to review again as well

@@ -1138,27 +1138,47 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D
}
}

fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> {
Some(Arc::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here explaining why this function is needed -- specifically that it is setting the DataType / field name correctly?

@@ -198,7 +198,6 @@ pub(crate) fn make_array_inner(arrays: &[ArrayRef]) -> Result<ArrayRef> {
let array = new_null_array(&DataType::Int64, length);
Ok(Arc::new(array_into_list_array_nullable(array)))
}
LargeList(..) => array_array::<i64>(arrays, data_type),
Copy link
Contributor Author

@blaginin blaginin Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line was correct. According to the signature, the function output is always a List. However, this one sometimes returns a LargeList.

I added a test to illustrate this:

select make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)'))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query ?
select make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)'));
----
[[1], [-1]]

query T
select arrow_typeof(make_array(make_array(1), arrow_cast(make_array(-1), 'LargeList(Int8)')));
----
List(Field { name: "item", data_type: LargeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

It seems correct 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You fix it :) Thanks

(FixedSizeList(_, _), List(_)) => Some(rhs_type.clone()),
(
LargeList(lhs_field),
List(rhs_field) | LargeList(rhs_field) | FixedSizeList(rhs_field, _),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's maybe add listview to the mix. it's hard to test, but why not be future-proof here?
  • let's maybe reorder clauses so that it's clear that any list types are handled
 match (lhs_type, rhs_type) {
        // Coerce to the left side FixedSizeList type if the list lengths are the same,
        // otherwise coerce to list with the left type for dynamic length
        (FixedSizeList(lhs_field, ls), FixedSizeList(rhs_field, rs)) if ls == rs => Some(
            FixedSizeList(coerce_list_children(lhs_field, rhs_field)?, *rs),
        ),

        // Left is a LargeList[View] or right is a LargeList[View]
        (
            LargeList(lhs_field) | LargeListView(lhs_field),
            List(rhs_field)
            | ListView(rhs_field)
            | LargeList(rhs_field)
            | LargeListView(rhs_field)
            | FixedSizeList(rhs_field, _),
        )
        | (
            List(lhs_field)
            | ListView(lhs_field)
            | FixedSizeList(lhs_field, _)
            | LargeList(lhs_field)
            | LargeListView(lhs_field),
            LargeList(rhs_field) | LargeListView(rhs_field),
        ) => Some(LargeList(coerce_list_children(lhs_field, rhs_field)?)),

        // Left and right are lists
        (
            List(lhs_field) | ListView(lhs_field) | FixedSizeList(lhs_field, _),
            List(rhs_field) | ListView(rhs_field) | FixedSizeList(rhs_field, _),
        ) => Some(List(coerce_list_children(lhs_field, rhs_field)?)),

        _ => None,
    }

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ListView already supported in arrow? I would prefer to handle list view if there is corresponding test as well to ensure the test coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'd as well would handle this separately with a proper test. I'll put a ticket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but will reorder the arms

@alamb
Copy link
Contributor

alamb commented Nov 20, 2024

@jayzhan211 are you good to merge this PR?

@jayzhan211 jayzhan211 merged commit 240402d into apache:main Nov 21, 2024
25 checks passed
@jayzhan211
Copy link
Contributor

Thanks @blaginin @alamb @findepi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrays of non-coercible types are coercible
4 participants