Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coerce Array inner types #13452

Merged
merged 19 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 41 additions & 11 deletions datafusion/expr-common/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1138,27 +1138,44 @@ fn numeric_string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<D
}
}

fn coerce_list_children(lhs_field: &FieldRef, rhs_field: &FieldRef) -> Option<FieldRef> {
Some(Arc::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here explaining why this function is needed -- specifically that it is setting the DataType / field name correctly?

Arc::unwrap_or_clone(Arc::clone(lhs_field)).with_data_type(comparison_coercion(
blaginin marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparison_coercion

should we use type_union_resolution here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So both will work, but IMO the current version is a bit better as it makes code aligned with the dictionally behaviour (dictionary_comparison_coercion uses comparison_coercion)

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use type_union_resolution here since we need to preserve dictionary when we union two array. But comparison coercion doesn't preserve dictionary type, only inner type matters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have misunderstood your point, but I am pretty sure comparison_coercion preserves dicts:

.or_else(|| dictionary_comparison_coercion(lhs_type, rhs_type, true))

For example, this query correctly casts Dict(Utf8) to Dict(LargeUtf8)

select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;

Also, I think type_union_resolution is a bit more limiting than comparison_coercion, and so, for example, if I switch to it, the following two queries will stop working:

-- type_union_resolution can't cast nulls
select make_array(arrow_cast('a', 'Utf8')) x UNION ALL SELECT make_array(NULL) x;

-- type_union_resolution can't handle large lists (or fixed lists)
select make_array(make_array(1)) x UNION ALL SELECT make_array(arrow_cast(make_array(-1), 'LargeList(Int8)')) x;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

Shouldn't the result be like

query T
select arrow_typeof(x) from (select make_array(arrow_cast('a', 'Dictionary(Int8, Utf8)')) x UNION ALL SELECT make_array(arrow_cast('b', 'Dictionary(Int8, LargeUtf8)'))) x;
----
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: Dict(), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think type_union_resolution is a bit more limiting than comparison_coercion, and so, for example, if I switch to it, the following two queries will stop working

It indicates we need to handle more coercion for type_union_resolution.

dictionary_comparison_coercion should not preserve dict if both are dict. There is a logic to optionally preserve dict if one of them is dict, one is not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greatest(x, y, z) should be equivalent to select max(a) from select x as a union all select y as a select z as a and also should be equivalent to select array_max(array[x, y, z])

all these should converge x, y, z to the common super type, and i believe type_union_resolution is supposedly doing just that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's fair! I've switched to type_union_resolution and will create a ticket to handle large arrays and nulls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb can there be a problem if we coerce two lists and they have different field names?

@blaginin the resulting field nullability should be OR of nullability of sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13468 seems related to this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR doesn’t change names behaviour, let’s go as is and then fix it separately in that PR you highlighted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am fine assuming the answer to this question is 'nope':

can there be a problem if we coerce two lists and they have different field names?

i hope it is 'nope, no problem'
@alamb @jayzhan211 please confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any problem in theory. I am a little fuzzy about what the semantic meaning of the Field's name in a DataType::List means -- sometimes the code is overly pedantic when comparing but I think semantically two lists are the same if their element's types are the same (and nullness and metadata). I don't think the field name should be compared

However, I remember there have been issues before on this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #13481 is a complain that list item name isn't preserved in some cases. I don't know when it would matter though. From SQL perspective, it shouldn't.

lhs_field.data_type(),
rhs_field.data_type(),
)?),
))
}

/// Coercion rules for list types.
fn list_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
(List(_), List(_)) => Some(lhs_type.clone()),
(LargeList(_), List(_)) => Some(lhs_type.clone()),
blaginin marked this conversation as resolved.
Show resolved Hide resolved
(List(_), LargeList(_)) => Some(rhs_type.clone()),
(LargeList(_), LargeList(_)) => Some(lhs_type.clone()),
(List(_), FixedSizeList(_, _)) => Some(lhs_type.clone()),
(FixedSizeList(_, _), List(_)) => Some(rhs_type.clone()),
(
LargeList(lhs_field),
List(rhs_field) | LargeList(rhs_field) | FixedSizeList(rhs_field, _),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's maybe add listview to the mix. it's hard to test, but why not be future-proof here?
  • let's maybe reorder clauses so that it's clear that any list types are handled
 match (lhs_type, rhs_type) {
        // Coerce to the left side FixedSizeList type if the list lengths are the same,
        // otherwise coerce to list with the left type for dynamic length
        (FixedSizeList(lhs_field, ls), FixedSizeList(rhs_field, rs)) if ls == rs => Some(
            FixedSizeList(coerce_list_children(lhs_field, rhs_field)?, *rs),
        ),

        // Left is a LargeList[View] or right is a LargeList[View]
        (
            LargeList(lhs_field) | LargeListView(lhs_field),
            List(rhs_field)
            | ListView(rhs_field)
            | LargeList(rhs_field)
            | LargeListView(rhs_field)
            | FixedSizeList(rhs_field, _),
        )
        | (
            List(lhs_field)
            | ListView(lhs_field)
            | FixedSizeList(lhs_field, _)
            | LargeList(lhs_field)
            | LargeListView(lhs_field),
            LargeList(rhs_field) | LargeListView(rhs_field),
        ) => Some(LargeList(coerce_list_children(lhs_field, rhs_field)?)),

        // Left and right are lists
        (
            List(lhs_field) | ListView(lhs_field) | FixedSizeList(lhs_field, _),
            List(rhs_field) | ListView(rhs_field) | FixedSizeList(rhs_field, _),
        ) => Some(List(coerce_list_children(lhs_field, rhs_field)?)),

        _ => None,
    }

Copy link
Contributor

@jayzhan211 jayzhan211 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ListView already supported in arrow? I would prefer to handle list view if there is corresponding test as well to ensure the test coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I'd as well would handle this separately with a proper test. I'll put a ticket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but will reorder the arms

)
| (FixedSizeList(lhs_field, _) | List(lhs_field), LargeList(rhs_field)) => {
blaginin marked this conversation as resolved.
Show resolved Hide resolved
Some(LargeList(coerce_list_children(lhs_field, rhs_field)?))
}

(List(lhs_field), List(rhs_field) | FixedSizeList(rhs_field, _))
| (FixedSizeList(lhs_field, _), List(rhs_field)) => {
Some(List(coerce_list_children(lhs_field, rhs_field)?))
}

// Coerce to the left side FixedSizeList type if the list lengths are the same,
// otherwise coerce to list with the left type for dynamic length
(FixedSizeList(lf, ls), FixedSizeList(_, rs)) => {
(FixedSizeList(lhs_field, ls), FixedSizeList(rhs_field, rs)) => {
if ls == rs {
Some(lhs_type.clone())
Some(FixedSizeList(
coerce_list_children(lhs_field, rhs_field)?,
*rs,
))
} else {
Some(List(Arc::clone(lf)))
Some(List(coerce_list_children(lhs_field, rhs_field)?))
}
}
(LargeList(_), FixedSizeList(_, _)) => Some(lhs_type.clone()),
(FixedSizeList(_, _), LargeList(_)) => Some(rhs_type.clone()),
_ => None,
}
}
Expand Down Expand Up @@ -2105,6 +2122,19 @@ mod tests {
DataType::List(Arc::clone(&inner_field))
);

// Negative test: inner_timestamp_field and inner_field are not compatible because their inner types are not compatible
let inner_timestamp_field = Arc::new(Field::new(
"item",
DataType::Timestamp(TimeUnit::Microsecond, None),
true,
));
let result_type = get_input_types(
&DataType::List(Arc::clone(&inner_field)),
&Operator::Eq,
&DataType::List(Arc::clone(&inner_timestamp_field)),
);
assert!(result_type.is_err());

// TODO add other data type
Ok(())
}
Expand Down
4 changes: 4 additions & 0 deletions datafusion/sqllogictest/test_files/union.slt
Original file line number Diff line number Diff line change
Expand Up @@ -761,3 +761,7 @@ SELECT NULL WHERE FALSE;
----
0.5
1

# Test Union of List Types. Issue: https://github.com/apache/datafusion/issues/12291
query error DataFusion error: type_coercion\ncaused by\nError during planning: Incompatible inputs for Union: Previous inputs were of type List(.*), but got incompatible type List(.*) on column 'x'
SELECT make_array(2) x UNION ALL SELECT make_array(now()) x;