Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with LargeList arrow type not working #2851

Open
ericvandever opened this issue Sep 6, 2024 · 3 comments
Open

Merge with LargeList arrow type not working #2851

ericvandever opened this issue Sep 6, 2024 · 3 comments
Labels
binding/python Issues for the Python package bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason

Comments

@ericvandever
Copy link

ericvandever commented Sep 6, 2024

Environment

--------Version info---------
Polars: 1.6.0
Index type: UInt32
Platform: macOS-14.5-arm64-arm-64bit
Python: 3.12.3 (main, Jun 6 2024, 07:44:06) [Clang 15.0.0 (clang-1500.3.9.4)]
----Optional dependencies----
deltalake 0.19.2
numpy 2.1.1
pyarrow 17.0.0


Bug

What happened:
In version 0.19 large_dtypes flags were deprecated causing my workflow with merging deltatables including list types to break.

What you expected to happen:
Continue to be able to merge on tables including list (arrow largelist) type columns. Or if that is intentionally unsupported, at least some documentation/migration guide to let me know what I have to change.

How to reproduce it:

works in 0.18.x, fails on 0.19.x

    pl.DataFrame({"a": ["test"], "b": [[1]]}).write_delta(tmp_path)
    pl.DataFrame({"a": ["test1"], "b": [[1]]}).write_delta(
        tmp_path,
        mode="merge",
        delta_merge_options={
            "predicate": "s.a = t.a",
            "source_alias": "s",
            "target_alias": "t",
        },
    ).when_not_matched_insert_all().execute()

More details:
Using polars implies large_dtypes=True which is included in the underlying call to write_deltalake. If I bypass polars function and call DeltaTable.create with a schema including only non-large types things work as expected. Is it intended that the TableMerger does not support large_dtypes? Broader question (between both polars and delta): what is the implications of forcing non-large types and interacting with polars? I know the polars maintainers have strong opinions that they should be using the large types, but for my workflows, I don't need anywhere near the # of elements supported by the standard list, so largelist is just using extra space.

test output

============================== 1 failed in 0.12s ===============================
FAILED                        [100%]
tests/polars/deltalake/test_persistence.py:726 (test_polar_deltatable)
tmp_path = PosixPath('/private/var/folders/xq/fhffzxx95nvgzgl911sjsthh0000gn/T/pytest-of-ericv/pytest-59/test_polar_deltatable0')

    @pytest.mark.write_disk
    def test_polar_deltatable(tmp_path: Path) -> None:
        pl.DataFrame({"a": ["test"], "b": [[1]]}).write_delta(tmp_path)
        pl.DataFrame({"a": ["test1"], "b": [[1]]}).write_delta(
            tmp_path,
            mode="merge",
            delta_merge_options={
                "predicate": "s.a = t.a",
                "source_alias": "s",
                "target_alias": "t",
            },
>       ).when_not_matched_insert_all().execute()

test_persistence.py:738: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <deltalake.table.TableMerger object at 0x11fca8650>

    def execute(self) -> Dict[str, Any]:
        """Executes `MERGE` with the previously provided settings in Rust with Apache Datafusion query engine.
    
        Returns:
            Dict: metrics
        """
>       metrics = self._table.merge_execute(self._builder)
E       _internal.DeltaError: Generic DeltaTable error: type_coercion
E       caused by
E       Error during planning: Failed to coerce then ([LargeList(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })]) and else (None) to common types in CASE WHEN expression

../../../.venv/lib/python3.12/site-packages/deltalake/table.py:1677: DeltaError
@ericvandever ericvandever added the bug Something isn't working label Sep 6, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Sep 6, 2024
@ion-elgreco
Copy link
Collaborator

Please share an error msg

@ericvandever
Copy link
Author

Please share an error msg

I've updated the message, can't believe I forgot.

@ion-elgreco ion-elgreco changed the title DeltaTable large_dtypes deprecated expected behavior? Merge with LargeList arrow type not working Sep 7, 2024
@ion-elgreco
Copy link
Collaborator

@ericvandever this is out of our control, I've created an issue upstream in datafusion.

The only reason it worked before is because the data was downcasted to List prior the merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason
Projects
None yet
Development

No branches or pull requests

3 participants