Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for handling duplicate timestamps in rolling operations similar to Pandas #85

Open
fachra opened this issue Aug 10, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@fachra
Copy link

fachra commented Aug 10, 2024

Description

Feature Request:

I would like to request a feature in Polars that allows handling duplicate timestamps in rolling operations in a way that mimics Pandas' behavior.
Related issue: pola-rs/polars#6691
Related question: stackoverflow

Context:

In Pandas, when performing a rolling operation with a time-based window on a DataFrame that includes duplicate timestamps, each row is treated individually, and the rolling sum (or other rolling operations) accounts for each row separately, even if they share the same timestamp.

For example, consider the following code in Pandas:

import pandas as pd

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = pd.DataFrame(data)
df["timestamp"] = pd.to_datetime(df["timestamp"])
df.set_index("timestamp", inplace=True)
df["rolling_sum"] = df["value"].rolling("10min").sum()
print(df)

This results in:

                     value  rolling_sum
timestamp                              
2023-08-04 10:00:00      1          1.0
2023-08-04 10:05:00      2          3.0
2023-08-04 10:10:00      3          5.0
2023-08-04 10:10:00      4          9.0
2023-08-04 10:20:00      5          5.0
2023-08-04 10:20:00      6         11.0

In Polars, using a similar operation does not yield the same result, as duplicate timestamps are aggregated differently:

import polars as pl

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = (
    pl.DataFrame(data)
    .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
    .sort("timestamp")
    .with_columns(
        pl.col("value")
        .rolling_sum_by(by="timestamp", window_size="10m")
        .alias("rolling_sum")
    )
)
print(df)

This results in:

┌─────────────────────┬───────┬─────────────┐
│ timestamp           ┆ value ┆ rolling_sum │
│ ---                 ┆ ---   ┆ ---         │
│ datetime[μs]        ┆ i64   ┆ i64         │
╞═════════════════════╪═══════╪═════════════╡
│ 2023-08-04 10:00:00 ┆ 1     ┆ 1           │
│ 2023-08-04 10:05:00 ┆ 2     ┆ 3           │
│ 2023-08-04 10:10:00 ┆ 3     ┆ 9           │
│ 2023-08-04 10:10:00 ┆ 4     ┆ 9           │
│ 2023-08-04 10:20:00 ┆ 5     ┆ 11          │
│ 2023-08-04 10:20:00 ┆ 6     ┆ 11          │
└─────────────────────┴───────┴─────────────┘

Proposed Solution:

It would be beneficial if Polars could offer an option or parameter within the rolling operations that allows for handling duplicate timestamps in a way that each row is treated individually, similar to Pandas.

Thank you for considering this feature request!

@fachra fachra added the enhancement New feature or request label Aug 10, 2024
@MarcoGorelli
Copy link
Collaborator

thanks for the request

may I ask what your use case is for considering them separate? the current behaviour looks to me like it respects the rolling definition

@fachra
Copy link
Author

fachra commented Aug 10, 2024

Thank you for your response! Let me provide more context about my use case.

In my specific use case, I need Polars' results to match the output of a production system. In the production system, data is streamed in one by one, but the timestamps are only accurate to the second. For multiple data points received within the same second, I know the order in which they were received, but I don't have more precise timestamps.

Each time a new data point arrives, the production system calculates the sum of all data points received in the past five seconds. If I use Polars' current rolling calculation method, future data points within the same second would be included in the calculation, which doesn't match the actual behavior of the production system.

I want to emphasize that I don't think the current implementation in Polars is incorrect. It works as intended and respects the rolling definition. However, it would be helpful if there were an option or parameter that allowed switching between the current behavior and a mode that matches Pandas' rolling calculation, particularly for handling duplicate timestamps. Currently, achieving similar behavior to Pandas with the current version of Polars is not easy.

@JSteilberg
Copy link

JSteilberg commented Aug 17, 2024

Commenting to add that in the case of plain old rolling(), this behavior leads to a (sometimes undesirable in my opinion) creation of duplicate groupings:

data = pl.DataFrame({
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
})

df = (
    data
    .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
    .sort("timestamp")
    .rolling(index_column="timestamp", period="10m")
    .agg(pl.col("value").alias("value_friends"))
).with_columns(data.select("value"))
print(df)
shape: (6, 3)
┌─────────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_friends ┆ value │
│ ---                 ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────┴───────┘

This could be undesirable in the following convoluted but relevant case: Say I want to look at each value in my dataframe, and find its "friends" within a time period before, so that I can do processing with respect to the value and those friends. I know that in real life none of my two values actually happen at the same time, but with me sensor's API I can only get resolution down to five minutes, even though the values will by ordered by true time of arrival to the sensor.

In this case, I really want

shape: (6, 3)
┌─────────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_friends ┆ value │
│ ---                 ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3]        ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5]           ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────┴───────┘

I cannot think of a perfect solution to obtain the above table-- considering adding range(0, num_rows) nanoseconds to the timestamp column....

Thanks for maintaining/developing this library by the way... It's all I've been doing for the past week. Took a 4-hour multiprocessing pandas query I was doing down to 10 minutes!

@fachra
Copy link
Author

fachra commented Aug 18, 2024

@JSteilberg There is a workaround to get what you want.
But I hope Polars can achieve this behevior more easily.

import polars as pl
data = pl.DataFrame({
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
})

df = (
    data.with_columns(
        pl.col("timestamp").str.strptime(pl.Datetime)
    )
    .sort("timestamp")
    # aggregate values into list column
    .rolling(index_column="timestamp", period="10m").agg(pl.col('value').alias('value_old_friends'))
    # take sum of sliced list column to get desired behaviour
    .with_columns(
        pl.col("value_old_friends")
        .list.slice(
            0,
            pl.col("value_old_friends").list.len() - pl.int_range(pl.len()).reverse().over("timestamp")
        )
        .alias('value_friends')
    )
).with_columns(data.select("value"))
print(df)

Output

shape: (6, 4)
┌─────────────────────┬───────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_old_friends ┆ value_friends ┆ value │
│ ---                 ┆ ---               ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]         ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]               ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]            ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]         ┆ [2, 3]        ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]         ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5, 6]            ┆ [5]           ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]            ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────────┴───────────────┴───────┘

@MarcoGorelli
Copy link
Collaborator

One thing that'd be tricky to define about doing what you're asking for is that Polars no longer requires data to be sorted by the 'by' column in rolling_by_*, so I don't think it's unambiguous which timestamp the user intends as the one which came first

The current behaviour matches what DuckDB does:

In [17]: df
Out[17]:
shape: (4, 2)
┌────────────┬─────┐
│ ab   │
│ ------ │
│ datei64 │
╞════════════╪═════╡
│ 2020-01-034   │
│ 2020-01-011   │
│ 2020-01-029   │
│ 2020-01-022   │
└────────────┴─────┘

In [18]: duckdb.sql('select a, b, mean(b) over (order by a range between interval 1 days preceding and current row) as b
    ...: _roll from df')
Out[18]:
┌────────────┬───────┬────────┐
│     abb_roll │
│    dateint64double │
├────────────┼───────┼────────┤
│ 2020-01-0111.0 │
│ 2020-01-0294.0 │
│ 2020-01-0224.0 │
│ 2020-01-0345.0 │
└────────────┴───────┴────────┘

In [19]: df.with_columns(b_roll=pl.col('b').rolling_mean_by('a', window_size='2d'))
Out[19]:
shape: (4, 3)
┌────────────┬─────┬────────┐
│ abb_roll │
│ ---------    │
│ datei64f64    │
╞════════════╪═════╪════════╡
│ 2020-01-0345.0    │
│ 2020-01-0111.0    │
│ 2020-01-0294.0    │
│ 2020-01-0224.0    │
└────────────┴─────┴────────┘

In [20]: df.to_pandas().set_index('a').sort_index()['b'].rolling('2D').mean()
Out[20]:
a
2020-01-01    1.0
2020-01-02    5.0
2020-01-02    4.0
2020-01-03    5.0
Name: b, dtype: float64

@JSteilberg
Copy link

Should the docs for plain rolling be updated in that case? It seems to still say the index should be sorted

https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rolling.html#polars.DataFrame.rolling

I agree that if it no longer requires sorting then my ask is malformed

@MarcoGorelli
Copy link
Collaborator

rolling still requires sorted data, but rolling_*_by aggs don't

maybe we can wrap the workaround as a convenience function in polars-xdt, so you can just import it and have something more readable

@MarcoGorelli MarcoGorelli transferred this issue from pola-rs/polars Oct 13, 2024
@MarcoGorelli
Copy link
Collaborator

from discussion - can confirm that Polars should keep its current behaviour, but we can add the requested one here in polars-xdt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants