Support for handling duplicate timestamps in rolling operations similar to Pandas #85

fachra · 2024-08-10T06:30:50Z

Description

Feature Request:

I would like to request a feature in Polars that allows handling duplicate timestamps in rolling operations in a way that mimics Pandas' behavior.
Related issue: pola-rs/polars#6691
Related question: stackoverflow

Context:

In Pandas, when performing a rolling operation with a time-based window on a DataFrame that includes duplicate timestamps, each row is treated individually, and the rolling sum (or other rolling operations) accounts for each row separately, even if they share the same timestamp.

For example, consider the following code in Pandas:

import pandas as pd

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = pd.DataFrame(data)
df["timestamp"] = pd.to_datetime(df["timestamp"])
df.set_index("timestamp", inplace=True)
df["rolling_sum"] = df["value"].rolling("10min").sum()
print(df)

This results in:

                     value  rolling_sum
timestamp                              
2023-08-04 10:00:00      1          1.0
2023-08-04 10:05:00      2          3.0
2023-08-04 10:10:00      3          5.0
2023-08-04 10:10:00      4          9.0
2023-08-04 10:20:00      5          5.0
2023-08-04 10:20:00      6         11.0

In Polars, using a similar operation does not yield the same result, as duplicate timestamps are aggregated differently:

import polars as pl

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = (
    pl.DataFrame(data)
    .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
    .sort("timestamp")
    .with_columns(
        pl.col("value")
        .rolling_sum_by(by="timestamp", window_size="10m")
        .alias("rolling_sum")
    )
)
print(df)

This results in:

┌─────────────────────┬───────┬─────────────┐
│ timestamp           ┆ value ┆ rolling_sum │
│ ---                 ┆ ---   ┆ ---         │
│ datetime[μs]        ┆ i64   ┆ i64         │
╞═════════════════════╪═══════╪═════════════╡
│ 2023-08-04 10:00:00 ┆ 1     ┆ 1           │
│ 2023-08-04 10:05:00 ┆ 2     ┆ 3           │
│ 2023-08-04 10:10:00 ┆ 3     ┆ 9           │
│ 2023-08-04 10:10:00 ┆ 4     ┆ 9           │
│ 2023-08-04 10:20:00 ┆ 5     ┆ 11          │
│ 2023-08-04 10:20:00 ┆ 6     ┆ 11          │
└─────────────────────┴───────┴─────────────┘

Proposed Solution:

It would be beneficial if Polars could offer an option or parameter within the rolling operations that allows for handling duplicate timestamps in a way that each row is treated individually, similar to Pandas.

Thank you for considering this feature request!

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-08-10T09:14:49Z

thanks for the request

may I ask what your use case is for considering them separate? the current behaviour looks to me like it respects the rolling definition

fachra · 2024-08-10T10:56:09Z

Thank you for your response! Let me provide more context about my use case.

In my specific use case, I need Polars' results to match the output of a production system. In the production system, data is streamed in one by one, but the timestamps are only accurate to the second. For multiple data points received within the same second, I know the order in which they were received, but I don't have more precise timestamps.

Each time a new data point arrives, the production system calculates the sum of all data points received in the past five seconds. If I use Polars' current rolling calculation method, future data points within the same second would be included in the calculation, which doesn't match the actual behavior of the production system.

I want to emphasize that I don't think the current implementation in Polars is incorrect. It works as intended and respects the rolling definition. However, it would be helpful if there were an option or parameter that allowed switching between the current behavior and a mode that matches Pandas' rolling calculation, particularly for handling duplicate timestamps. Currently, achieving similar behavior to Pandas with the current version of Polars is not easy.

JSteilberg · 2024-08-17T20:40:58Z

Commenting to add that in the case of plain old rolling(), this behavior leads to a (sometimes undesirable in my opinion) creation of duplicate groupings:

data = pl.DataFrame({
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
})

df = (
    data
    .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
    .sort("timestamp")
    .rolling(index_column="timestamp", period="10m")
    .agg(pl.col("value").alias("value_friends"))
).with_columns(data.select("value"))
print(df)
shape: (6, 3)
┌─────────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_friends ┆ value │
│ ---                 ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────┴───────┘

This could be undesirable in the following convoluted but relevant case: Say I want to look at each value in my dataframe, and find its "friends" within a time period before, so that I can do processing with respect to the value and those friends. I know that in real life none of my two values actually happen at the same time, but with me sensor's API I can only get resolution down to five minutes, even though the values will by ordered by true time of arrival to the sensor.

In this case, I really want

shape: (6, 3)
┌─────────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_friends ┆ value │
│ ---                 ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3]        ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5]           ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────┴───────┘

I cannot think of a perfect solution to obtain the above table-- considering adding range(0, num_rows) nanoseconds to the timestamp column....

Thanks for maintaining/developing this library by the way... It's all I've been doing for the past week. Took a 4-hour multiprocessing pandas query I was doing down to 10 minutes!

fachra · 2024-08-18T02:17:45Z

@JSteilberg There is a workaround to get what you want.
But I hope Polars can achieve this behevior more easily.

import polars as pl
data = pl.DataFrame({
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
})

df = (
    data.with_columns(
        pl.col("timestamp").str.strptime(pl.Datetime)
    )
    .sort("timestamp")
    # aggregate values into list column
    .rolling(index_column="timestamp", period="10m").agg(pl.col('value').alias('value_old_friends'))
    # take sum of sliced list column to get desired behaviour
    .with_columns(
        pl.col("value_old_friends")
        .list.slice(
            0,
            pl.col("value_old_friends").list.len() - pl.int_range(pl.len()).reverse().over("timestamp")
        )
        .alias('value_friends')
    )
).with_columns(data.select("value"))
print(df)

Output

shape: (6, 4)
┌─────────────────────┬───────────────────┬───────────────┬───────┐
│ timestamp           ┆ value_old_friends ┆ value_friends ┆ value │
│ ---                 ┆ ---               ┆ ---           ┆ ---   │
│ datetime[μs]        ┆ list[i64]         ┆ list[i64]     ┆ i64   │
╞═════════════════════╪═══════════════════╪═══════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ [1]               ┆ [1]           ┆ 1     │
│ 2023-08-04 10:05:00 ┆ [1, 2]            ┆ [1, 2]        ┆ 2     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]         ┆ [2, 3]        ┆ 3     │
│ 2023-08-04 10:10:00 ┆ [2, 3, 4]         ┆ [2, 3, 4]     ┆ 4     │
│ 2023-08-04 10:20:00 ┆ [5, 6]            ┆ [5]           ┆ 5     │
│ 2023-08-04 10:20:00 ┆ [5, 6]            ┆ [5, 6]        ┆ 6     │
└─────────────────────┴───────────────────┴───────────────┴───────┘

MarcoGorelli · 2024-09-06T11:00:44Z

One thing that'd be tricky to define about doing what you're asking for is that Polars no longer requires data to be sorted by the 'by' column in rolling_by_*, so I don't think it's unambiguous which timestamp the user intends as the one which came first

The current behaviour matches what DuckDB does:

In [17]: df
Out[17]:
shape: (4, 2)
┌────────────┬─────┐
│ a          ┆ b   │
│ ---        ┆ --- │
│ date       ┆ i64 │
╞════════════╪═════╡
│ 2020-01-03 ┆ 4   │
│ 2020-01-01 ┆ 1   │
│ 2020-01-02 ┆ 9   │
│ 2020-01-02 ┆ 2   │
└────────────┴─────┘

In [18]: duckdb.sql('select a, b, mean(b) over (order by a range between interval 1 days preceding and current row) as b
    ...: _roll from df')
Out[18]:
┌────────────┬───────┬────────┐
│     a      │   b   │ b_roll │
│    date    │ int64 │ double │
├────────────┼───────┼────────┤
│ 2020-01-01 │     1 │    1.0 │
│ 2020-01-02 │     9 │    4.0 │
│ 2020-01-02 │     2 │    4.0 │
│ 2020-01-03 │     4 │    5.0 │
└────────────┴───────┴────────┘

In [19]: df.with_columns(b_roll=pl.col('b').rolling_mean_by('a', window_size='2d'))
Out[19]:
shape: (4, 3)
┌────────────┬─────┬────────┐
│ a          ┆ b   ┆ b_roll │
│ ---        ┆ --- ┆ ---    │
│ date       ┆ i64 ┆ f64    │
╞════════════╪═════╪════════╡
│ 2020-01-03 ┆ 4   ┆ 5.0    │
│ 2020-01-01 ┆ 1   ┆ 1.0    │
│ 2020-01-02 ┆ 9   ┆ 4.0    │
│ 2020-01-02 ┆ 2   ┆ 4.0    │
└────────────┴─────┴────────┘

In [20]: df.to_pandas().set_index('a').sort_index()['b'].rolling('2D').mean()
Out[20]:
a
2020-01-01    1.0
2020-01-02    5.0
2020-01-02    4.0
2020-01-03    5.0
Name: b, dtype: float64

JSteilberg · 2024-09-06T14:34:24Z

Should the docs for plain rolling be updated in that case? It seems to still say the index should be sorted

https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rolling.html#polars.DataFrame.rolling

I agree that if it no longer requires sorting then my ask is malformed

MarcoGorelli · 2024-09-06T16:27:08Z

rolling still requires sorted data, but rolling_*_by aggs don't

maybe we can wrap the workaround as a convenience function in polars-xdt, so you can just import it and have something more readable

MarcoGorelli · 2024-10-13T14:46:11Z

from discussion - can confirm that Polars should keep its current behaviour, but we can add the requested one here in polars-xdt

fachra added the enhancement New feature or request label Aug 10, 2024

MarcoGorelli transferred this issue from pola-rs/polars Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for handling duplicate timestamps in rolling operations similar to Pandas #85

Support for handling duplicate timestamps in rolling operations similar to Pandas #85

fachra commented Aug 10, 2024

MarcoGorelli commented Aug 10, 2024

fachra commented Aug 10, 2024 •

edited

Loading

JSteilberg commented Aug 17, 2024 •

edited

Loading

fachra commented Aug 18, 2024

MarcoGorelli commented Sep 6, 2024

JSteilberg commented Sep 6, 2024

MarcoGorelli commented Sep 6, 2024

MarcoGorelli commented Oct 13, 2024

Support for handling duplicate timestamps in rolling operations similar to Pandas #85

Support for handling duplicate timestamps in rolling operations similar to Pandas #85

Comments

fachra commented Aug 10, 2024

Description

MarcoGorelli commented Aug 10, 2024

fachra commented Aug 10, 2024 • edited Loading

JSteilberg commented Aug 17, 2024 • edited Loading

fachra commented Aug 18, 2024

MarcoGorelli commented Sep 6, 2024

JSteilberg commented Sep 6, 2024

MarcoGorelli commented Sep 6, 2024

MarcoGorelli commented Oct 13, 2024

fachra commented Aug 10, 2024 •

edited

Loading

JSteilberg commented Aug 17, 2024 •

edited

Loading