-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for handling duplicate timestamps in rolling operations similar to Pandas #85
Comments
thanks for the request may I ask what your use case is for considering them separate? the current behaviour looks to me like it respects the rolling definition |
Thank you for your response! Let me provide more context about my use case. In my specific use case, I need Polars' results to match the output of a production system. In the production system, data is streamed in one by one, but the timestamps are only accurate to the second. For multiple data points received within the same second, I know the order in which they were received, but I don't have more precise timestamps. Each time a new data point arrives, the production system calculates the sum of all data points received in the past five seconds. If I use Polars' current rolling calculation method, future data points within the same second would be included in the calculation, which doesn't match the actual behavior of the production system. I want to emphasize that I don't think the current implementation in Polars is incorrect. It works as intended and respects the rolling definition. However, it would be helpful if there were an option or parameter that allowed switching between the current behavior and a mode that matches Pandas' rolling calculation, particularly for handling duplicate timestamps. Currently, achieving similar behavior to Pandas with the current version of Polars is not easy. |
Commenting to add that in the case of plain old
This could be undesirable in the following convoluted but relevant case: Say I want to look at each value in my dataframe, and find its "friends" within a time period before, so that I can do processing with respect to the value and those friends. I know that in real life none of my two values actually happen at the same time, but with me sensor's API I can only get resolution down to five minutes, even though the values will by ordered by true time of arrival to the sensor. In this case, I really want
I cannot think of a perfect solution to obtain the above table-- considering adding Thanks for maintaining/developing this library by the way... It's all I've been doing for the past week. Took a 4-hour multiprocessing pandas query I was doing down to 10 minutes! |
@JSteilberg There is a workaround to get what you want. import polars as pl
data = pl.DataFrame({
"timestamp": [
"2023-08-04 10:00:00",
"2023-08-04 10:05:00",
"2023-08-04 10:10:00",
"2023-08-04 10:10:00",
"2023-08-04 10:20:00",
"2023-08-04 10:20:00",
],
"value": [1, 2, 3, 4, 5, 6],
})
df = (
data.with_columns(
pl.col("timestamp").str.strptime(pl.Datetime)
)
.sort("timestamp")
# aggregate values into list column
.rolling(index_column="timestamp", period="10m").agg(pl.col('value').alias('value_old_friends'))
# take sum of sliced list column to get desired behaviour
.with_columns(
pl.col("value_old_friends")
.list.slice(
0,
pl.col("value_old_friends").list.len() - pl.int_range(pl.len()).reverse().over("timestamp")
)
.alias('value_friends')
)
).with_columns(data.select("value"))
print(df) Output
|
One thing that'd be tricky to define about doing what you're asking for is that Polars no longer requires data to be sorted by the The current behaviour matches what DuckDB does: In [17]: df
Out[17]:
shape: (4, 2)
┌────────────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═════╡
│ 2020-01-03 ┆ 4 │
│ 2020-01-01 ┆ 1 │
│ 2020-01-02 ┆ 9 │
│ 2020-01-02 ┆ 2 │
└────────────┴─────┘
In [18]: duckdb.sql('select a, b, mean(b) over (order by a range between interval 1 days preceding and current row) as b
...: _roll from df')
Out[18]:
┌────────────┬───────┬────────┐
│ a │ b │ b_roll │
│ date │ int64 │ double │
├────────────┼───────┼────────┤
│ 2020-01-01 │ 1 │ 1.0 │
│ 2020-01-02 │ 9 │ 4.0 │
│ 2020-01-02 │ 2 │ 4.0 │
│ 2020-01-03 │ 4 │ 5.0 │
└────────────┴───────┴────────┘
In [19]: df.with_columns(b_roll=pl.col('b').rolling_mean_by('a', window_size='2d'))
Out[19]:
shape: (4, 3)
┌────────────┬─────┬────────┐
│ a ┆ b ┆ b_roll │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ f64 │
╞════════════╪═════╪════════╡
│ 2020-01-03 ┆ 4 ┆ 5.0 │
│ 2020-01-01 ┆ 1 ┆ 1.0 │
│ 2020-01-02 ┆ 9 ┆ 4.0 │
│ 2020-01-02 ┆ 2 ┆ 4.0 │
└────────────┴─────┴────────┘
In [20]: df.to_pandas().set_index('a').sort_index()['b'].rolling('2D').mean()
Out[20]:
a
2020-01-01 1.0
2020-01-02 5.0
2020-01-02 4.0
2020-01-03 5.0
Name: b, dtype: float64 |
Should the docs for plain I agree that if it no longer requires sorting then my ask is malformed |
maybe we can wrap the workaround as a convenience function in polars-xdt, so you can just import it and have something more readable |
from discussion - can confirm that Polars should keep its current behaviour, but we can add the requested one here in polars-xdt |
Description
Feature Request:
I would like to request a feature in Polars that allows handling duplicate timestamps in rolling operations in a way that mimics Pandas' behavior.
Related issue: pola-rs/polars#6691
Related question: stackoverflow
Context:
In Pandas, when performing a rolling operation with a time-based window on a DataFrame that includes duplicate timestamps, each row is treated individually, and the rolling sum (or other rolling operations) accounts for each row separately, even if they share the same timestamp.
For example, consider the following code in Pandas:
This results in:
In Polars, using a similar operation does not yield the same result, as duplicate timestamps are aggregated differently:
This results in:
Proposed Solution:
It would be beneficial if Polars could offer an option or parameter within the rolling operations that allows for handling duplicate timestamps in a way that each row is treated individually, similar to Pandas.
Thank you for considering this feature request!
The text was updated successfully, but these errors were encountered: