python - Calculate cumulative sum of time series X for time points in series Y - Stack Overflow

admin2025-04-16  4

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
) 

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
) 
Share Improve this question edited Feb 3 at 17:07 jqurious 22.1k5 gold badges20 silver badges39 bronze badges asked Feb 2 at 11:36 Dimitri ShvorobDimitri Shvorob 5352 gold badges8 silver badges24 bronze badges 1
  • It seems like your example excludes cases that you might care about, such as multiple entries in df for just one in dg. Additionally, and most importantly, you should have an expected output to go along with your example. – Dean MacGregor Commented Feb 3 at 18:00
Add a comment  | 

1 Answer 1

Reset to default 1

This can be achieved relying purely on polar's native expressions API.

As first step, we can associate each row in df with the earliest timestamp in dg equal or later than the corresponding timestamp in df. For this, pl.DataFrame.join_asof with strategy="forward" might be used.

df.join_asof(dg, on="time", strategy="forward", coalesce=False)
shape: (3, 3)
┌─────────────────────┬────────┬─────────────────────┐
│ time                ┆ amount ┆ time_right          │
│ ---                 ┆ ---    ┆ ---                 │
│ datetime[μs]        ┆ f64    ┆ datetime[μs]        │
╞═════════════════════╪════════╪═════════════════════╡
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 2025-02-02 11:01:00 │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 2025-02-02 11:02:00 │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 2025-02-02 11:03:00 │
└─────────────────────┴────────┴─────────────────────┘

Next, we can join use these timestamps to join the amount values to dg.

dg.join(
    (
        df
        .join_asof(dg, on="time", strategy="forward", coalesce=False)
        .select("amount", pl.col("time_right").alias("time"))
    ),
    on="time",
    how="left",
)
shape: (6, 2)
┌─────────────────────┬────────┐
│ time                ┆ amount │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ f64    │
╞═════════════════════╪════════╡
│ 2025-02-02 11:00:00 ┆ null   │
│ 2025-02-02 11:01:00 ┆ 5.0    │
│ 2025-02-02 11:02:00 ┆ -1.0   │
│ 2025-02-02 11:03:00 ┆ 10.0   │
│ 2025-02-02 11:04:00 ┆ null   │
│ 2025-02-02 11:05:00 ┆ null   │
└─────────────────────┴────────┘

Note that we rename the column in the dataframe returned by pl.DataFrame.join_asof before merging back to dg.

While not shown in this example, it might be the case that there are now duplicate rows for a given timestamp (as there might be multiple values in df associated with a given timestamp in dg). Hence, we first aggregate the amount values for each timestamp. Then, we can perform a regular cumulative sum.

(
    dg
    .join(
        (
            df
            .join_asof(dg, on="time", strategy="forward", coalesce=False)
            .select("amount", pl.col("time_right").alias("time"))
        ),
        on="time",
        how="left",
    )
    .group_by("time").agg(pl.col("amount").sum())
    .with_columns(
        pl.col("amount").cum_sum().name.prefix("cum_")
    )
)
shape: (6, 3)
┌─────────────────────┬────────┬────────────┐
│ time                ┆ amount ┆ cum_amount │
│ ---                 ┆ ---    ┆ ---        │
│ datetime[μs]        ┆ f64    ┆ f64        │
╞═════════════════════╪════════╪════════════╡
│ 2025-02-02 11:00:00 ┆ 0.0    ┆ 0.0        │
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 5.0        │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 4.0        │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 14.0       │
│ 2025-02-02 11:04:00 ┆ 0.0    ┆ 14.0       │
│ 2025-02-02 11:05:00 ┆ 0.0    ┆ 14.0       │
└─────────────────────┴────────┴────────────┘
转载请注明原文地址:http://www.anycun.com/QandA/1744802868a87829.html