
Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)

我有一个数据框,其中一列包含数字(数量)。每行代表一天,因此整个数据帧应被视为顺序数据。我想添加第二列来计算数量列的累计总和,但如果在任何时候累计总和大于 0,则下一行应从 0 开始计算累计总和。

我使用 iterrows() 解决了这个问题,但我读到这个函数效率很低,并且有数百万行,计算需要 20 多分钟。我的解决方案如下:

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']


#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0


我也不确定“if index == 0”块是否是最佳解决方案,是否可以用更优雅的方式解决?如果没有此块,则会出现错误,因为在第一行中不能有用于计算的“上一行”。

遍历 DataFrame 行非常慢,应该避免。处理大块数据是 pandas.


对于你的情况,将你的 DataFramequantity 视为一个 numpy 数组,与你的方法相比,下面的代码应该大大加快了这个过程:

import pandas as pd
import numpy as np

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
    total += xi
    y[i] = total
    total = total if total < 0 else 0

df['outcome'] = y



    quantity  outcome
0         -1     -1.0
1         -1     -2.0
2         -1     -3.0
3         -1     -4.0
4         15     11.0
5         -1     -1.0
6         -1     -2.0
7         -1     -3.0
8         -1     -4.0
9          5      1.0
10        -1     -1.0
11        15     14.0
12        -1     -1.0
13        -1     -2.0
14        -1     -3.0

如果你还需要更快的速度,建议看看numba as per jezrael

编辑 - 性能测试

我对性能很好奇,并用所有 3 种方法完成了这个模块。


Performance test of iteration over DataFrame rows.

Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
from numba import njit
import pandas as pd
import numpy as np

def pditerrows(df):
    """Iterate over DataFrame using `iterrows`"""

    for index, row in df.iterrows():
        if index == 0:
            df.loc[index, 'outcome'] = df.loc[index, 'quantity']
            previous_outcome = df.loc[index-1, 'outcome'] 
            if previous_outcome > 0:
                previous_outcome = 0

            df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
    return df

def nparray(df):
    """Convert DataFrame column to `numpy` arrays."""

    x = np.array(df.quantity)
    y = np.zeros(x.size)

    total = 0
    for i, xi in enumerate(x):
        total += xi
        y[i] = total
        total = total if total < 0 else 0
    df['outcome'] = y
    return df

def f(x, lim):
    result = np.empty(len(x))
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

def numbaloop(df):
    """Convert DataFrame to `numpy` arrays and loop using `numba`.
    See [
    df['outcome'] = f(df.quantity.to_numpy(), 0)
    return df

def create_df(size):
    """Create a DataFrame filed with -1's and 15's, with 90% of 
    the entries equal to -1 and 10% equal to 15, randomly 
    placed in the array.
    df = pd.DataFrame(
                (-1, 15), 
                p=[0.9, 0.1]
    return df

# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))

运行 对于一个有点小的数组,size = 20_000,导致:

In: import bench_dataframe as bd
 .. df = bd.create_df(size=20_000)

In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此处 numpy 数组比 iterrows() 快 700 倍以上,numba 仍然比 numpy 快 22 倍。

对于更大的数组,size = 200_000,我们得到:

In: import bench_dataframe as bd
 .. df = bd.create_df(size=200_000)

In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P

In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

再次使 numbanumpy 数组快 25 倍以上,并确认您应该不惜一切代价避免使用 iterrows() 超过几个数百行。

如果性能很重要,我认为在使用循环时 numba 是最好的:

def f(x, lim):
    result = np.empty(len(x), dtype=np.int)
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

df['outcome1'] = f(df.quantity.to_numpy(), 0)
    quantity  outcome  outcome1
0         -1     -1.0        -1
1         -1     -2.0        -2
2         -1     -3.0        -3
3         -1     -4.0        -4
4         15     11.0        11
5         -1     -1.0        -1
6         -1     -2.0        -2
7         -1     -3.0        -3
8         -1     -4.0        -4
9          5      1.0         1
10        -1     -1.0        -1
11        15     14.0        14
12        -1     -1.0        -1
13        -1     -2.0        -2
14        -1     -3.0        -3