根据上一行修改DataFrame（根据上一行的累加结果进行带条件的累加）

Question

我有一个数据框，其中一列包含数字（数量）。每行代表一天，因此整个数据帧应被视为顺序数据。我想添加第二列来计算数量列的累计总和，但如果在任何时候累计总和大于 0，则下一行应从 0 开始计算累计总和。

我使用 iterrows() 解决了这个问题，但我读到这个函数效率很低，并且有数百万行，计算需要 20 多分钟。我的解决方案如下：

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])


for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
    else:
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']

print(df)

#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0

有没有更快（更优化的方法）来计算这个？

我也不确定“if index == 0”块是否是最佳解决方案，是否可以用更优雅的方式解决？如果没有此块，则会出现错误，因为在第一行中不能有用于计算的“上一行”。

Answer 1

遍历 DataFrame 行非常慢，应该避免。处理大块数据是 pandas.

的方法

对于你的情况，将你的 DataFrame 列 quantity 视为一个 numpy 数组，与你的方法相比，下面的代码应该大大加快了这个过程：

import pandas as pd
import numpy as np

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
    total += xi
    y[i] = total
    total = total if total < 0 else 0

df['outcome'] = y

print(df)

输出：

    quantity  outcome
0         -1     -1.0
1         -1     -2.0
2         -1     -3.0
3         -1     -4.0
4         15     11.0
5         -1     -1.0
6         -1     -2.0
7         -1     -3.0
8         -1     -4.0
9          5      1.0
10        -1     -1.0
11        15     14.0
12        -1     -1.0
13        -1     -2.0
14        -1     -3.0

如果你还需要更快的速度，建议看看numba as per jezrael 。

编辑 - 性能测试

我对性能很好奇，并用所有 3 种方法完成了这个模块。

我没有对个别功能进行优化，只是从OP和复制了代码并做了一些小改动。

"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.

Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np


def pditerrows(df):
    """Iterate over DataFrame using `iterrows`"""

    for index, row in df.iterrows():
        if index == 0:
            df.loc[index, 'outcome'] = df.loc[index, 'quantity']
        else:
            previous_outcome = df.loc[index-1, 'outcome'] 
            if previous_outcome > 0:
                previous_outcome = 0

            df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
            
    return df


def nparray(df):
    """Convert DataFrame column to `numpy` arrays."""

    x = np.array(df.quantity)
    y = np.zeros(x.size)

    total = 0
    for i, xi in enumerate(x):
        total += xi
        y[i] = total
        total = total if total < 0 else 0
    
    df['outcome'] = y
    
    return df


@njit
def f(x, lim):
    result = np.empty(len(x))
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

def numbaloop(df):
    """Convert DataFrame to `numpy` arrays and loop using `numba`.
    See [
    """
    df['outcome'] = f(df.quantity.to_numpy(), 0)
    return df

def create_df(size):
    """Create a DataFrame filed with -1's and 15's, with 90% of 
    the entries equal to -1 and 10% equal to 15, randomly 
    placed in the array.
    """
    df = pd.DataFrame(
            np.random.choice(
                (-1, 15), 
                size=size, 
                p=[0.9, 0.1]
            ),
            columns=['quantity'])
    return df


# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
                  columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))

运行对于一个有点小的数组，size = 20_000，导致：

In: import bench_dataframe as bd
 .. df = bd.create_df(size=20_000)

In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此处 numpy 数组比 iterrows() 快 700 倍以上，numba 仍然比 numpy 快 22 倍。

对于更大的数组，size = 200_000，我们得到：

In: import bench_dataframe as bd
 .. df = bd.create_df(size=200_000)

In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P

In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

再次使 numba 比 numpy 数组快 25 倍以上，并确认您应该不惜一切代价避免使用 iterrows() 超过几个数百行。

Answer 2

如果性能很重要，我认为在使用循环时 numba 是最好的：

@njit
def f(x, lim):
    result = np.empty(len(x), dtype=np.int)
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
    quantity  outcome  outcome1
0         -1     -1.0        -1
1         -1     -2.0        -2
2         -1     -3.0        -3
3         -1     -4.0        -4
4         15     11.0        11
5         -1     -1.0        -1
6         -1     -2.0        -2
7         -1     -3.0        -3
8         -1     -4.0        -4
9          5      1.0         1
10        -1     -1.0        -1
11        15     14.0        14
12        -1     -1.0        -1
13        -1     -2.0        -2
14        -1     -3.0        -3

根据上一行修改DataFrame（根据上一行的累加结果进行带条件的累加）

Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)

python

sequential

pandas

编辑 - 性能测试