根据上一行修改DataFrame(根据上一行的累加结果进行带条件的累加)
Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)
我有一个数据框,其中一列包含数字(数量)。每行代表一天,因此整个数据帧应被视为顺序数据。我想添加第二列来计算数量列的累计总和,但如果在任何时候累计总和大于 0,则下一行应从 0 开始计算累计总和。
我使用 iterrows() 解决了这个问题,但我读到这个函数效率很低,并且有数百万行,计算需要 20 多分钟。我的解决方案如下:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
有没有更快(更优化的方法)来计算这个?
我也不确定“if index == 0”块是否是最佳解决方案,是否可以用更优雅的方式解决?如果没有此块,则会出现错误,因为在第一行中不能有用于计算的“上一行”。
遍历 DataFrame
行非常慢,应该避免。处理大块数据是 pandas
.
的方法
对于你的情况,将你的 DataFrame
列 quantity
视为一个 numpy
数组,与你的方法相比,下面的代码应该大大加快了这个过程:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
输出:
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
如果你还需要更快的速度,建议看看numba as per jezrael 。
编辑 - 性能测试
我对性能很好奇,并用所有 3 种方法完成了这个模块。
我没有对个别功能进行优化,只是从OP和复制了代码并做了一些小改动。
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
@njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
运行 对于一个有点小的数组,size = 20_000
,导致:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
此处 numpy
数组比 iterrows()
快 700 倍以上,numba
仍然比 numpy
快 22 倍。
对于更大的数组,size = 200_000
,我们得到:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
再次使 numba
比 numpy
数组快 25 倍以上,并确认您应该不惜一切代价避免使用 iterrows()
超过几个数百行。
如果性能很重要,我认为在使用循环时 numba 是最好的:
@njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3
我有一个数据框,其中一列包含数字(数量)。每行代表一天,因此整个数据帧应被视为顺序数据。我想添加第二列来计算数量列的累计总和,但如果在任何时候累计总和大于 0,则下一行应从 0 开始计算累计总和。
我使用 iterrows() 解决了这个问题,但我读到这个函数效率很低,并且有数百万行,计算需要 20 多分钟。我的解决方案如下:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
有没有更快(更优化的方法)来计算这个?
我也不确定“if index == 0”块是否是最佳解决方案,是否可以用更优雅的方式解决?如果没有此块,则会出现错误,因为在第一行中不能有用于计算的“上一行”。
遍历 DataFrame
行非常慢,应该避免。处理大块数据是 pandas
.
对于你的情况,将你的 DataFrame
列 quantity
视为一个 numpy
数组,与你的方法相比,下面的代码应该大大加快了这个过程:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
输出:
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
如果你还需要更快的速度,建议看看numba as per jezrael
编辑 - 性能测试
我对性能很好奇,并用所有 3 种方法完成了这个模块。
我没有对个别功能进行优化,只是从OP和
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
@njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
运行 对于一个有点小的数组,size = 20_000
,导致:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
此处 numpy
数组比 iterrows()
快 700 倍以上,numba
仍然比 numpy
快 22 倍。
对于更大的数组,size = 200_000
,我们得到:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
再次使 numba
比 numpy
数组快 25 倍以上,并确认您应该不惜一切代价避免使用 iterrows()
超过几个数百行。
如果性能很重要,我认为在使用循环时 numba 是最好的:
@njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3