Pandas 数据帧矢量化 bucketing/aggregation？

Question

任务

我有一个如下所示的数据框：

date	money_spent ($)	meals_eaten	weight
2021-01-01 10:00:00	350	5	140
2021-01-02 18:00:00	250	2	170
2021-01-03 12:10:00	200	3	160
2021-01-04 19:40:00	100	1	150

我想将其离散化，以便每隔 $X“切割”行。我想知道一些关于我每 $X 花多少钱的统计数据。

所以如果我使用 0 作为阈值，前两行将落在第一次切割中，我可以按如下方式聚合剩余的列：

剪辑的第一个 date
平均meals_eaten
最小值weight
最大值weight

所以最后的 table 将是这样的两行：

date	cumulative_spent ($)	meals_eaten	min_weight	max_weight
2021-01-01 10:00:00	600	3.5	140	170
2021-01-03 12:10:00	300	2	150	160

我的方法：

我的第一直觉是计算 cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() 来基本上创建一个新列，我们称之为 spent_bin，它决定每一行的 bin。

注意： 在这个玩具示例中，spent_bin 基本上是：前两行 [0,500]，最后两行 (500-1000]二.

然后就很简单了，我做一个groupbyspent_bin然后聚合如下：

.agg({
    'date':'first', 
    'meals_eaten':'mean', 
    'returns': ['min', 'max']
})

我试过的

import pandas as pd


rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]

df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()

print(df)
print(pd.cut(df.cum_spent, 500))

出于某种原因，我无法执行 cut 步骤。这是我上面的 toy code。由于某种原因，标签不干净[0-500], (500,1000]。老实说，我会接受 [350,500],(500-800]（这是实际的总和值在切口边缘的值），但即使我做的与文档示例。有什么帮助吗？

注意事项和困难：

当然在for循环中写这个很容易，只需要while cum_spent < 500:。问题是我的实际数据集中有数百万行，目前以这种方式处理单个 df 需要 20 分钟。

还有一个小问题，有时行会打破间隔。当发生这种情况时，我希望包括最后一行。这个问题出现在玩具示例中，其中第 2 行实际上以 600 美元而不是 500 美元结束。但它是 第一行结束或超过 500 美元，所以我将其包含在第一个箱子中。

Answer 1

自定义函数实现cumsum有复位限制

df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
    date = ('date','first'),
    meals_eaten = ('meals_eaten','mean'),
    min_weight = ('weight','min'),
    max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]: 
            date  meals_eaten  min_weight  max_weight
new                                                  
1    2021-01-01           3.5         140         170
0    2021-01-03           2.0         150         160

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        check = 0
        total += y
        if total >= lim:
            total = 0
            check = 1
        result.append(check)
    return result

Pandas 数据帧矢量化 bucketing/aggregation？

Pandas dataframe vectorized bucketing/aggregation?

python

vectorization

dataframe

pandas

任务

我的方法：

我试过的

注意事项和困难：