Pandas - 对数据列求和直到满足值,构建子集,冲洗并重复所有行

Pandas - Sum column of data until value is met, build subset, rinse and repeat for all rows

这里是新手,但希望有人可以帮助我编写代码来帮助我分解大型数据框。我需要在很多行(可能是数十万行)上执行此操作,所以我想使用 Pandas 将所有数据放入数据框中。我正在尝试用较小的数据子集来计算逻辑,然后再在较大的数据集上尝试它,我将使用 dask 或 Pandas with chunksize 来引入较大的数据集......需要像尽可能提高内存效率。

假设我有以下数据框:

   a   b  
0  10  random_data_that_I need 
1  23  random_data_that_I_need
2  45  random_data_that_I_need
3  32  random_data_that_I_need
4  15  random_data_that_I_need
5  10  random_data_that_I_need
6  34  random_data_that_I_need
7  65  random_data_that_I_need
8  20  random_data_that_I_need
9  45  random_data_that_I_need
10 11  random_data_that_I_need
11 12  random_data_that_I_need

我想做的是总结“a”列,直到满足一个值,假设我的目标阈值是 50。一旦达到阈值,我想要所有让我到达那里的行作为子集包含在内。如果添加下一行让我失望,没关系,因为之前的行总和低于“50”阈值,它应该添加下一行,然后重新启动该过程。如果我在最后有任何剩余的行没有让我达到阈值数字,那么将它们相加。

所以最终结果看起来像

result_df1:
0  10  random_data_that_I need 
1  23  random_data_that_I need
2  45  random_data_that_I need

result_df2:
3  32  random_data_that_I need
4  15  random_data_that_I need
5  10  random_data_that_I need

result_df3:
6  34  random_data_that_I need
7  65  random_data_that_I need

result_df4:
8  20  random_data_that_I need
9  45  random_data_that_I need

result_df5:
10 11  random_data_that_I_need
11 12  random_data_that_I_need

结果不一定是数据框...但如果是...可能会很好

一种方式:

df_list = []
old_index = 0
while True:
    m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
    if any(m):
        index = m.idxmax()
    else:
        break
    df1 = df.iloc[old_index:index+1]
    df_list.append(df1)
    old_index = index + 1

df_list.append(df.iloc[index+1:, :])
输出:
[    a                        b
 0  10  random_data_that_I_need
 1  23  random_data_that_I_need
 2  45  random_data_that_I_need,
     a                        b
 3  32  random_data_that_I_need
 4  15  random_data_that_I_need
 5  10  random_data_that_I_need,
     a                        b
 6  34  random_data_that_I_need
 7  65  random_data_that_I_need,
     a                        b
 8  20  random_data_that_I_need
 9  45  random_data_that_I_need,
      a                        b
 10  11  random_data_that_I_need
 11  12  random_data_that_I_need]
选择:
sums = 0
df_list = []
old_index = 0
for index, i in enumerate(df.a):
    sums += i
    if sums > 50:
        df_list.append(df[old_index:index+1])
        old_index = index + 1
        sums = 0
df_list.append(df[old_index:])
list_of_df = []
current_df = df.iloc[0:1]
for idx in range(1, df.shape[0]):
    if current_df['col1'].sum() < 50:
        current_df = pd.concat([current_df, df.iloc[idx:idx+1]])
    else:
        list_of_df.append(current_df)
        current_df = df.iloc[idx:idx+1]
    if idx == df.shape[0]-1:
        list_of_df.append(current_df)

要获取数据框,只需像这样从列表中调用它:

# get the first dataframe 
list_of_df[0]

# or if you want to output all dataframes to the console like your example:
for dataframe in list_of_df:
    print(dataframe)