Pandas - 对数据列求和直到满足值,构建子集,冲洗并重复所有行
Pandas - Sum column of data until value is met, build subset, rinse and repeat for all rows
这里是新手,但希望有人可以帮助我编写代码来帮助我分解大型数据框。我需要在很多行(可能是数十万行)上执行此操作,所以我想使用 Pandas 将所有数据放入数据框中。我正在尝试用较小的数据子集来计算逻辑,然后再在较大的数据集上尝试它,我将使用 dask 或 Pandas with chunksize 来引入较大的数据集......需要像尽可能提高内存效率。
假设我有以下数据框:
a b
0 10 random_data_that_I need
1 23 random_data_that_I_need
2 45 random_data_that_I_need
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need
6 34 random_data_that_I_need
7 65 random_data_that_I_need
8 20 random_data_that_I_need
9 45 random_data_that_I_need
10 11 random_data_that_I_need
11 12 random_data_that_I_need
我想做的是总结“a”列,直到满足一个值,假设我的目标阈值是 50。一旦达到阈值,我想要所有让我到达那里的行作为子集包含在内。如果添加下一行让我失望,没关系,因为之前的行总和低于“50”阈值,它应该添加下一行,然后重新启动该过程。如果我在最后有任何剩余的行没有让我达到阈值数字,那么将它们相加。
所以最终结果看起来像
result_df1:
0 10 random_data_that_I need
1 23 random_data_that_I need
2 45 random_data_that_I need
result_df2:
3 32 random_data_that_I need
4 15 random_data_that_I need
5 10 random_data_that_I need
result_df3:
6 34 random_data_that_I need
7 65 random_data_that_I need
result_df4:
8 20 random_data_that_I need
9 45 random_data_that_I need
result_df5:
10 11 random_data_that_I_need
11 12 random_data_that_I_need
结果不一定是数据框...但如果是...可能会很好
一种方式:
df_list = []
old_index = 0
while True:
m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
if any(m):
index = m.idxmax()
else:
break
df1 = df.iloc[old_index:index+1]
df_list.append(df1)
old_index = index + 1
df_list.append(df.iloc[index+1:, :])
输出:
[ a b
0 10 random_data_that_I_need
1 23 random_data_that_I_need
2 45 random_data_that_I_need,
a b
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need,
a b
6 34 random_data_that_I_need
7 65 random_data_that_I_need,
a b
8 20 random_data_that_I_need
9 45 random_data_that_I_need,
a b
10 11 random_data_that_I_need
11 12 random_data_that_I_need]
选择:
sums = 0
df_list = []
old_index = 0
for index, i in enumerate(df.a):
sums += i
if sums > 50:
df_list.append(df[old_index:index+1])
old_index = index + 1
sums = 0
df_list.append(df[old_index:])
list_of_df = []
current_df = df.iloc[0:1]
for idx in range(1, df.shape[0]):
if current_df['col1'].sum() < 50:
current_df = pd.concat([current_df, df.iloc[idx:idx+1]])
else:
list_of_df.append(current_df)
current_df = df.iloc[idx:idx+1]
if idx == df.shape[0]-1:
list_of_df.append(current_df)
要获取数据框,只需像这样从列表中调用它:
# get the first dataframe
list_of_df[0]
# or if you want to output all dataframes to the console like your example:
for dataframe in list_of_df:
print(dataframe)
这里是新手,但希望有人可以帮助我编写代码来帮助我分解大型数据框。我需要在很多行(可能是数十万行)上执行此操作,所以我想使用 Pandas 将所有数据放入数据框中。我正在尝试用较小的数据子集来计算逻辑,然后再在较大的数据集上尝试它,我将使用 dask 或 Pandas with chunksize 来引入较大的数据集......需要像尽可能提高内存效率。
假设我有以下数据框:
a b
0 10 random_data_that_I need
1 23 random_data_that_I_need
2 45 random_data_that_I_need
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need
6 34 random_data_that_I_need
7 65 random_data_that_I_need
8 20 random_data_that_I_need
9 45 random_data_that_I_need
10 11 random_data_that_I_need
11 12 random_data_that_I_need
我想做的是总结“a”列,直到满足一个值,假设我的目标阈值是 50。一旦达到阈值,我想要所有让我到达那里的行作为子集包含在内。如果添加下一行让我失望,没关系,因为之前的行总和低于“50”阈值,它应该添加下一行,然后重新启动该过程。如果我在最后有任何剩余的行没有让我达到阈值数字,那么将它们相加。
所以最终结果看起来像
result_df1:
0 10 random_data_that_I need
1 23 random_data_that_I need
2 45 random_data_that_I need
result_df2:
3 32 random_data_that_I need
4 15 random_data_that_I need
5 10 random_data_that_I need
result_df3:
6 34 random_data_that_I need
7 65 random_data_that_I need
result_df4:
8 20 random_data_that_I need
9 45 random_data_that_I need
result_df5:
10 11 random_data_that_I_need
11 12 random_data_that_I_need
结果不一定是数据框...但如果是...可能会很好
一种方式:
df_list = []
old_index = 0
while True:
m = df.iloc[old_index:, :].a.cumsum().sub(50).gt(0)
if any(m):
index = m.idxmax()
else:
break
df1 = df.iloc[old_index:index+1]
df_list.append(df1)
old_index = index + 1
df_list.append(df.iloc[index+1:, :])
输出:
[ a b
0 10 random_data_that_I_need
1 23 random_data_that_I_need
2 45 random_data_that_I_need,
a b
3 32 random_data_that_I_need
4 15 random_data_that_I_need
5 10 random_data_that_I_need,
a b
6 34 random_data_that_I_need
7 65 random_data_that_I_need,
a b
8 20 random_data_that_I_need
9 45 random_data_that_I_need,
a b
10 11 random_data_that_I_need
11 12 random_data_that_I_need]
选择:
sums = 0
df_list = []
old_index = 0
for index, i in enumerate(df.a):
sums += i
if sums > 50:
df_list.append(df[old_index:index+1])
old_index = index + 1
sums = 0
df_list.append(df[old_index:])
list_of_df = []
current_df = df.iloc[0:1]
for idx in range(1, df.shape[0]):
if current_df['col1'].sum() < 50:
current_df = pd.concat([current_df, df.iloc[idx:idx+1]])
else:
list_of_df.append(current_df)
current_df = df.iloc[idx:idx+1]
if idx == df.shape[0]-1:
list_of_df.append(current_df)
要获取数据框,只需像这样从列表中调用它:
# get the first dataframe
list_of_df[0]
# or if you want to output all dataframes to the console like your example:
for dataframe in list_of_df:
print(dataframe)