如何使用多级索引 pandas 数据框中一列的总和值作为新列中值的条件
How to use the sum values from a column in a multi-level indexed pandas dataframe as a condition for values in new column
我有一个多级索引 pandas 数据框。我想创建一个新列,其中该列中的值基于条件。此条件基于对该索引的另一列求和,然后将其减半。如果这小于存储在单独列表中的最后一个值,则新列中的值将采用与数据框中另一列相同的值。如果不满足此条件,则新列中的所有值都应为 0
.
使用这个问题来尝试实现这个 我使用了 np.where
和 df.sum(level=0, axis=1)
的组合,但这会导致以下错误:
ValueError: operands could not be broadcast together with shapes (2,8) (21,) ()
这是我的数据框示例和我目前使用的代码:
import pandas as pd
import numpy as np
balance = [1400]
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'Pot_Bet': [3.236731,2.416966,2.278365,2.264023,2.225353,2.174407, 2.141420,2.122386,2.832997,2.411094,
2.167218,2.138972,2.132137,2.128341,2.116338,2.115239,2.115123,2.114284362,2.113420,
2.113186,2.112729],
'Liability':[3.236731, 2.416966, 12.245492, 12.795112, 15.079176, 23.336171, 50.741182, 571.003118, 2.832997, 6.691736, 15.808607, 27.935834, 35.954927, 43.275250, 147.165537, 193.017915, 199.622454, 265.809019, 405.808678, 473.926781, 706.332594]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'Pot_Bet','WIN_LOSE'])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True) #Selecting columns for indexing
df['Bet'] = np.where(df.sum(level = 0) > 0.5*balance[-1], df['Pot_Bet'], 0)
这会导致前面所述的错误。
对于索引 112335580
,新列的值应与 'Pot_Bet'
相同。而对于索引 112335582
,新列的值应为 0
.
干杯,
桑迪
问题是如果使用 df.sum(level=0)
它与 df.groupby(level = 0).sum()
相同 - 按 MultiIndex
的第一级聚合。
解决方案是对 Series
使用 GroupBy.transform
,大小与原始 DataFrame
:
相同
df['Bet'] = np.where(df.groupby(level = 0)['Pot_Bet'].transform('sum') > 0.5*balance[-1],
df['Pot_Bet'], 0)
详情:
print (df.groupby(level = 0)['Pot_Bet'].transform('sum'))
EVENT_ID SELECTION_ID
112335580 6356576 18.859651
2554439 18.859651
2503211 18.859651
6297034 18.859651
4233251 18.859651
2522967 18.859651
5284417 18.859651
7660920 18.859651
112335582 8112876 28.611078
7546023 28.611078
8175276 28.611078
8145908 28.611078
8175274 28.611078
7300754 28.611078
8065540 28.611078
8175275 28.611078
8106158 28.611078
8086265 28.611078
2291406 28.611078
8065533 28.611078
8125015 28.611078
Name: Pot_Bet, dtype: float64
如果只需要使用磨练列是可能的 select 它对于 Series
按列名称:
print (df['Pot_Bet'].sum(level=0))
EVENT_ID
112335580 18.859651
112335582 28.611078
Name: Pot_Bet, dtype: float64
print (df.groupby(level = 0)['Pot_Bet'].sum())
EVENT_ID
112335580 18.859651
112335582 28.611078
Name: Pot_Bet, dtype: float64
我有一个多级索引 pandas 数据框。我想创建一个新列,其中该列中的值基于条件。此条件基于对该索引的另一列求和,然后将其减半。如果这小于存储在单独列表中的最后一个值,则新列中的值将采用与数据框中另一列相同的值。如果不满足此条件,则新列中的所有值都应为 0
.
使用这个问题来尝试实现这个 np.where
和 df.sum(level=0, axis=1)
的组合,但这会导致以下错误:
ValueError: operands could not be broadcast together with shapes (2,8) (21,) ()
这是我的数据框示例和我目前使用的代码:
import pandas as pd
import numpy as np
balance = [1400]
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'Pot_Bet': [3.236731,2.416966,2.278365,2.264023,2.225353,2.174407, 2.141420,2.122386,2.832997,2.411094,
2.167218,2.138972,2.132137,2.128341,2.116338,2.115239,2.115123,2.114284362,2.113420,
2.113186,2.112729],
'Liability':[3.236731, 2.416966, 12.245492, 12.795112, 15.079176, 23.336171, 50.741182, 571.003118, 2.832997, 6.691736, 15.808607, 27.935834, 35.954927, 43.275250, 147.165537, 193.017915, 199.622454, 265.809019, 405.808678, 473.926781, 706.332594]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'Pot_Bet','WIN_LOSE'])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True) #Selecting columns for indexing
df['Bet'] = np.where(df.sum(level = 0) > 0.5*balance[-1], df['Pot_Bet'], 0)
这会导致前面所述的错误。
对于索引 112335580
,新列的值应与 'Pot_Bet'
相同。而对于索引 112335582
,新列的值应为 0
.
干杯, 桑迪
问题是如果使用 df.sum(level=0)
它与 df.groupby(level = 0).sum()
相同 - 按 MultiIndex
的第一级聚合。
解决方案是对 Series
使用 GroupBy.transform
,大小与原始 DataFrame
:
df['Bet'] = np.where(df.groupby(level = 0)['Pot_Bet'].transform('sum') > 0.5*balance[-1],
df['Pot_Bet'], 0)
详情:
print (df.groupby(level = 0)['Pot_Bet'].transform('sum'))
EVENT_ID SELECTION_ID
112335580 6356576 18.859651
2554439 18.859651
2503211 18.859651
6297034 18.859651
4233251 18.859651
2522967 18.859651
5284417 18.859651
7660920 18.859651
112335582 8112876 28.611078
7546023 28.611078
8175276 28.611078
8145908 28.611078
8175274 28.611078
7300754 28.611078
8065540 28.611078
8175275 28.611078
8106158 28.611078
8086265 28.611078
2291406 28.611078
8065533 28.611078
8125015 28.611078
Name: Pot_Bet, dtype: float64
如果只需要使用磨练列是可能的 select 它对于 Series
按列名称:
print (df['Pot_Bet'].sum(level=0))
EVENT_ID
112335580 18.859651
112335582 28.611078
Name: Pot_Bet, dtype: float64
print (df.groupby(level = 0)['Pot_Bet'].sum())
EVENT_ID
112335580 18.859651
112335582 28.611078
Name: Pot_Bet, dtype: float64