Pandas 装箱并计数
Pandas bin and count
我是 Pandas 的新手,请不要太苛刻 ;) 假设我的初始数据框如下所示:
#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)
#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)
我现在根据'size',
对数据框进行bin
#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )
导致此输出:
---
(0, 1]
flag_A flag_B flag_C size
25 False False True 0.091269
40 True True True 0.902894
41 True True True 0.159964
46 False True True 0.494409
53 False True True 0.638736
73 True False True 0.530348
80 True False False 0.669700
88 True True True 0.858495
---
(1, 2]
flag_A flag_B flag_C size
...
我现在的问题是:如何从这里开始计算每个 bin 的每个标志 (A、B、C) 的 True 和 False 计数?例如。对于 bin=(0,1],我希望得到类似 N_flag_A_true = 5、N_flag_A_false = 3 等的结果。理想情况下,我希望通过扩展此数据框来汇总此信息,或进入新的数据框。
可以通过多索引groupbys实现,拼接结果并拆栈:
flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()
T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')
结果是:
flag_A flag_B flag_C
Value False True False True False True
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
编辑:您可以像这样解析列中的多索引:
R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']
结果:
flag_A_F flag_A_T flag_B_F flag_B_T flag_C_F flag_C_T
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
您可以将您的小组申请到DF然后pd.melt:
df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])
哪个会给你:
group variable value
0 (6, 7] flag_A False
1 (3, 4] flag_A False
2 (4, 5] flag_A True
3 (7, 8] flag_A True
4 (6, 7] flag_A True
5 (1, 2] flag_A False
[...]
然后按列分组并计算每组的大小:
df2 = melted.groupby(['group', 'variable', 'value']).size()
这给你:
group variable value
(0, 1] flag_A False 3
True 5
flag_B False 3
True 5
flag_C False 1
True 7
(1, 2] flag_A False 6
True 8
flag_B False 7
True 7
flag_C False 5
True 9
(2, 3] flag_A False 7
True 9
flag_B False 11
True 5
flag_C False 13
True 3
[...]
然后您需要重新调整它的使用方式...
我是 Pandas 的新手,请不要太苛刻 ;) 假设我的初始数据框如下所示:
#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)
#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)
我现在根据'size',
对数据框进行bin#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )
导致此输出:
---
(0, 1]
flag_A flag_B flag_C size
25 False False True 0.091269
40 True True True 0.902894
41 True True True 0.159964
46 False True True 0.494409
53 False True True 0.638736
73 True False True 0.530348
80 True False False 0.669700
88 True True True 0.858495
---
(1, 2]
flag_A flag_B flag_C size
...
我现在的问题是:如何从这里开始计算每个 bin 的每个标志 (A、B、C) 的 True 和 False 计数?例如。对于 bin=(0,1],我希望得到类似 N_flag_A_true = 5、N_flag_A_false = 3 等的结果。理想情况下,我希望通过扩展此数据框来汇总此信息,或进入新的数据框。
可以通过多索引groupbys实现,拼接结果并拆栈:
flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()
T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')
结果是:
flag_A flag_B flag_C
Value False True False True False True
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
编辑:您可以像这样解析列中的多索引:
R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']
结果:
flag_A_F flag_A_T flag_B_F flag_B_T flag_C_F flag_C_T
Bins
(0, 1] 3.0 5.0 3.0 5.0 1.0 7.0
(1, 2] 6.0 8.0 7.0 7.0 5.0 9.0
(2, 3] 7.0 9.0 11.0 5.0 13.0 3.0
(3, 4] 15.0 12.0 12.0 15.0 17.0 10.0
(4, 5] 2.0 8.0 5.0 5.0 7.0 3.0
(5, 6] 5.0 5.0 3.0 7.0 7.0 3.0
(6, 7] 1.0 5.0 NaN 6.0 3.0 3.0
(7, 8] NaN 2.0 1.0 1.0 NaN 2.0
(8, 9] NaN NaN NaN NaN NaN NaN
您可以将您的小组申请到DF然后pd.melt:
df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])
哪个会给你:
group variable value
0 (6, 7] flag_A False
1 (3, 4] flag_A False
2 (4, 5] flag_A True
3 (7, 8] flag_A True
4 (6, 7] flag_A True
5 (1, 2] flag_A False
[...]
然后按列分组并计算每组的大小:
df2 = melted.groupby(['group', 'variable', 'value']).size()
这给你:
group variable value
(0, 1] flag_A False 3
True 5
flag_B False 3
True 5
flag_C False 1
True 7
(1, 2] flag_A False 6
True 8
flag_B False 7
True 7
flag_C False 5
True 9
(2, 3] flag_A False 7
True 9
flag_B False 11
True 5
flag_C False 13
True 3
[...]
然后您需要重新调整它的使用方式...