Pandas 装箱并计数

Pandas bin and count

我是 Pandas 的新手,请不要太苛刻 ;) 假设我的初始数据框如下所示:

#::: initialize dictionary
np.random.seed(0)
d = {}
d['size'] = 2 * np.random.randn(100) + 3
d['flag_A'] = np.random.randint(0,2,100).astype(bool)
d['flag_B'] = np.random.randint(0,2,100).astype(bool)
d['flag_C'] = np.random.randint(0,2,100).astype(bool)

#::: convert dictionary into pandas dataframe
df = pd.DataFrame(d)

我现在根据'size',

对数据框进行bin
#::: bin pandas dataframe per size
bins = np.arange(0,10,1)
groups = df.groupby( pd.cut( df['size'], bins ) )

导致此输出:

---
(0, 1]
   flag_A flag_B flag_C      size
25  False  False   True  0.091269
40   True   True   True  0.902894
41   True   True   True  0.159964
46  False   True   True  0.494409
53  False   True   True  0.638736
73   True  False   True  0.530348
80   True  False  False  0.669700
88   True   True   True  0.858495
---
(1, 2]
   flag_A flag_B flag_C      size
...

我现在的问题是:如何从这里开始计算每个 bin 的每个标志 (A、B、C) 的 True 和 False 计数?例如。对于 bin=(0,1],我希望得到类似 N_flag_A_true = 5、N_flag_A_false = 3 等的结果。理想情况下,我希望通过扩展此数据框来汇总此信息,或进入新的数据框。

可以通过多索引groupbys实现,拼接结果并拆栈:

flag_A = df.groupby( [pd.cut( df['size'], bins),'flag_A'] ).count()['size'].to_frame()
flag_B = df.groupby( [pd.cut( df['size'], bins),'flag_B'] ).count()['size'].to_frame()
flag_C = df.groupby( [pd.cut( df['size'], bins),'flag_C'] ).count()['size'].to_frame()

T = pd.concat([flag_A,flag_B],axis=1)
R = pd.concat([T,flag_C],axis=1)
R.columns = ['flag_A','flag_B','flag_C']
R.index.names = [u'Bins',u'Value']
R = R.unstack('Value')

结果是:

       flag_A       flag_B       flag_C      
Value   False True   False True   False True 
Bins                                         
(0, 1]    3.0   5.0    3.0   5.0    1.0   7.0
(1, 2]    6.0   8.0    7.0   7.0    5.0   9.0
(2, 3]    7.0   9.0   11.0   5.0   13.0   3.0
(3, 4]   15.0  12.0   12.0  15.0   17.0  10.0
(4, 5]    2.0   8.0    5.0   5.0    7.0   3.0
(5, 6]    5.0   5.0    3.0   7.0    7.0   3.0
(6, 7]    1.0   5.0    NaN   6.0    3.0   3.0
(7, 8]    NaN   2.0    1.0   1.0    NaN   2.0
(8, 9]    NaN   NaN    NaN   NaN    NaN   NaN

编辑:您可以像这样解析列中的多索引:

R.columns = ['flag_A_F','flag_A_T','flag_B_F','flag_B_T','flag_C_F','flag_C_T']

结果:

        flag_A_F  flag_A_T  flag_B_F  flag_B_T  flag_C_F  flag_C_T
Bins                                                              
(0, 1]       3.0       5.0       3.0       5.0       1.0       7.0
(1, 2]       6.0       8.0       7.0       7.0       5.0       9.0
(2, 3]       7.0       9.0      11.0       5.0      13.0       3.0
(3, 4]      15.0      12.0      12.0      15.0      17.0      10.0
(4, 5]       2.0       8.0       5.0       5.0       7.0       3.0
(5, 6]       5.0       5.0       3.0       7.0       7.0       3.0
(6, 7]       1.0       5.0       NaN       6.0       3.0       3.0
(7, 8]       NaN       2.0       1.0       1.0       NaN       2.0
(8, 9]       NaN       NaN       NaN       NaN       NaN       NaN

您可以将您的小组申请到DF然后pd.melt:

df['group'] = pd.cut(df['size'], bins=bins)
melted = pd.melt(df, id_vars='group', value_vars=['flag_A', 'flag_B', 'flag_C'])

哪个会给你:

      group variable  value
0    (6, 7]   flag_A  False
1    (3, 4]   flag_A  False
2    (4, 5]   flag_A   True
3    (7, 8]   flag_A   True
4    (6, 7]   flag_A   True
5    (1, 2]   flag_A  False
[...]

然后按列分组并计算每组的大小:

df2 = melted.groupby(['group', 'variable', 'value']).size()

这给你:

group   variable  value
(0, 1]  flag_A    False     3
                  True      5
        flag_B    False     3
                  True      5
        flag_C    False     1
                  True      7
(1, 2]  flag_A    False     6
                  True      8
        flag_B    False     7
                  True      7
        flag_C    False     5
                  True      9
(2, 3]  flag_A    False     7
                  True      9
        flag_B    False    11
                  True      5
        flag_C    False    13
                  True      3
        [...]

然后您需要重新调整它的使用方式...