pandas 中具有强制值的交叉表

Crosstab in pandas with enforced values

我得到了一些数据样本,例如

df = pd.DataFrame({'A':[1,1,3,3,4],
                   'B':['Very Happy','Sad','Sad','Happy','Happy'],
                   'C': [True,False,False,True,False]})
>> df 


   A           B      C
0  1  Very Happy   True
1  1         Sad  False
2  3         Sad  False
3  3       Happy   True
4  4       Happy  False

并且想要计算每个组合的计数,所以 crosstab 是可行的方法

counts = pd.crosstab(index = df['A'], columns = [df['B'],df['C']])
>> counts


B Happy         Sad Very Happy
C False True  False      True 
A                             
1     0     0     1          1
3     0     1     1          0
4     1     0     0          0

然而,'Very Sad' 情绪和 id 2 都没有出现在这个数据样本中,所以它不在交叉表中。我想把它作为

  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0

我的解决方法是设置一个模板

emotions = ['Very Happy', 'Happy', 'Sad', 'Very Sad']
ids = [1,2,3,4]
truths = [True,False]

template = pd.DataFrame(index = pd.Index(ids),
                    columns= pd.MultiIndex.from_product((emotions,truths)))
>> template
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
2        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
3        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN
4        NaN   NaN   NaN   NaN   NaN   NaN      NaN   NaN

然后填写

template.unstack()[counts.unstack().index] = counts.unstack()
template = template.fillna(0)
>> template
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0

问题是感觉必须有一种更清晰、更易读的方法来实现相同的结果。有什么想法吗?

那是一个pivot_table:

>>> pv = df.pivot_table(index='A',
...                     columns=['B', 'C'],
...                     aggfunc='size',
...                     fill_value=0)
>>> pv
B Happy         Sad Very Happy
C False True  False      True 
A                             
1     0     0     1          1
3     0     1     1          0
4     1     0     0          0

没有出现的columns/rows是因为它们的横截面不存在于框架中。您可以通过 .reindex:

添加它们
>>> cols = pd.MultiIndex.from_product((['Very Happy', 'Happy', 'Sad', 'Very Sad'], [True, False]))
>>> pv.reindex(index=range(1, 5), columns=cols, fill_value=0)
  Very Happy       Happy         Sad       Very Sad      
       True  False True  False True  False    True  False
A                                                        
1          1     0     0     0     0     1        0     0
2          0     0     0     0     0     0        0     0
3          0     0     1     0     0     1        0     0
4          0     0     0     1     0     0        0     0