Pandas df group by count 元素

Pandas df group by count elements

我的数据框看起来像这样。

# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
     
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])

我想计算每个日期出现的次数(以 % 为单位)。这样数据框看起来像这样:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

我尝试使用 Pandas 分组。

逻辑不完全清楚(因为看起来提供的输出不是与提供的输入对应的真实输出),但这里有一些方法:

使用crosstab

每年百分比

df2 = df.melt(value_name='year')

df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)

输出:

year      1993   1994   1996   1997  1998  1999  2001   2002  2003
variable                                                          
A          0.0    0.0    0.0    0.0  75.0  25.0  50.0    0.0   0.0
B          0.0    0.0    0.0  100.0  25.0  50.0  50.0    0.0   0.0
C         50.0    0.0  100.0    0.0   0.0   0.0   0.0  100.0  50.0
D         50.0  100.0    0.0    0.0   0.0  25.0   0.0    0.0  50.0

每个变量的百分比

df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)

# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)

输出:

year      1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B          0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C         20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D         20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

使用groupby

(df.stack()
 .groupby(level=1)
 .apply(lambda s: s.value_counts(normalize=True))
 .unstack(fill_value=0)
 .mul(100)
 )

输出:

   1993  1994  1996  1997  1998  1999  2001  2002  2003
A   0.0   0.0   0.0   0.0  60.0  20.0  20.0   0.0   0.0
B   0.0   0.0   0.0  20.0  20.0  40.0  20.0   0.0   0.0
C  20.0   0.0  20.0   0.0   0.0   0.0   0.0  20.0  40.0
D  20.0  20.0   0.0   0.0   0.0  20.0   0.0   0.0  40.0

另一个选项可能是:

# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)

# filling null values with 0 
df2.fillna(0, inplace=True)

# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'

# getting your output
df2.loc['1997':'1999', 'A':'C'].T

输出:

    1997    1998    1999
A   20%     80%     100%
B   30%     10%     0%
C   70%     10%     0%

melt + groupby + unstack

(df.melt().groupby(['variable', 'value']).size() 
 / df.melt().groupby('value').size()).unstack(1)

Out[1]: 
value     1993  1994  1996  1997  1998  1999  2001  2002  2003
variable                                                      
A          NaN   NaN   NaN   NaN  0.75  0.25   0.5   NaN   NaN
B          NaN   NaN   NaN   1.0  0.25  0.50   0.5   NaN   NaN
C          0.5   NaN   1.0   NaN   NaN   NaN   NaN   1.0   0.5
D          0.5   1.0   NaN   NaN   NaN  0.25   NaN   NaN   0.5