Pandas df group by count 元素
Pandas df group by count elements
我的数据框看起来像这样。
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
我想计算每个日期出现的次数(以 % 为单位)。这样数据框看起来像这样:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
我尝试使用 Pandas 分组。
逻辑不完全清楚(因为看起来提供的输出不是与提供的输入对应的真实输出),但这里有一些方法:
使用crosstab
每年百分比
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
输出:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
每个变量的百分比
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
输出:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
使用groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
输出:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
另一个选项可能是:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
输出:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
melt + groupby + unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5
我的数据框看起来像这样。
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
我想计算每个日期出现的次数(以 % 为单位)。这样数据框看起来像这样:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
我尝试使用 Pandas 分组。
逻辑不完全清楚(因为看起来提供的输出不是与提供的输入对应的真实输出),但这里有一些方法:
使用crosstab
每年百分比
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
输出:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
每个变量的百分比
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
输出:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
使用groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
输出:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
另一个选项可能是:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
输出:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
melt + groupby + unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5