如何计算 groupby 列的百分比并按降序排序?
How can I calculate percentage of a groupby column and sort it by descending order?
问题:如何计算 groupby 列的百分比并按降序排序?
期望的输出:
country count percentage
United States 2555 45%
India 923 12%
United Kingdom 397 4%
Japan 226 3%
South Korea 183 2%
我做了一些研究,查看了 Pandas 文档,查看了 Whosebug 上的其他问题
运气不好。
我尝试了以下方法:
#1 尝试:
Df2 = df.groupby('country')['show_id'].count().nlargest()
df3 = df2.groupby(level=0).apply(lambda x: x/x.sum() * 100)
输出:
director
A. L. Vijay 100.0
A. Raajdheep 100.0
A. Salaam 100.0
A.R. Murugadoss 100.0
Aadish Keluskar 100.0
...
Çagan Irmak 100.0
Ísold Uggadóttir 100.0
Óskar Thór Axelsson 100.0
Ömer Faruk Sorak 100.0
Şenol Sönmez 100.0
Name: show_id, Length: 4049, dtype: float64
#2 尝试:
df2 = df.groupby('country')['show_id'].count()
df2['percentage'] = df2['show_id']/6000
输出:
KeyError: 'show_id'
数据集样本:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
这不解决“国家”字段中有多个国家的行,但下面的行应该适用于问题的其他部分:
创建初始数据框:
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',0,0],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
按国家分组:
df2 = df.groupby(by="country", as_index=False)['show_id']\
.agg('count')
重命名聚合列:
df2 = df2.rename(columns={'show_id':'count'})
创建百分比列:
df2['percent'] = (df2['count']*100)/df2['count'].sum()
降序排列:
df2 = df2.sort_values(by='percent', ascending=False)
您的尝试 #1 中的部分问题可能是您没有在 groupby 函数中包含“by”参数。
newDF = pd.DataFrame(DF.Country.value_counts())
newDF['percentage'] = round(pd.DataFrame(DF.Country.value_counts(normalize = \
True).mul(100)),2)
newDF.columns = ['count', 'percentage']
newDF
问题:如何计算 groupby 列的百分比并按降序排序?
期望的输出:
country count percentage
United States 2555 45%
India 923 12%
United Kingdom 397 4%
Japan 226 3%
South Korea 183 2%
我做了一些研究,查看了 Pandas 文档,查看了 Whosebug 上的其他问题 运气不好。
我尝试了以下方法:
#1 尝试:
Df2 = df.groupby('country')['show_id'].count().nlargest()
df3 = df2.groupby(level=0).apply(lambda x: x/x.sum() * 100)
输出:
director
A. L. Vijay 100.0
A. Raajdheep 100.0
A. Salaam 100.0
A.R. Murugadoss 100.0
Aadish Keluskar 100.0
...
Çagan Irmak 100.0
Ísold Uggadóttir 100.0
Óskar Thór Axelsson 100.0
Ömer Faruk Sorak 100.0
Şenol Sönmez 100.0
Name: show_id, Length: 4049, dtype: float64
#2 尝试:
df2 = df.groupby('country')['show_id'].count()
df2['percentage'] = df2['show_id']/6000
输出:
KeyError: 'show_id'
数据集样本:
import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
这不解决“国家”字段中有多个国家的行,但下面的行应该适用于问题的其他部分:
创建初始数据框:
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',0,0],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'],
'country':['United States, India, South Korea, China',
'United Kingdom','United States'],
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
按国家分组:
df2 = df.groupby(by="country", as_index=False)['show_id']\
.agg('count')
重命名聚合列:
df2 = df2.rename(columns={'show_id':'count'})
创建百分比列:
df2['percent'] = (df2['count']*100)/df2['count'].sum()
降序排列:
df2 = df2.sort_values(by='percent', ascending=False)
您的尝试 #1 中的部分问题可能是您没有在 groupby 函数中包含“by”参数。
newDF = pd.DataFrame(DF.Country.value_counts())
newDF['percentage'] = round(pd.DataFrame(DF.Country.value_counts(normalize = \
True).mul(100)),2)
newDF.columns = ['count', 'percentage']
newDF