固定列名并在按两列对数据框进行分组后重命名它们
Fixing column names and renaming them after grouping the dataframe by two columns
我有一个数据框:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
我按两列分组
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
因此,我收到了这个数据框,其中列名有两个级别
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
我的理想输出——一级列名和重命名列名的dataframe
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`
您可以修改列:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
或从头开始使用NamedAgg
:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
您还可以将字典传递给 groupby.agg
以获得简洁的代码:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
输出:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50
我有一个数据框:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
我按两列分组
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
因此,我收到了这个数据框,其中列名有两个级别
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
我的理想输出——一级列名和重命名列名的dataframe
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`
您可以修改列:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
或从头开始使用NamedAgg
:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
您还可以将字典传递给 groupby.agg
以获得简洁的代码:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
输出:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50