Groupby 创建列表
Groupby to create a list
我正在使用 JupyterLab 以特定方式在电子表格中打印一些数据。
我有两个不同的文件:
1)
2)
对于每个 original_id == id,我想按国家/地区分组并列出品牌,然后汇总并列出每个品牌的持有量。
我的代码得到的结果是这样的:
FundID Domicile (brand, AUM)
0 A1 IT (BBB, 10.0), UK (BBB, 7.0),
1 B2 CH (AAA, 12.0),
2 C3 DE (CCC, 5.0),
3 D4 CH (EEE, 9.0), UK (EEE, 11.0),
虽然,我的 objective 是得到这样的东西:
密码是
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
df_fofs = pd.read_excel('SampleDF.xlsx')
df_extract = pd.read_excel('SampleID_ex.xlsx')
df_extract
original_id
0 A1
1 B2
2 C3
3 D4
df_fofs
brand country id holding
0 AAA UK A1 2000000
1 AAA CH B2 4000000
2 BBB UK A1 7000000
3 CCC DE C3 5000000
4 BBB IT A1 10000000
5 EEE UK D4 11000000
6 EEE CH D4 3000000
7 EEE CH D4 6000000
8 AAA CH B2 8000000
fund_ids = list(df_extract['original_id'])
result = {}
for fund in fund_ids:
temp = []
df_funds = df_fofs[(df_fofs['id'] == fund )][['country', 'brand', 'holding']]
domicile_fof = df_fofs[df_fofs['id'] == fund ][['country', 'holding']]
df_funds = df_funds.groupby(['country', 'brand'])["holding"].sum()
domicile_fof = domicile_fof.groupby('country')["holding"].sum()
s = ''
for i in range(len(df_funds)):
row = df_funds.reset_index().iloc[i]
if row['holding'] >= 5000000:
s += row['country'] + ' (' + str(row['brand']) + ', ' + str(round(((row['holding'])/1000000), 2)) + '), '
result[fund] = [s]
df_result = pd.DataFrame.from_dict(result, orient = 'index')
df_result.reset_index(inplace = True)
df_result.columns = ['FundID', 'Domicile (brand, AUM)']
df_result
FundID Domicile (brand, AUM)
0 A1 IT (BBB, 10.0), UK (BBB, 7.0),
1 B2 CH (AAA, 12.0),
2 C3 DE (CCC, 5.0),
3 D4 CH (EEE, 9.0), UK (EEE, 11.0),
您可以结合 id、按 id 分组和国家/地区的表来制作内部项目,然后仅按 id 分组保存以创建外部级别
def f(x):
n = x.apply(lambda r: '{} ({})'.format(r['brand'],int(r['holding']/1000000)), axis=1)
return '{} [{}]'.format(x.iloc[0]['country'],', '.join(n))
df_extract.merge(df_fofs, left_on='original_id', right_on='id')
.groupby(['original_id','country']).apply(f) \
.groupby(level=0).apply(', '.join)
original_id
A1 IT [BBB (10)], UK [AAA (2), BBB (7)]
B2 CH [AAA (4), AAA (8)]
C3 DE [CCC (5)]
D4 CH [EEE (3), EEE (6)], UK [EEE (11)]
dtype: object
试试这个,
>>> df
brand country id holding
0 AAA UK A1 2
1 AAA CH B2 4
2 BBB UK A1 7
3 CCC DE C3 5
4 BBB IT A1 10
5 EEE UK D4 11
6 EEE CH D4 3
7 EEE CH D4 6
8 AAA CH B2 8
>>> final_df = df.groupby(by='id').apply(lambda x: x.groupby(by='country')
.apply(lambda y: y.groupby(by='brand')
.agg(sum))).reset_index()
>>> final_df.groupby(by='id')\
.apply(lambda x: ", ".join([f"{row['country']} [{row['brand']}({row['holding']})]"
for _, row in x.iterrows()]))
id
A1 IT [BBB(10)], UK [AAA(2)], UK [BBB(7)]
B2 CH [AAA(12)]
C3 DE [CCC(5)]
D4 CH [EEE(9)], UK [EEE(11)]
dtype: object
您要查找的是函数 pandas.DataFrame.pivot_table
。此处的文档 pandas/pivot_table.
此代码解决了您的示例(而不是我使用 multiIndex 的列表)
import pandas as pd
df = pd.DataFrame([
('AAA','UK','A1',2000000),
('AAA','CH','B2',4000000),
('BBB','UK','A1',7000000),
('CCC','DE','C3',5000000),
('BBB','IT','A1',10000000),
('EEE','UK','D4',11000000),
('EEE','CH','D4',3000000),
('EEE','CH','D4',6000000),
('AAA','CH','B2',8000000)],
columns=['brand', 'country', 'id', 'holding'])
df.pivot_table(values='holding',index=['id','country','brand'])
结果是
resulting dataframe
我正在使用 JupyterLab 以特定方式在电子表格中打印一些数据。
我有两个不同的文件:
1)
对于每个 original_id == id,我想按国家/地区分组并列出品牌,然后汇总并列出每个品牌的持有量。
我的代码得到的结果是这样的:
FundID Domicile (brand, AUM)
0 A1 IT (BBB, 10.0), UK (BBB, 7.0),
1 B2 CH (AAA, 12.0),
2 C3 DE (CCC, 5.0),
3 D4 CH (EEE, 9.0), UK (EEE, 11.0),
虽然,我的 objective 是得到这样的东西:
密码是
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
df_fofs = pd.read_excel('SampleDF.xlsx')
df_extract = pd.read_excel('SampleID_ex.xlsx')
df_extract
original_id
0 A1
1 B2
2 C3
3 D4
df_fofs
brand country id holding
0 AAA UK A1 2000000
1 AAA CH B2 4000000
2 BBB UK A1 7000000
3 CCC DE C3 5000000
4 BBB IT A1 10000000
5 EEE UK D4 11000000
6 EEE CH D4 3000000
7 EEE CH D4 6000000
8 AAA CH B2 8000000
fund_ids = list(df_extract['original_id'])
result = {}
for fund in fund_ids:
temp = []
df_funds = df_fofs[(df_fofs['id'] == fund )][['country', 'brand', 'holding']]
domicile_fof = df_fofs[df_fofs['id'] == fund ][['country', 'holding']]
df_funds = df_funds.groupby(['country', 'brand'])["holding"].sum()
domicile_fof = domicile_fof.groupby('country')["holding"].sum()
s = ''
for i in range(len(df_funds)):
row = df_funds.reset_index().iloc[i]
if row['holding'] >= 5000000:
s += row['country'] + ' (' + str(row['brand']) + ', ' + str(round(((row['holding'])/1000000), 2)) + '), '
result[fund] = [s]
df_result = pd.DataFrame.from_dict(result, orient = 'index')
df_result.reset_index(inplace = True)
df_result.columns = ['FundID', 'Domicile (brand, AUM)']
df_result
FundID Domicile (brand, AUM)
0 A1 IT (BBB, 10.0), UK (BBB, 7.0),
1 B2 CH (AAA, 12.0),
2 C3 DE (CCC, 5.0),
3 D4 CH (EEE, 9.0), UK (EEE, 11.0),
您可以结合 id、按 id 分组和国家/地区的表来制作内部项目,然后仅按 id 分组保存以创建外部级别
def f(x):
n = x.apply(lambda r: '{} ({})'.format(r['brand'],int(r['holding']/1000000)), axis=1)
return '{} [{}]'.format(x.iloc[0]['country'],', '.join(n))
df_extract.merge(df_fofs, left_on='original_id', right_on='id')
.groupby(['original_id','country']).apply(f) \
.groupby(level=0).apply(', '.join)
original_id
A1 IT [BBB (10)], UK [AAA (2), BBB (7)]
B2 CH [AAA (4), AAA (8)]
C3 DE [CCC (5)]
D4 CH [EEE (3), EEE (6)], UK [EEE (11)]
dtype: object
试试这个,
>>> df
brand country id holding
0 AAA UK A1 2
1 AAA CH B2 4
2 BBB UK A1 7
3 CCC DE C3 5
4 BBB IT A1 10
5 EEE UK D4 11
6 EEE CH D4 3
7 EEE CH D4 6
8 AAA CH B2 8
>>> final_df = df.groupby(by='id').apply(lambda x: x.groupby(by='country')
.apply(lambda y: y.groupby(by='brand')
.agg(sum))).reset_index()
>>> final_df.groupby(by='id')\
.apply(lambda x: ", ".join([f"{row['country']} [{row['brand']}({row['holding']})]"
for _, row in x.iterrows()]))
id
A1 IT [BBB(10)], UK [AAA(2)], UK [BBB(7)]
B2 CH [AAA(12)]
C3 DE [CCC(5)]
D4 CH [EEE(9)], UK [EEE(11)]
dtype: object
您要查找的是函数 pandas.DataFrame.pivot_table
。此处的文档 pandas/pivot_table.
此代码解决了您的示例(而不是我使用 multiIndex 的列表)
import pandas as pd
df = pd.DataFrame([
('AAA','UK','A1',2000000),
('AAA','CH','B2',4000000),
('BBB','UK','A1',7000000),
('CCC','DE','C3',5000000),
('BBB','IT','A1',10000000),
('EEE','UK','D4',11000000),
('EEE','CH','D4',3000000),
('EEE','CH','D4',6000000),
('AAA','CH','B2',8000000)],
columns=['brand', 'country', 'id', 'holding'])
df.pivot_table(values='holding',index=['id','country','brand'])
结果是 resulting dataframe