Pandas:连接多级索引,使另一个 Dataframe 具有不同的排序
Pandas: concatenate multilevel index such that the other Dataframe has different sorting
假设您有以下两个数据帧 dfs
和 df1
dfs 由
定义
dfs=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsd'),
'Accepted': [2570, 1020, 2140, 120, 15]
})
dfs=dfs.groupby(['Year', 'Provider']).sum()
df1 由
定义
df1=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('aabbc'),
'Gender': list('mfmfm'),
'Accepted app': ['990', '1180', '435', '405', '985']
})
我想合并这两个数据框以获得类似的东西
df2=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006,2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsdabcsd'),
'Accepted': [2570, 1020, 2140, 120, 15,2570, 1020, 2140, 120, 15],
'Gender': ['m', 'm', 'm', 'Nan', 'Nan', 'f', 'f', 'Nan', 'Nan', 'Nan'],
'Accepted app': ['990', '435', '985', 'Nan', 'Nan','1180', '405', 'Nan', 'Nan', 'Nan']
})
我不知道如何保留dfs
的多级索引或如何合并它们。
您可以按 Gender
的唯一值重复行 concat
, use DataFrame.reset_index
然后 merge
左连接:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left'))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 m NaN
4 2006 s 120 m NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 f NaN
8 2006 d 15 f NaN
9 2006 s 120 f NaN
如果还想将 Gender
列设置为缺失值,可以通过 DataFrame.merge
中的 indicator=True
参数识别新行的来源,然后用缺失值替换:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left', indicator=True)
.assign(Gender=lambda x: x['Gender'].mask(x['_merge'].eq('left_only')))
.drop('_merge', axis=1))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 NaN NaN
4 2006 s 120 NaN NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 NaN NaN
8 2006 d 15 NaN NaN
9 2006 s 120 NaN NaN
假设您有以下两个数据帧 dfs
和 df1
dfs 由
定义dfs=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsd'),
'Accepted': [2570, 1020, 2140, 120, 15]
})
dfs=dfs.groupby(['Year', 'Provider']).sum()
df1 由
定义df1=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('aabbc'),
'Gender': list('mfmfm'),
'Accepted app': ['990', '1180', '435', '405', '985']
})
我想合并这两个数据框以获得类似的东西
df2=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006,2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsdabcsd'),
'Accepted': [2570, 1020, 2140, 120, 15,2570, 1020, 2140, 120, 15],
'Gender': ['m', 'm', 'm', 'Nan', 'Nan', 'f', 'f', 'Nan', 'Nan', 'Nan'],
'Accepted app': ['990', '435', '985', 'Nan', 'Nan','1180', '405', 'Nan', 'Nan', 'Nan']
})
我不知道如何保留dfs
的多级索引或如何合并它们。
您可以按 Gender
的唯一值重复行 concat
, use DataFrame.reset_index
然后 merge
左连接:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left'))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 m NaN
4 2006 s 120 m NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 f NaN
8 2006 d 15 f NaN
9 2006 s 120 f NaN
如果还想将 Gender
列设置为缺失值,可以通过 DataFrame.merge
中的 indicator=True
参数识别新行的来源,然后用缺失值替换:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left', indicator=True)
.assign(Gender=lambda x: x['Gender'].mask(x['_merge'].eq('left_only')))
.drop('_merge', axis=1))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 NaN NaN
4 2006 s 120 NaN NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 NaN NaN
8 2006 d 15 NaN NaN
9 2006 s 120 NaN NaN