如何加入具有多个 ID 的数据框?
How to join dataframes with multiple IDs?
我有两个数据框和一个相当棘手的连接要完成。
第一个数据帧:
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1
Output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [100, 101, 102]
1 Standard2 [100, 102]
2 Standard3 [103]
... ... ...
第二个:
data = [[100, 'verahren', ['word1', 'word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4', 'word5']],
[103, 'ort', ['word6', 'word7']]]
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2
Output:
KeyWordGroupID KeyWordGroupName KeyWords
100 verahren ['word1', 'word2']
101 flaechen ['word3']
102 nutzung ['word4', 'word5']
103 ort ['word6', 'word7']
... ... ...
期望的输出:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [['word1', 'word2'], ['word3'], ['word4', 'word5']]
1 Standard2 [['word1', 'word2'], ['word4', 'word5']]
2 Standard3 [['word6', 'word7']]
我尝试使用 df.to_dict('records')
将第二个数据帧转换为字典并将其放入 pandas 应用用户定义的函数以通过键值进行匹配,但这似乎不是一个干净的方法.
有人有办法解决这个问题吗?任何想法都会得到回报。
我想你有几个不同的选择
- 您可以创建字典并使用
map
- 您可以将列表转换为字符串并使用
replace
选项 1
e = df1.explode('KeyWordGroupID') # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')
RuleSetID RuleSetName KeyWordGroupID \
0 0 Standard1 [100, 101, 102]
1 1 Standard2 [100, 102]
2 2 Standard3 [103]
KeyWords
0 [[word1, word2], [word3], [word4, word5]]
1 [[word1, word2], [word4, word5]]
2 [[word6, word7]]
@Corralien 用 pandas 解决了这个问题。但在这里,我想介绍一种更简洁的方法,使用 datar
、re-imagination of pandas APIs:
>>> from datar.all import f, unchop, left_join, group_by, summarise
>>>
>>> (
... df1
... >> unchop(f.KeyWordGroupID) # Make KeyWordGroupID one at a row
... >> left_join(df2, by=f.KeyWordGroupID) # Attach df2 by KeyWordGroupIDs
... >> group_by(f.RuleSetID, f.RuleSetName)
... >> summarise(KeyWords = f.KeyWords.agg(pd.Series)) # Concatenate the KeyWords
... )
[2022-03-28 13:52:38][datar][ INFO] `summarise()` has grouped output by ['RuleSetID'] (override with `_groups` argument)
RuleSetID RuleSetName KeyWords
<int64> <object> <object>
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [word6, word7]
[TibbleGrouped: RuleSetID (n=3)]
与 pandas 本身相同的想法:
(
df1
.explode("KeyWordGroupID")
.merge(df2, how="left", on="KeyWordGroupID")
.groupby(["RuleSetID", "RuleSetName"])
.agg({"KeyWords": pd.Series})
.reset_index()
)
主要思想是将df2
转换为字典映射Series
,其中key
是KeyWordGroupID
列,value
是KeyWords
列。
您可以使用 explode
将 df1
的 KeyWordGroupID
列展平,然后 map
到 df2
然后 groupby
重塑您的第一个数据框:
df1['KeyWordGroupID'] = (
df1['KeyWordGroupID'].explode().map(df2.set_index('KeyWordGroupID')['KeyWords'])
.groupby(level=0).apply(list)
)
print(df1)
# Output
RuleSetID RuleSetName KeyWordGroupID
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [[word6, word7]]
我有两个数据框和一个相当棘手的连接要完成。
第一个数据帧:
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1
Output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [100, 101, 102]
1 Standard2 [100, 102]
2 Standard3 [103]
... ... ...
第二个:
data = [[100, 'verahren', ['word1', 'word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4', 'word5']],
[103, 'ort', ['word6', 'word7']]]
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2
Output:
KeyWordGroupID KeyWordGroupName KeyWords
100 verahren ['word1', 'word2']
101 flaechen ['word3']
102 nutzung ['word4', 'word5']
103 ort ['word6', 'word7']
... ... ...
期望的输出:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [['word1', 'word2'], ['word3'], ['word4', 'word5']]
1 Standard2 [['word1', 'word2'], ['word4', 'word5']]
2 Standard3 [['word6', 'word7']]
我尝试使用 df.to_dict('records')
将第二个数据帧转换为字典并将其放入 pandas 应用用户定义的函数以通过键值进行匹配,但这似乎不是一个干净的方法.
有人有办法解决这个问题吗?任何想法都会得到回报。
我想你有几个不同的选择
- 您可以创建字典并使用
map
- 您可以将列表转换为字符串并使用
replace
选项 1
e = df1.explode('KeyWordGroupID') # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')
RuleSetID RuleSetName KeyWordGroupID \
0 0 Standard1 [100, 101, 102]
1 1 Standard2 [100, 102]
2 2 Standard3 [103]
KeyWords
0 [[word1, word2], [word3], [word4, word5]]
1 [[word1, word2], [word4, word5]]
2 [[word6, word7]]
@Corralien 用 pandas 解决了这个问题。但在这里,我想介绍一种更简洁的方法,使用 datar
、re-imagination of pandas APIs:
>>> from datar.all import f, unchop, left_join, group_by, summarise
>>>
>>> (
... df1
... >> unchop(f.KeyWordGroupID) # Make KeyWordGroupID one at a row
... >> left_join(df2, by=f.KeyWordGroupID) # Attach df2 by KeyWordGroupIDs
... >> group_by(f.RuleSetID, f.RuleSetName)
... >> summarise(KeyWords = f.KeyWords.agg(pd.Series)) # Concatenate the KeyWords
... )
[2022-03-28 13:52:38][datar][ INFO] `summarise()` has grouped output by ['RuleSetID'] (override with `_groups` argument)
RuleSetID RuleSetName KeyWords
<int64> <object> <object>
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [word6, word7]
[TibbleGrouped: RuleSetID (n=3)]
与 pandas 本身相同的想法:
(
df1
.explode("KeyWordGroupID")
.merge(df2, how="left", on="KeyWordGroupID")
.groupby(["RuleSetID", "RuleSetName"])
.agg({"KeyWords": pd.Series})
.reset_index()
)
主要思想是将df2
转换为字典映射Series
,其中key
是KeyWordGroupID
列,value
是KeyWords
列。
您可以使用 explode
将 df1
的 KeyWordGroupID
列展平,然后 map
到 df2
然后 groupby
重塑您的第一个数据框:
df1['KeyWordGroupID'] = (
df1['KeyWordGroupID'].explode().map(df2.set_index('KeyWordGroupID')['KeyWords'])
.groupby(level=0).apply(list)
)
print(df1)
# Output
RuleSetID RuleSetName KeyWordGroupID
0 0 Standard1 [[word1, word2], [word3], [word4, word5]]
1 1 Standard2 [[word1, word2], [word4, word5]]
2 2 Standard3 [[word6, word7]]