如何加入具有多个 ID 的数据框?

How to join dataframes with multiple IDs?

我有两个数据框和一个相当棘手的连接要完成。

第一个数据帧:

data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
 
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1 

Output:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [100, 101, 102]
    1         Standard2    [100, 102]
    2         Standard3    [103]
   ...         ...          ... 

第二个:

data = [[100, 'verahren', ['word1', 'word2']], 
        [101, 'flaechen', ['word3']], 
        [102, 'nutzung', ['word4', 'word5']],
        [103, 'ort', ['word6', 'word7']]]
 
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2

Output:

KeyWordGroupID  KeyWordGroupName    KeyWords
    100               verahren      ['word1', 'word2']
    101               flaechen      ['word3']
    102               nutzung       ['word4', 'word5']
    103               ort           ['word6', 'word7']
    ...               ...            ...

期望的输出:

RuleSetID   RuleSetName    KeyWordGroupID
    0         Standard1    [['word1', 'word2'], ['word3'], ['word4', 'word5']]
    1         Standard2    [['word1', 'word2'], ['word4', 'word5']]
    2         Standard3    [['word6', 'word7']]

我尝试使用 df.to_dict('records') 将第二个数据帧转换为字典并将其放入 pandas 应用用户定义的函数以通过键值进行匹配,但这似乎不是一个干净的方法.

有人有办法解决这个问题吗?任何想法都会得到回报。

我想你有几个不同的选择

  1. 您可以创建字典并使用 map
  2. 您可以将列表转换为字符串并使用 replace

选项 1

e = df1.explode('KeyWordGroupID')  # explode youre frame
# create a dictionary from KeyWords and map it to the KeyWordGroupID
e['KeyWords'] = e['KeyWordGroupID'].map(df2.set_index('KeyWordGroupID')['KeyWords'].to_dict())
# merge df1 with e
new_df = df1.merge(e.groupby('RuleSetID')['KeyWords'].agg(list), right_index=True, left_on='RuleSetID')

   RuleSetID RuleSetName   KeyWordGroupID  \
0          0   Standard1  [100, 101, 102]   
1          1   Standard2       [100, 102]   
2          2   Standard3            [103]   

                                    KeyWords  
0  [[word1, word2], [word3], [word4, word5]]  
1           [[word1, word2], [word4, word5]]  
2                           [[word6, word7]]  

@Corralien 用 pandas 解决了这个问题。但在这里,我想介绍一种更简洁的方法,使用 datar、re-imagination of pandas APIs:

>>> from datar.all import f, unchop, left_join, group_by, summarise
>>> 
>>> (
...     df1 
...     >> unchop(f.KeyWordGroupID)  # Make KeyWordGroupID one at a row
...     >> left_join(df2, by=f.KeyWordGroupID)  # Attach df2 by KeyWordGroupIDs
...     >> group_by(f.RuleSetID, f.RuleSetName)
...     >> summarise(KeyWords = f.KeyWords.agg(pd.Series))  # Concatenate the KeyWords
... )
[2022-03-28 13:52:38][datar][   INFO] `summarise()` has grouped output by ['RuleSetID'] (override with `_groups` argument)
   RuleSetID RuleSetName                                   KeyWords
     <int64>    <object>                                   <object>
0          0   Standard1  [[word1, word2], [word3], [word4, word5]]
1          1   Standard2           [[word1, word2], [word4, word5]]
2          2   Standard3                             [word6, word7]
[TibbleGrouped: RuleSetID (n=3)]

与 pandas 本身相同的想法:

(
  df1
  .explode("KeyWordGroupID")
  .merge(df2, how="left", on="KeyWordGroupID")
  .groupby(["RuleSetID", "RuleSetName"])
  .agg({"KeyWords": pd.Series})
  .reset_index()
)

主要思想是将df2转换为字典映射Series,其中keyKeyWordGroupID列,valueKeyWords列。

您可以使用 explodedf1KeyWordGroupID 列展平,然后 mapdf2 然后 groupby 重塑您的第一个数据框:

df1['KeyWordGroupID'] = (
    df1['KeyWordGroupID'].explode().map(df2.set_index('KeyWordGroupID')['KeyWords'])
                         .groupby(level=0).apply(list)
)
print(df1)

# Output
   RuleSetID RuleSetName                             KeyWordGroupID
0          0   Standard1  [[word1, word2], [word3], [word4, word5]]
1          1   Standard2           [[word1, word2], [word4, word5]]
2          2   Standard3                           [[word6, word7]]