如何使用 Groupby 比较设置列上的 2 个数据帧，其中数据在其中一个中组织不理想

Question

我试图解决的问题是协调适用于“AccountTable”中显示的实际帐户的费率与应在“”中设置的费率费率表”。对于每个帐户，其费率可以设置在不同级别，可以是帐户级别，也可以是父级别。多个帐户可以链接到同一父帐户，但会因货币等不同而有所不同。虽然我可以对其进行比较，但我的解决方案涉及大量重复的代码且不可扩展，在此示例中仅查看 2 个不同的分组，我最多可以比较 9 个不同的分组组合。

这是示例 AccountTable：

import pandas as pd
import numpy as np
AccountTable = pd.DataFrame([[1234567890,456,'EUR',3.5],
                    [7854567890,15,'USD',2.7],
                    [9632587415,56,'GBP',1.4]],
columns = ['Account','ParentID','Cur','Rate'])
AccountTable

输出：

Account ParentID    Cur Rate
0   1234567890  456 EUR 3.5
1   7854567890  15  USD 2.7
2   9632587415  56  GBP 1.4

这是费率表：

RateTable = pd.DataFrame([['Account',1234567890,'EUR',3.5], # Rate set at account level and shuold return a match
                    ['ParentID',456,'EUR',3.5], # should be Unused as match found at account level
                    ['ParentID',15,'USD',2.7],# rate set at account level and matches 
                    ['ParentID',15,'CAD',1.5],# CAD not in Account Table therfore unused 
                    ['Account',9876542190,'EUR',3.5], # Account Table therfore unused  
                    ['ParentID',56,'GBP',1.5]], # rate set on parent level but rates don't match so return should be mismatch here
columns = ['Level_Type','ID','Cur','Set_Rate'])

输出：

Level_Type  ID          Cur Set_Rate
0   Account 1234567890  EUR 3.5
1   ParentID 456        EUR 3.5
2   ParentID 15         USD 2.7
3   ParentID 15         CAD 1.5
4   Account 9876542190  EUR 3.5
5   ParentID 56         GBP 1.5

我的解决方案如下，我根据不同的级别将RateTable拆分成多个Dataframes。在本例中为 2 - 帐户级别和父级别。然后我使用 Groupby 函数将它们独立地加入到 AccountTable 并比较费率。

option1 = ['Account']
option2 = ['ParentID']
AccountView = RateTable[RateTable['Level_Type'].isin(option1)]
ParentView = RateTable[RateTable['Level_Type'].isin(option2)]
AccountView = AccountView.rename(columns={'Set_Rate':'Account_Set_Rate'})
ParentView = ParentView.rename(columns={'Set_Rate':'Parent_Set_Rate'})
AccountView = AccountView.rename(columns={'ID':'Account_ID'})
ParentView = ParentView.rename(columns={'ID':'Parent_ID'})
# new view to identify matches at Account level Only 
df = pd.merge(AccountTable, AccountView, left_on=['Account','Cur'], right_on=['Account_ID','Cur'], how='left')
df['Account_level_RateMatch'] = np.where(df['Rate'] == df['Account_Set_Rate'],'1','0').astype(int)

Account ParentID    Cur Rate    Level_Type  Account_ID  Account_Set_Rate    Account_level_RateMatch
0   1234567890  456 EUR 3.5     Account     1.234568e+09    3.5             1
1   7854567890  15  USD 2.7     NaN         NaN             NaN             0
2   9632587415  56  GBP 1.4     NaN         NaN             NaN             0

以上重复但现在在父级匹配：

df = pd.merge(AccountTable, ParentView, left_on=['ParentID','Cur'], right_on=['Parent_ID','Cur'], how='left')
df['Parent_level_RateMatch'] = np.where(df['Rate'] == df['Parent_Set_Rate'],'1','0').astype(int) # compare rates

输出：

Account ParentID    Cur Rate    Level_Type  Parent_ID   Parent_Set_Rate Parent_level_RateMatch
0   1234567890  456 EUR 3.5     ParentID    456         3.5             1
1   7854567890  15  USD 2.7     ParentID    15          2.7             1
2   9632587415  56  GBP 1.4     ParentID    56          1.5             0

我需要一种更好的方法来比较帐户利率和利率 table，而不是单独查看。此外，逻辑需要是，如果在第一级“帐户级别”找到匹配项，它就停在那里并且不需要检查下一个级别，即父级，例如在#row 1 中，它在帐户和父级都匹配.

如有任何想法或解决方案，我们将不胜感激。

期望的输出：

Account ParentID    Cur Rate    IsMatch LevelFound
0   1234567890  456 EUR 3.5     1       Account
1   7854567890  15  USD 2.7     1       Parent
2   9632587415  56  GBP 1.4     0       Parent

Answer 1

EDIT 解决方案，与原始解决方案相似，但更适合 OP 的预期输出

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
res = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                 # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']]
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           .query('Cur == Cur_opt') #EDIT to query same cur
           .assign(LevelFound=lvl, 
                   #EDIT if rate not the same then 0
                   Is_Match=lambda x: x['Rate'].eq(x['Set_Rate']).astype(int)) 
         for lvl in ordered_levels]) # do the merge operation on each level
    #EDIT for selecting first 1 if any, then first 0
    .sort_values('Is_Match', ascending=False) 
    # keep the first matched per initial AccountTable or higher level non-match
    .drop_duplicates(ordered_levels) 
    [AccountTable.columns.tolist() + ['LevelFound','Is_Match']] 
)
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1
# 3  9632587415        56  GBP   1.4   ParentID         0

原解

这是一个解决方案，您需要先定义级别的顺序，然后您可以遍历每个级别，select RateTable 中想要的行，然后 merge与帐户，并仅保留匹配的货币和利率 (query)。 concat 所有匹配的数据并仅保留每个 AccountTable 初始行的第一个匹配项。

#define the order of the levels
ordered_levels = ['Account','ParentID'] 

# fnd all the matching rates
matched = (
    pd.concat(
        [AccountTable
           .merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
                                # columns for comparision with AccountTable
                                ['ID','Cur','Set_Rate']] 
                           .rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}), 
                 on=lvl, how='inner')
           # keep only the matching data
           .query('Cur == Cur_opt and Rate == Set_Rate')
           # add the two columns for the ouput
           .assign(LevelFound=opt, Is_Match=1)
         for lvl in ordered_levels]) # do the merge operation on each level
    .drop_duplicates(ordered_levels) # keep the first matched per initial AccountTable
    [AccountTable.columns.tolist() + ['LevelFound','Is_Match']] 
)
print(matched) # note that the row wihtout match is missing
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account         1
# 1  7854567890        15  USD   2.7   ParentID         1

如果你想添加没有匹配的行，那么你可以这样做

res = AccountTable.merge(matched, how='left')
print(res)
#       Account  ParentID  Cur  Rate LevelFound  Is_Match
# 0  1234567890       456  EUR   3.5    Account       1.0
# 1  7854567890        15  USD   2.7   ParentID       1.0
# 2  9632587415        56  GBP   1.4        NaN       NaN

如何使用 Groupby 比较设置列上的 2 个数据帧，其中数据在其中一个中组织不理想

How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in one of them

pandas

numpy

group-by

transpose

pandas-melt