如何使用 Groupby 比较设置列上的 2 个数据帧,其中数据在其中一个中组织不理想
How to compare 2 Dataframes on set columns using Groupby where the data is not organized ideally in one of them
我试图解决的问题是协调适用于“AccountTable”中显示的实际帐户的费率与应在“”中设置的费率费率表”。对于每个帐户,其费率可以设置在不同级别,可以是帐户级别,也可以是父级别。多个帐户可以链接到同一父帐户,但会因货币等不同而有所不同。虽然我可以对其进行比较,但我的解决方案涉及大量重复的代码且不可扩展,在此示例中仅查看 2 个不同的分组,我最多可以比较 9 个不同的分组组合。
这是示例 AccountTable:
import pandas as pd
import numpy as np
AccountTable = pd.DataFrame([[1234567890,456,'EUR',3.5],
[7854567890,15,'USD',2.7],
[9632587415,56,'GBP',1.4]],
columns = ['Account','ParentID','Cur','Rate'])
AccountTable
输出:
Account ParentID Cur Rate
0 1234567890 456 EUR 3.5
1 7854567890 15 USD 2.7
2 9632587415 56 GBP 1.4
这是费率表:
RateTable = pd.DataFrame([['Account',1234567890,'EUR',3.5], # Rate set at account level and shuold return a match
['ParentID',456,'EUR',3.5], # should be Unused as match found at account level
['ParentID',15,'USD',2.7],# rate set at account level and matches
['ParentID',15,'CAD',1.5],# CAD not in Account Table therfore unused
['Account',9876542190,'EUR',3.5], # Account Table therfore unused
['ParentID',56,'GBP',1.5]], # rate set on parent level but rates don't match so return should be mismatch here
columns = ['Level_Type','ID','Cur','Set_Rate'])
输出:
Level_Type ID Cur Set_Rate
0 Account 1234567890 EUR 3.5
1 ParentID 456 EUR 3.5
2 ParentID 15 USD 2.7
3 ParentID 15 CAD 1.5
4 Account 9876542190 EUR 3.5
5 ParentID 56 GBP 1.5
我的解决方案如下,我根据不同的级别将RateTable拆分成多个Dataframes。在本例中为 2 - 帐户级别和父级别。然后我使用 Groupby 函数将它们独立地加入到 AccountTable 并比较费率。
option1 = ['Account']
option2 = ['ParentID']
AccountView = RateTable[RateTable['Level_Type'].isin(option1)]
ParentView = RateTable[RateTable['Level_Type'].isin(option2)]
AccountView = AccountView.rename(columns={'Set_Rate':'Account_Set_Rate'})
ParentView = ParentView.rename(columns={'Set_Rate':'Parent_Set_Rate'})
AccountView = AccountView.rename(columns={'ID':'Account_ID'})
ParentView = ParentView.rename(columns={'ID':'Parent_ID'})
# new view to identify matches at Account level Only
df = pd.merge(AccountTable, AccountView, left_on=['Account','Cur'], right_on=['Account_ID','Cur'], how='left')
df['Account_level_RateMatch'] = np.where(df['Rate'] == df['Account_Set_Rate'],'1','0').astype(int)
Account ParentID Cur Rate Level_Type Account_ID Account_Set_Rate Account_level_RateMatch
0 1234567890 456 EUR 3.5 Account 1.234568e+09 3.5 1
1 7854567890 15 USD 2.7 NaN NaN NaN 0
2 9632587415 56 GBP 1.4 NaN NaN NaN 0
以上重复但现在在父级匹配:
df = pd.merge(AccountTable, ParentView, left_on=['ParentID','Cur'], right_on=['Parent_ID','Cur'], how='left')
df['Parent_level_RateMatch'] = np.where(df['Rate'] == df['Parent_Set_Rate'],'1','0').astype(int) # compare rates
输出:
Account ParentID Cur Rate Level_Type Parent_ID Parent_Set_Rate Parent_level_RateMatch
0 1234567890 456 EUR 3.5 ParentID 456 3.5 1
1 7854567890 15 USD 2.7 ParentID 15 2.7 1
2 9632587415 56 GBP 1.4 ParentID 56 1.5 0
我需要一种更好的方法来比较帐户利率和利率 table,而不是单独查看。此外,逻辑需要是,如果在第一级“帐户级别”找到匹配项,它就停在那里并且不需要检查下一个级别,即父级,例如在#row 1 中,它在帐户和父级都匹配.
如有任何想法或解决方案,我们将不胜感激。
期望的输出:
Account ParentID Cur Rate IsMatch LevelFound
0 1234567890 456 EUR 3.5 1 Account
1 7854567890 15 USD 2.7 1 Parent
2 9632587415 56 GBP 1.4 0 Parent
EDIT 解决方案,与原始解决方案相似,但更适合 OP 的预期输出
#define the order of the levels
ordered_levels = ['Account','ParentID']
# fnd all the matching rates
res = (
pd.concat(
[AccountTable
.merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
# columns for comparision with AccountTable
['ID','Cur','Set_Rate']]
.rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}),
on=lvl, how='inner')
.query('Cur == Cur_opt') #EDIT to query same cur
.assign(LevelFound=lvl,
#EDIT if rate not the same then 0
Is_Match=lambda x: x['Rate'].eq(x['Set_Rate']).astype(int))
for lvl in ordered_levels]) # do the merge operation on each level
#EDIT for selecting first 1 if any, then first 0
.sort_values('Is_Match', ascending=False)
# keep the first matched per initial AccountTable or higher level non-match
.drop_duplicates(ordered_levels)
[AccountTable.columns.tolist() + ['LevelFound','Is_Match']]
)
print(res)
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1
# 1 7854567890 15 USD 2.7 ParentID 1
# 3 9632587415 56 GBP 1.4 ParentID 0
原解
这是一个解决方案,您需要先定义级别的顺序,然后您可以遍历每个级别,select RateTable
中想要的行,然后 merge
与帐户,并仅保留匹配的货币和利率 (query
)。 concat
所有匹配的数据并仅保留每个 AccountTable 初始行的第一个匹配项。
#define the order of the levels
ordered_levels = ['Account','ParentID']
# fnd all the matching rates
matched = (
pd.concat(
[AccountTable
.merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
# columns for comparision with AccountTable
['ID','Cur','Set_Rate']]
.rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}),
on=lvl, how='inner')
# keep only the matching data
.query('Cur == Cur_opt and Rate == Set_Rate')
# add the two columns for the ouput
.assign(LevelFound=opt, Is_Match=1)
for lvl in ordered_levels]) # do the merge operation on each level
.drop_duplicates(ordered_levels) # keep the first matched per initial AccountTable
[AccountTable.columns.tolist() + ['LevelFound','Is_Match']]
)
print(matched) # note that the row wihtout match is missing
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1
# 1 7854567890 15 USD 2.7 ParentID 1
如果你想添加没有匹配的行,那么你可以这样做
res = AccountTable.merge(matched, how='left')
print(res)
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1.0
# 1 7854567890 15 USD 2.7 ParentID 1.0
# 2 9632587415 56 GBP 1.4 NaN NaN
我试图解决的问题是协调适用于“AccountTable”中显示的实际帐户的费率与应在“”中设置的费率费率表”。对于每个帐户,其费率可以设置在不同级别,可以是帐户级别,也可以是父级别。多个帐户可以链接到同一父帐户,但会因货币等不同而有所不同。虽然我可以对其进行比较,但我的解决方案涉及大量重复的代码且不可扩展,在此示例中仅查看 2 个不同的分组,我最多可以比较 9 个不同的分组组合。
这是示例 AccountTable:
import pandas as pd
import numpy as np
AccountTable = pd.DataFrame([[1234567890,456,'EUR',3.5],
[7854567890,15,'USD',2.7],
[9632587415,56,'GBP',1.4]],
columns = ['Account','ParentID','Cur','Rate'])
AccountTable
输出:
Account ParentID Cur Rate
0 1234567890 456 EUR 3.5
1 7854567890 15 USD 2.7
2 9632587415 56 GBP 1.4
这是费率表:
RateTable = pd.DataFrame([['Account',1234567890,'EUR',3.5], # Rate set at account level and shuold return a match
['ParentID',456,'EUR',3.5], # should be Unused as match found at account level
['ParentID',15,'USD',2.7],# rate set at account level and matches
['ParentID',15,'CAD',1.5],# CAD not in Account Table therfore unused
['Account',9876542190,'EUR',3.5], # Account Table therfore unused
['ParentID',56,'GBP',1.5]], # rate set on parent level but rates don't match so return should be mismatch here
columns = ['Level_Type','ID','Cur','Set_Rate'])
输出:
Level_Type ID Cur Set_Rate
0 Account 1234567890 EUR 3.5
1 ParentID 456 EUR 3.5
2 ParentID 15 USD 2.7
3 ParentID 15 CAD 1.5
4 Account 9876542190 EUR 3.5
5 ParentID 56 GBP 1.5
我的解决方案如下,我根据不同的级别将RateTable拆分成多个Dataframes。在本例中为 2 - 帐户级别和父级别。然后我使用 Groupby 函数将它们独立地加入到 AccountTable 并比较费率。
option1 = ['Account']
option2 = ['ParentID']
AccountView = RateTable[RateTable['Level_Type'].isin(option1)]
ParentView = RateTable[RateTable['Level_Type'].isin(option2)]
AccountView = AccountView.rename(columns={'Set_Rate':'Account_Set_Rate'})
ParentView = ParentView.rename(columns={'Set_Rate':'Parent_Set_Rate'})
AccountView = AccountView.rename(columns={'ID':'Account_ID'})
ParentView = ParentView.rename(columns={'ID':'Parent_ID'})
# new view to identify matches at Account level Only
df = pd.merge(AccountTable, AccountView, left_on=['Account','Cur'], right_on=['Account_ID','Cur'], how='left')
df['Account_level_RateMatch'] = np.where(df['Rate'] == df['Account_Set_Rate'],'1','0').astype(int)
Account ParentID Cur Rate Level_Type Account_ID Account_Set_Rate Account_level_RateMatch
0 1234567890 456 EUR 3.5 Account 1.234568e+09 3.5 1
1 7854567890 15 USD 2.7 NaN NaN NaN 0
2 9632587415 56 GBP 1.4 NaN NaN NaN 0
以上重复但现在在父级匹配:
df = pd.merge(AccountTable, ParentView, left_on=['ParentID','Cur'], right_on=['Parent_ID','Cur'], how='left')
df['Parent_level_RateMatch'] = np.where(df['Rate'] == df['Parent_Set_Rate'],'1','0').astype(int) # compare rates
输出:
Account ParentID Cur Rate Level_Type Parent_ID Parent_Set_Rate Parent_level_RateMatch
0 1234567890 456 EUR 3.5 ParentID 456 3.5 1
1 7854567890 15 USD 2.7 ParentID 15 2.7 1
2 9632587415 56 GBP 1.4 ParentID 56 1.5 0
我需要一种更好的方法来比较帐户利率和利率 table,而不是单独查看。此外,逻辑需要是,如果在第一级“帐户级别”找到匹配项,它就停在那里并且不需要检查下一个级别,即父级,例如在#row 1 中,它在帐户和父级都匹配.
如有任何想法或解决方案,我们将不胜感激。
期望的输出:
Account ParentID Cur Rate IsMatch LevelFound
0 1234567890 456 EUR 3.5 1 Account
1 7854567890 15 USD 2.7 1 Parent
2 9632587415 56 GBP 1.4 0 Parent
EDIT 解决方案,与原始解决方案相似,但更适合 OP 的预期输出
#define the order of the levels
ordered_levels = ['Account','ParentID']
# fnd all the matching rates
res = (
pd.concat(
[AccountTable
.merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
# columns for comparision with AccountTable
['ID','Cur','Set_Rate']]
.rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}),
on=lvl, how='inner')
.query('Cur == Cur_opt') #EDIT to query same cur
.assign(LevelFound=lvl,
#EDIT if rate not the same then 0
Is_Match=lambda x: x['Rate'].eq(x['Set_Rate']).astype(int))
for lvl in ordered_levels]) # do the merge operation on each level
#EDIT for selecting first 1 if any, then first 0
.sort_values('Is_Match', ascending=False)
# keep the first matched per initial AccountTable or higher level non-match
.drop_duplicates(ordered_levels)
[AccountTable.columns.tolist() + ['LevelFound','Is_Match']]
)
print(res)
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1
# 1 7854567890 15 USD 2.7 ParentID 1
# 3 9632587415 56 GBP 1.4 ParentID 0
原解
这是一个解决方案,您需要先定义级别的顺序,然后您可以遍历每个级别,select RateTable
中想要的行,然后 merge
与帐户,并仅保留匹配的货币和利率 (query
)。 concat
所有匹配的数据并仅保留每个 AccountTable 初始行的第一个匹配项。
#define the order of the levels
ordered_levels = ['Account','ParentID']
# fnd all the matching rates
matched = (
pd.concat(
[AccountTable
.merge(RateTable.loc[RateTable['Level_Type'].eq(lvl),#row with good level
# columns for comparision with AccountTable
['ID','Cur','Set_Rate']]
.rename(columns={'ID':lvl, 'Cur':f'Cur_opt'}),
on=lvl, how='inner')
# keep only the matching data
.query('Cur == Cur_opt and Rate == Set_Rate')
# add the two columns for the ouput
.assign(LevelFound=opt, Is_Match=1)
for lvl in ordered_levels]) # do the merge operation on each level
.drop_duplicates(ordered_levels) # keep the first matched per initial AccountTable
[AccountTable.columns.tolist() + ['LevelFound','Is_Match']]
)
print(matched) # note that the row wihtout match is missing
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1
# 1 7854567890 15 USD 2.7 ParentID 1
如果你想添加没有匹配的行,那么你可以这样做
res = AccountTable.merge(matched, how='left')
print(res)
# Account ParentID Cur Rate LevelFound Is_Match
# 0 1234567890 456 EUR 3.5 Account 1.0
# 1 7854567890 15 USD 2.7 ParentID 1.0
# 2 9632587415 56 GBP 1.4 NaN NaN