Pandas:根据字典中的值搜索输入数据

Pandas: Search input data against values in dictionary

我有 2 个数据源 - 配置和输入数据

配置

+------------+-------+
| ContractID | PB ID |
+------------+-------+
| H9500      | 002   |
+------------+-------+
| H9500      | 008   |
+------------+-------+
| H3544      | 800   |
+------------+-------+
| H3544      | 801   |
+------------+-------+
| H3544      | 802   |
+------------+-------+

输入文件

+-------+------------+-------+
| Index | ContractID | PB ID |
+-------+------------+-------+
| 1     | H9500      | 456   |
+-------+------------+-------+
| 2     | H9500      | 008   |
+-------+------------+-------+
| 3     | H9500      | 002   |
+-------+------------+-------+
| 4     | H3544      | 853   |
+-------+------------+-------+
| 5     | H3544      | 802   |
+-------+------------+-------+
| 6     | H4599      | 465   |
+-------+------------+-------+

我正在尝试遍历输入文件并检查合同 ID 是否有效。如果 Contract ID = H9500PB ID= 456 无效,因为该组合不存在于配置文件中。 Contract ID = H9500PB ID= 008 有效,因为该组合存在于配置文件中。

我正在使用以下逻辑

input_df=pd.DataFrame({'Index': [1,2,3,4,5,6], 
                      'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
                       'PBID': ['456','008','002','853','802','465']})

config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
                        'PBID':['002','008','800','801','802']})

config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}

def test_6_3(s):
    if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
        return "Invalid PB Contract"
    else:
        return "Valid"
    
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)

input_df

但我没有得到预期的结果

+-------+------------+-------+----------+
| Index | ContractID | PB ID | test_6_3 |
+-------+------------+-------+----------+
| 1     | H9500      | 456   | Invalid  |
+-------+------------+-------+----------+
| 2     | H9500      | 008   | Valid    |
+-------+------------+-------+----------+
| 3     | H9500      | 002   | Valid    |
+-------+------------+-------+----------+
| 4     | H3544      | 853   | Invalid  |
+-------+------------+-------+----------+
| 5     | H3544      | 802   | Valid    |
+-------+------------+-------+----------+
| 6     | H4599      | 465   | Invalid  |
+-------+------------+-------+----------+

UPDATED(基于 OP 更正的输入):

这应该可以解决您的问题:

df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')

输入:

   Index ContractID PBID
0      1      H9500  456
1      2      H9500  008
2      3      H9500  002
3      4      H3544  853
4      5      H3544  802
5      6      H4599  465
  ContractID PBID
0      H9500  002
1      H9500  008
2      H3544  800
3      H3544  801
4      H3544  802

输出:

   Index ContractID PBID test_6_3
0      1      H9500  456  Invalid
1      2      H9500  008    Valid
2      3      H9500  002    Valid
3      4      H3544  853  Invalid
4      5      H3544  802    Valid
5      6      H4599  465  Invalid

更新#2:

要坚持 OP 问题中的 apply() 方法并使其有效,您可以按如下方式修改 test_6_3()

def test_6_3(s):
    if s["ContractID"] in config_dict and s["PBID"] in config_dict[s["ContractID"]]:
        return "Valid"
    else:
        return "Invalid"

输出:

   Index ContractID PBID test_6_3
0      1      H9500  456  Invalid
1      2      H9500  008    Valid
2      3      H9500  002    Valid
3      4      H3544  853  Invalid
4      5      H3544  802    Valid
5      6      H4599  465  Invalid
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config

输出:

[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]

代码:

(
    df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
    .assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)

输出:

   Index ContractID  PBID test_6_3
0      1      H9500   456  Invalid
1      2      H9500     8    Valid
2      3      H9500     2    Valid
3      4      H3544   853  Invalid
4      5      H3544   802    Valid
5      6      H4599   465  Invalid