Pandas:根据字典中的值搜索输入数据
Pandas: Search input data against values in dictionary
我有 2 个数据源 - 配置和输入数据
配置
+------------+-------+
| ContractID | PB ID |
+------------+-------+
| H9500 | 002 |
+------------+-------+
| H9500 | 008 |
+------------+-------+
| H3544 | 800 |
+------------+-------+
| H3544 | 801 |
+------------+-------+
| H3544 | 802 |
+------------+-------+
输入文件
+-------+------------+-------+
| Index | ContractID | PB ID |
+-------+------------+-------+
| 1 | H9500 | 456 |
+-------+------------+-------+
| 2 | H9500 | 008 |
+-------+------------+-------+
| 3 | H9500 | 002 |
+-------+------------+-------+
| 4 | H3544 | 853 |
+-------+------------+-------+
| 5 | H3544 | 802 |
+-------+------------+-------+
| 6 | H4599 | 465 |
+-------+------------+-------+
我正在尝试遍历输入文件并检查合同 ID 是否有效。如果 Contract ID = H9500
和 PB ID= 456
无效,因为该组合不存在于配置文件中。 Contract ID = H9500
和 PB ID= 008
有效,因为该组合存在于配置文件中。
我正在使用以下逻辑
input_df=pd.DataFrame({'Index': [1,2,3,4,5,6],
'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
'PBID': ['456','008','002','853','802','465']})
config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
'PBID':['002','008','800','801','802']})
config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}
def test_6_3(s):
if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
return "Invalid PB Contract"
else:
return "Valid"
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)
input_df
但我没有得到预期的结果
+-------+------------+-------+----------+
| Index | ContractID | PB ID | test_6_3 |
+-------+------------+-------+----------+
| 1 | H9500 | 456 | Invalid |
+-------+------------+-------+----------+
| 2 | H9500 | 008 | Valid |
+-------+------------+-------+----------+
| 3 | H9500 | 002 | Valid |
+-------+------------+-------+----------+
| 4 | H3544 | 853 | Invalid |
+-------+------------+-------+----------+
| 5 | H3544 | 802 | Valid |
+-------+------------+-------+----------+
| 6 | H4599 | 465 | Invalid |
+-------+------------+-------+----------+
UPDATED(基于 OP 更正的输入):
这应该可以解决您的问题:
df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')
输入:
Index ContractID PBID
0 1 H9500 456
1 2 H9500 008
2 3 H9500 002
3 4 H3544 853
4 5 H3544 802
5 6 H4599 465
ContractID PBID
0 H9500 002
1 H9500 008
2 H3544 800
3 H3544 801
4 H3544 802
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
更新#2:
要坚持 OP 问题中的 apply()
方法并使其有效,您可以按如下方式修改 test_6_3()
:
def test_6_3(s):
if s["ContractID"] in config_dict and s["PBID"] in config_dict[s["ContractID"]]:
return "Valid"
else:
return "Invalid"
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config
输出:
[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]
代码:
(
df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
.assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 8 Valid
2 3 H9500 2 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
我有 2 个数据源 - 配置和输入数据
配置
+------------+-------+
| ContractID | PB ID |
+------------+-------+
| H9500 | 002 |
+------------+-------+
| H9500 | 008 |
+------------+-------+
| H3544 | 800 |
+------------+-------+
| H3544 | 801 |
+------------+-------+
| H3544 | 802 |
+------------+-------+
输入文件
+-------+------------+-------+
| Index | ContractID | PB ID |
+-------+------------+-------+
| 1 | H9500 | 456 |
+-------+------------+-------+
| 2 | H9500 | 008 |
+-------+------------+-------+
| 3 | H9500 | 002 |
+-------+------------+-------+
| 4 | H3544 | 853 |
+-------+------------+-------+
| 5 | H3544 | 802 |
+-------+------------+-------+
| 6 | H4599 | 465 |
+-------+------------+-------+
我正在尝试遍历输入文件并检查合同 ID 是否有效。如果 Contract ID = H9500
和 PB ID= 456
无效,因为该组合不存在于配置文件中。 Contract ID = H9500
和 PB ID= 008
有效,因为该组合存在于配置文件中。
我正在使用以下逻辑
input_df=pd.DataFrame({'Index': [1,2,3,4,5,6],
'ContractID' : ['H9500','H9500','H9500','H3544','H3544','H4599'],
'PBID': ['456','008','002','853','802','465']})
config_df=pd.DataFrame({'ContractID':['H9500','H9500','H3544','H3544','H3544'],
'PBID':['002','008','800','801','802']})
config_dict={k: list(v) for k,v in config_df.groupby("ContractID")["PBID"]}
def test_6_3(s):
if config_dict.get(s["ContractID"]) and s["PBID"] not in config_dict.values():
return "Invalid PB Contract"
else:
return "Valid"
input_df['test_6_3'] = input_df.apply(test_6_3, axis=1)
input_df
但我没有得到预期的结果
+-------+------------+-------+----------+
| Index | ContractID | PB ID | test_6_3 |
+-------+------------+-------+----------+
| 1 | H9500 | 456 | Invalid |
+-------+------------+-------+----------+
| 2 | H9500 | 008 | Valid |
+-------+------------+-------+----------+
| 3 | H9500 | 002 | Valid |
+-------+------------+-------+----------+
| 4 | H3544 | 853 | Invalid |
+-------+------------+-------+----------+
| 5 | H3544 | 802 | Valid |
+-------+------------+-------+----------+
| 6 | H4599 | 465 | Invalid |
+-------+------------+-------+----------+
UPDATED(基于 OP 更正的输入):
这应该可以解决您的问题:
df = input_df.join(config_df.assign(test_6_3='Valid').set_index(['ContractID', 'PBID']), on = ['ContractID', 'PBID']).fillna('Invalid')
输入:
Index ContractID PBID
0 1 H9500 456
1 2 H9500 008
2 3 H9500 002
3 4 H3544 853
4 5 H3544 802
5 6 H4599 465
ContractID PBID
0 H9500 002
1 H9500 008
2 H3544 800
3 H3544 801
4 H3544 802
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
更新#2:
要坚持 OP 问题中的 apply()
方法并使其有效,您可以按如下方式修改 test_6_3()
:
def test_6_3(s):
if s["ContractID"] in config_dict and s["PBID"] in config_dict[s["ContractID"]]:
return "Valid"
else:
return "Invalid"
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 008 Valid
2 3 H9500 002 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid
names = ['ContractID', 'PBID']
config = dfc[names].apply(tuple, axis=1).to_list()
config
输出:
[('H9500', 2), ('H9500', 8), ('H3544', 800), ('H3544', 801), ('H3544', 802)]
代码:
(
df.assign(test_6_3=df[names].apply(lambda x: tuple(x) in config, axis=1))
.assign(test_6_3=lambda x: x.test_6_3.map({True: 'Valid', False: 'Invalid'}))
)
输出:
Index ContractID PBID test_6_3
0 1 H9500 456 Invalid
1 2 H9500 8 Valid
2 3 H9500 2 Valid
3 4 H3544 853 Invalid
4 5 H3544 802 Valid
5 6 H4599 465 Invalid