如何从 defaultdict(list) 中提取顶部项目?

How to extract top item from a defaultdict(list)?

我刚开始使用 defaultdicts。 我有一个匹配脚本,它将唯一标识符作为“键”放置,然后使用 defaultdict(list) 将标识符的潜在匹配列表放入字典中。匹配项是公司名称、地址和匹配分数(基于匹配算法)。有时它是 1-1 匹配,这意味着有 1 个键与匹配相关联,但有时算法会捕获接近的匹配,因此有时会有多个匹配。对于那些我喜欢 select 这场得分最高的比赛。

目标: 从 defaultdict(list) 中为每个唯一标识符提取数据。如果 unique identifier 的值超过 1 个,则取 Lev Score、Fuzzy Score 和 Jaro score 最高的数据。

数据预览如下:

#imports
from collections import defaultdict
test_dic_stack = defaultdict(list)

#testing data (unique1 has a 1-1 match &  unique2 has a 1-5 match)
test_dic_stack['unique1'].append({'Account Name': 'company1', 'Matching Account': 'company1', 'Account_Address': '123 Road', 'Address_match': '123 Road',  'Lev_score': 98.0, 'Fuzzy_score': 100, 'Jaro_Score': 99.0})
test_dic_stack['unique2'].append({'Account Name': 'company1', 'Matching Account': 'company1', 'Account_Address': '1 awesome street', 'Address_match': '1 awesome street',  'Lev_score': 91.0, 'Fuzzy_score': 89, 'Jaro_Score': 99.0})
test_dic_stack['unique2'].append({'Account Name': 'company2', 'Matching Account': 'company2', 'Account_Address': '1 awesome street', 'Address_match': '1 awesome st',  'Lev_score': 71.0, 'Fuzzy_score': 82, 'Jaro_Score': 84.0})
test_dic_stack['unique2'].append({'Account Name': 'company3', 'Matching Account': 'company3', 'Account_Address': '1 awesome street', 'Address_match': '1 awesome street suite 1',  'Lev_score': 88.0, 'Fuzzy_score': 89, 'Jaro_Score': 90.0})
test_dic_stack['unique2'].append({'Account Name': 'company4', 'Matching Account': 'company4', 'Account_Address': '1 awesome street', 'Address_match': '1 awe street',  'Lev_score': 81.0, 'Fuzzy_score': 90, 'Jaro_Score': 86.0})
test_dic_stack['unique2'].append({'Account Name': 'company5', 'Matching Account': 'company5', 'Account_Address': '1 awesome street', 'Address_match': '1 awe st',  'Lev_score': 70.0, 'Fuzzy_score': 86, 'Jaro_Score': 89.0})

#defaultdict preview
defaultdict(list,
            {'unique1': [{'Account Name': 'company1',
               'Matching Account': 'company1',
               'Account_Address': '123 Road',
               'Address_match': '123 Road',
               'Lev_score': 98.0,
               'Fuzzy_score': 100,
               'Jaro_Score': 99.0}],
             'unique2': [{'Account Name': 'company1',
               'Matching Account': 'company1',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awesome street',
               'Lev_score': 91.0,
               'Fuzzy_score': 89,
               'Jaro_Score': 99.0},
              {'Account Name': 'company2',
               'Matching Account': 'company2',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awesome st',
               'Lev_score': 71.0,
               'Fuzzy_score': 82,
               'Jaro_Score': 84.0},
              {'Account Name': 'company3',
               'Matching Account': 'company3',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awesome street suite 1',
               'Lev_score': 88.0,
               'Fuzzy_score': 89,
               'Jaro_Score': 90.0},
              {'Account Name': 'company4',
               'Matching Account': 'company4',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awe street',
               'Lev_score': 81.0,
               'Fuzzy_score': 90,
               'Jaro_Score': 86.0},
              {'Account Name': 'company5',
               'Matching Account': 'company5',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awe st',
               'Lev_score': 70.0,
               'Fuzzy_score': 86,
               'Jaro_Score': 89.0}]})

这是我请求的结果:
提取 unique1 数据并提取 unique2 “最佳匹配”数据。 注意有时最佳匹配并不总是第一个

results = [{'unique1': {'Account Name': 'company1',
               'Matching Account': 'company1',
               'Account_Address': '123 Road',
               'Address_match': '123 Road',
               'Lev_score': 98.0,
               'Fuzzy_score': 100,
               'Jaro_Score': 99.0},

          'unique2': {'Account Name': 'company1',
               'Matching Account': 'company1',
               'Account_Address': '1 awesome street',
               'Address_match': '1 awesome street',
               'Lev_score': 91.0,
               'Fuzzy_score': 89,
               'Jaro_Score': 99.0}]


您可以使用 max 的字典理解,使用三个分数的总和作为键。

假设d输入字典。

out = {k:max(v, key=lambda x: sum((x['Fuzzy_score'], x['Lev_score'], x['Jaro_Score'])))
       for k,v in d.items()}

输出:

{'unique1': {'Account Name': 'company1',
  'Matching Account': 'company1',
  'Account_Address': '123 Road',
  'Address_match': '123 Road',
  'Lev_score': 98.0,
  'Fuzzy_score': 100,
  'Jaro_Score': 99.0},
 'unique2': {'Account Name': 'company1',
  'Matching Account': 'company1',
  'Account_Address': '1 awesome street',
  'Address_match': '1 awesome street',
  'Lev_score': 91.0,
  'Fuzzy_score': 89,
  'Jaro_Score': 99.0}}