合并列表中的元组 - Spacy Trainset 相关

Merge Tuples in a list - Spacy Trainset related

我在列表中有一个 'n'(10K 或更多)个元组,如下所示(SpaCy 的训练格式)-

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

计划是将相同的句子分组并合并词典。我显然采用了强力循环的想法,但如果我有 10-25K 数据,那会非常慢。有什么 better/optimal 方法可以做到这一点吗?

期望的输出 -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

由于句子是您的关键,因此听写是执行此操作的自然方式。对于列表中的每个条目,将实体元组附加到该句子的 运行 列表中,如下所示:

itemlist = [
    ('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
    ('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

sentence_dict = {}
for item in itemlist:
    sentence = item[0]
    entity = item[1]['entities'][0]  # Get the tuple from the list which is the value for entities
    value_dict = sentence_dict.get(sentence, {'entities': []})
    value_dict['entities'].append(entity)
    sentence_dict[sentence] = value_dict

list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
    list_of_tuples.append((sentence, entity_dict))

>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

希望对您有所帮助,编码愉快!

利用 python 中的 str 可以是 hashed/indexed 的事实。

这里我使用的是字典,键是元组的字符串或第一个元素

如果您有内存限制,可以将其分批处理或使用开源平台,例如 Google Colab

temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
    if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
    else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]