合并列表中的元组 - Spacy Trainset 相关

Question

我在列表中有一个 'n'（10K 或更多）个元组，如下所示（SpaCy 的训练格式）-

[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

计划是将相同的句子分组并合并词典。我显然采用了强力循环的想法，但如果我有 10-25K 数据，那会非常慢。有什么 better/optimal 方法可以做到这一点吗？

期望的输出 -

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

Answer 1

由于句子是您的关键，因此听写是执行此操作的自然方式。对于列表中的每个条目，将实体元组附加到该句子的运行列表中，如下所示：

itemlist = [
    ('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
    ('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]

sentence_dict = {}
for item in itemlist:
    sentence = item[0]
    entity = item[1]['entities'][0]  # Get the tuple from the list which is the value for entities
    value_dict = sentence_dict.get(sentence, {'entities': []})
    value_dict['entities'].append(entity)
    sentence_dict[sentence] = value_dict

list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
    list_of_tuples.append((sentence, entity_dict))

>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

希望对您有所帮助，编码愉快！

Answer 2

利用 python 中的 str 可以是 hashed/indexed 的事实。

这里我使用的是字典，键是元组的字符串或第一个元素

如果您有内存限制，可以将其分批处理或使用开源平台，例如 Google Colab

temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
    if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
    else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)

[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]

合并列表中的元组 - Spacy Trainset 相关

Merge Tuples in a list - Spacy Trainset related

python

python-3.x

tuples

spacy

named-entity-recognition