合并列表中的元组 - Spacy Trainset 相关
Merge Tuples in a list - Spacy Trainset related
我在列表中有一个 'n'(10K 或更多)个元组,如下所示(SpaCy 的训练格式)-
[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
计划是将相同的句子分组并合并词典。我显然采用了强力循环的想法,但如果我有 10-25K 数据,那会非常慢。有什么 better/optimal 方法可以做到这一点吗?
期望的输出 -
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
由于句子是您的关键,因此听写是执行此操作的自然方式。对于列表中的每个条目,将实体元组附加到该句子的 运行 列表中,如下所示:
itemlist = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
sentence_dict = {}
for item in itemlist:
sentence = item[0]
entity = item[1]['entities'][0] # Get the tuple from the list which is the value for entities
value_dict = sentence_dict.get(sentence, {'entities': []})
value_dict['entities'].append(entity)
sentence_dict[sentence] = value_dict
list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
list_of_tuples.append((sentence, entity_dict))
>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
希望对您有所帮助,编码愉快!
利用 python 中的 str
可以是 hashed/indexed 的事实。
这里我使用的是字典,键是元组的字符串或第一个元素
如果您有内存限制,可以将其分批处理或使用开源平台,例如 Google Colab
temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
我在列表中有一个 'n'(10K 或更多)个元组,如下所示(SpaCy 的训练格式)-
[
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
计划是将相同的句子分组并合并词典。我显然采用了强力循环的想法,但如果我有 10-25K 数据,那会非常慢。有什么 better/optimal 方法可以做到这一点吗?
期望的输出 -
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
由于句子是您的关键,因此听写是执行此操作的自然方式。对于列表中的每个条目,将实体元组附加到该句子的 运行 列表中,如下所示:
itemlist = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
sentence_dict = {}
for item in itemlist:
sentence = item[0]
entity = item[1]['entities'][0] # Get the tuple from the list which is the value for entities
value_dict = sentence_dict.get(sentence, {'entities': []})
value_dict['entities'].append(entity)
sentence_dict[sentence] = value_dict
list_of_tuples = []
for sentence, entity_dict in sentence_dict.items():
list_of_tuples.append((sentence, entity_dict))
>>> print(list_of_tuples)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]
希望对您有所帮助,编码愉快!
利用 python 中的 str
可以是 hashed/indexed 的事实。
这里我使用的是字典,键是元组的字符串或第一个元素
如果您有内存限制,可以将其分批处理或使用开源平台,例如 Google Colab
temp = [
('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG')]}),
('Apple are selling their products on Amazon', {'entities': [(36, 42, 'ECOM-ORG')]})
]
data = {}
for i in temp:
if i[0] in data:data[i[0]]['entities'].append(i[1]['entities'][0])
else: data[i[0]]= i[1]
temp = [(k,v) for i,(k,v) in enumerate(data.items())]
print(temp)
[('Apple are selling their products on Amazon', {'entities': [(0, 5, 'PROD-ORG'), (36, 42, 'ECOM-ORG')]})]