使用 Huggingface 转换器进行命名实体识别,映射回完整实体
Named Entity Recognition with Huggingface transformers, mapping back to complete entities
我正在查看 Huggingface pipeline for Named Entity Recognition 的文档,但我不清楚这些结果将如何用于实际的实体识别模型。
例如,给定文档中的示例:
>>> from transformers import pipeline
>>> nlp = pipeline("ner")
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as an entity from the 9 classes defined above. Here is the expected results:
print(nlp(sequence))
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
虽然仅此一项就令人印象深刻,但我不清楚从以下位置获取“DUMBO”的正确方法:
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
---甚至更清晰的多标记匹配,例如将“纽约市”与简单的城市“约克”区分开来。
虽然我可以想象启发式方法,但根据您的输入,将这些标记连接回正确标签的正确预期方法是什么?
管道对象可以在您设置参数时为您执行此操作:
- 变形金刚 < 4.7.0:grouped_entities 至
True
。
- 变形金刚 >= 4.7.0:aggregation_strategy 至
simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
输出:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
快速更新:grouped_entities
已弃用。
UserWarning: grouped_entities
is deprecated and will be removed in
version v5.0.0, defaulted to
aggregation_strategy="AggregationStrategy.SIMPLE"
instead.
f'grouped_entities
is deprecated and will be removed in version
v5.0.0, defaulted to aggregation_strategy="{aggregation_strategy}"
instead.'
您必须将代码更改为:
ner = pipeline("ner", aggregation_stategy="simple")
我正在查看 Huggingface pipeline for Named Entity Recognition 的文档,但我不清楚这些结果将如何用于实际的实体识别模型。
例如,给定文档中的示例:
>>> from transformers import pipeline
>>> nlp = pipeline("ner")
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as an entity from the 9 classes defined above. Here is the expected results:
print(nlp(sequence))
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
虽然仅此一项就令人印象深刻,但我不清楚从以下位置获取“DUMBO”的正确方法:
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
---甚至更清晰的多标记匹配,例如将“纽约市”与简单的城市“约克”区分开来。
虽然我可以想象启发式方法,但根据您的输入,将这些标记连接回正确标签的正确预期方法是什么?
管道对象可以在您设置参数时为您执行此操作:
- 变形金刚 < 4.7.0:grouped_entities 至
True
。 - 变形金刚 >= 4.7.0:aggregation_strategy 至
simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
输出:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
快速更新:grouped_entities
已弃用。
UserWarning:
grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="AggregationStrategy.SIMPLE"
instead.
f'grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="{aggregation_strategy}"
instead.'
您必须将代码更改为:
ner = pipeline("ner", aggregation_stategy="simple")