如何在 pytorch NER 中获得提及而不是令牌？

Question

我正在使用 PyTorch 和预训练模型。

这是我的代码：

class NER(object):
    def __init__(self, model_name_or_path, tokenizer_name_or_path):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path)
        self.model = AutoModelForTokenClassification.from_pretrained(
            model_name_or_path)
        self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer)

    def get_mention_entities(self, query):
        return self.nlp(query)

当我调用 get_mention_entities 并打印它的输出为“اونجا دانشگاه صنعتی امیرکبیر است”时。

它给出：

[{'entity': 'B-FAC', 'score': 0.9454591, 'index': 2, 'word': 'دانشگاه', 'start': 6, 'end': 13}, {'entity': 'I-FAC', 'score': 0.9713519, 'index': 3, 'word': 'صنعتی', 'start': 14, 'end': 19}, {'entity': 'I-FAC', 'score': 0.9860724, 'index': 4, 'word': 'امیرکبیر', 'start': 20, 'end': 28}]

如您所见，它可以识别大学名称，但列表中有三个标记。

是否有任何标准方法可以根据“实体”属性组合这些标记？

期望的输出是这样的：

[{'entity': 'FAC', 'word': 'دانشگاه صنعتی امیرکبیر', 'start': 6, 'end': 28}]

最后，我可以编写一个函数来迭代、比较和合并基于“实体”属性的标记，但我想要一个标准的方法，比如内部 PyTorch 函数或类似的东西。

我的问题类似于。

PS: "دانشگاه صنعتو امورکبیر" 是一个大学名称。

Answer 1

Huggingface 的 NER 管道有一个参数 grouped_entities=True，它将完全满足您的要求：将 BI 分组为统一的实体。

添加

self.nlp = pipeline("ner", model=self.model, tokenizer=self.tokenizer, grouped_entities=True)

应该可以解决问题

如何在 pytorch NER 中获得提及而不是令牌？

how to get mentions in pytorch NER instead of toknes?

python

named-entity-recognition

mention

pytorch

huggingface-transformers