Python命名实体识别(NER):用标签替换命名实体
Python named entity recognition (NER): Replace named entities with labels
我是 Python NER 的新手,我正在尝试用它们的标签替换文本输入中的命名实体。
from nerd import ner
input_text = """Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network,[5][6][7] created in 2008 by Jeff Atwood and Joel Spolsky."""
doc = ner.name(input_text, language='en_core_web_sm')
text_label = [(X.text, X.label_) for X in doc]
print(text_label)
输出为:[('2008', 'DATE'), ('Jeff Atwood', 'PERSON'), ('Joel Spolsky', 'PERSON')]
然后我可以提取人,例如:
people = [i for i,label in text_label if 'PERSON' in label]
print(people)
得到['Jeff Atwood', 'Joel Spolsky']
.
我的问题是如何替换原始输入文本中已识别的命名实体,以便结果为:
Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network,[5][6][7] created in DATE by PERSON and PERSON.
非常感谢!
您可以遍历 text_label
并用相应的标签替换每个文本
for text, label in text_label:
input_text = input_text.replace(text, label)
print(input_text)
您确实可以像@taha 解释的那样遍历文本和标签,但在一般情况下这是个坏主意!此循环可能会在文本中混合具有相同名称但类型不同(或有时不是实体)的实体,因为您只依赖于实体的标签。
例如考虑以下内容:
In 2000 I sent 2000 emails.
I saw a statue of Washington in Washington.
您将无法区分出现的“2000”或“华盛顿”!这可能看起来很少见,但避免此类错误不是更好吗,尤其是对于非常长的文档?
据我了解,ner python 模块看起来像是对 Spacy 的简单绑定,所以我想您可以访问“start_char”和“[=22=” ]" 值来避免这种情况,需要一些基本的 Python 编程。顺便说一下,我也认为从计算的角度来看这应该更有效。
我是 Python NER 的新手,我正在尝试用它们的标签替换文本输入中的命名实体。
from nerd import ner
input_text = """Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network,[5][6][7] created in 2008 by Jeff Atwood and Joel Spolsky."""
doc = ner.name(input_text, language='en_core_web_sm')
text_label = [(X.text, X.label_) for X in doc]
print(text_label)
输出为:[('2008', 'DATE'), ('Jeff Atwood', 'PERSON'), ('Joel Spolsky', 'PERSON')]
然后我可以提取人,例如:
people = [i for i,label in text_label if 'PERSON' in label]
print(people)
得到['Jeff Atwood', 'Joel Spolsky']
.
我的问题是如何替换原始输入文本中已识别的命名实体,以便结果为:
Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network,[5][6][7] created in DATE by PERSON and PERSON.
非常感谢!
您可以遍历 text_label
并用相应的标签替换每个文本
for text, label in text_label:
input_text = input_text.replace(text, label)
print(input_text)
您确实可以像@taha 解释的那样遍历文本和标签,但在一般情况下这是个坏主意!此循环可能会在文本中混合具有相同名称但类型不同(或有时不是实体)的实体,因为您只依赖于实体的标签。
例如考虑以下内容:
In 2000 I sent 2000 emails.
I saw a statue of Washington in Washington.
您将无法区分出现的“2000”或“华盛顿”!这可能看起来很少见,但避免此类错误不是更好吗,尤其是对于非常长的文档?
据我了解,ner python 模块看起来像是对 Spacy 的简单绑定,所以我想您可以访问“start_char”和“[=22=” ]" 值来避免这种情况,需要一些基本的 Python 编程。顺便说一下,我也认为从计算的角度来看这应该更有效。