使用 spacy 从文档中删除复合词命名实体

Question

如果一些命名实体是复合词，如何使用 spaCy 从文本中删除命名实体？

我知道的问题我相信这不是那个问题的重复，因为如果命名实体是复合词，那么那里发布的接受的答案将会失败。下面显示了链接问题的已接受答案为何失败的示例代码。

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like New York and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

输出：

This is a text document that speaks about entities like New York and'

从文本中删除命名实体（包括复合词实体）的最佳方法是什么？谢谢

P.S。我会把它作为对链接问题的评论发布，但作为新用户，我缺乏足够的声誉来发表评论。我尝试将其作为答案发布在那里，但由于我不知道解决方案（只有接受的答案会因复合词而失败），所以这不是一个好的答案，因此被删除了。一个新问题似乎是留给我的最后办法，但如果这不合适，任何关于在这种情况下的正确行动方案的建议将不胜感激。

Answer 1

如果令牌是命名实体的一部分，则其 token.ent_type_ 属性将属于该实体类型，即使它是复合命名实体的一部分。您可以使用以下代码实现您正在寻找的内容，如下所示：

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like New York and Nokia. My name is John Smith'

doc = nlp(text_data)

# You can add other Named Entities types that you would want to remove to the list
text_without_ne = [token.text for token in doc if token.ent_type_ not in ['GPE', 'PERSON']]

print(" ".join(text_without_ne))

输出

This is a text document that speaks about entities like and . My name is

使用 spacy 从文档中删除复合词命名实体

Removing compound worded named entities from a document using spacy

python

nlp

named-entity-recognition

spacy