使用 spaCy 过滤具有 1 个以上标记的专有名词

Question

我想使用 spacy 从文本文件中过滤掉所有具有超过 1 个标记的专有名词。有没有人知道如何做到这一点？

例如：return纽约和新奥尔良但不会墨西哥

我只想使用标准库和 spaCy。

Answer 1

希望我能正确理解你的问题。如果您尝试 return 长度超过 1 个单词的专有名词，例如名称或城市，运行以下代码。

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)
text = "It would return New York City and New Orleans but not Mexico. Some more cities to test are New Dehli, Berlin, Sao Paulo, Buenos Aires, and Moscow."
doc = nlp(text)
pattern = [
    [
        {'POS': 'PROPN', "OP": "!"},
        {'POS': 'PROPN', "DEP": "compound", "OP": "+"}, 
        {'POS': 'PROPN'},
        {'POS': 'PROPN', "OP": "!"},
    ]
]
matcher.add('mutliPropn', pattern)
matches = matcher(doc)
for match_id, start, end in matches:
    print(doc[start+1:end-1])

# Output:
# New York City
# New Orleans
# New Dehli
# Sao Paulo
# Buenos Aires

使用 spaCy 过滤具有 1 个以上标记的专有名词

Filter Proper Noun with more than 1 Token with spaCy

python

spacy