
Noun phrases with spacy

如何使用 spacy 从文本中提取名词短语?
我指的不是词性标签。 在文档中我找不到任何关于名词短语或常规解析树的信息。

如果您想要基本 NP,即没有协调、介词短语或关系从句的 NP,您可以在 Doc 和 Span 对象上使用 noun_chunks 迭代器:

>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'The cat and the dog sleep in the basket near the door.')
>>> for np in doc.noun_chunks:
>>>     np.text
u'The cat'
u'the dog'
u'the basket'
u'the door'


from spacy.symbols import *

np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree
import spacy
nlp = spacy.load("en_core_web_sm")
doc =nlp('Bananas are an excellent source of potassium.')
for np in doc.noun_chunks:
  an excellent source

for word in doc:
    print('word.dep:', word.dep, ' | ', 'word.dep_:', word.dep_)
  word.dep: 429  |  word.dep_: nsubj
  word.dep: 8206900633647566924  |  word.dep_: ROOT
  word.dep: 415  |  word.dep_: det
  word.dep: 402  |  word.dep_: amod
  word.dep: 404  |  word.dep_: attr
  word.dep: 443  |  word.dep_: prep
  word.dep: 439  |  word.dep_: pobj
  word.dep: 445  |  word.dep_: punct

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj])
print('np_labels:', np_labels)
  np_labels: {416, 422, 429, 430, 439}


def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:

  <generator object iter_nps at 0x7fd7b08b5bd0>

## Modified method:
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            print(word.text, word.dep_)

  Bananas nsubj
  potassium pobj

doc = nlp('BRCA1 is a tumor suppressor protein that functions to maintain genomic stability.')
for np in doc.noun_chunks:
  a tumor suppressor protein
  genomic stability

  BRCA1 nsubj
  that nsubj
  stability dobj

如果您想更准确地指定要提取哪种名词短语,可以使用textacy's matches功能。您可以传递 POS 标签的任意组合。例如,

textacy.extract.matches(doc, "POS:ADP POS:DET:? POS:ADJ:? POS:NOUN:+")

将 return 前面有介词和可选的限定词 and/or 形容词的任何名词。

Textacy 建立在 spacy 之上,因此它们应该可以完美地协同工作。

from spacy.en import English可能会报错

No module named 'spacy.en'

所有语言数据已移至 spacy2.0+

中的子模块 spacy.lang

请使用spacy.lang.en import English

然后执行@syllogism_ 回答的所有剩余步骤


    import spacy
    doc=nlp("When Sebastian Thrun started working on self-driving cars at "
    "Google in 2007, few people outside of the company took him "
    "seriously. “I can tell you very senior CEOs of major American "
    "car companies would shake my hand and turn away because I wasn’t "
    "worth talking to,” said Thrun, in an interview with Recode earlier "
    "this week.")
    #doc text is from spacy website
    for x in doc :
    if x.pos_ == "NOUN" or x.pos_ == "PROPN" or x.pos_=="PRON":
    # here you can get Nouns, Proper Nouns and Pronouns