如何在 Spacy 中获取所有名词短语

Question

我是 Spacy 的新手，我想从句子中提取 "all" 名词短语。我想知道我该怎么做。我有以下代码：

import spacy

nlp = spacy.load("en")

file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
    print(np.text)

但它 returns 只有基本名词短语，即其中没有任何其他 NP 的短语。也就是说，对于以下短语，我得到以下结果：

短语：We try to explicitly describe the geometry of the edges of the images.

结果：We, the geometry, the edges, the images.

预期结果：We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.

如何获取所有名词短语，包括嵌套短语？

Answer 1

请参阅下面的注释代码递归组合名词。代码灵感来自 Spacy Docs here

import spacy

nlp = spacy.load("en")

doc = nlp("We try to explicitly describe the geometry of the edges of the images.")

for np in doc.noun_chunks: # use np instead of np.text
    print(np)

print()

# code to recursively combine nouns
# 'We' is actually a pronoun but included in your question
# hence the token.pos_ == "PRON" part in the last if statement
# suggest you extract PRON separately like the noun-chunks above

index = 0
nounIndices = []
for token in doc:
    # print(token.text, token.pos_, token.dep_, token.head.text)
    if token.pos_ == 'NOUN':
        nounIndices.append(index)
    index = index + 1


print(nounIndices)
for idxValue in nounIndices:
    doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
    span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
    span.merge()

    for token in doc:
        if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
            print(token.text)

Answer 2

对于每个名词块，您还可以获得其下方的子树。 Spacy 提供了两种访问方式：left_edge 和 right edge 属性和 subtree 属性，其中 returns 是一个 Token 迭代器而不是跨度。组合 noun_chunks 和它们的子树会导致一些重复，稍后可以将其删除。

这是一个使用 left_edge 和 right edge 属性的示例

{np.text
  for nc in doc.noun_chunks
  for np in [
    nc, 
    doc[
      nc.root.left_edge.i
      :nc.root.right_edge.i+1]]}                                                                                                                                                                                                                                                                                                                                                                                                                                                 

==>

{'We',
 'the edges',
 'the edges of the images',
 'the geometry',
 'the geometry of the edges of the images',
 'the images'}

Answer 3

请尝试从文本中获取所有名词：

import spacy
nlp = spacy.load("en_core_web_sm")
text = ("We try to explicitly describe the geometry of the edges of the images.")
doc = nlp(text)
print([chunk.text for chunk in doc.noun_chunks])

如何在 Spacy 中获取所有名词短语

How to get all noun phrases in Spacy

python

nlp

spacy