如何改善 textacy.extract.semistructured_statements() 结果

Question

对于这个项目，我使用了 Wikipedia、spacy 和 textacy.extract 模块。

我使用维基百科模块来抓取我设置主题的页面。它将 returns 一串它的内容。

然后，我使用 textacy.extract.semistructured_statements() 来过滤事实。它需要两个必需的参数。第一个是文档，第二个是实体。

出于测试目的，我尝试将主题设置为 Ubuntu 和 Bill Gates。

#The Subject we are looking for
subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
wikiPage = wikipedia.page(wikiResults[0]).content

#Spacy
nlp = spacy.load("en_core_web_sm")
document = nlp(wikiPage)

#Textacy.Extract
statments = textacy.extract.semistructured_statements(document, subject)

for statement in statements:
    subject, verb, fact = statement

    print(fact)

所以当我运行程序时，我返回了搜索 Ubuntu 的多个结果，但不是比尔盖茨。为什么会这样？如何改进我的代码以从维基百科页面中提取更多事实？

编辑：这是最终结果

Ubuntu:

比尔·盖茨：

Answer 1

非常感谢 gabriele m。给我方向。

I added ["It","he","she","they"] which I saw in neuralcoref module example.

下面的代码将为您完成工作

import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)

wikiPage = wikipedia.page(wikiResults[0]).content

nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()

for word in ["It","he","she","they"]+subject.split(' '):    
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

for statement in uniqueStatements:
    entity, cue, fact = statement
    print(entity, cue, fact)

Answer 2

您需要使用不同的线索来处理文档以提取用于描述主题的常用动词，如果您有多个单词要搜索，还需要拆分字符串。例如，对于 Bill Gates，您将需要搜索 'Bill'、'Gates'、'Bill Gates' 组合，并且您需要提取用于描述 person/object 兴趣的不同线索基础动词。

例如搜索 'Gates'：

statments = textacy.extract.semistructured_statements(document, "Gates", cue = 'have',  max_n_words = 200, )

会给你更多的东西，比如：

* entity: Gates , cue: had , fact: primary responsibility for Microsoft's product strategy from the company's founding in 1975 until 2006
* entity: Gates , cue: is , fact: notorious for not being reachable by phone and for not returning phone calls
* entity: Gates , cue: was , fact: the second wealthiest person behind Carlos Slim, but regained the top position in 2013, according to the Bloomberg Billionaires List
* entity: Bill , cue: were , fact: the second-most generous philanthropists in America, having given over  billion to charity
* entity: Gates , cue: was , fact: seven years old
* entity: Gates , cue: was , fact: the guest on BBC Radio 4's Desert Island Discs on January 31, 2016, in which he talks about his relationships with his father and Steve Jobs, meeting Melinda Ann French, the start of Microsoft and some of his habits (for example reading The Economist "from cover to cover every week
* entity: Gates , cue: was , fact: the world's highest-earning billionaire in 2013, as his net worth increased by US.8 billion to US.5 billion

请注意，动词可以像结果 2 中的那样是否定的！

我还注意到，使用超过默认 20 个单词的 max_n_words 可以产生更有趣的陈述。

这是我的完整脚本：

import wikipedia
import spacy
import textacy
import en_core_web_sm

subject = 'Bill Gates'

#The Wikipedia Page
wikiResults = wikipedia.search(subject)
#print("wikiResults:", wikiResults)
wikiPage = wikipedia.page(wikiResults[0]).content
print("\n\nwikiPage:", wikiPage, "'\n")
nlp = en_core_web_sm.load()
document = nlp(wikiPage)
uniqueStatements = set()
for word in ["Gates", "Bill", "Bill Gates"]:
    for cue in ["be", "have", "write", "talk", "talk about"]:
        statments = textacy.extract.semistructured_statements(document, word, cue = cue,  max_n_words = 200, )
        for statement in statments:
            uniqueStatements.add(statement)

print("found", len(uniqueStatements), "statements.")
for statement in uniqueStatements:
    entity, cue, fact = statement
    print("* entity:",entity, ", cue:", cue, ", fact:", fact)

不同的主题和提示动词让我得到 23 个结果，而不是一个。

如何改善 textacy.extract.semistructured_statements() 结果

How to improve textacy.extract.semistructured_statements() results

python

spacy

textacy

编辑：这是最终结果