NLP - Python 中的信息提取 (spaCy)
NLP - information extraction in Python (spaCy)
我正在尝试从以下段落结构中提取此类信息:
women_ran men_ran kids_ran walked
1 2 1 3
2 4 3 1
3 6 5 2
text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]
我正在使用 Python 的 spaCy
作为我的 NLP 库。我是 NLP 工作的新手,希望就什么是从此类句子中提取表格信息的最佳方法提供一些指导。
如果只是简单的判断是否有人运行或者步行,我会直接用sklearn
去拟合一个分类模型,但是我需要提取的信息显然是比这更细化(我正在尝试检索每个子类别和值)。任何指导将不胜感激。
您需要为此使用依赖项解析。您可以使用 the displaCy visualiser.
查看示例句子的可视化
您可以通过几种不同的方式实现您需要的规则——就像总是有多种方式来编写 XPath 查询、DOM 选择器等一样
像这样的东西应该可以工作:
nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
for j, sent in enumerate(doc.sents):
subjects = [w for w in sent if w.dep_ == 'nsubj']
for subject in subjects:
numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
if len(numbers) == 1:
print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))
对于 text
中的示例,您应该得到:
document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One
我正在尝试从以下段落结构中提取此类信息:
women_ran men_ran kids_ran walked
1 2 1 3
2 4 3 1
3 6 5 2
text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]
我正在使用 Python 的 spaCy
作为我的 NLP 库。我是 NLP 工作的新手,希望就什么是从此类句子中提取表格信息的最佳方法提供一些指导。
如果只是简单的判断是否有人运行或者步行,我会直接用sklearn
去拟合一个分类模型,但是我需要提取的信息显然是比这更细化(我正在尝试检索每个子类别和值)。任何指导将不胜感激。
您需要为此使用依赖项解析。您可以使用 the displaCy visualiser.
查看示例句子的可视化您可以通过几种不同的方式实现您需要的规则——就像总是有多种方式来编写 XPath 查询、DOM 选择器等一样
像这样的东西应该可以工作:
nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
for j, sent in enumerate(doc.sents):
subjects = [w for w in sent if w.dep_ == 'nsubj']
for subject in subjects:
numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
if len(numbers) == 1:
print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))
对于 text
中的示例,您应该得到:
document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One