如何解析特定的句子？

Question

考虑这个最小的数据框

import spacy
nlp = spacy.load('en_core_web_sm')
import pandas as pd
import numpy as np    

mydata = pd.DataFrame({'text' : [u'the cat eats the dog. the dog eats the cat']})

我知道我可以在我的文本栏上使用 apply 到运行 spacy:

mydata['parsed'] = mydata.text.apply(lambda x: nlp(x))

但是，我想做一些更微妙的事情：如何使用词性标记和 spacy 提取句子，其中主题是 dog？ =17=]

输出应为下面的 extracted 列：

Out[16]: 
              extracted                                        text
0  the dog eats the cat  the cat eats the dog. the dog eats the cat

谢谢！

Answer 1

这不是真正的 pandas 问题。您遇到了三个问题：

将每个字符串拆分为多个句子
确定每个句子的主语
Return 句子如果主语是 dog

1.我们可以使用split()方法将一个字符串分割成list。

my_string = "the dog ate the bread. the cat ate the bread"
sentences = my_string.split('.')

2. 根据 Spacy 文档，在 string 上调用 nlp() 会给我们一个 Doc，其中包含 tokens 又附加了一些 properties。

我们感兴趣的 property 是 dep_ 因为它会告诉我们 token 和另一个 tokens 之间的关系，即如果我们的 token 是不是主语。

您可以在此处找到属性列表：https://spacy.io/usage/linguistic-features

doc = nlp(my_string)

for token in doc:
    print(token.dep_)  # if this prints `nsubj` the token is a noun subject!

3. 为了检查 token 是否等于 'dog' 我们需要从令牌中获取文本属性：

token.text

如果我们扩大规模：

NLP = spacy.load('en_core_web_sm')

def extract_sentence_based_on_subject(string, subject):

    sentences = string.split('.')

    for sentence in sentences:
        doc = NLP(sentence)
        for token in doc:
            if token.dep_ == 'nsubj':
                if token.text == subject:
                    return sentence


mydata['text'].apply(extract_sentence_based_on_subject, subject='dog')

如何解析特定的句子？

how to parse a specific sentence?

python

pandas

spacy