使用 spacy 去除 pandas 系列中的停用词

Question

我一直在尝试使用 spacy 库摆脱停用词。

代码

import spacy
import pandas as pd
import numpy as np

nlp= spacy.load('en_core_web_sm')

my_series:

my_series

0        this laptop sits at just over 4 stars while so...
1        i ordered this monitor because i wanted to mak...
2        this monitor is a great deal for the price and...
3        bought this for the height adjustment. the swi...
4        worked for a month and then it died. after 5 c...
                               ...                        
30618                                           great deal
30619                                      pour le travail
30620                                         business use
30621                                            good size
30622    pour mon ordinateur.plus grande image.vraiment...
Name: text_body, Length: 30623, dtype: object

标记化

s_tokenized=my_series.apply(lambda x: nlp(x))

删除停用词

all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
filtered_text

0        [this, laptop, sits, at, just, over, 4, stars,...
1        [i, ordered, this, monitor, because, i, wanted...
2        [this, monitor, is, a, great, deal, for, the, ...
3        [bought, this, for, the, height, adjustment, ....
4        [worked, for, a, month, and, then, it, died, ....
                               ...                        
30618                                        [great, deal]
30619                                  [pour, le, travail]
30620                                      [business, use]
30621                                         [good, size]
30622    [pour, mon, ordinateur.plus, grande, image.vra...
Name: text_body, Length: 30623, dtype: object

tokenize 似乎工作正常，但删除停用词似乎根本没有删除任何单词，也没有引发任何错误。有没有我遗漏或做错了什么？

Answer 1

您对这行有疑问：

filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])

更正为：

filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])

你可以走了：

import spacy
nlp=spacy.load("en_core_web_sm")
s_tokenized = my_series.apply(nlp)
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
filtered_text
0      [laptop, sits, 4, stars]
1    [ordered, monitor, wanted]
dtype: object

请注意，您不需要 pandas 系列来保存您的数据。只需字符串或字符串列表就足够了。 Spacy 做同样的事情的方法是：

import spacy
nlp=spacy.load("en_core_web_sm")
texts = ["this laptop sits at just over 4 stars while", "i ordered this monitor because i wanted"]
docs = nlp.pipe(texts)
filtered_text= []
for doc in docs:
#     yield [tok for tok in doc if not tok.is_stop]
    filtered_text.append([tok for tok in doc if not tok.is_stop])
print(filtered_text)

[[laptop, sits, 4, stars], [ordered, monitor, wanted]]

使用 spacy 去除 pandas 系列中的停用词

Using spacy to get rid of stopwords in pandas series

python

nlp

text-mining

spacy

data-science