使用 spacy 去除 pandas 系列中的停用词
Using spacy to get rid of stopwords in pandas series
我一直在尝试使用 spacy 库摆脱停用词。
代码
import spacy
import pandas as pd
import numpy as np
nlp= spacy.load('en_core_web_sm')
my_series:
my_series
0 this laptop sits at just over 4 stars while so...
1 i ordered this monitor because i wanted to mak...
2 this monitor is a great deal for the price and...
3 bought this for the height adjustment. the swi...
4 worked for a month and then it died. after 5 c...
...
30618 great deal
30619 pour le travail
30620 business use
30621 good size
30622 pour mon ordinateur.plus grande image.vraiment...
Name: text_body, Length: 30623, dtype: object
标记化
s_tokenized=my_series.apply(lambda x: nlp(x))
删除停用词
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
filtered_text
0 [this, laptop, sits, at, just, over, 4, stars,...
1 [i, ordered, this, monitor, because, i, wanted...
2 [this, monitor, is, a, great, deal, for, the, ...
3 [bought, this, for, the, height, adjustment, ....
4 [worked, for, a, month, and, then, it, died, ....
...
30618 [great, deal]
30619 [pour, le, travail]
30620 [business, use]
30621 [good, size]
30622 [pour, mon, ordinateur.plus, grande, image.vra...
Name: text_body, Length: 30623, dtype: object
tokenize 似乎工作正常,但删除停用词似乎根本没有删除任何单词,也没有引发任何错误。有没有我遗漏或做错了什么?
您对这行有疑问:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
更正为:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
你可以走了:
import spacy
nlp=spacy.load("en_core_web_sm")
s_tokenized = my_series.apply(nlp)
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
filtered_text
0 [laptop, sits, 4, stars]
1 [ordered, monitor, wanted]
dtype: object
请注意,您不需要 pandas 系列来保存您的数据。只需字符串或字符串列表就足够了。 Spacy 做同样的事情的方法是:
import spacy
nlp=spacy.load("en_core_web_sm")
texts = ["this laptop sits at just over 4 stars while", "i ordered this monitor because i wanted"]
docs = nlp.pipe(texts)
filtered_text= []
for doc in docs:
# yield [tok for tok in doc if not tok.is_stop]
filtered_text.append([tok for tok in doc if not tok.is_stop])
print(filtered_text)
[[laptop, sits, 4, stars], [ordered, monitor, wanted]]
我一直在尝试使用 spacy 库摆脱停用词。
代码
import spacy
import pandas as pd
import numpy as np
nlp= spacy.load('en_core_web_sm')
my_series:
my_series
0 this laptop sits at just over 4 stars while so...
1 i ordered this monitor because i wanted to mak...
2 this monitor is a great deal for the price and...
3 bought this for the height adjustment. the swi...
4 worked for a month and then it died. after 5 c...
...
30618 great deal
30619 pour le travail
30620 business use
30621 good size
30622 pour mon ordinateur.plus grande image.vraiment...
Name: text_body, Length: 30623, dtype: object
标记化
s_tokenized=my_series.apply(lambda x: nlp(x))
删除停用词
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
filtered_text
0 [this, laptop, sits, at, just, over, 4, stars,...
1 [i, ordered, this, monitor, because, i, wanted...
2 [this, monitor, is, a, great, deal, for, the, ...
3 [bought, this, for, the, height, adjustment, ....
4 [worked, for, a, month, and, then, it, died, ....
...
30618 [great, deal]
30619 [pour, le, travail]
30620 [business, use]
30621 [good, size]
30622 [pour, mon, ordinateur.plus, grande, image.vra...
Name: text_body, Length: 30623, dtype: object
tokenize 似乎工作正常,但删除停用词似乎根本没有删除任何单词,也没有引发任何错误。有没有我遗漏或做错了什么?
您对这行有疑问:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
更正为:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
你可以走了:
import spacy
nlp=spacy.load("en_core_web_sm")
s_tokenized = my_series.apply(nlp)
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
filtered_text
0 [laptop, sits, 4, stars]
1 [ordered, monitor, wanted]
dtype: object
请注意,您不需要 pandas 系列来保存您的数据。只需字符串或字符串列表就足够了。 Spacy 做同样的事情的方法是:
import spacy
nlp=spacy.load("en_core_web_sm")
texts = ["this laptop sits at just over 4 stars while", "i ordered this monitor because i wanted"]
docs = nlp.pipe(texts)
filtered_text= []
for doc in docs:
# yield [tok for tok in doc if not tok.is_stop]
filtered_text.append([tok for tok in doc if not tok.is_stop])
print(filtered_text)
[[laptop, sits, 4, stars], [ordered, monitor, wanted]]