删除更改的停用词
Removing altered stopwords
背景:
1) 我有以下代码使用 nltk 包删除 stopwords
:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]
2) 此代码用于删除 stopwords
,例如 the
,如下所示:
['dog', 'barks', 'tree', 'sees', 'squirrel']
3) 我修改了 stopwords
以使用以下代码保留单词 not
:
to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)
问题:
4) 但是当我将 new_stopwords
与以下代码一起使用时:
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]
5) 我收到以下错误,因为 new_stopwords
是 set
:
AttributeError: 'set' object has no attribute 'words'
问题:
6) 如何使用新定义的 new_stopwords
获得所需的输出:
['dog', 'not','barks', 'tree', 'sees', 'squirrel']
你非常接近,但你对错误消息的阅读是错误的:问题不在于你所说的“new_stopwords
是一个 set
”,而是一个“ set
没有属性 words
"
这不是。 new_stopwords
是一个集合,也就是说可以直接在list comprehension中使用:
filtered_words = [word for word in lower_tokens if word not in new_stopwords]
您还可以省去修改停用词列表的麻烦,只需使用两个条件:
keep_list = ['not']
filtered_words = [word for word in lower_tokens if (word not in stopwords.words("english")) or (word in keep_list)]
背景:
1) 我有以下代码使用 nltk 包删除 stopwords
:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]
2) 此代码用于删除 stopwords
,例如 the
,如下所示:
['dog', 'barks', 'tree', 'sees', 'squirrel']
3) 我修改了 stopwords
以使用以下代码保留单词 not
:
to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)
问题:
4) 但是当我将 new_stopwords
与以下代码一起使用时:
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]
5) 我收到以下错误,因为 new_stopwords
是 set
:
AttributeError: 'set' object has no attribute 'words'
问题:
6) 如何使用新定义的 new_stopwords
获得所需的输出:
['dog', 'not','barks', 'tree', 'sees', 'squirrel']
你非常接近,但你对错误消息的阅读是错误的:问题不在于你所说的“new_stopwords
是一个 set
”,而是一个“ set
没有属性 words
"
这不是。 new_stopwords
是一个集合,也就是说可以直接在list comprehension中使用:
filtered_words = [word for word in lower_tokens if word not in new_stopwords]
您还可以省去修改停用词列表的麻烦,只需使用两个条件:
keep_list = ['not']
filtered_words = [word for word in lower_tokens if (word not in stopwords.words("english")) or (word in keep_list)]