为什么在使用 Python 的词云库时,停用词没有被排除在词云之外?
Why are stop words not being excluded from the word cloud when using Python's wordcloud library?
我想排除 'The'、'They' 和 'My' 在我的词云中显示。我正在使用 python 库 'wordcloud' 如下所示,并使用这 3 个额外的停用词更新停用词列表,但 wordcloud 仍然包括它们。我需要更改什么才能排除这 3 个词?
我导入的库是:
import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
我已经尝试将元素添加到 STOPWORDS 集中,但即使成功添加了单词,wordcloud 仍然显示我添加到 STOPWORDS 集中的 3 个单词:
len(STOPWORDS)
输出:192
那我运行:
STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')
那我运行:
len(STOPWORDS)
输出:195
我是运行python版本3.7.3
我知道我可以修改文本输入以删除 运行 wordcloud 之前的 3 个单词(而不是尝试修改 WordCloud 的停用词集),但我想知道 WordCloud 是否存在错误或者我是否我 updating/using 停用词不正确?
pip install nltk
不要忘记安装停用词。
python
>>> import nltk
>>> nltk.download('stopwords')
试一试:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
background_color='white',
max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Wordcloud 的默认设置是 collocations=True
,因此两个相邻单词的频繁短语包含在云中 - 对于您的问题很重要,对于搭配,停用词的删除是不同的,例如“Thank you”是一个有效的搭配并且可能出现在生成的云中,即使“you”在默认停用词中也是如此。仅包含停用词 的搭配被 删除。
这听起来不无道理的理由是,如果在构建搭配列表之前删除了停用词,那么例如“thank you very much”会提供“thank very”的搭配,我绝对不想要。
因此,为了让您的停用词按照您的预期工作,即完全没有停用词出现在云中,您可以像这样使用 collocations=False
:
my_wordcloud = WordCloud(
stopwords=my_stopwords,
background_color='white',
collocations=False,
max_words=10).generate(all_tweets_as_one_string)
更新:
- 搭配 False 时,停用词全部小写,以便在删除它们时与小写文本进行比较 - 因此无需添加 'The' 等
- 使用搭配 True(默认设置)时停用词小写,当查找所有停用词搭配以将其删除时,源文本不是小写的,因此 e.g.g
The
在删除 the
时不会删除文本 - 这就是@Balaji Ambresh 的代码有效的原因,您会看到云中没有大写字母。这可能是 Wordcloud 的一个缺陷,不确定。但是添加例如The
到停用词不会影响这一点,因为不管搭配如何,停用词总是小写 True/False
这一切都在源代码中可见:-)
例如使用默认值 collocations=True
我得到:
然后 collocations=False
我得到:
代码:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(collocations=False,
background_color='white',
max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
我想排除 'The'、'They' 和 'My' 在我的词云中显示。我正在使用 python 库 'wordcloud' 如下所示,并使用这 3 个额外的停用词更新停用词列表,但 wordcloud 仍然包括它们。我需要更改什么才能排除这 3 个词?
我导入的库是:
import numpy as np
import pandas as pd
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
我已经尝试将元素添加到 STOPWORDS 集中,但即使成功添加了单词,wordcloud 仍然显示我添加到 STOPWORDS 集中的 3 个单词:
len(STOPWORDS)
输出:192
那我运行:
STOPWORDS.add('The')
STOPWORDS.add('They')
STOPWORDS.add('My')
那我运行:
len(STOPWORDS)
输出:195
我是运行python版本3.7.3
我知道我可以修改文本输入以删除 运行 wordcloud 之前的 3 个单词(而不是尝试修改 WordCloud 的停用词集),但我想知道 WordCloud 是否存在错误或者我是否我 updating/using 停用词不正确?
pip install nltk
不要忘记安装停用词。
python
>>> import nltk
>>> nltk.download('stopwords')
试一试:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(stopwords=stopwords,
background_color='white',
max_words=10).generate(text.lower())
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Wordcloud 的默认设置是 collocations=True
,因此两个相邻单词的频繁短语包含在云中 - 对于您的问题很重要,对于搭配,停用词的删除是不同的,例如“Thank you”是一个有效的搭配并且可能出现在生成的云中,即使“you”在默认停用词中也是如此。仅包含停用词 的搭配被 删除。
这听起来不无道理的理由是,如果在构建搭配列表之前删除了停用词,那么例如“thank you very much”会提供“thank very”的搭配,我绝对不想要。
因此,为了让您的停用词按照您的预期工作,即完全没有停用词出现在云中,您可以像这样使用 collocations=False
:
my_wordcloud = WordCloud(
stopwords=my_stopwords,
background_color='white',
collocations=False,
max_words=10).generate(all_tweets_as_one_string)
更新:
- 搭配 False 时,停用词全部小写,以便在删除它们时与小写文本进行比较 - 因此无需添加 'The' 等
- 使用搭配 True(默认设置)时停用词小写,当查找所有停用词搭配以将其删除时,源文本不是小写的,因此 e.g.g
The
在删除the
时不会删除文本 - 这就是@Balaji Ambresh 的代码有效的原因,您会看到云中没有大写字母。这可能是 Wordcloud 的一个缺陷,不确定。但是添加例如The
到停用词不会影响这一点,因为不管搭配如何,停用词总是小写 True/False
这一切都在源代码中可见:-)
例如使用默认值 collocations=True
我得到:
然后 collocations=False
我得到:
代码:
from wordcloud import WordCloud
from matplotlib import pyplot as plt
text = "The bear sat with the cat. They were good friends. " + \
"My friend is a bit bear like. He's lovely. The bear, the cat, the dog and me were all sat " + \
"there enjoying the view. You should have seen it. The view was absolutely lovely. " + \
"It was such a lovely day. The bear was loving it too."
cloud = WordCloud(collocations=False,
background_color='white',
max_words=10).generate(text)
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()