Word Cloud 在单词中显示几个 ' 并且不确定为什么

Question

我试图用“'”排除它们，但失败了。不确定他们从哪里拉，因为他们不在文件中。感谢您的帮助

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np 

url = 'https://raw.githubusercontent.com/Imme21/WordCloud/main/StockData3.csv'
df = pd.read_csv(url, error_bad_lines=False)
df.dropna(inplace = True)
text = df['Stock'].values

wordcloud = WordCloud(background_color = 'white',
            stopwords = ['Date','Stock', 'Tickers', 
                         'Open','Close', 'High', 
                         'Low', 'IV', 'under',
                         'over', 'price', 'change', 
                         '%', 'null']).generate(str(text))


plt.imshow(wordcloud) 
plt.axis("off")
plt.show()

Answer 1

发生这种情况是因为您有一个字符串数组，其中每个字符串都包含一个引号。 Wordcloud 假定尾随的撇号可以是单词的一部分（因此它可以像 can't aren't 这样处理单词）。有关详细信息，请参阅 this post。

您可以通过使用 space 分隔的单词字符串而不是列表来解决此问题。 ' '.join(text) 应该可以解决您的问题

# You read a sring from file and converted it to array
text = df['Stock'].values
>>> text
array(['GME', 'SPY', 'TSLA', 'PLTR', 'AAPL', 'AMC', 'RKT', 'NIO', 'DIS',
.....

text = ' '.join(text)
>>> text
'GME SPY TSLA PLTR AAPL AMC RKT NIO.....

这应该在您的词云中处理正确的尾随引号或撇号。

Answer 2

问题与如何从数据框列中的值获取字符串有关。具体来说，text = df['Stock'].values 和 .generate(str(text).

使用 pandas.Series.str.cat 将生成“正确”的字符串并为您提供所需的结果：

...
>>> text = df['Stock'].str.cat(sep=' ')
...
>>> wordcloud = WordCloud(background_color = 'white',
            stopwords = ['Date','Stock', 'Tickers', 
                         'Open','Close', 'High', 
                         'Low', 'IV', 'under',
                         'over', 'price', 'change', 
                         '%', 'null']).generate(text)
...

Word Cloud 在单词中显示几个 ' 并且不确定为什么

Word Cloud showing several ' amongst words and not sure why

python

stop-words

word-cloud