如何从 Python 中的标记化单词生成词云?
How can I generate a word cloud from tokenized words in Python?
我有一个代码可以导入 txt 文件并使用 NLTK 库获取标记化的单词(就像在 https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk 中所做的一样)。我几乎轻松地完成了我需要的所有事情,但是我正在努力用我现在拥有的词构建一个词云,即使在网上搜索了几个小时后我也没有任何线索。
到目前为止,这是我的代码:
# Carrega bibliotecas
!pip install nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Import file
f = open('PNAD2002.txt','r')
pnad2002 = ""
while 1:
line = f.readline()
if not line:break
pnad2002 += line
f.close()
tokenized_word=word_tokenize(pnad2002)
tokenized_word_2 = [w.lower() for w in tokenized_word]
我想使用以下代码(来自 https://github.com/amueller/word_cloud/blob/master/examples/simple.py):
# Read the whole text.
text = open(path.join(d, 'constitution.txt')).read()
# Generate a word cloud image
wordcloud = WordCloud().generate(text)
# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
但我不知道如何使用我的标记化词。
您需要实例化一个 WordCloud
对象然后调用 generate_from_text
:
wc = WordCloud()
img = wc.generate_from_text(' '.join(tokenized_word_2))
img.to_file('worcloud.jpeg') # example of something you can do with the img
您可以将大量自定义传递给 WordCloud
,您可以在线找到示例,例如:https://www.datacamp.com/community/tutorials/wordcloud-python
我有一个代码可以导入 txt 文件并使用 NLTK 库获取标记化的单词(就像在 https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk 中所做的一样)。我几乎轻松地完成了我需要的所有事情,但是我正在努力用我现在拥有的词构建一个词云,即使在网上搜索了几个小时后我也没有任何线索。
到目前为止,这是我的代码:
# Carrega bibliotecas
!pip install nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Import file
f = open('PNAD2002.txt','r')
pnad2002 = ""
while 1:
line = f.readline()
if not line:break
pnad2002 += line
f.close()
tokenized_word=word_tokenize(pnad2002)
tokenized_word_2 = [w.lower() for w in tokenized_word]
我想使用以下代码(来自 https://github.com/amueller/word_cloud/blob/master/examples/simple.py):
# Read the whole text.
text = open(path.join(d, 'constitution.txt')).read()
# Generate a word cloud image
wordcloud = WordCloud().generate(text)
# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
但我不知道如何使用我的标记化词。
您需要实例化一个 WordCloud
对象然后调用 generate_from_text
:
wc = WordCloud()
img = wc.generate_from_text(' '.join(tokenized_word_2))
img.to_file('worcloud.jpeg') # example of something you can do with the img
您可以将大量自定义传递给 WordCloud
,您可以在线找到示例,例如:https://www.datacamp.com/community/tutorials/wordcloud-python