使用 python 创建 n-gram 词云

Creating n-grams word cloud using python

我正在尝试使用二元语法生成词云。我能够生成前 30 个辨别词,但无法在绘图时一起显示词。我的词云图像看起来仍然像一个 uni-gram 云。我使用了以下脚本和sci-kit学习包。

def create_wordcloud(pipeline): 
"""
Create word cloud with top 30 discriminative words for each category
"""

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology'])

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[]

for i, class_label in enumerate(class_labels):
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:]

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30)))

    for j in top30:
        word_text.append(feature_names[j])
    #print(word_text)
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text))

    # Save word cloud as .png file
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png")

    # Plot wordcloud on console
    plt.figure(figsize=(15,8))
    plt.imshow(wordcloud1, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    word_text=[]

这是我的管道代码

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])

这些是我为类别"Arts"

获得的一些特征
Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors

我认为您需要以某种方式将 feature_names 中的 n-gramms 与 space 以外的任何其他符号连接起来。例如,我建议使用下划线。 现在,这部分让你的 n-gramms 再次分隔单词,我认为:

' '.join(word_text)

我想你必须在这里用下划线代替 space:

word_text.append(feature_names[j])

更改为:

word_text.append(feature_names[j].replace(' ', '_'))