Python Kmeans 打印每个簇中单词的绝对频率
Python Kmeans Print absolute frequency of words in each cluster
你好有没有办法打印出集群中每个单词的绝对频率?
我的代码如下所示:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(list)
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :5]:
print(' %s' % terms[ind],)
print
我的结果例如:
每个集群的热门术语:
簇 0:
房子
屋顶
table
椅子
电视
集群 1:
...
但我想要这样的东西,每个词的绝对频率:
簇 0:
房子 65
屋顶 45
table 44
椅子 33
电视 18
提前谢谢你:)
不确定 tfidfvectorizer 对单词的需求是什么。但无论如何,使用 kmeans 只是预测每个单词的集群标签。并通过 df[df.cluster==#somelabel].words.value_counts
简单地检查每个集群中的词频
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
words = ['this','is','a','very','long','text','my','name','is','not','cortana','today','I','will',
'write','a','long','text','I','am','from','planet','earth','this','text','does','not','make',
'sense']
#tfidf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(words)
#kmeans
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
lab = model.predict(X)
#save cluster labels for each sample in a dataframe
df = pd.DataFrame({'words':words, 'cluster':lab})
#check word freq for cluster==1
df[df.cluster==1].words.value_counts()
你好有没有办法打印出集群中每个单词的绝对频率? 我的代码如下所示:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(list)
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :5]:
print(' %s' % terms[ind],)
print
我的结果例如:
每个集群的热门术语:
簇 0:
房子
屋顶
table
椅子
电视
集群 1:
...
但我想要这样的东西,每个词的绝对频率:
簇 0:
房子 65
屋顶 45
table 44
椅子 33
电视 18
提前谢谢你:)
不确定 tfidfvectorizer 对单词的需求是什么。但无论如何,使用 kmeans 只是预测每个单词的集群标签。并通过 df[df.cluster==#somelabel].words.value_counts
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
words = ['this','is','a','very','long','text','my','name','is','not','cortana','today','I','will',
'write','a','long','text','I','am','from','planet','earth','this','text','does','not','make',
'sense']
#tfidf
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(words)
#kmeans
true_k = 4
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
lab = model.predict(X)
#save cluster labels for each sample in a dataframe
df = pd.DataFrame({'words':words, 'cluster':lab})
#check word freq for cluster==1
df[df.cluster==1].words.value_counts()