使用 TF-IDF 在 K-Means 中绘制质心
Plot centroids in K-Means using TF-IDF
我正在编码以使用 KMeans 对文本进行分组并且一切正常,但我无法将质心绘制在一起。我不知道如何使用 matplotlib,只有 seaborn 和 tdidf 创建的向量。
MiniBatchKMeans 有变量 cluster_centers_
,但我无法在图像中使用它。
from sklearn.feature_extraction.text import TfidfVectorizer
df_abstracts = df_cleared['abstract'].tolist() # list with 33,000 lines of strings
tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
vextorized = tfidf.fit_transform(df_abstracts)
#For the plot generation, I do this dimensionality reduction from 33,000 to 2.
from sklearn.decomposition import PCA
pca = PCA(n_components = 9)
X_pca = pca.fit_transform(vextorized.toarray())
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10,
random_state=9)
y_pred = kmeans.fit_predict(vextorized)
np.unique(y_pred)
palette = sns.color_palette('bright', len(set(y_pred)))
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
plt.title('Clustered')
你在原始数据上做了 k 均值聚类,所以你的中心投影到 PCA space,你需要再次转换它。
我使用示例数据集:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
categories = ['rec.sport.baseball', 'sci.electronics',
'comp.os.ms-windows.misc', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='train',
categories=categories)
X_train = newsgroups.data
y_train = newsgroups.target
tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
vextorized = tfidf.fit_transform(X_train)
这部分当你执行主成分分析时,你需要保留拟合,以便它可以用来投影 kmeans 中心:
pca = PCA(n_components = 9).fit(vextorized.toarray())
X_pca = pca.transform(vextorized.toarray())
这是带有实际标签的数据的样子:
labels = [newsgroups.target_names[i] for i in y_train]
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=labels, legend='full',palette="Set2")
现在 kmeans:
kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10,
random_state=777)
y_pred = kmeans.fit_predict(vextorized)
palette = sns.color_palette('bright', len(set(y_pred)))
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
plt.title('Clustered')
我们将中心投射到前两个分量上并绘制它们:
centers_on_PCs = pca.transform(kmeans.cluster_centers_)
plt.scatter(x=centers_on_PCs[:,0],y=centers_on_PCs[:,1],s=200,c="k",marker="X")
我正在编码以使用 KMeans 对文本进行分组并且一切正常,但我无法将质心绘制在一起。我不知道如何使用 matplotlib,只有 seaborn 和 tdidf 创建的向量。
MiniBatchKMeans 有变量 cluster_centers_
,但我无法在图像中使用它。
from sklearn.feature_extraction.text import TfidfVectorizer
df_abstracts = df_cleared['abstract'].tolist() # list with 33,000 lines of strings
tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
vextorized = tfidf.fit_transform(df_abstracts)
#For the plot generation, I do this dimensionality reduction from 33,000 to 2.
from sklearn.decomposition import PCA
pca = PCA(n_components = 9)
X_pca = pca.fit_transform(vextorized.toarray())
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10,
random_state=9)
y_pred = kmeans.fit_predict(vextorized)
np.unique(y_pred)
palette = sns.color_palette('bright', len(set(y_pred)))
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
plt.title('Clustered')
你在原始数据上做了 k 均值聚类,所以你的中心投影到 PCA space,你需要再次转换它。
我使用示例数据集:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
categories = ['rec.sport.baseball', 'sci.electronics',
'comp.os.ms-windows.misc', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='train',
categories=categories)
X_train = newsgroups.data
y_train = newsgroups.target
tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
vextorized = tfidf.fit_transform(X_train)
这部分当你执行主成分分析时,你需要保留拟合,以便它可以用来投影 kmeans 中心:
pca = PCA(n_components = 9).fit(vextorized.toarray())
X_pca = pca.transform(vextorized.toarray())
这是带有实际标签的数据的样子:
labels = [newsgroups.target_names[i] for i in y_train]
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=labels, legend='full',palette="Set2")
现在 kmeans:
kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10,
random_state=777)
y_pred = kmeans.fit_predict(vextorized)
palette = sns.color_palette('bright', len(set(y_pred)))
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
plt.title('Clustered')
我们将中心投射到前两个分量上并绘制它们:
centers_on_PCs = pca.transform(kmeans.cluster_centers_)
plt.scatter(x=centers_on_PCs[:,0],y=centers_on_PCs[:,1],s=200,c="k",marker="X")