使用不同颜色和标签的聚类
Cluster using different colours and labels
我正在研究文本聚类。我需要使用不同的颜色绘制数据。
我使用 kmeans
聚类方法和 tf-idf
相似性方法。
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])
目前,我的输出如下: 有一些元素是测试。
我需要添加标签(它们是字符串)并按簇区分点:每个簇都应该有自己的颜色以使 reader 易于分析图表。
能否请您告诉我如何更改我的代码以包含标签和颜色?我认为任何例子都会很棒。
我的数据集样本是(上面的输出是从不同的样本生成的):
句子
Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
使用您的代码尝试以下操作:
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_
cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}
fig, ax = plt.subplots()
for g in np.unique(group):
ix = np.where(group == g)
ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()
我假设你的 kmeans
有 n_clusters=3
。 cdict
和 ldict
需要根据簇数进行相应设置。在这种情况下,集群 0 将是红色的,标签为 label_1
,集群 1 将是蓝色的,标签为 label_2
,依此类推。
编辑:我将 cdict
的键更改为从 0 开始。
编辑 2:添加标签。
我们可以使用示例数据集:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
newsgroups = fetch_20newsgroups(subset='train',
categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target
pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
像你一样做 KMeans,获取集群和中心,所以只需为集群添加一个名称:
kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]
您可以通过向 "c="
提供集群并从 cm 调用颜色映射或定义您自己的映射来添加颜色:
plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")
你也可以考虑用seaborn:
sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")
我正在研究文本聚类。我需要使用不同的颜色绘制数据。
我使用 kmeans
聚类方法和 tf-idf
相似性方法。
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])
目前,我的输出如下:
能否请您告诉我如何更改我的代码以包含标签和颜色?我认为任何例子都会很棒。
我的数据集样本是(上面的输出是从不同的样本生成的):
句子
Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.
使用您的代码尝试以下操作:
kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_
pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_
cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}
fig, ax = plt.subplots()
for g in np.unique(group):
ix = np.where(group == g)
ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()
我假设你的 kmeans
有 n_clusters=3
。 cdict
和 ldict
需要根据簇数进行相应设置。在这种情况下,集群 0 将是红色的,标签为 label_1
,集群 1 将是蓝色的,标签为 label_2
,依此类推。
编辑:我将 cdict
的键更改为从 0 开始。
编辑 2:添加标签。
我们可以使用示例数据集:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
newsgroups = fetch_20newsgroups(subset='train',
categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target
pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
像你一样做 KMeans,获取集群和中心,所以只需为集群添加一个名称:
kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]
您可以通过向 "c="
提供集群并从 cm 调用颜色映射或定义您自己的映射来添加颜色:
plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")
你也可以考虑用seaborn:
sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")