使用 k-means 可视化集群

Question

我有以下数据集：

    Date    Text
0   05/26/2020  è morto all'improvviso jk, aveva...
1   05/26/2020  è morto a 51 anni jk, attore, co...
2   05/26/2020  aveva 51 anni e si trovava in Italia. il rico...
3   05/26/2020  arriva a milano nel 1990 per una serie di conc...
4   05/26/2020  jk, l'attore e comico, e...
5   05/26/2020  spettacolo.it ha appreso che jk, l'...
6   05/26/2020  e' morto all'improvviso jk. cant...
7   05/26/2020  addio a jk . una morte improvvis...
8   05/26/2020  lutto nel mondo della televisione. è morto a 5...
9   05/26/2020  è morto all'età di 51 anni ...
10  05/26/2020  è morto all'età di 51 anni ...
11  05/26/2020  all'improvviso se ne è andato  ...
12  05/26/2020  è andato al supermercato  ...
13  05/26/2020  jk è morto improvvisamente a 51 ...
14  05/26/2020  è morto, a menfi, il 51enne jk...
15  05/26/2020  muore a cinquantuno anni jk, il ...

我想使用聚类 (k-mean) 创建用于分类文本的标签。我做了如下：

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('italian')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed


vect =TfidfVectorizer(tokenizer=preprocessing)
vectorized_text=vect.fit_transform(df['Text'])
kmeans =KMeans(n_clusters=2).fit(vectorized_text)

然后

import string as st 
from pandas import Series, DataFrame

cl=kmeans.predict(vectorized_text)
df['Cluster']=pd.Series(cl, index=df.index)
df.groupby("Cluster").count()

我想知道如何可视化结果。我试过如下：

plt.scatter(vectorized_text, cl)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

但是我有这个错误：

ValueError: x and y must be the same size

由于 plt.scatter(vectorized_text, cl)，所以那里有些地方不对劲。查看网络上可能的解决方案，我通过使用 PCA 找到了一些东西。我应该考虑吗？

谢谢

更新：收到以下答案后，我尝试了：

plt.scatter(vectorized_text[:, 0] ,cl)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

不幸的是，我仍然收到错误消息：

ValueError: x and y must be the same size

Answer 1

plt.scatter() 中 x 参数的形状必须具有维数 (n,)，但此处不是这种情况。您只能 select 一列 vectorized_text 用于散点图，而不是全部。现在你的 x 尺寸是 209x1245，你的 y 尺寸是 (209,)

如何将 `vectorized_text` 转换为一维数组？

剧透：你不能！你首先需要从中切出一列，然后将其转换为密集矩阵（现在是稀疏矩阵），然后将其转换为数组。

假设您要绘制 vectorized_text 的第一列：您需要提供的 x 到 plt.scatterplot 是：

np.asarray(vectorized_text[:, 0].todense())

使用 k-means 可视化集群

Visualise clusters with k-means

python

cluster-analysis

k-means

pandas

如何将 `vectorized_text` 转换为一维数组？

使用 k-means 可视化集群

Visualise clusters with k-means

python

cluster-analysis

k-means

pandas

如何将 vectorized_text 转换为一维数组？

如何将 `vectorized_text` 转换为一维数组？