KMeans 聚类不适用于超过 4 列的数据框

Question

我在这里问了一个类似的问题：我收到了一些有价值的回复。但是，我没有成功地让 KMeans 聚类处理超过 4 列的数据框。

相关数据框有 5 列，如下所示：

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22

我有一个简单的 KMeans 聚类 python 代码，我尝试将其应用于 5 列数据框，如下所示。

from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

X = np.array(df)

model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()

当我运行代码时，它抱怨行 pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])，并显示错误消息 'ValueError: 无法识别的标记样式 [[0.14 0.44 0.22]]'。但是，如果我从数据框中删除第 5 列（即 col5）并从代码中删除 X[row_ix, 4]，则聚类有效。

我需要做什么才能让 KMeans 处理我的示例数据框？

[更新：一次 2 或 3 个维度]

根据之前的 post，有人建议我可以使用以下函数一次表示 2 个或 3 个维度来拆分任务。但是，该函数不会产生预期的聚类输出（见附件 output.png）

def plot(self):
    

    import itertools
    combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
    
    fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination

    for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
        
        
        for i, index in enumerate(self.clusters):
            point = self.X[index].T
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            ax.scatter(px, py)

        for point in self.centroids:
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            
            ax.scatter(px, py, marker="x", color='black', linewidth=2)

        ax.set_title('feature {} vs feature {}'.format(x,y))
    plt.show()

如何修复上述函数以获得聚类输出。

Answer 1

您的 KMeans 可以工作，但您想要显示结果的方式不正确。如果您查看 matplotlib 散点函数 (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html) 的文档，您会发现该函数的前四个参数可以接受类数组，而第五个仅接受 'MarkerStyle'。这就是为什么只有在添加第五个参数时才会出现错误。实际上，您正在尝试在 2 维平面中绘制 5 维数据集，如果不事先进行降维，这是不可能的。 PCA 或 PLSDA 可能是降低数据集维度的不错选择。

Answer 2

如其他答案和评论中所述，您不能将所有 5 轴绘制在一起。一种方法是使用降维，例如 PCA 将其降为 2 维并绘制：

import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
from sklearn.decomposition import PCA

df = pd.read_csv('test.csv')

model = KMeans(n_clusters=5)
model.fit(df)
yhat = model.predict(df)
clusters = np.unique(yhat)

dims = PCA(n_components=2).fit_transform(X)
dims = pd.DataFrame(dims,columns=['PC1','PC2'])

fig,ax = plt.subplots(1,1) 
for cluster in clusters:
    ix = yhat == cluster
    ax.scatter(x=dims.loc[ix,'PC1'],y=dims.loc[ix,'PC2'],label=cluster)
ax.legend()

或者您确实使用 seaborn 并可视化所有变量，如果您只有 5 个变量也可以：

import seaborn as sns
df['cluster'] = yhat
sns.pairplot(data=df,hue='cluster',diag_kind=None)

KMeans 聚类不适用于超过 4 列的数据框

KMeans clustering won't work on a dataframe with more than 4 columns

python

numpy

k-means

dataframe

pandas