K-Means R vs K-Means Python 不同的聚类值生成不同的条形图

Question

下面是两组代码，一组在 Python 中做同样的事情，另一组在 R 中。它们都绘制了关于 PCA 的 Kmeans，但是一旦我在最后使用条形图cluster Center 和 Graphs 完全不同。我认为 python 中的 Kmeans 和聚类计算有问题。原始代码在 R 中提供。我想看看为什么 python 中的条形图不匹配，我相信它的中心。请查看并提供一些反馈。

请使用下面的link下载我用来生成这些图表的数据集。

https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0

R代码

## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
  

pcp <- read.csv(file='E:\ProgramData\R\Code\TableStats2.csv')

#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]


#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]

#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)

plot.data <- data.frame(pca$x[, 1:2])

set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)

g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
  geom_point(size = 3.5) +
  geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
  theme_bw()

behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")

g2 <- ggplot(behavious, aes(x = variable, y = value)) +
  geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
  facet_wrap(~cluster) +
  theme_grey() +
  theme(axis.text.x = element_text(angle = 90))

python代码

import pandas as pd    
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans    
from matplotlib import pyplot as plt    
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text

TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
  #print(label)
  plt.annotate(label,(x1[i], y1[i]))
plt.show()

df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 

clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)

df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables

#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
    ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')

plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()

# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]

b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")

(ggplot(b2, aes(x = 'variable', y = 'value')) + 
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") + 
facet_wrap('~cluster') + 
theme_grey() + 
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8)) 
)

Answer 1

现在更新我可以在 R 中使用它 Python

查看此特定问题，检查 PCA 的输出 - 它们不同，因此 k-means 不会相同。原因在您的 R 代码中 - 您重复行 data <- data[, -1]，删除 table 名称和数据的第一列。去掉多余的线，簇看起来一样。

关于 R 和 Python kmeans 实现的一般评论

一般来说，R 和 python 默认使用不同的算法。 R默认使用"Hartigan-Wong"，Python的scikit-learn大概使用"elkan"。在 R 中设置 algorithm='Lloyd' 并在 Python 中设置 algorithm='full'（我相信目前运行 Lloyd 算法也是如此）以确保他们至少在尝试相同的事情。

您还有不同的初始化方法 - R 是随机的，对于 Python，您使用的是 'k-means++'。在 Python 中设置 init='random' 以使这些匹配。

它们有不同的最大迭代次数 - R 默认为 10，Python 为 300。也将它们设置为相等。

最后，如果您在 Python KMeans 调用中设置 random_state（并检查您没有 set.seed 在 R 中也是如此）。

完成此操作后，多次尝试运行ning 两者，并比较值的分布。希望您会看到两种实现之间的重叠。

查看 R implementation and the scikit-learn implementation 的文档。

最后一点 - kmeans 是无监督的。 class 标签没有绝对意义。运行代码多次，并且 class 0 不会总是分配给相同的数据点，即使数据点被相同地分组。

这是一个可重现的例子：

import pandas as pd
from sklearn import cluster, datasets

from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)

random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):

    clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)

    df = (df
      .assign(**{
          'Cluster': clusters.labels_,
          'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
          'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
      })
     )
    
    row = i//2
    col = i - row*2
    sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
    sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, 
                    palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])

这是包含您的数据的版本：

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

TableStats = pd.read_csv('TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
            'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']

# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]

random_states = [1,2,3,4,5,6]
for r in random_states:
    df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
                                      'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 
    clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)

    df = (df
          .assign(**{
              'Cluster': clusters.labels_,
              'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
              'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
          })
         )
    
    plt.figure(figsize=(20, 11))
    ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
    ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', 
                         s=1000, palette='coolwarm', legend=False, alpha=0.1)    

    plt.legend(loc='upper right', title='Cluster')
    ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
    plt.show()

K-Means R vs K-Means Python 不同的聚类值生成不同的条形图

K-Means R vs K-Means Python different cluster values generating different bar Graphs

python

r

cluster-analysis

k-means