K-Means R vs K-Means Python 不同的聚类值生成不同的条形图
K-Means R vs K-Means Python different cluster values generating different bar Graphs
下面是两组代码,一组在 Python 中做同样的事情,另一组在 R 中。它们都绘制了关于 PCA 的 Kmeans,但是一旦我在最后使用条形图cluster Center 和 Graphs 完全不同。我认为 python 中的 Kmeans 和聚类计算有问题。原始代码在 R 中提供。我想看看为什么 python 中的条形图不匹配,我相信它的中心。请查看并提供一些反馈。
请使用下面的link下载我用来生成这些图表的数据集。
https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0
R代码
## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
pcp <- read.csv(file='E:\ProgramData\R\Code\TableStats2.csv')
#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]
#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]
#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)
plot.data <- data.frame(pca$x[, 1:2])
set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)
g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
geom_point(size = 3.5) +
geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
theme_bw()
behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")
g2 <- ggplot(behavious, aes(x = variable, y = value)) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap(~cluster) +
theme_grey() +
theme(axis.text.x = element_text(angle = 90))
python代码
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text
TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
#print(label)
plt.annotate(label,(x1[i], y1[i]))
plt.show()
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)
df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables
#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]
b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")
(ggplot(b2, aes(x = 'variable', y = 'value')) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap('~cluster') +
theme_grey() +
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8))
)
现在更新我可以在 R 中使用它 Python
查看此特定问题,检查 PCA 的输出 - 它们不同,因此 k-means 不会相同。原因在您的 R 代码中 - 您重复行 data <- data[, -1]
,删除 table 名称和数据的第一列。去掉多余的线,簇看起来一样。
关于 R 和 Python kmeans 实现的一般评论
一般来说,R 和 python 默认使用不同的算法。 R默认使用"Hartigan-Wong"
,Python的scikit-learn大概使用"elkan"
。在 R 中设置 algorithm='Lloyd'
并在 Python 中设置 algorithm='full'
(我相信目前 运行 Lloyd 算法也是如此)以确保他们至少在尝试相同的事情。
您还有不同的初始化方法 - R 是随机的,对于 Python,您使用的是 'k-means++'
。在 Python 中设置 init='random'
以使这些匹配。
它们有不同的最大迭代次数 - R 默认为 10,Python 为 300。也将它们设置为相等。
最后,如果您在 Python KMeans 调用中设置 random_state
(并检查您没有 set.seed
在 R 中也是如此)。
完成此操作后,多次尝试 运行ning 两者,并比较值的分布。希望您会看到两种实现之间的重叠。
查看 R implementation and the scikit-learn implementation 的文档。
最后一点 - kmeans 是无监督的。 class 标签没有绝对意义。 运行 代码多次,并且 class 0 不会总是分配给相同的数据点,即使数据点被相同地分组。
这是一个可重现的例子:
import pandas as pd
from sklearn import cluster, datasets
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)
random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
row = i//2
col = i - row*2
sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000,
palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])
这是包含您的数据的版本:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
TableStats = pd.read_csv('TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
random_states = [1,2,3,4,5,6]
for r in random_states:
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster',
s=1000, palette='coolwarm', legend=False, alpha=0.1)
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
下面是两组代码,一组在 Python 中做同样的事情,另一组在 R 中。它们都绘制了关于 PCA 的 Kmeans,但是一旦我在最后使用条形图cluster Center 和 Graphs 完全不同。我认为 python 中的 Kmeans 和聚类计算有问题。原始代码在 R 中提供。我想看看为什么 python 中的条形图不匹配,我相信它的中心。请查看并提供一些反馈。
请使用下面的link下载我用来生成这些图表的数据集。
https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0
R代码
## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
pcp <- read.csv(file='E:\ProgramData\R\Code\TableStats2.csv')
#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]
#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]
#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)
plot.data <- data.frame(pca$x[, 1:2])
set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)
g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
geom_point(size = 3.5) +
geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
theme_bw()
behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")
g2 <- ggplot(behavious, aes(x = variable, y = value)) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap(~cluster) +
theme_grey() +
theme(axis.text.x = element_text(angle = 90))
python代码
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text
TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
#print(label)
plt.annotate(label,(x1[i], y1[i]))
plt.show()
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)
df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables
#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]
b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")
(ggplot(b2, aes(x = 'variable', y = 'value')) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap('~cluster') +
theme_grey() +
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8))
)
现在更新我可以在 R 中使用它 Python
查看此特定问题,检查 PCA 的输出 - 它们不同,因此 k-means 不会相同。原因在您的 R 代码中 - 您重复行 data <- data[, -1]
,删除 table 名称和数据的第一列。去掉多余的线,簇看起来一样。
关于 R 和 Python kmeans 实现的一般评论
一般来说,R 和 python 默认使用不同的算法。 R默认使用"Hartigan-Wong"
,Python的scikit-learn大概使用"elkan"
。在 R 中设置 algorithm='Lloyd'
并在 Python 中设置 algorithm='full'
(我相信目前 运行 Lloyd 算法也是如此)以确保他们至少在尝试相同的事情。
您还有不同的初始化方法 - R 是随机的,对于 Python,您使用的是 'k-means++'
。在 Python 中设置 init='random'
以使这些匹配。
它们有不同的最大迭代次数 - R 默认为 10,Python 为 300。也将它们设置为相等。
最后,如果您在 Python KMeans 调用中设置 random_state
(并检查您没有 set.seed
在 R 中也是如此)。
完成此操作后,多次尝试 运行ning 两者,并比较值的分布。希望您会看到两种实现之间的重叠。
查看 R implementation and the scikit-learn implementation 的文档。
最后一点 - kmeans 是无监督的。 class 标签没有绝对意义。 运行 代码多次,并且 class 0 不会总是分配给相同的数据点,即使数据点被相同地分组。
这是一个可重现的例子:
import pandas as pd
from sklearn import cluster, datasets
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)
random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
row = i//2
col = i - row*2
sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000,
palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])
这是包含您的数据的版本:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
TableStats = pd.read_csv('TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
random_states = [1,2,3,4,5,6]
for r in random_states:
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster',
s=1000, palette='coolwarm', legend=False, alpha=0.1)
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()