为什么 kmeans 每次都给出完全相同的结果?
Why does kmeans give exactly the same results everytime?
我重新运行 kmeans 4次并得到
根据其他答案,我明白了
Everytime K-Means initializes the centroid, it is generated randomly.
能否请您解释一下为什么每次的结果都完全一样?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
我 post @AEF 的评论将此问题从未回答列表中删除。
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
它们不一样。他们很相似。 K-means 是一种迭代移动质心的算法,以便它们在拆分数据时变得越来越好,虽然这个过程是确定性的,但您必须为这些质心选择初始值,这通常是随机完成的。随机开始,并不意味着最终的质心是随机的。他们会收敛到比较好的东西,而且往往是相似的。
通过这个简单的修改查看您的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
如果您查看这些质心的确切值,它们将如下所示:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
相似但不同的值集。
还有:
查看 KMeans 的默认参数。有一个叫做 n_init:
Number of time the k-means algorithm will be run with different
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.
默认情况下它等于 10。这意味着每次您 运行 k-means 它实际上 运行 10 次并选择最好的结果。这些最佳结果将比 k-means 中单个 运行 的结果更加相似。
我重新运行 kmeans 4次并得到
根据其他答案,我明白了
Everytime K-Means initializes the centroid, it is generated randomly.
能否请您解释一下为什么每次的结果都完全一样?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
我 post @AEF 的评论将此问题从未回答列表中删除。
Random initialziation does not necessarily mean random result. Easiest example: k-means with k=1 always finds the mean in one step, regardless of where the center is initialised.
它们不一样。他们很相似。 K-means 是一种迭代移动质心的算法,以便它们在拆分数据时变得越来越好,虽然这个过程是确定性的,但您必须为这些质心选择初始值,这通常是随机完成的。随机开始,并不意味着最终的质心是随机的。他们会收敛到比较好的东西,而且往往是相似的。
通过这个简单的修改查看您的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%config InlineBackend.figure_format = 'svg' # Change the image format to svg for better quality
don = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/donclassif.txt.gz', sep=';')
fig, ax = plt.subplots(nrows=2, ncols=2, figsize= 2 * np.array(plt.rcParams['figure.figsize']))
cc = []
for row in ax:
for col in row:
kmeans = KMeans(n_clusters = 4)
kmeans.fit(don)
cc.append(kmeans.cluster_centers_)
y_kmeans = kmeans.predict(don)
col.scatter(don['V1'], don['V2'], c = y_kmeans, cmap = 'viridis')
centers = kmeans.cluster_centers_
col.scatter(centers[:, 0], centers[:, 1], c = 'red', s = 200, alpha = 0.5);
plt.show()
cc
如果您查看这些质心的确切值,它们将如下所示:
[array([[ 4.97975722, 4.93316461],
[ 5.21715504, -0.18757547],
[ 0.31141141, 0.06726803],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162],
[ 4.97975722, 4.93316461]]),
array([[ 0.30066361, 0.06804847],
[ 4.97975722, 4.93316461],
[ 5.21017831, -0.18735444],
[ 0.00747797, 5.00534801]]),
array([[ 5.21374245, -0.18608103],
[ 4.97975722, 4.93316461],
[ 0.00747797, 5.00534801],
[ 0.30592308, 0.06549162]])]
相似但不同的值集。
还有:
查看 KMeans 的默认参数。有一个叫做 n_init:
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
默认情况下它等于 10。这意味着每次您 运行 k-means 它实际上 运行 10 次并选择最好的结果。这些最佳结果将比 k-means 中单个 运行 的结果更加相似。