集群连续而不是离散 - python

Cluster groups continuously instead of discrete - python

我正在尝试以概率方式对一组点进行聚类。使用下面,我有一组 xy 点,它们记录在 XY 中。我想使用参考点聚类成组,显示在 X2Y2.

在答案的帮助下,当前的方法是使用 k-means 测量与参考点和组的距离。虽然,它提供了一种使用参考点进行聚类的方法,但是硬截断和遵守 k 聚类使得它在处理大量数据集时有些不合适。例如,此示例所需的集群数可能是 3。但单独的示例可能会有所不同。我每次都必须手动修改 k

鉴于 k-means 的非概率性质,一个单独的选项可能是 GMM。建模时是否可以考虑参考点?如果我将输出附加到底层模型下方,则不会像我希望的那样进行聚类。

如果我查看每个点在一个组中的概率,它并没有像我希望的那样聚集在一起。有了这个,我 运行 遇到了手动更改组件数量的相同问题。因为点是随机分布的,所以使用“AIC”或“BIC”来[=44​​=]适当的簇数是行不通的。没有最佳人数。

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df = pd.DataFrame({                                   
    'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
    'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    'X2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],
    'Y2' : [0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],           
    })

k-均值:

df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)
df['distance'] = np.sqrt((df['X2'] - df['Y2'])**2 + (df['BallY'] - df['y_post'])**2)

model = KMeans(n_clusters = 2) 

model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) 
df['group'] = model.labels_ 

plt.scatter(df['X'], df['Y'], c = model.labels_, cmap = 'bwr', marker = 'o', s = 5)
plt.scatter(df['X2'], df['Y2'], c ='k', marker = 'o', s = 5)

GMM:

Y_sklearn = df[['X','Y']].values

gmm = mixture.GaussianMixture(n_components=3, covariance_type='diag', random_state=42)
gmm.fit(Y_sklearn)
labels = gmm.predict(Y_sklearn)
df['group'] = labels
plt.scatter(Y_sklearn[:, 0], Y_sklearn[:, 1], c=labels, s=5, cmap='viridis');
plt.scatter(df['X2'], df['Y2'], c='red', marker = 'x', edgecolor = 'k', s = 5, zorder = 10)

proba = pd.DataFrame(gmm.predict_proba(Y_sklearn).round(2)).reset_index(drop = True)
df_pred = pd.concat([df, proba], axis = 1)

以您的中心点 0,0 为中心,我们可以计算出从该点到您 df 中所有点的 Euclidean distance

df['distance'] = np.sqrt(df['X']**2 + df['Y']**2)

如果中心点不是零,则为:

df['distance'] = np.sqrt((centre_point_x - df['X'])**2 + (centre_point_y - df['Y'])**2)

像以前一样使用您的数据和图表,我们可以绘制它并看到距离度量随着我们远离中心而增加。

fig, ax = plt.subplots(figsize = (6,6))

ax.scatter(df['X'], df['Y'], c = df['distance'], cmap = 'viridis', marker = 'o', s = 30)

ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])

plt.show()

K-均值

我们现在可以使用这个距离数据并像以前一样用它来计算 K-means 聚类,但这次使用距离数据和一个零数组(零是因为这个 k-means 需要一个二维数组但我们只想拆分维度数据的一维数组。所以零点充当 'filler'

model = KMeans(n_clusters = 2) #choose how many clusters
# create this 2d array for the KMeans model
model_data = np.array([df['distance'].values, np.zeros(df.shape[0])])
model.fit(model_data.T) # transformed array because the above code produces
# data with 27 columns and 2 rows but we want it the other way round
df['group'] = model.labels_ # put the labels into the dataframe

然后我们可以绘制结果

fig, ax = plt.subplots(figsize = (6,6))

ax.scatter(df['X'], df['Y'], c = df['group'], cmap = 'viridis', marker = 'o', s = 30)

ax.set_xlim([-35, 35])
ax.set_ylim([-35, 35])

plt.show()

对于三个集群,我们得到以下结果:

其他聚类方法

查看 SKlearn's clustering page 了解更多选项。我对 DBSCAN 进行了实验,取得了一些不错的结果,但这取决于您要实现的目标。查看示例图表下方的 table,了解它们各自的比较情况。

你是指密度估计吗?您可以将数据建模为高斯混合,然后获得点属于该混合的概率。您可以为此使用 sklearn.mixture.GaussianMixture。通过更改组件数量,您可以控制将拥有的集群数量。聚类的度量是与参考点的欧几里德距离。因此 GMM 模型将为您提供数据点应分类到哪个聚类的预测。

由于你的指标是 1d,你将得到一组高斯分布,即一组均值和方差。所以你可以很容易地计算出任何点在某个簇中的概率,只需计算它与参考点的距离并将值放入正态分布pdf公式中。

为了使图像更清晰,我将参考点更改为 (-5, 5) 并且 select 簇数 = 4。为了获得最佳簇数,请使用一些指标最小化总方差并惩罚混合物数量的增长。例如argmin(model.covariances_.sum()*num_clusters)

import pandas as pd
from sklearn.mixture import GaussianMixture
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
df = pd.DataFrame({                                   
    'X' : [-1.0,-1.0,0.5,0.0,0.0,2.0,3.0,5.0,0.0,-2.5,2.0,8.0,-10.5,15.0,-20.0,-32.0,-20.0,-20.0,-10.0,20.5,0.0,20.0,-30.0,-15.0,20.0,-15.0,-10.0],
    'Y' : [0.0,1.0,-0.5,0.5,-0.5,0.0,1.0,4.0,5.0,-3.5,-2.0,-8.0,-0.5,-10.5,-20.5,0.0,16.0,-15.0,5.0,13.5,20.0,-20.0,2.0,-17.5,-15,19.0,20.0],     
    })

ref_X, ref_Y = -5, 5
dist  = np.sqrt((df.X-ref_X)**2 + (df.Y-ref_Y)**2)

n_mix = 4
gmm = GaussianMixture(n_mix)
model = gmm.fit(dist.values.reshape(-1,1))
x = np.linspace(-35., 35.)
y = np.linspace(-30., 30.)
X, Y = np.meshgrid(x, y)
XX = np.sqrt((X.ravel() - ref_X)**2 + (Y.ravel() - ref_Y)**2)
Z = model.score_samples(XX.reshape(-1,1))
Z = Z.reshape(X.shape)

# plot grid points probabilities
plt.set_cmap('plasma')
plt.contourf(X, Y, Z, 40)
plt.scatter(df.X, df.Y, c=model.predict(dist.values.reshape(-1,1)), edgecolor='black') 

你可以阅读更多here and here

P.S。 score_samples() returns 对数似然,使用exp() 转换为概率

在我看来,如果你想将聚类定义为“点彼此靠近的区域”,你应该使用DBSCAN。 该聚类算法通过查看点彼此靠近的区域(即密集区域)来找到聚类,并通过点密度较低的区域与其他聚类分开。 该算法可以将点归类为噪声(异常值)。异常值标记为 -1。 它们是不属于任何簇的点。

这里有一些代码可以执行 DBSCAN 聚类,并将聚类标签作为新的分类列插入到原始 Y_sklearn DataFrame 中。它还打印找到了多少个聚类和多少个离群值。

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN


Y_sklearn = df.loc[:, ["X", "Y"]].copy()
n_points = Y_sklearn.shape[0]

dbs = DBSCAN()
labels_clusters = dbs.fit_predict(Y_sklearn)

#Number of found clusters (outliers are not considered a cluster).
n_clusters = labels_clusters.max() + 1
print(f"DBSCAN found {n_clusters} clusters in dataset with {n_points} points.")

#Number of found outliers (possibly no outliers found).
n_outliers = np.count_nonzero((labels_clusters == -1))
if n_outliers:
    print(f"{n_outliers} outliers were found.\n")
else:
    print(f"No outliers were found.\n")

#Add cluster labels as a new column to original DataFrame.
Y_sklearn["cluster"] = labels_clusters
#Setting `cluster` column to Categorical dtype makes seaborn function properly treat
#cluster labels as categorical, and not numerical.
Y_sklearn["cluster"] = Y_sklearn["cluster"].astype("category")

如果你想绘制结果,我建议你使用Seaborn。下面是一些代码,用于绘制 Y_sklearn DataFrame 的点,并根据它们所属的集群为它们着色。我还定义了一个新的调色板,它只是默认的 Seaborn 调色板,但异常值(带有标签 -1)将为黑色。

import matplotlib.pyplot as plt
import seaborn as sns


name_palette = "tab10"
palette = sns.color_palette(name_palette)
if n_outliers:
    color_outliers = "black"
    palette.insert(0, color_outliers)
else:
    pass
sns.set_palette(palette)


fig, ax = plt.subplots()
sns.scatterplot(data=Y_sklearn,
                x="X",
                y="Y",
                hue="cluster",
                ax=ax,
                )

使用默认超参数,DBSCAN 算法在您提供的数据中未发现聚类:所有点都被视为异常值,因为没有区域中的点明显更密集。那是你的整个数据集,还是只是一个样本?如果是样本的话,整个数据集的点会多很多,DBSCAN肯定会找到一些高密度的区域。 或者您可以尝试调整超参数,特别是 min_sampleseps。如果你想“强制”算法找到更多的簇,你可以减少 min_samples(默认为 5),或增加 eps(默认为 0.5)。当然,最佳的超参数值取决于特定的数据集,但默认值被认为对 DBSCAN 非常好。所以,如果算法认为你数据集中的所有点都是离群点,那就意味着没有“自然”的聚类!