获取从 sklearn Gaussian Mixture 创建的簇的边界坐标

Question

我有一个数据框 df，我已将其应用 sklearn.mixture.GaussianMixture 以便对我的数据进行聚类。是比较简单的模型：

# Pandas and Numpy
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt

# Gaussian mixture clustering
from sklearn.mixture import GaussianMixture

# Define Colours and labels
colours = ['cyan', 'chartreuse']
lab = ['Segment 1', 'Segment 2',]

# Define dataset
X = df[['weights', 'percentiles']].to_numpy()
# Define the model
gm_model = GaussianMixture(n_components=2)
# Fit the model
gm_model.fit(X)
# Assign a cluster to each example
yhat = gm_model.predict(X)
# Retrieve unique clusters
clusters = np.unique(yhat)

# Create scatter plot for samples from each cluster
for i, cluster in enumerate(clusters):
    # Get row indexes for samples with this cluster
    row_ix = np.where(yhat == cluster)
    # Create scatter of these samples with a different colour and label for each segment
    plt.scatter(X[row_ix, 0], X[row_ix, 1], s=1, c=colours[i], label=lab[i])

lgnd = plt.legend(loc='lower right', scatterpoints=1, fontsize=30)

plt.show()

然后我想做的是获取另一个数据帧 df_1 并找出它的哪些值属于从 df 创建的集群。 df 和 df_1 具有完全相同的结构：

print(df.columns)

Index(['id', 'percentiles', 'weights', 'is_good'],
      dtype='object')

print(df.dtypes)

id                  object
percentile         float64
weight             float64
is_good             object

所以我想使用 where df_1['is_good'] == 'Yes' 来查找 df_1 的值，这些值会落入由 df 创建的集群中。

我正在考虑通过查找每个集群边界的坐标，然后仅在 df_1 中找到这些边界内的所有值并将它们标记为在特定集群内来实现这一点。然而，为了做到这一点，我需要知道如何找到集群边界的坐标。或者如果有另一种（或更好的）方法来做到这一点，我很想知道！

Answer 1

您可以按照与 df 相同的方式进行预测：

X = df_1[['weights', 'percentiles']].to_numpy()
prediction = gm_model.predict(X)

获取从 sklearn Gaussian Mixture 创建的簇的边界坐标

Get boundary coordinates for clusters created from sklearn Gaussian Mixture

python

numpy

cluster-analysis

pandas

scikit-learn