我们是否应该在 KElbowVisualizer 方法之前进行缩放以在 python 中进行聚类

Question

我知道在任何聚类之前我们需要缩放数据。

但是我想问一下 KElbowVisualizer 方法是自己进行缩放还是在给它数据之前我应该缩放它。

我已经在该方法的文档中进行了搜索，但没有找到答案，如果您找到了，能否与我分享。谢谢;

Answer 1

我在 github 查看了 yellowbrick/cluster/elbow.py 中 KElbowVisualizer 的实现，但我没有在函数 fit (line 306) 下找到任何代码用于缩放 X 变量。

# https://github.com/DistrictDataLabs/yellowbrick/blob/main/yellowbrick/cluster/elbow.py
#...
 def fit(self, X, y=None, **kwargs):
        """
        Fits n KMeans models where n is the length of ``self.k_values_``,
        storing the silhouette scores in the ``self.k_scores_`` attribute.
        The "elbow" and silhouette score corresponding to it are stored in
        ``self.elbow_value`` and ``self.elbow_score`` respectively.
        This method finishes up by calling draw to create the plot.
        """

        self.k_scores_ = []
        self.k_timers_ = []
        self.kneedle = None
        self.knee_value = None

        if self.locate_elbow:
            self.elbow_value_ = None
            self.elbow_score_ = None

        for k in self.k_values_:
            # Compute the start time for each  model
            start = time.time()

            # Set the k value and fit the model
            self.estimator.set_params(n_clusters=k)
            self.estimator.fit(X, **kwargs)

            # Append the time and score to our plottable metrics
            self.k_timers_.append(time.time() - start)
            self.k_scores_.append(self.scoring_metric(X, self.estimator.labels_))
#...

因此，在传递给 KElbowVisualizer().fit()

之前，您可能需要缩放数据（X 参数）

我们是否应该在 KElbowVisualizer 方法之前进行缩放以在 python 中进行聚类

Should we scale before the KElbowVisualizer method for clustering in python

python

cluster-analysis

dataframe

scikit-learn