有没有办法计算 make_blobs 生成的数据集的 cluster_std？

Question

make_blobs() 用于生成用于聚类的各向同性高斯斑点。

参数cluster_std是聚类的标准差。

我生成了一个数据集：

x, y = make_blobs(n_samples=100, centers=6,
                       cluster_std=0.60, random_state=1234)

我正在尝试计算标准偏差：

np.std(x)

产出

5.122249276993561

与初始参数 0.60 相差甚远。

有没有办法正确计算标准偏差？

Answer 1

从make_blobs()，你可以看到标准偏差0.60的规范已经作为参数接收到generator.normal(loc=centers[i], scale=std, size=(n, n_features))，这是sklearn为每个集群生成数据点的方式。

您应该计算每个聚类中每个特征的标准差：

import numpy as np

for i in set(y):
    print('--> label {}'.format(i))
    for j in range(x.shape[1]):
        print('std for feature {}: {}'.format(j, np.std(x[y==i][:,j])))

你得到：

--> label 0
std for feature 0: 0.345293121830674
std for feature 1: 0.7142696641502757
--> label 1
std for feature 0: 0.5041694666576663
std for feature 1: 0.6269103210381141
--> label 2
std for feature 0: 0.4168488521809934
std for feature 1: 0.6994177825578384
--> label 3
std for feature 0: 0.5760022004454849
std for feature 1: 0.580543624607708
--> label 4
std for feature 0: 0.5977962642901783
std for feature 1: 0.5271686872743192
--> label 5
std for feature 0: 0.6462807280468825
std for feature 1: 0.4928028738564903

Answer 2

如果我们在np.std()中没有提到axis值，那么所有的数据点被组合成一个数组，然后计算标准偏差。

来自 Documentation:

axis : None or int or tuple of ints, optional Axis or axes along which
the standard deviation is computed. The default is to compute the
standard deviation of the flattened array.

即使提到轴，也得不到想要的结果

np.std(x,axis=0)
array([5.51732287, 4.27190484])

原因是标准偏差，我们之前提供的是针对每个集群而不是整个数据集。

来自Documentation:

cluster_std : float or sequence of floats, optional (default=1.0) The
standard deviation of the clusters.

现在，如果我们计算每个集群的标准差：

>>> sample_size =  100
>>> x, y = make_blobs(n_samples=sample_size, centers=6,
                       cluster_std=0.60, random_state=1234)
>>> for i in range(6):
>>>     print(np.std(x[y==i], axis=0))

[0.34529312 0.71426966]
[0.50416947 0.62691032]
[0.41684885 0.69941778]
[0.5760022  0.58054362]
[0.59779626 0.52716869]
[0.64628073 0.49280287]

不过，这些值并不总是接近给定值 0.60。

现在，计算统计部分！只有当我们增加样本量时，我们才能看到样本标准偏差变得接近总体标准偏差（这是我们之前指定的值）。

如果我们将 sample_size 设置为 10,000,000，结果似乎非常接近！！

[0.600691   0.60049266]
[0.60009299 0.60028479]
[0.60048685 0.60019785]
[0.60000098 0.60000844]
[0.59989123 0.60017014]
[0.60010969 0.59936852]

有没有办法计算 make_blobs 生成的数据集的 cluster_std？

is there a way to compute the cluster_std of the dataset generated by make_blobs?

python

numpy

data-generation

standard-deviation

scikit-learn