为什么 sklearn 标准化数据的方差不等于 1？

Question

我正在使用包 sklearn 中的 preprocessing 来规范化数据，如下所示：

import pandas as pd
import urllib3
from sklearn import preprocessing

decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()

nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()

结果是

均值-1.516402e-16，几乎为0，反之，方差1.012423e+00，即1.012423。对我来说，1.012423 不被认为接近 1。

能否详细说明一下这个现象？

Answer 1

在这种情况下 sklearn 和 pandas 计算 std 不同。

sklearn.preprocessing.scale:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

pandas.Dataframe.describe 使用 pandas.core.series.Series.std 其中：

Normalized by N-1 by default. This can be changed using the ddof argument

...

ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

需要注意的是，在 2020-10-28 中，pandas.Dataframe.describe 没有 ddof 参数，因此 ddof=1 始终用于 Series 的默认值].

为什么 sklearn 标准化数据的方差不等于 1？

Why is not variance of normalized data by sklearn equal 1?

scientific-notation

python-3.x

scikit-learn