为什么 sklearn 标准化数据的方差不等于 1?

Why is not variance of normalized data by sklearn equal 1?

我正在使用包 sklearn 中的 preprocessing 来规范化数据,如下所示:

import pandas as pd
import urllib3
from sklearn import preprocessing

decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()

nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()

结果是

均值-1.516402e-16,几乎为0,反之,方差1.012423e+00,即1.012423。对我来说,1.012423 不被认为接近 1。

能否详细说明一下这个现象?

在这种情况下 sklearnpandas 计算 std 不同。

sklearn.preprocessing.scale:

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

pandas.Dataframe.describe 使用 pandas.core.series.Series.std 其中:

Normalized by N-1 by default. This can be changed using the ddof argument

...

ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

需要注意的是,在 2020-10-28 中,pandas.Dataframe.describe 没有 ddof 参数,因此 ddof=1 始终用于 Series 的默认值].