为什么 sklearn 标准化数据的方差不等于 1?
Why is not variance of normalized data by sklearn equal 1?
我正在使用包 sklearn
中的 preprocessing
来规范化数据,如下所示:
import pandas as pd
import urllib3
from sklearn import preprocessing
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()
nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()
结果是
均值-1.516402e-16
,几乎为0,反之,方差1.012423e+00
,即1.012423
。对我来说,1.012423
不被认为接近 1。
能否详细说明一下这个现象?
在这种情况下 sklearn
和 pandas
计算 std
不同。
sklearn.preprocessing.scale
:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0)
. Note that the choice of ddof
is unlikely to
affect model performance.
pandas.Dataframe.describe
使用 pandas.core.series.Series.std
其中:
Normalized by N-1 by default. This can be changed using the ddof argument
...
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
需要注意的是,在 2020-10-28 中,pandas.Dataframe.describe
没有 ddof
参数,因此 ddof=1
始终用于 Series
的默认值].
我正在使用包 sklearn
中的 preprocessing
来规范化数据,如下所示:
import pandas as pd
import urllib3
from sklearn import preprocessing
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()
nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()
结果是
均值-1.516402e-16
,几乎为0,反之,方差1.012423e+00
,即1.012423
。对我来说,1.012423
不被认为接近 1。
能否详细说明一下这个现象?
在这种情况下 sklearn
和 pandas
计算 std
不同。
sklearn.preprocessing.scale
:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0)
. Note that the choice ofddof
is unlikely to affect model performance.
pandas.Dataframe.describe
使用 pandas.core.series.Series.std
其中:
Normalized by N-1 by default. This can be changed using the ddof argument
...
ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
需要注意的是,在 2020-10-28 中,pandas.Dataframe.describe
没有 ddof
参数,因此 ddof=1
始终用于 Series
的默认值].