为什么 Python 与 R 有两个不同的归一化结果
Why two different normalized results from Python vs R
谁能解释一下幕后的数学原理?为什么 Python 和 R return 我的结果不同?对于真实的业务场景,我应该使用哪一个?
原始数据
id cost sales item
1 300 50 pen
2 3 88 wf
3 1 70 gher
4 5 80 dger
5 2 999 ww
Python代码:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('Scale.csv')
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df
Python 归一化结果
id cost sales item
0 1 1.999876 -0.559003 pen
1 2 -0.497867 -0.456582 wf
2 3 -0.514686 -0.505097 gher
3 4 -0.481047 -0.478144 dger
4 5 -0.506276 1.998826 ww
和R代码
library(readr)
library(dplyr)
df <- read_csv("C:/Users/Ho/Desktop/Scale.csv")
df <- df %>% mutate_each_(funs(scale(.) %>% as.vector),
vars=c("cost","sales"))
R归一化结果
id cost sales item
1 1 1.7887437 -0.4999873 pen
2 2 -0.4453054 -0.4083792 wf
3 3 -0.4603495 -0.4517725 gher
4 4 -0.4302613 -0.4276651 dger
5 5 -0.4528275 1.7878041 ww
感谢@文
我在Python中很少使用这些函数,但数据似乎暗示不同之处在于Python中的函数在计算方差时使用'n'来标准化with 和 R 使用 'n-1'。我们可以通过乘法在两者之间进行转换,下图显示乘以 sqrt(5/4) 后,来自 R 的数据与 Python 值匹配。
> tab <- read.table(textConnection("1 1 1.7887437 -0.4999873 pen
+ 2 2 -0.4453054 -0.4083792 wf
+ 3 3 -0.4603495 -0.4517725 gher
+ 4 4 -0.4302613 -0.4276651 dger
+ 5 5 -0.4528275 1.7878041 ww"))
> tab
V1 V2 V3 V4 V5
1 1 1 1.78874369999999994 -0.49998730000000002 pen
2 2 2 -0.44530540000000002 -0.40837920000000000 wf
3 3 3 -0.46034950000000002 -0.45177250000000002 gher
4 4 4 -0.43026130000000001 -0.42766510000000002 dger
5 5 5 -0.45282749999999999 1.78780410000000001 ww
> # To transform as if we used n in the denominator instead of
> # n-1 we just multiply by sqrt(n/(n-1))
> tab$V3 * sqrt(5/4)
[1] 1.99987625376224520 -0.49786657257386746 -0.51468638770401975
[4] -0.48104675744371517 -0.50627653604064304
> tab$V4 * sqrt(5/4)
[1] -0.55900279534329034 -0.45658182589849106 -0.50509701018251196
[4] -0.47814411760212272 1.99882574902641608
谁能解释一下幕后的数学原理?为什么 Python 和 R return 我的结果不同?对于真实的业务场景,我应该使用哪一个?
原始数据
id cost sales item
1 300 50 pen
2 3 88 wf
3 1 70 gher
4 5 80 dger
5 2 999 ww
Python代码:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('Scale.csv')
df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
df
Python 归一化结果
id cost sales item
0 1 1.999876 -0.559003 pen
1 2 -0.497867 -0.456582 wf
2 3 -0.514686 -0.505097 gher
3 4 -0.481047 -0.478144 dger
4 5 -0.506276 1.998826 ww
和R代码
library(readr)
library(dplyr)
df <- read_csv("C:/Users/Ho/Desktop/Scale.csv")
df <- df %>% mutate_each_(funs(scale(.) %>% as.vector),
vars=c("cost","sales"))
R归一化结果
id cost sales item
1 1 1.7887437 -0.4999873 pen
2 2 -0.4453054 -0.4083792 wf
3 3 -0.4603495 -0.4517725 gher
4 4 -0.4302613 -0.4276651 dger
5 5 -0.4528275 1.7878041 ww
感谢@文
我在Python中很少使用这些函数,但数据似乎暗示不同之处在于Python中的函数在计算方差时使用'n'来标准化with 和 R 使用 'n-1'。我们可以通过乘法在两者之间进行转换,下图显示乘以 sqrt(5/4) 后,来自 R 的数据与 Python 值匹配。
> tab <- read.table(textConnection("1 1 1.7887437 -0.4999873 pen
+ 2 2 -0.4453054 -0.4083792 wf
+ 3 3 -0.4603495 -0.4517725 gher
+ 4 4 -0.4302613 -0.4276651 dger
+ 5 5 -0.4528275 1.7878041 ww"))
> tab
V1 V2 V3 V4 V5
1 1 1 1.78874369999999994 -0.49998730000000002 pen
2 2 2 -0.44530540000000002 -0.40837920000000000 wf
3 3 3 -0.46034950000000002 -0.45177250000000002 gher
4 4 4 -0.43026130000000001 -0.42766510000000002 dger
5 5 5 -0.45282749999999999 1.78780410000000001 ww
> # To transform as if we used n in the denominator instead of
> # n-1 we just multiply by sqrt(n/(n-1))
> tab$V3 * sqrt(5/4)
[1] 1.99987625376224520 -0.49786657257386746 -0.51468638770401975
[4] -0.48104675744371517 -0.50627653604064304
> tab$V4 * sqrt(5/4)
[1] -0.55900279534329034 -0.45658182589849106 -0.50509701018251196
[4] -0.47814411760212272 1.99882574902641608