Python 中 PCA 的累积解释方差
Cumulative Explained Variance for PCA in Python
我有一个简单的 R 脚本用于 运行 FactoMineR's PCA 在一个小数据帧上找到每个变量解释的累积方差百分比:
library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)
df <- data.frame(a, b, c, d)
df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)
哪个returns:
> print(df_pca$eig$`cumulative percentage of variance`)
[1] 58.55305 84.44577 99.86661 100.00000
我正在尝试使用 scikit-learn's decomposition package 在 Python 中执行相同的操作,如下所示:
import pandas as pd
from sklearn import decomposition, linear_model
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)
# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
但这会导致:
[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]
如您所见,两者正确相加为 100%,但似乎每个变量的贡献在 R 和 Python 版本之间有所不同。有谁知道这些差异从何而来或如何正确复制 Python 中的 R 结果?
编辑:感谢 Vlo,我现在知道差异源于默认缩放数据的 FactoMineR PCA 函数。通过使用 sklearn 预处理包 (pca_data = preprocessing.scale(df)) 在 运行 PCA 之前缩放我的数据,我的结果匹配
感谢Vlo,我了解到FactoMineR PCA函数和sklearn PCA函数之间的区别在于FactoMineR默认对数据进行缩放。通过简单地向我的 python 代码添加一个缩放函数,我能够重现结果。
import pandas as pd
from sklearn import decomposition, preprocessing
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca_data = preprocessing.scale(df)
pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
输出:
[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]
我有一个简单的 R 脚本用于 运行 FactoMineR's PCA 在一个小数据帧上找到每个变量解释的累积方差百分比:
library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)
df <- data.frame(a, b, c, d)
df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)
哪个returns:
> print(df_pca$eig$`cumulative percentage of variance`)
[1] 58.55305 84.44577 99.86661 100.00000
我正在尝试使用 scikit-learn's decomposition package 在 Python 中执行相同的操作,如下所示:
import pandas as pd
from sklearn import decomposition, linear_model
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)
# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
但这会导致:
[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]
如您所见,两者正确相加为 100%,但似乎每个变量的贡献在 R 和 Python 版本之间有所不同。有谁知道这些差异从何而来或如何正确复制 Python 中的 R 结果?
编辑:感谢 Vlo,我现在知道差异源于默认缩放数据的 FactoMineR PCA 函数。通过使用 sklearn 预处理包 (pca_data = preprocessing.scale(df)) 在 运行 PCA 之前缩放我的数据,我的结果匹配
感谢Vlo,我了解到FactoMineR PCA函数和sklearn PCA函数之间的区别在于FactoMineR默认对数据进行缩放。通过简单地向我的 python 代码添加一个缩放函数,我能够重现结果。
import pandas as pd
from sklearn import decomposition, preprocessing
a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
pca_data = preprocessing.scale(df)
pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
if i == 0:
cum_explained_var.append(pca.explained_variance_ratio_[i])
else:
cum_explained_var.append(pca.explained_variance_ratio_[i] +
cum_explained_var[i-1])
print(cum_explained_var)
输出:
[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]