截断的 SVD 和 PCA
Truncated SVD and PCA
理论上,如果特征的均值为0,PCA和SVD的投影结果是一样的。所以我在python上试了一下。
from sklearn import datasets
cancer = datasets.load_breast_cancer()
from sklearn.preprocessing import StandardScaler
# we can set our feature to have mean 0 by setting with_mean=False
scaler = StandardScaler(with_mean=False,with_std=False)
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)
from sklearn.decomposition import PCA
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
from sklearn.decomposition import TruncatedSVD
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
但是当我打印结果时,它是不同的。为什么会这样?
print(X_pca)
print(X_svdm)
>>>[[1160.1425737 -293.91754364 48.57839763]
[1269.12244319 15.63018184 -35.39453423]
[ 995.79388896 39.15674324 -1.70975298]
...
[ 314.50175618 47.55352518 -10.44240718]
[1124.85811531 34.12922497 -19.74208742]
[-771.52762188 -88.64310636 23.88903189]]
>>>[[2241.97427647 347.71556015 -27.53741942]
[2372.40840267 56.90166991 23.86316187]
[2101.8402797 11.94762737 30.41138602]
...
[1424.53280954 -55.0217124 -3.5794351 ]
[2231.65579282 19.99439854 3.31619182]
[ 331.69302638 -5.29733966 -39.12136435]]
我应该修正什么才能使两种算法得到相同的结果?
with_mean bool, default=True If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
要使 PCA 和 SVD 提供相同的输出,您需要对数据进行居中和缩放,另请参阅 this post for details,因此如果您这样做:
# which is also the default
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(cancer.data)
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
X_pca
array([[ 9.19283683, 1.94858306, -1.12316567],
[ 2.3878018 , -3.76817175, -0.52929196],
[ 5.73389628, -1.0751738 , -0.55174751],
...,
[ 1.25617928, -1.90229671, 0.56273027],
[10.37479406, 1.67201011, -1.87702986],
[-5.4752433 , -0.6706368 , 1.49044385]])
X_svdm
array([[ 9.19283683, 1.94858307, -1.12316615],
[ 2.3878018 , -3.76817174, -0.52929266],
[ 5.73389628, -1.0751738 , -0.55174759],
...,
[ 1.25617928, -1.90229671, 0.56273052],
[10.37479406, 1.67201011, -1.87702935],
[-5.4752433 , -0.67063679, 1.49044309]])
理论上,如果特征的均值为0,PCA和SVD的投影结果是一样的。所以我在python上试了一下。
from sklearn import datasets
cancer = datasets.load_breast_cancer()
from sklearn.preprocessing import StandardScaler
# we can set our feature to have mean 0 by setting with_mean=False
scaler = StandardScaler(with_mean=False,with_std=False)
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)
from sklearn.decomposition import PCA
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
from sklearn.decomposition import TruncatedSVD
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
但是当我打印结果时,它是不同的。为什么会这样?
print(X_pca)
print(X_svdm)
>>>[[1160.1425737 -293.91754364 48.57839763]
[1269.12244319 15.63018184 -35.39453423]
[ 995.79388896 39.15674324 -1.70975298]
...
[ 314.50175618 47.55352518 -10.44240718]
[1124.85811531 34.12922497 -19.74208742]
[-771.52762188 -88.64310636 23.88903189]]
>>>[[2241.97427647 347.71556015 -27.53741942]
[2372.40840267 56.90166991 23.86316187]
[2101.8402797 11.94762737 30.41138602]
...
[1424.53280954 -55.0217124 -3.5794351 ]
[2231.65579282 19.99439854 3.31619182]
[ 331.69302638 -5.29733966 -39.12136435]]
我应该修正什么才能使两种算法得到相同的结果?
with_mean bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
要使 PCA 和 SVD 提供相同的输出,您需要对数据进行居中和缩放,另请参阅 this post for details,因此如果您这样做:
# which is also the default
scaler = StandardScaler(with_mean=True, with_std=True)
X_scaled = scaler.fit_transform(cancer.data)
pca=PCA(n_components=3,svd_solver='randomized')
pca.fit(X_scaled)
X_pca=pca.transform(X_scaled)
svdm=TruncatedSVD(n_components=3,algorithm='randomized')
svdm.fit(X_scaled)
X_svdm=svdm.transform(X_scaled)
X_pca
array([[ 9.19283683, 1.94858306, -1.12316567],
[ 2.3878018 , -3.76817175, -0.52929196],
[ 5.73389628, -1.0751738 , -0.55174751],
...,
[ 1.25617928, -1.90229671, 0.56273027],
[10.37479406, 1.67201011, -1.87702986],
[-5.4752433 , -0.6706368 , 1.49044385]])
X_svdm
array([[ 9.19283683, 1.94858307, -1.12316615],
[ 2.3878018 , -3.76817174, -0.52929266],
[ 5.73389628, -1.0751738 , -0.55174759],
...,
[ 1.25617928, -1.90229671, 0.56273052],
[10.37479406, 1.67201011, -1.87702935],
[-5.4752433 , -0.67063679, 1.49044309]])