为什么 sklearn 的 PCA 不是 return 可重现的结果?
Why does sklearn's PCA not return reproducible results?
我正在使用 scikit's PCA 并注意到一些非常奇怪的行为。本质上,当使用超过 500 个样本时,结果是不可重现的。这个例子展示了正在发生的事情:
import numpy as np
from sklearn.decomposition import PCA
Ncomp = 15
Nsamp = 501
Nfeat = 30
PCAnalyzer = PCA(n_components = Ncomp)
ManySamples = np.random.rand(Nsamp, Nfeat)
TestSample = np.ones((1, Nfeat))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
它输出:
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.25641111 0.42327221 0.4616427 -0.72047479 -0.12386481 0.10608497
0.28739712 -0.26003239 1.27305465 1.05307604 -0.53915119 -0.07127874
0.25312454 -0.12052255 -0.06738885]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.26656397 0.42293446 0.45487161 -0.7339531 -0.16134778 0.15389179
0.27052166 -0.33565591 1.26289845 0.96118269 0.5362569 -0.54688338
0.08329318 -0.08423136 -0.00253318]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.21899525 0.38527988 0.45101669 -0.73443888 -0.20501978 0.09640448
0.17826649 -0.37653009 1.04856884 1.10948052 0.60700417 -0.39864793
0.18020651 0.08061955 0.05383696]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.27070256 0.41532602 0.45936926 -0.73820121 -0.18160026 -0.13139435
0.28015907 -0.28144421 1.16554587 1.00472104 0.16983399 -0.67157762
-0.3005816 0.54645421 0.09807374]]
将样本数 (Nsamp
) 减少到 500 或更少,或将组件数 (Ncomp
) 增加到 20 或更多,可以解决问题 - 但这对我.
有时,阅读文档会有所帮助:
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.
这解决了问题:
PCAnalyzer = PCA(n_components = Ncomp, svd_solver = 'full')
这是因为 sklearn
使用的默认求解器。来自 docs:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
如果您需要可重现的结果,请使用不同的求解器,或设置 random_state
我正在使用 scikit's PCA 并注意到一些非常奇怪的行为。本质上,当使用超过 500 个样本时,结果是不可重现的。这个例子展示了正在发生的事情:
import numpy as np
from sklearn.decomposition import PCA
Ncomp = 15
Nsamp = 501
Nfeat = 30
PCAnalyzer = PCA(n_components = Ncomp)
ManySamples = np.random.rand(Nsamp, Nfeat)
TestSample = np.ones((1, Nfeat))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
print(PCAnalyzer.fit(ManySamples).transform(TestSample))
它输出:
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.25641111 0.42327221 0.4616427 -0.72047479 -0.12386481 0.10608497
0.28739712 -0.26003239 1.27305465 1.05307604 -0.53915119 -0.07127874
0.25312454 -0.12052255 -0.06738885]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.26656397 0.42293446 0.45487161 -0.7339531 -0.16134778 0.15389179
0.27052166 -0.33565591 1.26289845 0.96118269 0.5362569 -0.54688338
0.08329318 -0.08423136 -0.00253318]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.21899525 0.38527988 0.45101669 -0.73443888 -0.20501978 0.09640448
0.17826649 -0.37653009 1.04856884 1.10948052 0.60700417 -0.39864793
0.18020651 0.08061955 0.05383696]]
>>> print(PCAnalyzer.fit(ManySamples).transform(TestSample))
[[-0.27070256 0.41532602 0.45936926 -0.73820121 -0.18160026 -0.13139435
0.28015907 -0.28144421 1.16554587 1.00472104 0.16983399 -0.67157762
-0.3005816 0.54645421 0.09807374]]
将样本数 (Nsamp
) 减少到 500 或更少,或将组件数 (Ncomp
) 增加到 20 或更多,可以解决问题 - 但这对我.
有时,阅读文档会有所帮助:
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.
这解决了问题:
PCAnalyzer = PCA(n_components = Ncomp, svd_solver = 'full')
这是因为 sklearn
使用的默认求解器。来自 docs:
the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
如果您需要可重现的结果,请使用不同的求解器,或设置 random_state