为什么在实现 nsprcomp 时使用相同的数据集会得到不同的 PC 值

Question

  nonpca<-nsprcomp(data,ncomp=i,nneg=TRUE,scale.= TRUE)
  names(nonpca)
  SumPC<-rowSums(nonpca$rotation)
  w<-(SumPC)/sum(SumPC)

我的数据是一个 csv 文件并且是同一个文件，每次我运行虽然我得到不同的 PC 值。 ncomp = i 是运行通过来自 1:9

的 for 循环

Answer 1

如果你检查help page for nsprcomp，它会写

This package implements two non-negative and/or sparse PCA algorithms which are rooted in expectation-maximization (EM) for a probabilistic generative model of PCA (Sigg and Buhmann, 2008). The nsprcomp algorithm can also be described as applying a soft-thresholding operator to the well-known power iteration method for computing eigenvalues.

如果您习惯于使用 prcomp 或 princomp 计算 PCA，它们会使用 SVD 或协方差矩阵的特征值，如 this post 中所述，因此它是确定性的并且 returns 您使用相同的值每一次。

还有其他依赖EM方法计算PC的方法，您也可以查看this paper。解决方案不是确定性的，但当您的数据集很大时，这是一个很好的权衡，如 nsprcomp 的帮助页面中所述：

The nsprcomp algorithm is suitable for large and high-dimensional data sets, because it entirely avoids computing the covariance matrix. It is therefore especially suited to the case where the number of features exceeds the number of observations.

您会看到您的大多数 PC 都非常接近，也许标志被翻转了。如果您想要可复制的 PC，您可以设置种子：

set.seed(111)
head(nsprcomp(mtcars)$x[,1:2])
                          PC1        PC2
Mazda RX4          -79.595868  -2.152978
Mazda RX4 Wag      -79.598008  -2.168224
Datsun 710        -133.895409   5.022698
Hornet 4 Drive       8.528272 -44.983404
Hornet Sportabout  128.694362 -30.783879
Valiant            -23.211004 -35.112577

但一定要检查参数 nrestart 以确保您没有达到局部最大值 :

nrestart: the number of random restarts for computing the principal
          component via expectation-maximization (EM) iterations. The
          solution achieving maximum standard deviation over all random
          restarts is kept. A value greater than one can help to avoid
          poor local maxima.

为什么在实现 nsprcomp 时使用相同的数据集会得到不同的 PC 值

Why do i get different PC values when using the same data set when implementing nsprcomp

r

pca