从 PCA 输出中动态选择主成分

Question

这似乎是一个微不足道的问题，但我无法解决这个问题！

我已经获取了 iris 数据集的数字列..然后将其标准化如下

newiris<-iris[,1:4]
iris.norm<-data.frame(scale(newiris))
head(iris.norm)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8976739  1.01560199    -1.335752   -1.311052
2   -1.1392005 -0.13153881    -1.335752   -1.311052
3   -1.3807271  0.32731751    -1.392399   -1.311052
4   -1.5014904  0.09788935    -1.279104   -1.311052
5   -1.0184372  1.24503015    -1.335752   -1.311052
6   -0.5353840  1.93331463    -1.165809   -1.048667

# performed PCA now
pccomp <- prcomp(iris.norm )
summary(pccomp)
a <- summary(pccomp)
df<- as.data.frame(a$importance)
df <- t(df)
df
##     Standard deviation Proportion of Variance Cumulative Proportion
## PC1          1.7083611                0.72962               0.72962
## PC2          0.9560494                0.22851               0.95813
## PC3          0.3830886                0.03669               0.99482
## PC4          0.1439265                0.00518               1.00000

现在将行名转换为 df 的列，以便作为行名的 PC 形成第一列以供进一步操作

   library(tibble)
   library(dplyr)
   df<-rownames_to_column(as.data.frame(df), var="PrinComp") %>% head
   df
   ##   PrinComp Standard deviation Proportion of Variance Cumulative Proportion
   ## 1      PC1          1.7083611                0.72962               0.72962
   ## 2      PC2          0.9560494                0.22851               0.95813
   ## 3      PC3          0.3830886                0.03669               0.99482
   ## 4      PC4          0.1439265                0.00518               1.00000

 # Now will be selecting only those PCs where the cumulative proportion is say less than 96%
# subsetting
pcs<-as.vector(as.character(df[which(df$`Cumulative Proportion`<0.96),][,1])) # cumulative prop less than 96%
pcs
## [1] "PC1" "PC2"

现在我正在创建一个静态的 PC 数据帧，它包含我们从上述条件中获得的前 2 个主要成分的向量分数 (cum prop<0.96)

 x1 <- pccomp$x[,1]
 x2 <- pccomp$x[,2]
 pcdf <- cbind(x1,x2)
 head(pcdf)
##             x1         x2
## [1,] -2.257141 -0.4784238
## [2,] -2.074013  0.6718827
## [3,] -2.356335  0.3407664
## [4,] -2.291707  0.5953999
## [5,] -2.381863 -0.6446757
## [6,] -2.068701 -1.4842053

我的问题是，一旦我知道了基于累积比例小于 0.95 等条件的 PC 数量，我该如何动态创建上述 PC 数据框？

Answer 1

您可以在 df's cumulative proportion 字段上运行一个 while 循环并附加转换后的值，直到它小于所需的阈值。

threshold = 0.96
pcdf = list()
i    = 1
while(df$`Cumulative Proportion`[i]<threshold){
    pcdf[[i]] = pccomp$x[,i]
    i = i +1
}
pcdf = as.data.frame(pcdf)

names(pcdf) = paste("x",c(1:ncol(pcdf)),sep="")

输出

> head(pcdf)
         x1         x2
1 -2.257141 -0.4784238
2 -2.074013  0.6718827
3 -2.356335  0.3407664
4 -2.291707  0.5953999
5 -2.381863 -0.6446757
6 -2.068701 -1.4842053

当 threshold = 0.999 运行ning 相同的代码给出

> head(pcdf)
         x1         x2          x3
1 -2.257141 -0.4784238  0.12727962
2 -2.074013  0.6718827  0.23382552
3 -2.356335  0.3407664 -0.04405390
4 -2.291707  0.5953999 -0.09098530
5 -2.381863 -0.6446757 -0.01568565
6 -2.068701 -1.4842053 -0.02687825

更新

假设你知道你想要的主成分的数量i。你可以使用

a <- sapply(X = c(1:i),FUN = function(X){pcdf[[X]] = pccomp$x[,X]})

而不是整个while loop section。所以对于 i = 2 你得到

> head(a)
          [,1]       [,2]
[1,] -2.257141 -0.4784238
[2,] -2.074013  0.6718827
[3,] -2.356335  0.3407664
[4,] -2.291707  0.5953999
[5,] -2.381863 -0.6446757
[6,] -2.068701 -1.4842053

其中 a 是您的结果。

Answer 2

假设你总是想要至少一台PC，这里是一个在线版本

p <- 0.96
pccomp$x[,1:nrow(df[which(df$`Cumulative Proportion`<p),])] # first two PCs
p <- 0.75
pccomp$x[,1:nrow(df[which(df$`Cumulative Proportion`<p),])] # first PC

Answer 3

添加到上面提供的gr8解决方案：

pcs<-as.vector(as.character(df1[which(df1$`Cumulative Proportion`<0.96),][,1])) # cumulative prop less than 96%
 pcs  
 ## [1] "PC1" "PC2"
i=length(pcs) # we get the no of PCs fulfilling the cum prop condition
a <- sapply(X = c(1:i),FUN = function(X){pcdf[[X]] = pccomp$x[,X]})
head(a)

> head(a)
        [,1]       [,2]
[1,] -2.257141 -0.4784238
[2,] -2.074013  0.6718827
[3,] -2.356335  0.3407664
[4,] -2.291707  0.5953999
[5,] -2.381863 -0.6446757
[6,] -2.068701 -1.4842053

完成！

从 PCA 输出中动态选择主成分

Dynamically selecting principal components from the PCA output

r

machine-learning

data-mining

pca