提取 PCA 分析的 PCn

Question

我有一个 Dataframe 如下.....

df <-data.frame(variableA, variableB, variableC, variableD, variableE)

prcomp(scale(df))
summary(prcomp)

给出以下结果

                          PC1    PC2    PC3     PC4     PC5
Cumulative Proportion  0.5127 0.7222 0.8938 0.96075 1.00000

有没有办法将 PC 添加到新数据框，直到累积比例达到 85%？

我有很多数据帧，我希望运行这个，它们的大小各不相同，但我希望 85% 是一个任意的截止点。

Answer 1

没有更多细节很难说，但您可能运行遇到问题，因为结果向量的长度会因分析而异。例如。一个可能会导致 4 个主成分满足您的条件，另外 3 个主成分。另一方面，Dataframes 是矩形的，因此每一行的长度必须相同，每一列的长度必须相同，所以你不能制作一个 dataframe，一行有 3 列，另一行有 4 列。

像您这样的向量的几个简单选项：

# your vector of pcs
x1 <- summary(prcomp)

1) 制作一个适合最大数量组件的数据框，充满 NA，然后在适当的地方相应地替换保留 NA。

# storage df
outDF <- data.frame(matrix(rep(NA, 8), ncol = 4))
# store
outDF[1, x1 < 0.85] <- x1[x1 < 0.85]

2) 存储为列表，因为列表不需要是矩形的

# storage list
outList <- list()
# store
outList[[1]] <- x1[x1 < 0.85]

Answer 2

这是一种使用 kernlab 包中的 spam 数据来识别解释高达 85% 方差的组件的方法。

library(kernlab)
data(spam)
# log transform independent variables, ensuring all values above 0
princomp <- prcomp(log10(spam[,-58]+1))
stats <- summary(princomp)
# extract variable importance and list items explaining up to 85% variance
importance <- stats$importance[3,]
importance[importance <= 0.85]

...以及输出：

> importance[importance <= 0.85]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11 
0.49761 0.58021 0.63101 0.67502 0.70835 0.73188 0.75100 0.76643 0.78044 0.79368 0.80648 
   PC12    PC13    PC14 
0.81886 0.83046 0.84129 
>

我们可以获得前 14 个成分的因子得分，并将它们保存为数据框，如下所示。

resultNames <- names(importance[importance <= 0.85])
# return factor scores 
x_result <- as.data.frame(princomp$x[,resultNames])
head(x_result)

...以及输出：

> head(x_result)
         PC1         PC2          PC3          PC4          PC5         PC6         PC7
1  0.7364988  0.19181730  0.041818854 -0.009236399  0.001232911  0.03723833 -0.01144332
2  1.3478167  0.22953561 -0.149444409  0.091569400 -0.148434128 -0.01923707 -0.07119210
3  2.0489632 -0.02668038  0.222492079 -0.107120738 -0.092968198 -0.06400683 -0.07078830
4  0.4912016  0.20921288 -0.002072148  0.015524007 -0.002347262 -0.14519336 -0.09238828
5  0.4911676  0.20916725 -0.002122664  0.015467369 -0.002373622 -0.14517812 -0.09243136
6 -0.2337956 -0.10508875  0.187831101 -0.335491660  0.099445713  0.09516875  0.11234080
          PC8          PC9        PC10        PC11        PC12         PC13        PC14
1 -0.08745771  0.079650230 -0.14450436  0.15945517 -0.06490913 -0.042909658  0.05739735
2  0.00233124 -0.091471125 -0.10304536  0.06973190  0.09373344  0.003069536  0.02892939
3 -0.10888375  0.227437609 -0.07419313  0.08217271 -0.12488575  0.150950134  0.05180459
4 -0.15862241  0.003044418  0.01609690  0.01720151  0.02313224  0.142176889 -0.04013102
5 -0.15848785  0.002944493  0.01606874  0.01725410  0.02304496  0.142527110 -0.04007788
6 -0.13790588  0.197294502  0.07851300 -0.08131269 -0.02091459  0.246810914 -0.01869192
>

要将数据与原始数据框合并，我们可以使用cbind()。

mergedData <- cbind(spam,x_result)

Answer 3

您可以从摘要中提取阈值，例如像这样：

getMinPCs <- function(mat, thresh=.85){
    return(which(summary(prcomp(scale(mat)))$importance["Cumulative Proportion",] >= thresh)[1])
}

（虽然，很明显，您可能希望运行 prcomp 仅一次，并在函数内对子集做一些额外的事情）

我不清楚你在新的 data.frame 中想要什么 - 也许是旋转矩阵，然后你可以将其子集和 return - 或者子集和 return 整个名单：

getMinPCrotations <- function(mat, thresh=.85){
    res_pca <- prcomp(scale(mat))
    nPCs <- which(summary(res_pca)$importance["Cumulative Proportion",] >= thresh)[1]
    sub <- list(sdev=res_pca$sdev[seq_len(nPCs)], 
                rotation=res_pca$rotation[, seq_len(nPCs)],
                center=res_pca$center[seq_len(nPCs)],
                scale=res_pca$scale[seq_len(nPCs)],
                x=res_pca$x[, seq_len(nPCs)]
    )
    # setattr(sub, "class", "prcomp")
    return(sub)
}

提取 PCA 分析的 PCn

Extract PCn of a PCA Analysis

r

pca