FactoMineR 中的 PCA 摘要中的 ctr、距离和维度到底是什么？

Question

我正在尝试使用 FactoMineR 包在我的数据集上实施 PCA 和 MCA。

我有一个数据集，经过初步清理后，我在其上应用了 PCA() 函数。我试着理解结果的总结。

library(reshape)
library(gridExtra)
library(gdata)
library(ggplot2)
library(ggbiplot)
library(FactoMineR)

x <- read.csv('cars.csv',stringsAsFactors = FALSE)
y <- na.omit(x)

y <- y[,c(-8,-9)]
s <- y[,-1]
rownames(s) <- make.names(y[,1], unique = TRUE)


res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2)
summary(res.pca)

这是 summary(res.pca) 在我的控制台中打印出来的内容

Call:
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               4.788   0.729   0.258   0.125   0.063   0.036
% of var.             79.804  12.144   4.308   2.086   1.053   0.605
Cumulative % of var.  79.804  91.948  96.256  98.342  99.395 100.000

Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |
ford.torino               |  2.908 |  2.600  0.360  0.799 | -1.094  0.419  0.141 |
ford.galaxie.500          |  4.578 |  4.401  1.032  0.924 | -1.011  0.358  0.049 |
chevrolet.impala          |  5.210 |  4.920  1.289  0.892 | -1.368  0.655  0.069 |
plymouth.fury.iii         |  5.144 |  4.836  1.246  0.884 | -1.537  0.827  0.089 |
pontiac.catalina          |  5.165 |  4.910  1.285  0.904 | -1.041  0.379  0.041 |
amc.ambassador.dpl        |  4.406 |  4.056  0.876  0.847 | -1.668  0.974  0.143 |

Variables
                             Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Cylinders                 |  0.942 18.543  0.888 |  0.127  2.200  0.016 |
Displacement              |  0.971 19.672  0.942 |  0.093  1.177  0.009 |
Horsepower                |  0.950 18.846  0.902 | -0.142  2.761  0.020 |
Weight                    |  0.941 18.499  0.886 |  0.244  8.185  0.060 |
MPG                       | -0.873 15.918  0.762 | -0.209  5.994  0.044 |
Acceleration              | -0.639  8.522  0.408 |  0.762 79.683  0.581 |

虽然我从这个摘要中理解了所有内容，但我不确定数据点上的 dist、ctr 和 dim 是什么意思，即

 Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |

Answer 1

让我们看一下基于包中示例数据集的个人摘要table：

library(FactoMineR)
data(decathlon)
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13)

> summary(res.pca)
Call:
PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13) 
...
Individuals (the 10 first)
                Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
SEBRLE      |  2.369 |  0.792  0.467  0.112 |  0.772  0.836  0.106 |  0.827  1.187
CLAY        |  3.507 |  1.235  1.137  0.124 |  0.575  0.464  0.027 |  2.141  7.960
KARPOV      |  3.396 |  1.358  1.375  0.160 |  0.484  0.329  0.020 |  1.956  6.644
...

Dist 可以被认为是对数据集中所有相关列的个人测量值的汇总测量，计算为 sqrt(rowSums(X^2))，其中 X 是缩放版本输入数据集的 s（修剪掉补充变量后）。

如果 PCA 中的默认选项到位，即 scale.unit = TRUE、row.w = NULL、col.w = NULL，X 应等同于 scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1).我没有检查非默认选项，因为我发现直观的解释比这里的详细计算更重要。

# verify the calculated values against summary table's Dist values
> X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1))
> sqrt(rowSums(X^2))
     SEBRLE        CLAY      KARPOV     BERNARD      YURKOV     WARNERS   ZSIVOCZKY 
   2.368839    3.507004    3.396399    2.762607    3.017906    2.427873    2.563128 
...

Dim.X 测量每个人在多维 space 中从原点到主成分 X 的距离的投影。要将其可视化，请使用 plot(res.pca, choix = "ind") 对于个体因素图，切换 xlim / ylim / axes 参数以放大任何特定个体，并与 table 值进行比较。检查 ?plot.PCA 函数中的更多参数。

# plot individual factor map in the first two principle components
> plot(res.pca, axes = c(1, 2), choix = "ind")

# zoom in check Serbrle, Clay, & Karpov's coordinates
> plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1))

ctr表示每个人对给定主成分的贡献，以百分比形式表示。您可以从 res.pca$ind$contrib 获得完整的贡献列表。每列总和为 100(%)。

# view each individual's contribution to each principle component
> head(res.pca$ind$contrib)
             Dim.1     Dim.2    Dim.3      Dim.4      Dim.5
SEBRLE  0.46715109 0.8359506 1.186888  3.1842186  1.7811617
CLAY    1.13695340 0.4635341 7.959744  0.2905893 13.8872052
KARPOV  1.37515734 0.3289363 6.643820  7.9543342  2.2523610
BERNARD 0.27693912 1.0740657 1.374952 11.3801552  0.4658144
YURKOV  0.25595504 6.3757577 2.605847  1.7611939  5.5775065
WARNERS 0.09494738 3.9862179 1.020117  0.8014610  3.5736432

# verify each principle component's contributions sum up to 100%.
> colSums(res.pca$ind$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 
  100   100   100   100   100

cos2 是每个主成分的平方余弦，计算为 (Dim.X/Dist)^2。对于给定的主成分，它越接近 1，则该主成分越能捕捉到该个体的所有特征。

# verify the calculated values against summary table's cos2 values
> head((res.pca$ind$coord/res.pca$ind$dist)^2)
             Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
SEBRLE  0.11167888 0.10610262 0.12183534 0.24588345 0.08911755
CLAY    0.12400941 0.02684265 0.37278712 0.01023775 0.31701007
KARPOV  0.15991886 0.02030911 0.33175306 0.29878849 0.05481905
BERNARD 0.04867778 0.10023262 0.10377289 0.64611132 0.01713585
YURKOV  0.03769960 0.49858212 0.16480554 0.08379015 0.17193305
WARNERS 0.02160805 0.48164324 0.09968563 0.05891525 0.17021193

对于变量，"Dim.X" / "ctr" / "cos2" 的解释是相似的。确切的计算更加复杂，尤其是当您为行/列指定非均匀权重时。您可以在那里查看 PCA 的代码以获取详细信息。

FactoMineR 中的 PCA 摘要中的 ctr、距离和维度到底是什么？

What exactly are ctr, distance and dimensions in PCA summary in FactoMineR?

r

vector

multidimensional-array

pca