主成分回归?因变量是什么?

Principal Component Regression? What is the dependent variable?

我正在执行 PCA 以尝试清除高度相关的变量实系数。我有一个非常大的数据集,但会在这里尝试简化。我有公式:

lm(y~x1+x2+x3...x55) -> reg_linear_model

我遇到的问题是 x1:x4 都是 非常高度相关的 ,因此其中一些是负相关的。当我尝试执行 I get the list of components and their values. I would like to to test which components to use but the dependent Y is three years of data broken up by week so it is y1, y2, y3, y4, ....y156. 156 weeks. The issue I am having is that I cannot regress the components towards y because the lengths are different. Do I need to transform Y in some way to get it to fit into the number of rows as components? It is very hard to find an answer for this. A lot of PCR explanations just say to regress components onto y but Y is not in the 时。

感谢对此的任何帮助!

通常你这样做,我们可以使用鸢尾花数据集,让Sepal.Length成为依赖变量,其他变量成为自变量。

首先,依赖Petal.Width和Petal.Length之间存在相关性:

cor(iris[,2:4])
             Sepal.Width Petal.Length Petal.Width
Sepal.Width    1.0000000   -0.4284401  -0.3661259
Petal.Length  -0.4284401    1.0000000   0.9628654
Petal.Width   -0.3661259    0.9628654   1.0000000

就像你说的,如果我们进行回归,我们会看到其中一个变为负值:

summary(lm(Sepal.Length ~ .,data=iris[,1:4]))

Call:
lm(formula = Sepal.Length ~ ., data = iris[, 1:4])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82816 -0.21989  0.01875  0.19709  0.84570 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***

我们做一个PCA,得到主成分,在$x下:

pca=prcomp(iris[,2:4])
cor(iris[,"Sepal.Length"],pca$x)
           PC1       PC2       PC3
[1,] 0.8619141 -0.279587 0.1937703

data = data.frame(
Sepal.Length=iris[,"Sepal.Length"],
pca$x)

summary(lm(Sepal.Length ~ .,data=data))

Call:
lm(formula = Sepal.Length ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82816 -0.21989  0.01875  0.19709  0.84570 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.84333    0.02568 227.519  < 2e-16 ***
PC1          0.37123    0.01340  27.697  < 2e-16 ***
PC2         -0.58457    0.06506  -8.984 1.22e-15 ***
PC3          0.86983    0.13969   6.227 4.80e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

PC 组件不相关,您可以使用它们进行回归。如果你的变量比较多,也可以像上面那样通过与目标变量的相关性来选择。