主成分回归?因变量是什么?
Principal Component Regression? What is the dependent variable?
我正在执行 PCA
以尝试清除高度相关的变量实系数。我有一个非常大的数据集,但会在这里尝试简化。我有公式:
lm(y~x1+x2+x3...x55) -> reg_linear_model
我遇到的问题是 x1:x4
都是 非常高度相关的 ,因此其中一些是负相关的。当我尝试执行 pca I get the list of components and their values. I would like to to test which components to use but the dependent Y is three years of data broken up by week so it is y1, y2, y3, y4, ....y156. 156 weeks
. The issue I am having is that I cannot regress the components towards y because the lengths are different. Do I need to transform Y in some way to get it to fit into the number of rows as components? It is very hard to find an answer for this. A lot of PCR explanations just say to regress components onto y but Y is not in the pca 时。
感谢对此的任何帮助!
通常你这样做,我们可以使用鸢尾花数据集,让Sepal.Length成为依赖变量,其他变量成为自变量。
首先,依赖Petal.Width和Petal.Length之间存在相关性:
cor(iris[,2:4])
Sepal.Width Petal.Length Petal.Width
Sepal.Width 1.0000000 -0.4284401 -0.3661259
Petal.Length -0.4284401 1.0000000 0.9628654
Petal.Width -0.3661259 0.9628654 1.0000000
就像你说的,如果我们进行回归,我们会看到其中一个变为负值:
summary(lm(Sepal.Length ~ .,data=iris[,1:4]))
Call:
lm(formula = Sepal.Length ~ ., data = iris[, 1:4])
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
我们做一个PCA,得到主成分,在$x
下:
pca=prcomp(iris[,2:4])
cor(iris[,"Sepal.Length"],pca$x)
PC1 PC2 PC3
[1,] 0.8619141 -0.279587 0.1937703
data = data.frame(
Sepal.Length=iris[,"Sepal.Length"],
pca$x)
summary(lm(Sepal.Length ~ .,data=data))
Call:
lm(formula = Sepal.Length ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.02568 227.519 < 2e-16 ***
PC1 0.37123 0.01340 27.697 < 2e-16 ***
PC2 -0.58457 0.06506 -8.984 1.22e-15 ***
PC3 0.86983 0.13969 6.227 4.80e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PC 组件不相关,您可以使用它们进行回归。如果你的变量比较多,也可以像上面那样通过与目标变量的相关性来选择。
我正在执行 PCA
以尝试清除高度相关的变量实系数。我有一个非常大的数据集,但会在这里尝试简化。我有公式:
lm(y~x1+x2+x3...x55) -> reg_linear_model
我遇到的问题是 x1:x4
都是 非常高度相关的 ,因此其中一些是负相关的。当我尝试执行 pca I get the list of components and their values. I would like to to test which components to use but the dependent Y is three years of data broken up by week so it is y1, y2, y3, y4, ....y156. 156 weeks
. The issue I am having is that I cannot regress the components towards y because the lengths are different. Do I need to transform Y in some way to get it to fit into the number of rows as components? It is very hard to find an answer for this. A lot of PCR explanations just say to regress components onto y but Y is not in the pca 时。
感谢对此的任何帮助!
通常你这样做,我们可以使用鸢尾花数据集,让Sepal.Length成为依赖变量,其他变量成为自变量。
首先,依赖Petal.Width和Petal.Length之间存在相关性:
cor(iris[,2:4])
Sepal.Width Petal.Length Petal.Width
Sepal.Width 1.0000000 -0.4284401 -0.3661259
Petal.Length -0.4284401 1.0000000 0.9628654
Petal.Width -0.3661259 0.9628654 1.0000000
就像你说的,如果我们进行回归,我们会看到其中一个变为负值:
summary(lm(Sepal.Length ~ .,data=iris[,1:4]))
Call:
lm(formula = Sepal.Length ~ ., data = iris[, 1:4])
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
我们做一个PCA,得到主成分,在$x
下:
pca=prcomp(iris[,2:4])
cor(iris[,"Sepal.Length"],pca$x)
PC1 PC2 PC3
[1,] 0.8619141 -0.279587 0.1937703
data = data.frame(
Sepal.Length=iris[,"Sepal.Length"],
pca$x)
summary(lm(Sepal.Length ~ .,data=data))
Call:
lm(formula = Sepal.Length ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.02568 227.519 < 2e-16 ***
PC1 0.37123 0.01340 27.697 < 2e-16 ***
PC2 -0.58457 0.06506 -8.984 1.22e-15 ***
PC3 0.86983 0.13969 6.227 4.80e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
PC 组件不相关,您可以使用它们进行回归。如果你的变量比较多,也可以像上面那样通过与目标变量的相关性来选择。