如何使用相关或协方差矩阵而不是使用 R 的数据框来获得回归系数和模型拟合?
How to get regression coefficients and model fits using correlation or covariance matrix instead of data frame using R?
我希望能够通过提供相关性或协方差矩阵而不是 data.frame 来从多元线性回归中回归系数。我知道你丢失了一些与确定截距等相关的信息,但即使是相关矩阵也应该足以获得标准化系数和方差估计解释。
例如,如果您有以下数据
# get some data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
您可以 运行 回归如下:
lm(EngineSize ~ Horsepower + RPM, x)
但是如果你没有数据而是相关矩阵或协方差矩阵会怎么样:
corx <- cor(x)
covx <- cov(x)
- R 中的哪个函数允许您运行 基于相关或协方差矩阵的回归?理想情况下,它应该类似于
lm
,这样您就可以轻松获得诸如 r 平方、调整后的 r 平方、预测值等信息。据推测,对于其中一些事情,您还需要提供样本量和可能的均值向量。不过这样也行。
即,类似于:
lm(EngineSize ~ Horsepower + RPM, cov = covx) # obviously this doesn't work
请注意,Stats.SE 上的这个答案提供了为什么可能的理论解释,并提供了一些用于计算系数的自定义 R 代码的示例?
记住:
$beta=(X'X)^-1。 X'Y$
尝试:
(bs<-solve(covx[-1,-1],covx[-1,1]))
Horsepower RPM
0.01491908 -0.00100051
对于截距,您需要变量的平均值。
例如:
ms=colMeans(x)
(b0=ms[1]-bs%*%ms[-1])
[,1]
[1,] 5.805301
使用 lavaan,您可以执行以下操作:
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
lav.input<- cov(x)
lav.mean <- colMeans(x)
library(lavaan)
m1 <- 'EngineSize ~ Horsepower+RPM'
fit <- sem(m1, sample.cov = lav.input,sample.nobs = nrow(x), meanstructure = TRUE, sample.mean = lav.mean)
summary(fit, standardize=TRUE)
结果是:
Regressions:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize ~
Horsepower 0.015 0.001 19.889 0.000 0.015 0.753
RPM -0.001 0.000 -15.197 0.000 -0.001 -0.576
Intercepts:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 5.805 0.362 16.022 0.000 5.805 5.627
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 0.142 0.021 6.819 0.000 0.142 0.133
我认为 lavaan 听起来是个不错的选择,我注意到@Philip 为我指出了正确的方向。我只是在这里提到如何使用 lavaan(特别是 r 平方和调整后的 r 平方)提取一些您可能需要的额外模型特征。
最新版本见:https://gist.github.com/jeromyanglim/9f766e030966eaa1241f10bd7d6e2812
:
# get data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
# define sample statistics
covx <- cov(x)
n <- nrow(x)
means <- sapply(x, mean) # this is optional
fit <- lavaan::sem("EngineSize ~ Horsepower + RPM", sample.cov = covx,
sample.mean = means,
sample.nobs = n)
coef(fit) # unstandardised coefficients
standardizedSolution(fit) # Standardised coefficients
inspect(fit, 'r2') # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
# update p below with number of predictor variables
adjr2(inspect(fit, 'r2'), n = inspect(fit, "nobs"), p = 2)
自定义函数
这里有一些功能可以提供 lavaan 的合身度以及一些相关的功能(即,基本上包装了上述大部分功能)。它假设您没有办法的情况。
covlm <- function(dv, ivs, n, cov) {
# Assumes lavaan package
# library(lavaan)
# dv: charcter vector of length 1 with name of outcome variable
# ivs: character vector of names of predictors
# n: numeric vector of length 1: sample size
# cov: covariance matrix where row and column names
# correspond to dv and ivs
# Return
# list with lavaan model fit
# and various other features of the model
results <- list()
eq <- paste(dv, "~", paste(ivs, collapse = " + "))
results$fit <- lavaan::sem(eq, sample.cov = cov,
sample.nobs = n)
# coefficients
ufit <- parameterestimates(results$fit)
ufit <- ufit[ufit$op == "~", ]
results$coef <- ufit$est
names(results$coef) <- ufit$rhs
sfit <- standardizedsolution(results$fit)
sfit <- sfit[sfit$op == "~", ]
results$standardizedcoef <- sfit$est.std
names(results$standardizedcoef) <- sfit$rhs
# use unclass to not limit r2 to 3 decimals
results$r.squared <- unclass(inspect(results$fit, 'r2')) # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
results$adj.r.squared <- adjr2(unclass(inspect(results$fit, 'r2')),
n = n, p = length(ivs))
results
}
例如:
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
covlm(dv = "EngineSize", ivs = c("Horsepower", "RPM"),
n = nrow(x), cov = cov(x))
这全部产生:
$fit
lavaan (0.5-20) converged normally after 27 iterations
Number of observations 93
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
$coef
Horsepower RPM
0.01491908 -0.00100051
$standardizedcoef
Horsepower RPM
0.7532350 -0.5755326
$r.squared
EngineSize
0.867
$adj.r.squared
EngineSize
0.864
另一种时髦的解决方案是生成一个与原始数据具有相同方差-协方差矩阵的数据集。您可以使用 MASS
包中的 mvrnorm()
来执行此操作。在这个新数据集上使用 lm()
将产生与原始数据集估计值相同的参数估计值和标准误差(截距除外,除非您知道每个变量的均值,否则无法访问截距)。这是一个示例:
#Assuming the variance covariance matrix is called VC
n <- 100 #sample size
nvar <- ncol(VC)
fake.data <- mvrnorm(n, mu = rep(0, nvar), sigma = VC, empirical = TRUE)
lm(Y~., data = fake.data)
我希望能够通过提供相关性或协方差矩阵而不是 data.frame 来从多元线性回归中回归系数。我知道你丢失了一些与确定截距等相关的信息,但即使是相关矩阵也应该足以获得标准化系数和方差估计解释。
例如,如果您有以下数据
# get some data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
您可以 运行 回归如下:
lm(EngineSize ~ Horsepower + RPM, x)
但是如果你没有数据而是相关矩阵或协方差矩阵会怎么样:
corx <- cor(x)
covx <- cov(x)
- R 中的哪个函数允许您运行 基于相关或协方差矩阵的回归?理想情况下,它应该类似于
lm
,这样您就可以轻松获得诸如 r 平方、调整后的 r 平方、预测值等信息。据推测,对于其中一些事情,您还需要提供样本量和可能的均值向量。不过这样也行。
即,类似于:
lm(EngineSize ~ Horsepower + RPM, cov = covx) # obviously this doesn't work
请注意,Stats.SE 上的这个答案提供了为什么可能的理论解释,并提供了一些用于计算系数的自定义 R 代码的示例?
记住:
$beta=(X'X)^-1。 X'Y$
尝试:
(bs<-solve(covx[-1,-1],covx[-1,1]))
Horsepower RPM
0.01491908 -0.00100051
对于截距,您需要变量的平均值。 例如:
ms=colMeans(x)
(b0=ms[1]-bs%*%ms[-1])
[,1]
[1,] 5.805301
使用 lavaan,您可以执行以下操作:
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
lav.input<- cov(x)
lav.mean <- colMeans(x)
library(lavaan)
m1 <- 'EngineSize ~ Horsepower+RPM'
fit <- sem(m1, sample.cov = lav.input,sample.nobs = nrow(x), meanstructure = TRUE, sample.mean = lav.mean)
summary(fit, standardize=TRUE)
结果是:
Regressions:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize ~
Horsepower 0.015 0.001 19.889 0.000 0.015 0.753
RPM -0.001 0.000 -15.197 0.000 -0.001 -0.576
Intercepts:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 5.805 0.362 16.022 0.000 5.805 5.627
Variances:
Estimate Std.Err Z-value P(>|z|) Std.lv Std.all
EngineSize 0.142 0.021 6.819 0.000 0.142 0.133
我认为 lavaan 听起来是个不错的选择,我注意到@Philip 为我指出了正确的方向。我只是在这里提到如何使用 lavaan(特别是 r 平方和调整后的 r 平方)提取一些您可能需要的额外模型特征。
最新版本见:https://gist.github.com/jeromyanglim/9f766e030966eaa1241f10bd7d6e2812 :
# get data
library(MASS)
data("Cars93")
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
# define sample statistics
covx <- cov(x)
n <- nrow(x)
means <- sapply(x, mean) # this is optional
fit <- lavaan::sem("EngineSize ~ Horsepower + RPM", sample.cov = covx,
sample.mean = means,
sample.nobs = n)
coef(fit) # unstandardised coefficients
standardizedSolution(fit) # Standardised coefficients
inspect(fit, 'r2') # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
# update p below with number of predictor variables
adjr2(inspect(fit, 'r2'), n = inspect(fit, "nobs"), p = 2)
自定义函数
这里有一些功能可以提供 lavaan 的合身度以及一些相关的功能(即,基本上包装了上述大部分功能)。它假设您没有办法的情况。
covlm <- function(dv, ivs, n, cov) {
# Assumes lavaan package
# library(lavaan)
# dv: charcter vector of length 1 with name of outcome variable
# ivs: character vector of names of predictors
# n: numeric vector of length 1: sample size
# cov: covariance matrix where row and column names
# correspond to dv and ivs
# Return
# list with lavaan model fit
# and various other features of the model
results <- list()
eq <- paste(dv, "~", paste(ivs, collapse = " + "))
results$fit <- lavaan::sem(eq, sample.cov = cov,
sample.nobs = n)
# coefficients
ufit <- parameterestimates(results$fit)
ufit <- ufit[ufit$op == "~", ]
results$coef <- ufit$est
names(results$coef) <- ufit$rhs
sfit <- standardizedsolution(results$fit)
sfit <- sfit[sfit$op == "~", ]
results$standardizedcoef <- sfit$est.std
names(results$standardizedcoef) <- sfit$rhs
# use unclass to not limit r2 to 3 decimals
results$r.squared <- unclass(inspect(results$fit, 'r2')) # r-squared
# adjusted r-squared
adjr2 <- function(rsquared, n, p) 1 - (1-rsquared) * ((n-1)/(n-p-1))
results$adj.r.squared <- adjr2(unclass(inspect(results$fit, 'r2')),
n = n, p = length(ivs))
results
}
例如:
x <- Cars93[,c("EngineSize", "Horsepower", "RPM")]
covlm(dv = "EngineSize", ivs = c("Horsepower", "RPM"),
n = nrow(x), cov = cov(x))
这全部产生:
$fit
lavaan (0.5-20) converged normally after 27 iterations
Number of observations 93
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
$coef
Horsepower RPM
0.01491908 -0.00100051
$standardizedcoef
Horsepower RPM
0.7532350 -0.5755326
$r.squared
EngineSize
0.867
$adj.r.squared
EngineSize
0.864
另一种时髦的解决方案是生成一个与原始数据具有相同方差-协方差矩阵的数据集。您可以使用 MASS
包中的 mvrnorm()
来执行此操作。在这个新数据集上使用 lm()
将产生与原始数据集估计值相同的参数估计值和标准误差(截距除外,除非您知道每个变量的均值,否则无法访问截距)。这是一个示例:
#Assuming the variance covariance matrix is called VC
n <- 100 #sample size
nvar <- ncol(VC)
fake.data <- mvrnorm(n, mu = rep(0, nvar), sigma = VC, empirical = TRUE)
lm(Y~., data = fake.data)