从 R 中的多个 lm() 快速检索 pvalues
Quickly retrieve pvalues from multiple lm() in R
我有一个矩阵 (mat),其中包含 dims "13, 20000000" 和以下组
[1,] "wildtype"
[2,] "wildtype"
[3,] "wildtype"
[4,] "wildtype"
[5,] "wildtype"
[6,] "wildtype"
[7,] "wildtype"
[8,] "wildtype"
[9,] "wildtype"
[10,] "wildtype"
[11,] "mutant"
[12,] "mutant"
[13,] "mutant"
使用以下 R 代码,我在每个数据点上 运行 lm()
20M 次。
lm(mat ~ groups)
真的很快。需要很长时间的是使用 summary(lm1)
.
提取每个模型的 pvalue
我怎样才能加快提取 p 值的速度?
tvals_out <-'/tmp/tvals_lm.csv'
infile <- '/tmp/tempdata.dat'
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])
close(con)
groups = factor(c(rep('wt', 10), rep('mut', 3)))
lm1 <- lm(mat ~ groups)
# This is the longest running bit
sum_lm1 <- summary(lm1)
num_pixels <- dim(mat)[2]
result_pvalues <- numeric(num_pixels)
result_pvalues <- vapply(sum_lm1, function(x) x$coefficients[,4][2], FUN.VALUE = 1)
write.table(result_pvalues, tvals_out, sep=',');
outCon <- file(tvals_out, "wb")
writeBin(result_pvalues, outCon)
close(outCon)
编辑:
我从 mat 对象中添加了 10 个数据点的数据样本位
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
试试 broom
包如何?
install.packages(broom)
library(broom)
tidy(lm(mat ~ groups))
# response term estimate std.error statistic p.value
# 1 Y1 (Intercept) 27.000000 7.967548 3.3887465 6.048267e-03
# 2 Y1 groupswt 14.900000 9.084402 1.6401740 1.292246e-01
# 3 Y2 (Intercept) 23.333333 7.809797 2.9877004 1.234835e-02
# 4 Y2 groupswt 11.366667 8.904539 1.2765026 2.280689e-01
# 5 Y3 (Intercept) 44.000000 17.192317 2.5592828 2.655251e-02
# ...and more...
然后仅提取 groupswt
的结果(注意:实现此目的的各种方法...):
subset(tidy(lm(mat ~ groups)), term == "groupswt")[, c(1,6)]
# response p.value
# 2 Y1 0.12922460
# 4 Y2 0.22806894
# 6 Y3 0.88113522
# 8 Y4 0.20645833
# 10 Y5 0.10362436
# 12 Y6 0.84642990
# 14 Y7 0.27171390
# 16 Y8 0.15398258
# 18 Y9 0.66351492
# 20 Y10 0.05942893
我有一个脚本,我可以在其中进行大量回归,然后收集包括 p 值在内的系数。这是它的样子
library(data.table)
summ<-summary(lm1)$coefficients
coeffs<-data.table(summ)
coeffs[,coef:=row.names(summ)]
setnames(coeffs,c("estimate", "stderr","t","p","coef"))
我很难想象什么会比 summary
更快。为了尝试,我写了一个快速的 diddy 来根据系数和标准误差计算 p 值。我还尝试了 broom
方法。基于示例数据的结果如下
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
groups <- rep(c("wildtype", "mutant"), times = c(10, 3))
fit <- lm(mat ~ groups)
#* Using summary
do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4]))
#* Directly calculating p-value
pvalOnly <- function(fit){
pt(abs(coef(fit) / sqrt(diag(vcov(fit)))),
df = fit$df.residual,
lower.tail = FALSE) * 2
}
pvalDirect <- pvalOnly(fit)
#* Using broom
library(broom)
tidy(fit)$p.value
library(microbenchmark)
microbenchmark(
summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])),
direct = pvalOnly(fit),
broom = tidy(fit)$p.value
)
如您所见,在这个非常小的表示中,使用 summary
仍然比直接计算快一点点。 broom
增加了很多时间(不足为奇,因为它需要做很多工作来整理您不想捕捉的东西)
Unit: milliseconds
expr min lq mean median uq max neval cld
summary 1.685857 1.744652 1.969350 1.804914 1.877931 4.929129 100 a
direct 1.860630 1.933501 2.184573 2.047279 2.160765 6.442852 100 a
broom 5.303015 5.557257 6.060014 5.818830 5.999028 9.879372 100 b
以下函数能够在大约 25 秒内从 13x20,000,000 矩阵(如您的矩阵)的拟合中提取 p 值。
pvalOnly2 <- function(fit) {
# get estimates
est <- fit$coefficients[fit$qr$pivot, ]
# get R: see stats:::summary.lm to see how this is calculated
p1 <- 1L:(fit$rank)
R <- diag(chol2inv(fit$qr$qr[p1, p1, drop = FALSE]))
# get residual sum of squares for each
resvar <- colSums(fit$residuals^2) / fit$df.residual
# R is same for each coefficient, resvar is same within each model
se <- sqrt(outer(R, resvar))
pt(abs(est / se), df = fit$df.residual, lower.tail = FALSE) * 2
}
这会计算与调用 summary
(或本杰明的 pvalOnly
函数)相同的 p 值。但是,它跳过了 summary
为每个模型执行的所有其他步骤,从而使其速度更快。 (注意Benjamin的pvalOnly
调用了vcov
,后者又调用了summary
,这就是它不节省时间的原因)。
在小矩阵上,这比摘要快 30 倍:
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
groups <- rep(c("wildtype", "mutant"), times = c(10, 3))
fit <- lm(mat ~ groups)
library(microbenchmark)
microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])),
pvalOnly2(fit))
结果:
Unit: microseconds
expr min lq mean median uq max neval cld
summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b
pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a
速度优势更大,但是,当您拟合的模型更多时。在 13x1000 的矩阵上,它具有大约 300 倍的优势。在我的机器上,当有 2000 万列时,它会在 25 秒内计算出 p 值——实际上是 fit <- lm(mat ~ groups)
步骤的两倍。
> mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs
> dim(mat)
[1] 13 20000000
> system.time(fit <- lm(mat ~ groups))
user system elapsed
37.272 10.296 58.372
> system.time(pvals <- pvalOnly2(fit))
user system elapsed
21.945 1.889 24.322
生成的 p 值是正确的(与您从摘要中得出的结果相同):
> dim(pvals)
[1] 2 20000000
> pvals[, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319
groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990
[,7] [,8] [,9] [,10]
(Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05
groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02
(顺便说一下,性能分析显示函数中几乎所有 运行 时间都花在了 pt
函数上——因为这是在 C 中完成的,所以速度差不多用任何语言制作)。
针对您的评论,您还可以使用以下函数获取每个模型的 p 值(来自 F 统计量),其速度与 pvalOnly2
相似:
modelPvalOnly <- function(fit) {
f <- t(fit$fitted.values)
if (attr(fit$terms, "intercept")) {
mss <- rowSums((f - rowMeans(f)) ^ 2)
numdf <- fit$rank - 1
} else {
mss <- rowSums(f ^ 2)
numdf <- fit$rank
}
resvar <- colSums(fit$residuals^2) / fit$df.residual
fstat <- (mss / numdf) / resvar
pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE)
pval
}
我有一个矩阵 (mat),其中包含 dims "13, 20000000" 和以下组
[1,] "wildtype"
[2,] "wildtype"
[3,] "wildtype"
[4,] "wildtype"
[5,] "wildtype"
[6,] "wildtype"
[7,] "wildtype"
[8,] "wildtype"
[9,] "wildtype"
[10,] "wildtype"
[11,] "mutant"
[12,] "mutant"
[13,] "mutant"
使用以下 R 代码,我在每个数据点上 运行 lm()
20M 次。
lm(mat ~ groups)
真的很快。需要很长时间的是使用 summary(lm1)
.
我怎样才能加快提取 p 值的速度?
tvals_out <-'/tmp/tvals_lm.csv'
infile <- '/tmp/tempdata.dat'
con <- file(infile, "rb")
dim <- readBin(con, "integer", 2)
mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2])
close(con)
groups = factor(c(rep('wt', 10), rep('mut', 3)))
lm1 <- lm(mat ~ groups)
# This is the longest running bit
sum_lm1 <- summary(lm1)
num_pixels <- dim(mat)[2]
result_pvalues <- numeric(num_pixels)
result_pvalues <- vapply(sum_lm1, function(x) x$coefficients[,4][2], FUN.VALUE = 1)
write.table(result_pvalues, tvals_out, sep=',');
outCon <- file(tvals_out, "wb")
writeBin(result_pvalues, outCon)
close(outCon)
编辑:
我从 mat 对象中添加了 10 个数据点的数据样本位
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
试试 broom
包如何?
install.packages(broom)
library(broom)
tidy(lm(mat ~ groups))
# response term estimate std.error statistic p.value
# 1 Y1 (Intercept) 27.000000 7.967548 3.3887465 6.048267e-03
# 2 Y1 groupswt 14.900000 9.084402 1.6401740 1.292246e-01
# 3 Y2 (Intercept) 23.333333 7.809797 2.9877004 1.234835e-02
# 4 Y2 groupswt 11.366667 8.904539 1.2765026 2.280689e-01
# 5 Y3 (Intercept) 44.000000 17.192317 2.5592828 2.655251e-02
# ...and more...
然后仅提取 groupswt
的结果(注意:实现此目的的各种方法...):
subset(tidy(lm(mat ~ groups)), term == "groupswt")[, c(1,6)]
# response p.value
# 2 Y1 0.12922460
# 4 Y2 0.22806894
# 6 Y3 0.88113522
# 8 Y4 0.20645833
# 10 Y5 0.10362436
# 12 Y6 0.84642990
# 14 Y7 0.27171390
# 16 Y8 0.15398258
# 18 Y9 0.66351492
# 20 Y10 0.05942893
我有一个脚本,我可以在其中进行大量回归,然后收集包括 p 值在内的系数。这是它的样子
library(data.table)
summ<-summary(lm1)$coefficients
coeffs<-data.table(summ)
coeffs[,coef:=row.names(summ)]
setnames(coeffs,c("estimate", "stderr","t","p","coef"))
我很难想象什么会比 summary
更快。为了尝试,我写了一个快速的 diddy 来根据系数和标准误差计算 p 值。我还尝试了 broom
方法。基于示例数据的结果如下
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
groups <- rep(c("wildtype", "mutant"), times = c(10, 3))
fit <- lm(mat ~ groups)
#* Using summary
do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4]))
#* Directly calculating p-value
pvalOnly <- function(fit){
pt(abs(coef(fit) / sqrt(diag(vcov(fit)))),
df = fit$df.residual,
lower.tail = FALSE) * 2
}
pvalDirect <- pvalOnly(fit)
#* Using broom
library(broom)
tidy(fit)$p.value
library(microbenchmark)
microbenchmark(
summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])),
direct = pvalOnly(fit),
broom = tidy(fit)$p.value
)
如您所见,在这个非常小的表示中,使用 summary
仍然比直接计算快一点点。 broom
增加了很多时间(不足为奇,因为它需要做很多工作来整理您不想捕捉的东西)
Unit: milliseconds
expr min lq mean median uq max neval cld
summary 1.685857 1.744652 1.969350 1.804914 1.877931 4.929129 100 a
direct 1.860630 1.933501 2.184573 2.047279 2.160765 6.442852 100 a
broom 5.303015 5.557257 6.060014 5.818830 5.999028 9.879372 100 b
以下函数能够在大约 25 秒内从 13x20,000,000 矩阵(如您的矩阵)的拟合中提取 p 值。
pvalOnly2 <- function(fit) {
# get estimates
est <- fit$coefficients[fit$qr$pivot, ]
# get R: see stats:::summary.lm to see how this is calculated
p1 <- 1L:(fit$rank)
R <- diag(chol2inv(fit$qr$qr[p1, p1, drop = FALSE]))
# get residual sum of squares for each
resvar <- colSums(fit$residuals^2) / fit$df.residual
# R is same for each coefficient, resvar is same within each model
se <- sqrt(outer(R, resvar))
pt(abs(est / se), df = fit$df.residual, lower.tail = FALSE) * 2
}
这会计算与调用 summary
(或本杰明的 pvalOnly
函数)相同的 p 值。但是,它跳过了 summary
为每个模型执行的所有其他步骤,从而使其速度更快。 (注意Benjamin的pvalOnly
调用了vcov
,后者又调用了summary
,这就是它不节省时间的原因)。
在小矩阵上,这比摘要快 30 倍:
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38)
mat <- matrix(m, nrow=13)
groups <- rep(c("wildtype", "mutant"), times = c(10, 3))
fit <- lm(mat ~ groups)
library(microbenchmark)
microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])),
pvalOnly2(fit))
结果:
Unit: microseconds
expr min lq mean median uq max neval cld
summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b
pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a
速度优势更大,但是,当您拟合的模型更多时。在 13x1000 的矩阵上,它具有大约 300 倍的优势。在我的机器上,当有 2000 万列时,它会在 25 秒内计算出 p 值——实际上是 fit <- lm(mat ~ groups)
步骤的两倍。
> mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs
> dim(mat)
[1] 13 20000000
> system.time(fit <- lm(mat ~ groups))
user system elapsed
37.272 10.296 58.372
> system.time(pvals <- pvalOnly2(fit))
user system elapsed
21.945 1.889 24.322
生成的 p 值是正确的(与您从摘要中得出的结果相同):
> dim(pvals)
[1] 2 20000000
> pvals[, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6]
(Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319
groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990
[,7] [,8] [,9] [,10]
(Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05
groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02
(顺便说一下,性能分析显示函数中几乎所有 运行 时间都花在了 pt
函数上——因为这是在 C 中完成的,所以速度差不多用任何语言制作)。
针对您的评论,您还可以使用以下函数获取每个模型的 p 值(来自 F 统计量),其速度与 pvalOnly2
相似:
modelPvalOnly <- function(fit) {
f <- t(fit$fitted.values)
if (attr(fit$terms, "intercept")) {
mss <- rowSums((f - rowMeans(f)) ^ 2)
numdf <- fit$rank - 1
} else {
mss <- rowSums(f ^ 2)
numdf <- fit$rank
}
resvar <- colSums(fit$residuals^2) / fit$df.residual
fstat <- (mss / numdf) / resvar
pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE)
pval
}