使用 R 中的复杂调查设计提高查询速度
Improve speed of queries using complex survey design in R
我有一个大型数据集(超过 2000 万个 obs),我使用 survey
包进行分析,这让我花了很多时间来 运行 简单查询。我试图找到一种方法来加速我的代码,但我想知道是否有更好的方法来提高效率。
在我的基准测试中,我使用 svyby
/svytotal
:
比较了三个命令的速度
- 简单命令
svyby
/svytotal
foreach
dopar
使用 7 个内核进行并行计算
- 选项 2 的编译版本
剧透:选项 3 比第一个选项快两倍多但是它不适合大数据集,因为它依赖于并行计算,很快就会达到内存限制在处理大数据集时。尽管我有 16GB 的内存,但我也面临这个问题。有几个solutions to this memory limitation,但是none个适用于调查设计对象
关于如何让它更快并且不会因为内存限制而崩溃的任何想法?
我的代码和一个可重现的例子:
# Load Packages
library(survey)
library(data.table)
library(compiler)
library(foreach)
library(doParallel)
options(digits=3)
# Load Data
data(api)
# Convert data to data.table format (mostly to increase speed of the process)
apiclus1 <- as.data.table(apiclus1)
# Multiplicate data observations by 1000
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]
# create a count variable
apiclus1[, Vcount := 1]
# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
1) 简单代码
t1 <- Sys.time()
table1 <- svyby(~Vcount,
~stype+dnum+cname,
design = dclus1,
svytotal)
T1 <- Sys.time() - t1
2) 使用 7 核的 foreach dopar 并行计算
# in this option, I create a list with different subsets of the survey design
# that will be passed to different CPU cores to work at the same time
subdesign <- function(i){ subset(dclus1, dnum==i)}
groups <- unique(apiclus1$dnum)
list_subsets <- lapply(groups[], subdesign) # apply function and get all subsets in a list
i <- NULL
# Start Parallel
registerDoParallel(cores=7)
t2 <- Sys.time()
table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
options( survey.lonely.psu = "remove" )
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}
T2 <- Sys.time() - t2
3。选项 2
的编译版本
# make a function of the previous query
query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}}
# Compile the function to increase speed
query3 <- cmpfun(query2 )
t3 <- Sys.time()
table3 <- query3 (list_subsets)
T3 <- Sys.time() - t3
结果
>T1: 1.9 secs
>T2: 1.13 secs
>T3 0.58 secs
barplot(c(T1, T2, T3),
names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
ylab="Seconds")
感谢您很好地提出这个问题。在 R 中高效地处理大型调查数据集可能需要一些基本的 SQL 语法(这比 R 更容易学习)。 MonetDB 是唯一与 survey
包兼容的大数据选项,探索其他高性能包(可能)不会有成果。通常当我探索一个巨大的数据集时,我直接写在 SQL 查询中而不是使用调查包,因为标准误差计算是计算密集型的(并且方差在交互式数据探索中没有那么有用)。请注意最后的 SQL 时间戳是如何打败所有其他选项的。要计算快速加权平均值,请使用 "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"
当您确实需要交互式标准错误时,线性化 (svydesign
) 通常比复制 (svrepdesign
) 的计算量更大,但有时创建复制设计(就像我对jk1w_dclus1
下文)要求对调查方法的熟悉度超过某些用户的熟悉程度。
# Load Packages
library(MonetDB.R)
library(MonetDBLite)
library(DBI) # suggested in comments and needed on OSX
library(survey)
# Load Data
data(api)
# Multiplicate data observations by 10000
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]
# create a count variable
apiclus1$vcount <- 1
# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
dbfolder <- tempdir()
db <- dbConnect( MonetDBLite() , dbfolder )
dbWriteTable( db , 'apiclus1' , apiclus1 )
db_dclus1 <-
svydesign(
weight = ~pw ,
id = ~dnum ,
data = "apiclus1" ,
dbtype = "MonetDBLite" ,
dbname = dbfolder ,
fpc = ~fpc
)
# you provided a design without strata,
# so type="JK1" matches that most closely.
# but see survey:::as.svrepdesign for other linearization-to-replication options
jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )
# after the replicate-weights have been constructed,
# here's the `svrepdesign` call..
jk1w_dclus1 <-
svrepdesign(
weight = ~pw ,
type = "JK1" ,
repweights = jk1w$repweights ,
combined.weights = FALSE ,
scale = jk1w$scale ,
rscales = jk1w$rscales ,
data = 'apiclus1' ,
dbtype = "MonetDBLite" ,
dbname = dbfolder
)
# slow
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# user system elapsed
# 17.40 2.86 20.27
# faster
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# user system elapsed
# 13.00 1.20 14.18
# fastest
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# user system elapsed
# 10.75 1.19 11.96
# same standard errors across the board
all.equal( SE( res1 ) , SE( res2 ) )
all.equal( SE( res2 ) , SE( res3 ) )
# NOTE: the replicate-weighted design will be slightly different
# for certain designs. however this technique is defensible
# and gets used in
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico
# at the point you do not care about standard errors,
# learn some sql:
system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# because this is near-instantaneous, no matter how much data you have.
# same numbers as res1:
all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
# > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# user system elapsed
# 0.15 0.20 0.23
我有一个大型数据集(超过 2000 万个 obs),我使用 survey
包进行分析,这让我花了很多时间来 运行 简单查询。我试图找到一种方法来加速我的代码,但我想知道是否有更好的方法来提高效率。
在我的基准测试中,我使用 svyby
/svytotal
:
- 简单命令
svyby
/svytotal
foreach
dopar
使用 7 个内核进行并行计算- 选项 2 的编译版本
剧透:选项 3 比第一个选项快两倍多但是它不适合大数据集,因为它依赖于并行计算,很快就会达到内存限制在处理大数据集时。尽管我有 16GB 的内存,但我也面临这个问题。有几个solutions to this memory limitation,但是none个适用于调查设计对象
关于如何让它更快并且不会因为内存限制而崩溃的任何想法?
我的代码和一个可重现的例子:
# Load Packages
library(survey)
library(data.table)
library(compiler)
library(foreach)
library(doParallel)
options(digits=3)
# Load Data
data(api)
# Convert data to data.table format (mostly to increase speed of the process)
apiclus1 <- as.data.table(apiclus1)
# Multiplicate data observations by 1000
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]
# create a count variable
apiclus1[, Vcount := 1]
# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
1) 简单代码
t1 <- Sys.time()
table1 <- svyby(~Vcount,
~stype+dnum+cname,
design = dclus1,
svytotal)
T1 <- Sys.time() - t1
2) 使用 7 核的 foreach dopar 并行计算
# in this option, I create a list with different subsets of the survey design
# that will be passed to different CPU cores to work at the same time
subdesign <- function(i){ subset(dclus1, dnum==i)}
groups <- unique(apiclus1$dnum)
list_subsets <- lapply(groups[], subdesign) # apply function and get all subsets in a list
i <- NULL
# Start Parallel
registerDoParallel(cores=7)
t2 <- Sys.time()
table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
options( survey.lonely.psu = "remove" )
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}
T2 <- Sys.time() - t2
3。选项 2
的编译版本# make a function of the previous query
query2 <- function (list_subsets) { foreach (i = list_subsets, .combine= rbind, .packages="survey") %dopar% {
svyby(~Vcount,
~stype+dnum+cname,
design = i,
svytotal)}}
# Compile the function to increase speed
query3 <- cmpfun(query2 )
t3 <- Sys.time()
table3 <- query3 (list_subsets)
T3 <- Sys.time() - t3
结果
>T1: 1.9 secs
>T2: 1.13 secs
>T3 0.58 secs
barplot(c(T1, T2, T3),
names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
ylab="Seconds")
感谢您很好地提出这个问题。在 R 中高效地处理大型调查数据集可能需要一些基本的 SQL 语法(这比 R 更容易学习)。 MonetDB 是唯一与 survey
包兼容的大数据选项,探索其他高性能包(可能)不会有成果。通常当我探索一个巨大的数据集时,我直接写在 SQL 查询中而不是使用调查包,因为标准误差计算是计算密集型的(并且方差在交互式数据探索中没有那么有用)。请注意最后的 SQL 时间戳是如何打败所有其他选项的。要计算快速加权平均值,请使用 "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"
当您确实需要交互式标准错误时,线性化 (svydesign
) 通常比复制 (svrepdesign
) 的计算量更大,但有时创建复制设计(就像我对jk1w_dclus1
下文)要求对调查方法的熟悉度超过某些用户的熟悉程度。
# Load Packages
library(MonetDB.R)
library(MonetDBLite)
library(DBI) # suggested in comments and needed on OSX
library(survey)
# Load Data
data(api)
# Multiplicate data observations by 10000
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]
# create a count variable
apiclus1$vcount <- 1
# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
dbfolder <- tempdir()
db <- dbConnect( MonetDBLite() , dbfolder )
dbWriteTable( db , 'apiclus1' , apiclus1 )
db_dclus1 <-
svydesign(
weight = ~pw ,
id = ~dnum ,
data = "apiclus1" ,
dbtype = "MonetDBLite" ,
dbname = dbfolder ,
fpc = ~fpc
)
# you provided a design without strata,
# so type="JK1" matches that most closely.
# but see survey:::as.svrepdesign for other linearization-to-replication options
jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )
# after the replicate-weights have been constructed,
# here's the `svrepdesign` call..
jk1w_dclus1 <-
svrepdesign(
weight = ~pw ,
type = "JK1" ,
repweights = jk1w$repweights ,
combined.weights = FALSE ,
scale = jk1w$scale ,
rscales = jk1w$rscales ,
data = 'apiclus1' ,
dbtype = "MonetDBLite" ,
dbname = dbfolder
)
# slow
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# user system elapsed
# 17.40 2.86 20.27
# faster
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# user system elapsed
# 13.00 1.20 14.18
# fastest
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# user system elapsed
# 10.75 1.19 11.96
# same standard errors across the board
all.equal( SE( res1 ) , SE( res2 ) )
all.equal( SE( res2 ) , SE( res3 ) )
# NOTE: the replicate-weighted design will be slightly different
# for certain designs. however this technique is defensible
# and gets used in
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico
# at the point you do not care about standard errors,
# learn some sql:
system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# because this is near-instantaneous, no matter how much data you have.
# same numbers as res1:
all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
# > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# user system elapsed
# 0.15 0.20 0.23