R:加速多个 if 语句

R : speed up multiple if statements

我有一个包含 220k 行的数据框 (mydata),我想对每一行的 1 列 (BRLABELS) 执行 8 个 if 语句。简单的 if / else if 过程大约需要 5 分钟,我只是想加快速度。

我试过这样的开关功能方式。 起初我定义它

group_label<-function(x){
  switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
         "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}

然后在for循环中使用它

for ( i in 1:k){
  x<-mydata$BRLABELS[i]
  mydata$group[i]<-group_label(x)}

令人困惑的是,这种方法大约需要 15 分钟,而理论上 switch 方法适用于多个 if 语句。

有人可以解释为什么会发生这种情况并提供有效的替代方案吗?

您可以将代码从 switch 复制/粘贴到:

new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)

并更新值:

mydata$BRLABELS <- new_values[mydata$BRLABELS]

我假设 BRLABELS 不是因数(否则您的代码将无法运行)。

更新:时间测试

group_label<-function(x){
  switch(x,"15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3,
         "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)}

new_values <- c("15-19"=1,"20-24"=1,"25-29"=2,"30-34"=2,"35-39"=3,"40-44"=3, "45-49"=4,"50-54"=4,"55-59"=5,"60-64"=5,"ISCED 0"=6,"ISCED 1"=6,"ISCED 2"=6,"ISCED 3"=7,"ISCED 4"=7,"ISCED 5"=8,"ISCED 6"=8,0)

mydata <- 
  data.frame(
    BRLABELS = 
      sample(c("15-19","20-24","25-29","30-34","35-39","40-44",
               "45-49","50-54","55-59","60-64","ISCED 0","ISCED 1","ISCED 2","ISCED 3",
               "ISCED 4","ISCED 5","ISCED 6"), 
             10000, replace = TRUE ), 
    stringsAsFactors = FALSE)


mydata2 <- mydata




library(microbenchmark)

microbenchmark(times = 5,
  for_loop = for ( i in 1:nrow(mydata)){
    x<-mydata$BRLABELS[i]
    mydata$group[i]<-group_label(x)},
  direct = mydata2$group <- new_values[mydata2$BRLABELS]
  )


#     Unit: microseconds
#     expr            min         lq        mean     median         uq        max neval cld
#     for_loop 737247.663 765056.444 781973.1502 769505.576 814000.738 824055.330     5   b
#     direct      325.432    326.715    375.2092    344.249    387.012    492.638     5  a 

最后使用了James提到的"car"包的recode功能。

mydata$BRLABELS<-recode(mydata$BRLABELS,"c('15-19','20-24')='15-24';c('25-29','30-34')='25-34';c('35-39','40-44')='35-44'; c('45-49','50-54')='45-54';c('55-59','60-64')='55-64';c('ISCED 0','ISCED 1','ISCED 2')='ISCED 0-2';c('ISCED 3','ISCED 4')='ISCED 3-4';c('ISCED 5','ISCED 6')='ISCED 5-6'; else ='0'") 

比for\if循环更直观,而且时间上的差异很大。 最后,我使用 plyr 包添加了我想要的列(这是最终目的)。

ddply(mydata,~GEO +VAR +ANSWER +LABELS +BREAKDOWN +BRLABELS ,summarise,VALUE=sum(VALUE)) 

感谢大家的帮助