数百万 data:judge 多日期范围的循环速度 up/replace
Speed up/replace the loop for millions data:judge multi date range
大家晚上好,我有600万条数据,他们有四种类型。
z=structure(list(date = structure(c(11866, 16190, 14729, 11718), class = "Date"),
beg1 = structure(c(12264, 12264, 13970, 12264), class = "Date"),
end1 = structure(c(17621, 14760, 14760, 13298), class = "Date"),
ID1 = c(1003587, 1000396, 1010743, 1002113), beg2 = structure(c(NA,
14790, 14790, 13299), class = "Date"), end2 = structure(c(NA,
17621, 15217, 13969), class = "Date"), ID2 = c(NA, 1024488,
1027877, 1002824), beg3 = structure(c(NA, NA, 15218, 13970
), class = "Date"), end3 = structure(c(NA, NA, 17621, 14760
), class = "Date"), ID3 = c(NA, NA, 1031361, 1002113), beg4 = structure(c(NA,
NA, NA, 14790), class = "Date"), end4 = structure(c(NA, NA,
NA, 17621), class = "Date"), ID4 = c(NA, NA, NA, 1021290),
realID = c(NA, NA, NA, NA)), row.names = c(267365L, 193587L,
5294385L, 2039421L), class = "data.frame")
并且我尝试根据他们的日期在哪些日期范围内(使用循环)来判断和分配一个 suitalbe ID。
for(i in 1:nrow(z)){tryCatch({print(i)
if(between(z$date[i],z$beg1[i],z$end1[i])==T){z$realID[i]=z$ID1[i]}
if(between(z$date[i],z$beg2[i],z$end2[i])==T){z$realID[i]=z$ID2[i]}
if(between(z$date[i],z$beg3[i],z$end3[i])==T){z$realID[i]=z$ID3[i]}
if(between(z$date[i],z$beg4[i],z$end4[i])==T){z$realID[i]=z$ID4[i]}},error=function(e){})}
代码有效。
但是,现在的问题是我的数据太多,循环效率低下,可能要循环快一天。
有谁知道我该如何改进或替换代码?
非常感谢。
由于 R 是一种矢量化语言,要加快此代码的速度,最好对整个矢量进行操作,而不是循环遍历每个元素。
简单的解决方案是使用一系列 ifelse
语句。
z$realID <- ifelse(!is.na(z$beg1) & z$date> z$beg1 & z$date< z$end1, z$ID1, z$realID)
z$realID <- ifelse(!is.na(z$beg2) & z$date> z$beg2 & z$date< z$end2, z$ID2, z$realID)
z$realID <- ifelse(!is.na(z$beg3) & z$date> z$beg3 & z$date< z$end3, z$ID3, z$realID)
z$realID <- ifelse(!is.na(z$beg4) & z$date> z$beg4 & z$date< z$end4, z$ID4, z$realID)
当 if
语句计算为真时,realID 将更新,否则它将保留其先前的值。
大家晚上好,我有600万条数据,他们有四种类型。
z=structure(list(date = structure(c(11866, 16190, 14729, 11718), class = "Date"),
beg1 = structure(c(12264, 12264, 13970, 12264), class = "Date"),
end1 = structure(c(17621, 14760, 14760, 13298), class = "Date"),
ID1 = c(1003587, 1000396, 1010743, 1002113), beg2 = structure(c(NA,
14790, 14790, 13299), class = "Date"), end2 = structure(c(NA,
17621, 15217, 13969), class = "Date"), ID2 = c(NA, 1024488,
1027877, 1002824), beg3 = structure(c(NA, NA, 15218, 13970
), class = "Date"), end3 = structure(c(NA, NA, 17621, 14760
), class = "Date"), ID3 = c(NA, NA, 1031361, 1002113), beg4 = structure(c(NA,
NA, NA, 14790), class = "Date"), end4 = structure(c(NA, NA,
NA, 17621), class = "Date"), ID4 = c(NA, NA, NA, 1021290),
realID = c(NA, NA, NA, NA)), row.names = c(267365L, 193587L,
5294385L, 2039421L), class = "data.frame")
并且我尝试根据他们的日期在哪些日期范围内(使用循环)来判断和分配一个 suitalbe ID。
for(i in 1:nrow(z)){tryCatch({print(i)
if(between(z$date[i],z$beg1[i],z$end1[i])==T){z$realID[i]=z$ID1[i]}
if(between(z$date[i],z$beg2[i],z$end2[i])==T){z$realID[i]=z$ID2[i]}
if(between(z$date[i],z$beg3[i],z$end3[i])==T){z$realID[i]=z$ID3[i]}
if(between(z$date[i],z$beg4[i],z$end4[i])==T){z$realID[i]=z$ID4[i]}},error=function(e){})}
代码有效。 但是,现在的问题是我的数据太多,循环效率低下,可能要循环快一天。
有谁知道我该如何改进或替换代码? 非常感谢。
由于 R 是一种矢量化语言,要加快此代码的速度,最好对整个矢量进行操作,而不是循环遍历每个元素。
简单的解决方案是使用一系列 ifelse
语句。
z$realID <- ifelse(!is.na(z$beg1) & z$date> z$beg1 & z$date< z$end1, z$ID1, z$realID)
z$realID <- ifelse(!is.na(z$beg2) & z$date> z$beg2 & z$date< z$end2, z$ID2, z$realID)
z$realID <- ifelse(!is.na(z$beg3) & z$date> z$beg3 & z$date< z$end3, z$ID3, z$realID)
z$realID <- ifelse(!is.na(z$beg4) & z$date> z$beg4 & z$date< z$end4, z$ID4, z$realID)
当 if
语句计算为真时,realID 将更新,否则它将保留其先前的值。