提高大数据的查找性能 table
Improving lookup performance with large data table
我正在寻找提高对大型(超过一百万行)table 参考数据的查找性能的选项。查找根据 table 的每一行中的范围匹配一个键。可以进行多个匹配,我可以列出所有匹配项或最紧密的匹配项(范围最小的匹配项)。
下面的代码显示了一个小玩具示例。
# Create small example dataset with overlapping ranges
#
ip_from <- c(1,2,4,6,7, 7, 8,11)
ip_to <- c(5,3,5,6,9,15,10,15)
vl <- c("A","B","C","D","E","F","G","H")
dt1 <- data.table(ip_from,ip_to,vl,stringsAsFactors=FALSE)
str(dt1)
head(dt1,n=10)
#
# Perform some lookups manually
# A hit is when ip_from<= x <=ip_to
# There can be multiple hits
#
x <- 8
dt2 <- dt1[x>=ip_from & x<=ip_to] #Can produce multiple hits
head(dt2)
y <- dt2[,min(ip_to-ip_from)] #Get smallest difference
dt3 <- dt2[(ip_to-ip_from)==y] #Get hits matching smallest
head(dt3)
dt4 <- dt3[1] #Take first hit if multiples
head(dt4)
#
# Create a function to do the match
# It crudely caters for misses.
#
lkp <- function(dtbl,ky){
dtx <- dtbl[ky>=ip_from & ky<=ip_to]
if (nrow(dtx)!=0) {
y <- dtx[,min(ip_to-ip_from)]
dty <- dtx[(ip_to-ip_from)==y]
dtz <- dty[1]
return(dtz)
}
else {
return(NA)
}
}
#
# Make a set of keys
#
z <- as.data.frame(c(8,6))
colnames(z) <- "Obs"
#
# Do lookups, this gets slow.
#
zz <- rbindlist(apply(z,1,function(x) lkp(dt1,x[1])))
zzz <- cbind(z,zz)
head(zzz)
性能随着要查找的键的数量增加而下降,我可能会遇到成千上万个键的情况。请注意,在更实际的情况下,高值和低值是数据 table.
中的键
是否有任何其他方法可以显着提高性能?
我会这样做:
setDT(z)
dt1[, range := ip_to - ip_from] #calculate all ranges
setorder(dt1, range) #order by ranges
#do a data.table non-equi join taking only the first match
dt1[z, .(Obs = Obs, vl = x.vl, ip_from = x.ip_from, ip_to = x.ip_to),
on = c("ip_from <= Obs", "ip_to >= Obs"), mult = "first"]
# Obs vl ip_from ip_to
#1: 8 E 7 9
#2: 6 D 6 6
比较:
print(zzz)
# Obs ip_from ip_to vl
#1 8 7 9 E
#2 6 6 6 D
我正在寻找提高对大型(超过一百万行)table 参考数据的查找性能的选项。查找根据 table 的每一行中的范围匹配一个键。可以进行多个匹配,我可以列出所有匹配项或最紧密的匹配项(范围最小的匹配项)。
下面的代码显示了一个小玩具示例。
# Create small example dataset with overlapping ranges
#
ip_from <- c(1,2,4,6,7, 7, 8,11)
ip_to <- c(5,3,5,6,9,15,10,15)
vl <- c("A","B","C","D","E","F","G","H")
dt1 <- data.table(ip_from,ip_to,vl,stringsAsFactors=FALSE)
str(dt1)
head(dt1,n=10)
#
# Perform some lookups manually
# A hit is when ip_from<= x <=ip_to
# There can be multiple hits
#
x <- 8
dt2 <- dt1[x>=ip_from & x<=ip_to] #Can produce multiple hits
head(dt2)
y <- dt2[,min(ip_to-ip_from)] #Get smallest difference
dt3 <- dt2[(ip_to-ip_from)==y] #Get hits matching smallest
head(dt3)
dt4 <- dt3[1] #Take first hit if multiples
head(dt4)
#
# Create a function to do the match
# It crudely caters for misses.
#
lkp <- function(dtbl,ky){
dtx <- dtbl[ky>=ip_from & ky<=ip_to]
if (nrow(dtx)!=0) {
y <- dtx[,min(ip_to-ip_from)]
dty <- dtx[(ip_to-ip_from)==y]
dtz <- dty[1]
return(dtz)
}
else {
return(NA)
}
}
#
# Make a set of keys
#
z <- as.data.frame(c(8,6))
colnames(z) <- "Obs"
#
# Do lookups, this gets slow.
#
zz <- rbindlist(apply(z,1,function(x) lkp(dt1,x[1])))
zzz <- cbind(z,zz)
head(zzz)
性能随着要查找的键的数量增加而下降,我可能会遇到成千上万个键的情况。请注意,在更实际的情况下,高值和低值是数据 table.
中的键是否有任何其他方法可以显着提高性能?
我会这样做:
setDT(z)
dt1[, range := ip_to - ip_from] #calculate all ranges
setorder(dt1, range) #order by ranges
#do a data.table non-equi join taking only the first match
dt1[z, .(Obs = Obs, vl = x.vl, ip_from = x.ip_from, ip_to = x.ip_to),
on = c("ip_from <= Obs", "ip_to >= Obs"), mult = "first"]
# Obs vl ip_from ip_to
#1: 8 E 7 9
#2: 6 D 6 6
比较:
print(zzz)
# Obs ip_from ip_to vl
#1 8 7 9 E
#2 6 6 6 D