如何在大data.table(57M obs)内快速搜索?

How to fast search inside a large data.table (57M obs)?

如何使用 sqldf 在 data.table 中快速搜索?

我需要一个函数,它 returns 一个列的值 data.table 基于另外两个列值:

require(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)

> dt
#                         base prediction  count
#1:                         of        the 258586
#2:                         of        set 246646
#3:                         of     course 137533
#4: lead and background vocals       from      4
#5:          save thou me from        the      4
#6:         silent in the face         of      4

# the function needs to return the "prediction" value based on the max "count" value for the input "base" value.
# giving the input "of" to function:
> prediction("of")
# the desired output is:
> "the"
# or:
> prediction("save thou me from")
> "the"

此处提供的解决方案 适用于小型数据集,但不适用于非常大的 data.table (57M obs):

f1 <- function(val) dt[base == val, prediction[which.max(count)]]

我读到为 data.table 编制索引并使用 sqldf 函数进行搜索可以加快速度,但还不知道该怎么做。

感谢提前。

用sqldf就是这样。如果您无法将其放入内存,请添加 dbname = tempfile() 参数。

library(sqldf)

val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
##   count prediction
##1 258586        the

或者,直接使用 RSQLite 设置数据库并创建索引:

library(gsubfn)
library(RSQLite)

con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")

val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
##    count prediction
## 1 258586        the

dbDisconnect(con)

备注

运行这首:

library(data.table)

dt <- data.table(
    "base" = c("of", "of", "of", "lead and background vocals", 
     "save thou me from", "silent in the face"),
    "prediction" = c("the", "set", "course", "from", "the", "of"),
    "count" = c(258586, 246646, 137533, 4, 4, 4)
)

您可以考虑仅使用 data.table,如下所示。我认为它可以显着提高速度。

dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", 
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)

# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base 
# corresponds to the last row.
setDT(dt, key = c("base", "count"))

# for a given group in base, we consider only the last value of prediction as it is 
# on the same row with the max value of count. 
prediction <- function(x) {
  dt[.(x), prediction[.N] ]
}

prediction("of")
#"the"
prediction("save thou me from")
#"the"