如何在大data.table(57M obs)内快速搜索?
How to fast search inside a large data.table (57M obs)?
如何使用 sqldf 在 data.table 中快速搜索?
我需要一个函数,它 returns 一个列的值 data.table 基于另外两个列值:
require(data.table)
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", "silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
> dt
# base prediction count
#1: of the 258586
#2: of set 246646
#3: of course 137533
#4: lead and background vocals from 4
#5: save thou me from the 4
#6: silent in the face of 4
# the function needs to return the "prediction" value based on the max "count" value for the input "base" value.
# giving the input "of" to function:
> prediction("of")
# the desired output is:
> "the"
# or:
> prediction("save thou me from")
> "the"
此处提供的解决方案 适用于小型数据集,但不适用于非常大的 data.table (57M obs):
f1 <- function(val) dt[base == val, prediction[which.max(count)]]
我读到为 data.table
编制索引并使用 sqldf
函数进行搜索可以加快速度,但还不知道该怎么做。
感谢提前。
用sqldf就是这样。如果您无法将其放入内存,请添加 dbname = tempfile()
参数。
library(sqldf)
val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
## count prediction
##1 258586 the
或者,直接使用 RSQLite 设置数据库并创建索引:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")
val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
## count prediction
## 1 258586 the
dbDisconnect(con)
备注
运行这首:
library(data.table)
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals",
"save thou me from", "silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
您可以考虑仅使用 data.table,如下所示。我认为它可以显着提高速度。
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from",
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base
# corresponds to the last row.
setDT(dt, key = c("base", "count"))
# for a given group in base, we consider only the last value of prediction as it is
# on the same row with the max value of count.
prediction <- function(x) {
dt[.(x), prediction[.N] ]
}
prediction("of")
#"the"
prediction("save thou me from")
#"the"
如何使用 sqldf 在 data.table 中快速搜索?
我需要一个函数,它 returns 一个列的值 data.table 基于另外两个列值:
require(data.table)
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from", "silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
> dt
# base prediction count
#1: of the 258586
#2: of set 246646
#3: of course 137533
#4: lead and background vocals from 4
#5: save thou me from the 4
#6: silent in the face of 4
# the function needs to return the "prediction" value based on the max "count" value for the input "base" value.
# giving the input "of" to function:
> prediction("of")
# the desired output is:
> "the"
# or:
> prediction("save thou me from")
> "the"
此处提供的解决方案
f1 <- function(val) dt[base == val, prediction[which.max(count)]]
我读到为 data.table
编制索引并使用 sqldf
函数进行搜索可以加快速度,但还不知道该怎么做。
感谢提前。
用sqldf就是这样。如果您无法将其放入内存,请添加 dbname = tempfile()
参数。
library(sqldf)
val <- "of"
fn$sqldf("select max(count) count, prediction from dt where base = '$val'")
## count prediction
##1 258586 the
或者,直接使用 RSQLite 设置数据库并创建索引:
library(gsubfn)
library(RSQLite)
con <- dbConnect(SQLite(), "dt.db")
dbWriteTable(con, "dt", dt)
dbExecute(con, "create index idx on dt(base)")
val <- "of"
fn$dbGetQuery(con, "select max(count) count, prediction from dt where base = '$val'")
## count prediction
## 1 258586 the
dbDisconnect(con)
备注
运行这首:
library(data.table)
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals",
"save thou me from", "silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
您可以考虑仅使用 data.table,如下所示。我认为它可以显着提高速度。
dt <- data.table(
"base" = c("of", "of", "of", "lead and background vocals", "save thou me from",
"silent in the face"),
"prediction" = c("the", "set", "course", "from", "the", "of"),
"count" = c(258586, 246646, 137533, 4, 4, 4)
)
# set the key on both base and count.
# This rearranges the data such that the max value of count for each group in base
# corresponds to the last row.
setDT(dt, key = c("base", "count"))
# for a given group in base, we consider only the last value of prediction as it is
# on the same row with the max value of count.
prediction <- function(x) {
dt[.(x), prediction[.N] ]
}
prediction("of")
#"the"
prediction("save thou me from")
#"the"