r data.table 临时连接的函数包装器(在链中聚合)
r data.table function wrapper around ad-hoc join (with aggregation in a chain)
[data.table_1.9.6]
问题的背景是我正在尝试在类似星型模式的数据布局中构建类似 olap 的查询功能,即一个大事实 table 和几个元 tables。我正在围绕 data.table join 构建函数包装器,然后在链中进行聚合,如下所示:
# dummy data
dt1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
dt2 = data.table(k=11:15, z=letters[11:15])
# standard data.table query with ad-hoc key -> works fine
dt1[dt2, c("z") := .(i.z), with = F,
on = c(a="k")][, .(m = sum(a, na.rm = T),
count = .N), by = c("z")]
# wrapper function with setkey -> works fine
agg_foo <- function(x, meta_tbl, x_key, meta_key, agg_var) {
setkeyv(x, x_key)
setkeyv(meta_tbl, meta_key)
x[meta_tbl, (agg_var) := get(agg_var)][,.(a_sum = sum(a, na.rm=T),
count = .N),
by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}
# call function (works fine)
agg_foo(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")
# wrapper function with ad-hoc key -> does not work
agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) {
x[meta_tbl, (agg_var) := get(agg_var),
on = c(x_key = meta_key)][,.(a_sum = sum(a, na.rm=T),
count = .N), by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}
# call function (causes error)
agg_foo_ad_hoc(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")
Error in forderv(x, by = rightcols) :
'by' value -2147483648 out of range [1,4]
我的猜测是我必须以不同的方式提供临时 "on" 参数。我尝试了 = c(get(x_key) = meta_key) 但后来他抱怨意外的括号。我可以使用该函数的 setkey 版本,但我想知道这是否有效,因为该函数将在不同的元 tables 上工作,具体取决于使用哪个聚合属性,从而不断地重新设置密钥.还是总是首选 setkey?实际情况 table(此处为 x)有 > 3000 万行。
您需要做的就是构建一个带有正确标签的向量。这是一种方法:
agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) {
x[meta_tbl, (agg_var) := get(agg_var),
on = setNames(meta_key, x_key)][,.(a_sum = sum(a, na.rm=T),
count = .N), by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}
[data.table_1.9.6] 问题的背景是我正在尝试在类似星型模式的数据布局中构建类似 olap 的查询功能,即一个大事实 table 和几个元 tables。我正在围绕 data.table join 构建函数包装器,然后在链中进行聚合,如下所示:
# dummy data
dt1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
dt2 = data.table(k=11:15, z=letters[11:15])
# standard data.table query with ad-hoc key -> works fine
dt1[dt2, c("z") := .(i.z), with = F,
on = c(a="k")][, .(m = sum(a, na.rm = T),
count = .N), by = c("z")]
# wrapper function with setkey -> works fine
agg_foo <- function(x, meta_tbl, x_key, meta_key, agg_var) {
setkeyv(x, x_key)
setkeyv(meta_tbl, meta_key)
x[meta_tbl, (agg_var) := get(agg_var)][,.(a_sum = sum(a, na.rm=T),
count = .N),
by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}
# call function (works fine)
agg_foo(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")
# wrapper function with ad-hoc key -> does not work
agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) {
x[meta_tbl, (agg_var) := get(agg_var),
on = c(x_key = meta_key)][,.(a_sum = sum(a, na.rm=T),
count = .N), by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}
# call function (causes error)
agg_foo_ad_hoc(x=dt1, meta_tbl=dt2, x_key="a", meta_key="k",agg_var="z")
Error in forderv(x, by = rightcols) :
'by' value -2147483648 out of range [1,4]
我的猜测是我必须以不同的方式提供临时 "on" 参数。我尝试了 = c(get(x_key) = meta_key) 但后来他抱怨意外的括号。我可以使用该函数的 setkey 版本,但我想知道这是否有效,因为该函数将在不同的元 tables 上工作,具体取决于使用哪个聚合属性,从而不断地重新设置密钥.还是总是首选 setkey?实际情况 table(此处为 x)有 > 3000 万行。
您需要做的就是构建一个带有正确标签的向量。这是一种方法:
agg_foo_ad_hoc <- function(x, meta_tbl, x_key, meta_key, agg_var) {
x[meta_tbl, (agg_var) := get(agg_var),
on = setNames(meta_key, x_key)][,.(a_sum = sum(a, na.rm=T),
count = .N), by = c(agg_var)]
x[, (agg_var) := .(NULL)]
}