条件加入 r
Conditional join in r
我想有条件地将两个数据表连接在一起:
library(data.table)
set.seed(1)
key.table <-
data.table(
out = (0:10)/10,
keyz = sort(runif(11))
)
large.tbl <-
data.table(
ab = rnorm(1e6),
cd = runif(1e6)
)
根据以下规则:匹配key.table
中out
中keyz
值大于cd
的最小值。我有以下内容:
library(dplyr)
large.tbl %>%
rowwise %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
提供正确的输出。我遇到的问题是 rowwise
操作对于我实际使用的 large.tbl
来说似乎很昂贵,除非它在特定计算机上,否则它会崩溃。是否有更少的内存消耗操作?以下似乎稍微快一些,但不足以解决我遇到的问题。
large.tbl %>%
group_by(cd) %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
这听起来像是 data.table
答案的问题,但答案不必使用该包。
如果 key.table$out
也像您的玩具示例中那样排序,则以下内容会起作用
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
large.tbl$out <- key.table$out[ind]
head(large.tbl)
# ab cd out
#1: -0.928567035 0.99473795 NA
#2: -0.294720447 0.41107393 0.5
#3: -0.005767173 0.91086585 1.0
#4: 2.404653389 0.66491244 0.8
#5: 0.763593461 0.09590456 0.1
#6: -0.799009249 0.50963409 0.5
如果key.table$out
没有排序,
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
vec <- rev(cummin(rev(key.table$out)))
large.tbl$out <- vec[ind]
你想要的是:
setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]
见?data.table
>roll
:
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE
and i
's row matches to all but the last x
join column, and its value in the last i
join column falls in a gap (including after the last observation in x
for that group), then the prevailing value in x
is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x
's key, the last key column is a date (or time, or datetime) and all the columns of x
's key are joined to. A common idiom is to select a contemporaneous regular time series (dts
) across a set of identifiers (ids
): DT[CJ(ids,dts),roll=TRUE]
where DT
has a 2-column key (id,date
) and CJ
stands for cross join. When roll
is a positive number, this limits how far values are carried forward. roll=TRUE
is equivalent to roll=+Inf
. When roll
is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf
for unlimited roll back. When roll is "nearest"
, the nearest value is joined to.
(公平地说,我认为这可以进行一些说明,它非常密集)
我想有条件地将两个数据表连接在一起:
library(data.table)
set.seed(1)
key.table <-
data.table(
out = (0:10)/10,
keyz = sort(runif(11))
)
large.tbl <-
data.table(
ab = rnorm(1e6),
cd = runif(1e6)
)
根据以下规则:匹配key.table
中out
中keyz
值大于cd
的最小值。我有以下内容:
library(dplyr)
large.tbl %>%
rowwise %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
提供正确的输出。我遇到的问题是 rowwise
操作对于我实际使用的 large.tbl
来说似乎很昂贵,除非它在特定计算机上,否则它会崩溃。是否有更少的内存消耗操作?以下似乎稍微快一些,但不足以解决我遇到的问题。
large.tbl %>%
group_by(cd) %>%
mutate(out = min(key.table$out[key.table$keyz > cd]))
这听起来像是 data.table
答案的问题,但答案不必使用该包。
如果 key.table$out
也像您的玩具示例中那样排序,则以下内容会起作用
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
large.tbl$out <- key.table$out[ind]
head(large.tbl)
# ab cd out
#1: -0.928567035 0.99473795 NA
#2: -0.294720447 0.41107393 0.5
#3: -0.005767173 0.91086585 1.0
#4: 2.404653389 0.66491244 0.8
#5: 0.763593461 0.09590456 0.1
#6: -0.799009249 0.50963409 0.5
如果key.table$out
没有排序,
ind <- findInterval(large.tbl$cd, key.table$keyz) + 1
vec <- rev(cummin(rev(key.table$out)))
large.tbl$out <- vec[ind]
你想要的是:
setkey(large.tbl, cd)
setkey(key.table, keyz)
key.table[large.tbl, roll = -Inf]
见?data.table
>roll
:
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
roll=TRUE
andi
's row matches to all but the lastx
join column, and its value in the lasti
join column falls in a gap (including after the last observation inx
for that group), then the prevailing value inx
is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates inx
's key, the last key column is a date (or time, or datetime) and all the columns ofx
's key are joined to. A common idiom is to select a contemporaneous regular time series (dts
) across a set of identifiers (ids
):DT[CJ(ids,dts),roll=TRUE]
whereDT
has a 2-column key (id,date
) andCJ
stands for cross join. Whenroll
is a positive number, this limits how far values are carried forward.roll=TRUE
is equivalent toroll=+Inf
. Whenroll
is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use-Inf
for unlimited roll back. When roll is"nearest"
, the nearest value is joined to.
(公平地说,我认为这可以进行一些说明,它非常密集)