使用 data.table 与相等和不等条件进行左连接，并且每个左 table 行有多个匹配项

Question

我正在尝试确定如何在包含相等和不等作为子条件的条件下使用 data.table 方法连接两个数据集。这是一些示例数据：

> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
    name age
1: Sally  20
2:   Joe  25
3:  Fred  30

> B
    name age condition
1: Sally  20  deceased
2:   Joe  30      good
3:  Fred  35      good
4:  Fred  40    ailing

当我执行 A[B,on =.(name = name, age < age), condition := i.condition] 时，我只得到以下 3 行：

> A
    name age condition
1: Sally  20      <NA>
2:   Joe  25      good
3:  Fred  30    ailing

与直觉相反，典型的 SQL 用户会返回所有匹配连接条件的行（在本例中为 4）。我正在使用 data.table_1.11.8.

有没有 data.table 方法可以让我

处理子条件可能是等式混合的条件和不平等条件
使用 := 为现有数据集赋值以避免不必要的内存使用
保留与连接条件匹配的所有行，因为 SQL 会

?

如果没有 data.table 解决方案，最好的替代方案是什么（我的数据集很大，我希望尽可能少的包）？

编辑

为了阐明我正在寻找的输出，我将给出 SQL 代码，我试图模拟其功能：

create table #A (
name varchar(50),
age integer
);

insert into #A
values ('Sally',20),
       ('Joe',25),
       ('Fred',30);

create table #B (
name varchar(50),
age integer,
condition varchar(50)
);

insert into #B
values ('Sally',20,'deceased'),
       ('Joe',30,'good'),
       ('Fred',35,'good'),
       ('Fred',40,'ailing');

select
#A.*,
condition
from #A left join #B
on  #A.name = #B.name
and #A.age < #B.age;

以上returns以下结果集：

name    age   condition
Sally   20    NULL
Joe     25    good
Fred    30    good
Fred    30    ailing

Answer 1

如果需要 SQL 样式的左连接（如编辑中所述），可以使用与 icecreamtoucan 在评论中的建议非常相似的代码来实现：

B[A,on=.(name = name, age > age)]

注意：如果结果集超过连接元素的行数总和，data.table 会认为您犯了一个错误（与 SQL 引擎不同）并抛出一个错误。解决方案（假设您没有出错）是添加 allow.cartesian = TRUE。

此外，与 SQL 不同的是，此联接不会 return 构成 table 的所有列。相反（对于来自 SQL 背景的人来说有些令人沮丧）在连接的不平等条件中使用的左侧 table 的列值将 returned in 列与右边table列的名称相比较它在不等式连接条件下！

这里的解决方案（我前段时间在另一个 SO 答案中找到但现在找不到）是创建要保留的连接列的副本，将它们用于连接条件，然后将列指定为继续加入。

例如

A <- data.table( group = rep("WIZARD LEAGUE",3)
                ,name = rep("Fred",time=3)
                ,status_start = as.Date("2017-01-01") + c(0,370,545)
                ,status_end = as.Date("2017-01-01") + c(369,544,365*3-1) 
                ,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
                         ,name = "Sally"
                         ,status_start = as.Date("2017-01-01")
                         ,status_end = as.Date("2019-12-31")
                         ,status = "CONTRACTED"))
> A
           group  name status_start status_end     status
1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED
3: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED
4: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED


B <- data.table( group = rep("WIZARD LEAGUE",time=5)
                ,loc_start = as.Date("2017-01-01") + 180*0:4
                ,loc_end = as.Date("2017-01-01") + 180*1:5-1
                , loc = c("US","GER","FRA","ITA","MOR"))

> B
           group  loc_start    loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29  US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR

>#Try to join all rows whose date ranges intersect:

>B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end)]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 12 rows; more than 9 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names

> B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end), allow.cartesian = TRUE]
            group  loc_start    loc_end loc  name     status
 1: WIZARD LEAGUE 2018-01-05 2017-01-01  US  Fred UNEMPLOYED
 2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER  Fred UNEMPLOYED
 3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA  Fred UNEMPLOYED
 4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA  Fred   EMPLOYED
 5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA  Fred   EMPLOYED
 6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA  Fred    RETIRED
 7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR  Fred    RETIRED
 8: WIZARD LEAGUE 2019-12-31 2017-01-01  US Sally CONTRACTED
 9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED

>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep

> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
    ,..keep_cols
    ,on=.( group = group
          ,end >= start_dup
          ,start <= end_dup)
          ,allow.cartesian = TRUE]
            group  name status_start status_end     status  loc_start    loc_end loc
 1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29  US
 2: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
 3: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
 4: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2017-12-27 2018-06-24 FRA
 5: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2018-06-25 2018-12-21 ITA
 6: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-06-25 2018-12-21 ITA
 7: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-12-22 2019-06-19 MOR
 8: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29  US
 9: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR

我当然不是第一个指出这些偏离 SQL 惯例的人，或者说重现该功能相当麻烦（如上所示），我相信 improvements are actively being considered.

对于考虑替代策略（例如 sqldf 包）的任何人，我要说的是，虽然 data.table 有很多有价值的替代方案，但我一直在努力寻找与速度相比的任何解决方案data.table 当涉及非常大的数据集时，无论是关于联接还是其他操作。不用说，还有许多其他好处使这个包对我和其他许多人来说都是不可或缺的。因此，对于那些使用大型数据集的人，如果上面看起来很麻烦，我建议不要放弃 data.table 连接，而是养成完成这些动作的习惯，或者编写一个辅助函数来复制动作序列直到改进语法出现了。

最后，我在这里没有提到析取连接，但据我所知，这是 data.table 方法的另一个缺点（也是 sqldf 有用的另一个领域）。我一直在通过某种临时 "hacks" 来解决这些问题，但我将不胜感激任何有关 data.table.

中处理这些问题的最佳方法的有用建议

使用 data.table 与相等和不等条件进行左连接，并且每个左 table 行有多个匹配项

Using data.table to left join with equality and inequality conditions, and multiple matches per left table row

inequality

r

left-join

data.table