使用 data.table 与相等和不等条件进行左连接,并且每个左 table 行有多个匹配项

Using data.table to left join with equality and inequality conditions, and multiple matches per left table row

我正在尝试确定如何在包含相等和不等作为子条件的条件下使用 data.table 方法连接两个数据集。这是一些示例数据:

> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
    name age
1: Sally  20
2:   Joe  25
3:  Fred  30

> B
    name age condition
1: Sally  20  deceased
2:   Joe  30      good
3:  Fred  35      good
4:  Fred  40    ailing

当我执行 A[B,on =.(name = name, age < age), condition := i.condition] 时,我只得到以下 3 行:

> A
    name age condition
1: Sally  20      <NA>
2:   Joe  25      good
3:  Fred  30    ailing

与直觉相反,典型的 SQL 用户会返回所有匹配连接条件的行(在本例中为 4)。我正在使用 data.table_1.11.8.

有没有 data.table 方法可以让我

  1. 处理子条件可能是等式混合的条件 和不平等条件
  2. 使用 := 为现有数据集赋值以避免不必要的内存使用
  3. 保留与连接条件匹配的所有行,因为 SQL 会

?

如果没有 data.table 解决方案,最好的替代方案是什么(我的数据集很大,我希望尽可能少的包)?

编辑

为了阐明我正在寻找的输出,我将给出 SQL 代码,我试图模拟其功能:

create table #A (
name varchar(50),
age integer
);

insert into #A
values ('Sally',20),
       ('Joe',25),
       ('Fred',30);

create table #B (
name varchar(50),
age integer,
condition varchar(50)
);

insert into #B
values ('Sally',20,'deceased'),
       ('Joe',30,'good'),
       ('Fred',35,'good'),
       ('Fred',40,'ailing');

select
#A.*,
condition
from #A left join #B
on  #A.name = #B.name
and #A.age < #B.age;

以上returns以下结果集:

name    age   condition
Sally   20    NULL
Joe     25    good
Fred    30    good
Fred    30    ailing

如果需要 SQL 样式的左连接(如编辑中所述),可以使用与 icecreamtoucan 在评论中的建议非常相似的代码来实现:

B[A,on=.(name = name, age > age)]

注意:如果结果集超过连接元素的行数总和,data.table 会认为您犯了一个错误(与 SQL 引擎不同)并抛出一个错误。解决方案(假设您 没有 出错)是添加 allow.cartesian = TRUE

此外,与 SQL 不同的是,此联接不会 return 构成 table 的所有列。相反(对于来自 SQL 背景的人来说有些令人沮丧)在连接的不平等条件中使用的左侧 table 的列值将 returned in 列与右边table列的名称相比较它在不等式连接条件下!

这里的解决方案(我前段时间在另一个 SO 答案中找到但现在找不到)是创建要保留的连接列的副本,将它们用于连接条件,然后将列指定为继续加入。

例如

A <- data.table( group = rep("WIZARD LEAGUE",3)
                ,name = rep("Fred",time=3)
                ,status_start = as.Date("2017-01-01") + c(0,370,545)
                ,status_end = as.Date("2017-01-01") + c(369,544,365*3-1) 
                ,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
                         ,name = "Sally"
                         ,status_start = as.Date("2017-01-01")
                         ,status_end = as.Date("2019-12-31")
                         ,status = "CONTRACTED"))
> A
           group  name status_start status_end     status
1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED
3: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED
4: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED


B <- data.table( group = rep("WIZARD LEAGUE",time=5)
                ,loc_start = as.Date("2017-01-01") + 180*0:4
                ,loc_end = as.Date("2017-01-01") + 180*1:5-1
                , loc = c("US","GER","FRA","ITA","MOR"))

> B
           group  loc_start    loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29  US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR

>#Try to join all rows whose date ranges intersect:

>B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end)]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 12 rows; more than 9 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names

> B[A,on=.(group = group, loc_end >= status_start,  loc_start <= status_end), allow.cartesian = TRUE]
            group  loc_start    loc_end loc  name     status
 1: WIZARD LEAGUE 2018-01-05 2017-01-01  US  Fred UNEMPLOYED
 2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER  Fred UNEMPLOYED
 3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA  Fred UNEMPLOYED
 4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA  Fred   EMPLOYED
 5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA  Fred   EMPLOYED
 6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA  Fred    RETIRED
 7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR  Fred    RETIRED
 8: WIZARD LEAGUE 2019-12-31 2017-01-01  US Sally CONTRACTED
 9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED

>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep

> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
    ,..keep_cols
    ,on=.( group = group
          ,end >= start_dup
          ,start <= end_dup)
          ,allow.cartesian = TRUE]
            group  name status_start status_end     status  loc_start    loc_end loc
 1: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29  US
 2: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
 3: WIZARD LEAGUE  Fred   2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
 4: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2017-12-27 2018-06-24 FRA
 5: WIZARD LEAGUE  Fred   2018-01-06 2018-06-29   EMPLOYED 2018-06-25 2018-12-21 ITA
 6: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-06-25 2018-12-21 ITA
 7: WIZARD LEAGUE  Fred   2018-06-30 2019-12-31    RETIRED 2018-12-22 2019-06-19 MOR
 8: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29  US
 9: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally   2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR

我当然不是第一个指出这些偏离 SQL 惯例的人,或者说重现该功能相当麻烦(如上所示),我相信 improvements are actively being considered.

对于考虑替代策略(例如 sqldf 包)的任何人,我要说的是,虽然 data.table 有很多有价值的替代方案,但我一直在努力寻找与速度相比的任何解决方案data.table 当涉及非常大的数据集时,无论是关于联接还是其他操作。不用说,还有许多其他好处使这个包对我和其他许多人来说都是不可或缺的。因此,对于那些使用大型数据集的人,如果上面看起来很麻烦,我建议不要放弃 data.table 连接,而是养成完成这些动作的习惯,或者编写一个辅助函数来复制动作序列直到改进语法出现了。

最后,我在这里没有提到析取连接,但据我所知,这是 data.table 方法的另一个缺点(也是 sqldf 有用的另一个领域)。我一直在通过某种临时 "hacks" 来解决这些问题,但我将不胜感激任何有关 data.table.

中处理这些问题的最佳方法的有用建议