使用 data.table 与相等和不等条件进行左连接,并且每个左 table 行有多个匹配项
Using data.table to left join with equality and inequality conditions, and multiple matches per left table row
我正在尝试确定如何在包含相等和不等作为子条件的条件下使用 data.table
方法连接两个数据集。这是一些示例数据:
> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
name age
1: Sally 20
2: Joe 25
3: Fred 30
> B
name age condition
1: Sally 20 deceased
2: Joe 30 good
3: Fred 35 good
4: Fred 40 ailing
当我执行 A[B,on =.(name = name, age < age), condition := i.condition]
时,我只得到以下 3 行:
> A
name age condition
1: Sally 20 <NA>
2: Joe 25 good
3: Fred 30 ailing
与直觉相反,典型的 SQL 用户会返回所有匹配连接条件的行(在本例中为 4)。我正在使用 data.table_1.11.8.
有没有 data.table
方法可以让我
- 处理子条件可能是等式混合的条件
和不平等条件
- 使用
:=
为现有数据集赋值以避免不必要的内存使用
- 保留与连接条件匹配的所有行,因为 SQL 会
?
如果没有 data.table 解决方案,最好的替代方案是什么(我的数据集很大,我希望尽可能少的包)?
编辑
为了阐明我正在寻找的输出,我将给出 SQL 代码,我试图模拟其功能:
create table #A (
name varchar(50),
age integer
);
insert into #A
values ('Sally',20),
('Joe',25),
('Fred',30);
create table #B (
name varchar(50),
age integer,
condition varchar(50)
);
insert into #B
values ('Sally',20,'deceased'),
('Joe',30,'good'),
('Fred',35,'good'),
('Fred',40,'ailing');
select
#A.*,
condition
from #A left join #B
on #A.name = #B.name
and #A.age < #B.age;
以上returns以下结果集:
name age condition
Sally 20 NULL
Joe 25 good
Fred 30 good
Fred 30 ailing
如果需要 SQL 样式的左连接(如编辑中所述),可以使用与 icecreamtoucan 在评论中的建议非常相似的代码来实现:
B[A,on=.(name = name, age > age)]
注意:如果结果集超过连接元素的行数总和,data.table
会认为您犯了一个错误(与 SQL 引擎不同)并抛出一个错误。解决方案(假设您 没有 出错)是添加 allow.cartesian = TRUE
。
此外,与 SQL 不同的是,此联接不会 return 构成 table 的所有列。相反(对于来自 SQL 背景的人来说有些令人沮丧)在连接的不平等条件中使用的左侧 table 的列值将 returned in 列与右边table列的名称相比较它在不等式连接条件下!
这里的解决方案(我前段时间在另一个 SO 答案中找到但现在找不到)是创建要保留的连接列的副本,将它们用于连接条件,然后将列指定为继续加入。
例如
A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED
B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))
> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR
>#Try to join all rows whose date ranges intersect:
>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in 12 rows; more than 9 =
nrow(x)+nrow(i). Check for duplicate key values in i each of which
join to the same group in x over and over again. If that's ok, try
by=.EACHI to run j for each group to avoid the large allocation. If
you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and data.table issue tracker for advice.
>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names
> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED
>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep
> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR
我当然不是第一个指出这些偏离 SQL 惯例的人,或者说重现该功能相当麻烦(如上所示),我相信 improvements are actively being considered.
对于考虑替代策略(例如 sqldf
包)的任何人,我要说的是,虽然 data.table
有很多有价值的替代方案,但我一直在努力寻找与速度相比的任何解决方案data.table
当涉及非常大的数据集时,无论是关于联接还是其他操作。不用说,还有许多其他好处使这个包对我和其他许多人来说都是不可或缺的。因此,对于那些使用大型数据集的人,如果上面看起来很麻烦,我建议不要放弃 data.table
连接,而是养成完成这些动作的习惯,或者编写一个辅助函数来复制动作序列直到改进语法出现了。
最后,我在这里没有提到析取连接,但据我所知,这是 data.table
方法的另一个缺点(也是 sqldf
有用的另一个领域)。我一直在通过某种临时 "hacks" 来解决这些问题,但我将不胜感激任何有关 data.table
.
中处理这些问题的最佳方法的有用建议
我正在尝试确定如何在包含相等和不等作为子条件的条件下使用 data.table
方法连接两个数据集。这是一些示例数据:
> A <- data.table(name = c("Sally","Joe","Fred"),age = c(20,25,30))
> B <- data.table(name = c("Sally","Joe","Fred","Fred"),age = c(20,30,35,40),condition = c("deceased","good","good","ailing"))
> A
name age
1: Sally 20
2: Joe 25
3: Fred 30
> B
name age condition
1: Sally 20 deceased
2: Joe 30 good
3: Fred 35 good
4: Fred 40 ailing
当我执行 A[B,on =.(name = name, age < age), condition := i.condition]
时,我只得到以下 3 行:
> A
name age condition
1: Sally 20 <NA>
2: Joe 25 good
3: Fred 30 ailing
与直觉相反,典型的 SQL 用户会返回所有匹配连接条件的行(在本例中为 4)。我正在使用 data.table_1.11.8.
有没有 data.table
方法可以让我
- 处理子条件可能是等式混合的条件 和不平等条件
- 使用
:=
为现有数据集赋值以避免不必要的内存使用 - 保留与连接条件匹配的所有行,因为 SQL 会
?
如果没有 data.table 解决方案,最好的替代方案是什么(我的数据集很大,我希望尽可能少的包)?
编辑
为了阐明我正在寻找的输出,我将给出 SQL 代码,我试图模拟其功能:
create table #A (
name varchar(50),
age integer
);
insert into #A
values ('Sally',20),
('Joe',25),
('Fred',30);
create table #B (
name varchar(50),
age integer,
condition varchar(50)
);
insert into #B
values ('Sally',20,'deceased'),
('Joe',30,'good'),
('Fred',35,'good'),
('Fred',40,'ailing');
select
#A.*,
condition
from #A left join #B
on #A.name = #B.name
and #A.age < #B.age;
以上returns以下结果集:
name age condition
Sally 20 NULL
Joe 25 good
Fred 30 good
Fred 30 ailing
如果需要 SQL 样式的左连接(如编辑中所述),可以使用与 icecreamtoucan 在评论中的建议非常相似的代码来实现:
B[A,on=.(name = name, age > age)]
注意:如果结果集超过连接元素的行数总和,data.table
会认为您犯了一个错误(与 SQL 引擎不同)并抛出一个错误。解决方案(假设您 没有 出错)是添加 allow.cartesian = TRUE
。
此外,与 SQL 不同的是,此联接不会 return 构成 table 的所有列。相反(对于来自 SQL 背景的人来说有些令人沮丧)在连接的不平等条件中使用的左侧 table 的列值将 returned in 列与右边table列的名称相比较它在不等式连接条件下!
这里的解决方案(我前段时间在另一个 SO 答案中找到但现在找不到)是创建要保留的连接列的副本,将它们用于连接条件,然后将列指定为继续加入。
例如
A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED
B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))
> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR
>#Try to join all rows whose date ranges intersect:
>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 12 rows; more than 9 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names
> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED
>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep
> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR
我当然不是第一个指出这些偏离 SQL 惯例的人,或者说重现该功能相当麻烦(如上所示),我相信 improvements are actively being considered.
对于考虑替代策略(例如 sqldf
包)的任何人,我要说的是,虽然 data.table
有很多有价值的替代方案,但我一直在努力寻找与速度相比的任何解决方案data.table
当涉及非常大的数据集时,无论是关于联接还是其他操作。不用说,还有许多其他好处使这个包对我和其他许多人来说都是不可或缺的。因此,对于那些使用大型数据集的人,如果上面看起来很麻烦,我建议不要放弃 data.table
连接,而是养成完成这些动作的习惯,或者编写一个辅助函数来复制动作序列直到改进语法出现了。
最后,我在这里没有提到析取连接,但据我所知,这是 data.table
方法的另一个缺点(也是 sqldf
有用的另一个领域)。我一直在通过某种临时 "hacks" 来解决这些问题,但我将不胜感激任何有关 data.table
.