基于多个范围合并
Merging based on multiple ranges
我想合并多个范围内的两个数据框。下面我制作了一个有代表性的例子。 sqldf 解决方案有效,但是,我想知道是否有更好的方法来执行此操作(例如,使用 data.table)。
base <- data.frame(lower1 = c(12, 12, 3, 2), upper1 = c(20, 20, 20, 4),
lower2 = c(12, 12, 3, 2), upper2 = c(20, 20, 20, 4)) %>%
data.table()
more_info <- data.frame(color = 'red', value1 = 4, value2 = 4, thing1 = 5, thing2 = 5) %>%
data.table()
setkey(base, lower1, upper1, lower2, upper2)
setkey(more_info, value1, value2, thing1, thing2)
# works
sqldf('select * from base left join more_info
on ( base.lower1 <= more_info.value1 and base.upper1 >= more_info.value1
and base.lower2 <= more_info.thing1 and base.upper2 >= more_info.thing1)')
# doesn't work but is what i would like to do
setkey(base, lower1, upper1, lower2, upper2)
setkey(more_info, value1, value2, thing1, thing2)
foverlaps(more_info, base, by.x = key(more_info), by.y = key(base), type = 'within',
mult = 'all', nomatch = NA)
作为一点背景知识,我有一个匹配算法需要改进 运行 次。匹配算法的工作原理是根据某些特征将大量贷款过滤为较少数量的潜在匹配项。然后,我应用任何必要的额外统计技术来找到最佳匹配。阻碍是反复过滤所有匹配项的大型数据集,以减少潜在匹配项的数量。我的目标是找到一种更快的方法来创建潜在匹配的数据框,然后使用分组依据和其他矢量化函数来完成匹配过程。
类似于:
more_info[base, .(lower1, upper1, lower2, upper2, color, value1 = x.value1,
value2 = x.value2, thing1 = x.thing1, thing2 = x.thing2),
on = .(value1 >= lower1, value1 <= upper1, thing1 >= lower2, thing1 <= upper2)]
输出:
lower1 upper1 lower2 upper2 color value1 value2 thing1 thing2
1: 12 20 12 20 <NA> NA NA NA NA
2: 12 20 12 20 <NA> NA NA NA NA
3: 3 20 3 20 red 4 4 5 5
4: 2 4 2 4 <NA> NA NA NA NA
我想合并多个范围内的两个数据框。下面我制作了一个有代表性的例子。 sqldf 解决方案有效,但是,我想知道是否有更好的方法来执行此操作(例如,使用 data.table)。
base <- data.frame(lower1 = c(12, 12, 3, 2), upper1 = c(20, 20, 20, 4),
lower2 = c(12, 12, 3, 2), upper2 = c(20, 20, 20, 4)) %>%
data.table()
more_info <- data.frame(color = 'red', value1 = 4, value2 = 4, thing1 = 5, thing2 = 5) %>%
data.table()
setkey(base, lower1, upper1, lower2, upper2)
setkey(more_info, value1, value2, thing1, thing2)
# works
sqldf('select * from base left join more_info
on ( base.lower1 <= more_info.value1 and base.upper1 >= more_info.value1
and base.lower2 <= more_info.thing1 and base.upper2 >= more_info.thing1)')
# doesn't work but is what i would like to do
setkey(base, lower1, upper1, lower2, upper2)
setkey(more_info, value1, value2, thing1, thing2)
foverlaps(more_info, base, by.x = key(more_info), by.y = key(base), type = 'within',
mult = 'all', nomatch = NA)
作为一点背景知识,我有一个匹配算法需要改进 运行 次。匹配算法的工作原理是根据某些特征将大量贷款过滤为较少数量的潜在匹配项。然后,我应用任何必要的额外统计技术来找到最佳匹配。阻碍是反复过滤所有匹配项的大型数据集,以减少潜在匹配项的数量。我的目标是找到一种更快的方法来创建潜在匹配的数据框,然后使用分组依据和其他矢量化函数来完成匹配过程。
类似于:
more_info[base, .(lower1, upper1, lower2, upper2, color, value1 = x.value1,
value2 = x.value2, thing1 = x.thing1, thing2 = x.thing2),
on = .(value1 >= lower1, value1 <= upper1, thing1 >= lower2, thing1 <= upper2)]
输出:
lower1 upper1 lower2 upper2 color value1 value2 thing1 thing2
1: 12 20 12 20 <NA> NA NA NA NA
2: 12 20 12 20 <NA> NA NA NA NA
3: 3 20 3 20 red 4 4 5 5
4: 2 4 2 4 <NA> NA NA NA NA