合并两个表精确和模糊
Merge two tables exact and fuzzy
我有两个 table,我想根据一个变量的精确匹配和另一个变量的模糊匹配来合并它们。
考虑下面的两个 table。对于 dt1 中的每个 id1,我想在 dt2 中找到一个大小完全匹配且 dt2 中的日期值等于或晚于 dt1 中的日期字段的 id2。如果有多个匹配项,我想随机选择一个。
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))
dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))
结果 table 可能如下所示:
id1 size date id2
1: A 2 2013-03-27 1
2: B 3 2014-05-08 3
或像这样(取决于随机选择)
id1 size date id2
1: A 2 2013-03-27 4
2: B 3 2014-05-08 3
我不确定大多数人在说 'fuzzy matching' 时通常会想到什么——您想合并两个表,然后对匹配结果进行随机操作,如:
library(data.table)
library(tidyverse)
set.seed(1234)
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))
dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))
dt <- full_join(dt1, dt2, by = "size") %>%
filter(date.y >= date.x) %>%
group_by(size) %>%
sample_n(size = 1)
要按大小和 select 适当的日期条目加入,我们可以使用非等值加入:
> # Rename the date columns to make the join step clear:
> setnames(dt1, "date", "date1")
> setnames(dt2, "date", "date2")
> # Non equi-join will give all entries in dt2 matching on size where
> # date2 >= date1:
> dt2[dt1, on=.(size, date2 >= date1)]
id2 size date2 id1
1: 4 2 2013-03-27 A
2: 1 2 2013-03-27 A
3: 3 3 2014-05-08 B
我找不到一种可靠的方法来结合连接执行随机 selection 步骤。作为一个 hacky 解决方案,我们可以向上面的 table 添加一个新列,其中包含打乱的行号,然后 select 每个 id1
具有最大打乱行号的行:
> joined <- dt2[dt1, on=.(size, date2 >= date1)]
> joined[, selection_column := sample(.I, .N)]
> filtered <- joined[,.SD[which.max(selection_column)], by=id1]
> filtered[, selection_column := NULL]
> filtered
id1 id2 size date2
1: A 1 2 2013-03-27
2: B 3 3 2014-05-08
或者,我们可以使用 dplyr
进行随机 selection 步骤:
> library(dplyr)
> dt2[dt1, on=.(size, date2 >= date1)] %>%
+ group_by(id1) %>%
+ sample_n(1) %>%
+ as.data.table()
id2 size date2 id1
1: 4 2 2013-03-27 A
2: 3 3 2014-05-08 B
我有两个 table,我想根据一个变量的精确匹配和另一个变量的模糊匹配来合并它们。
考虑下面的两个 table。对于 dt1 中的每个 id1,我想在 dt2 中找到一个大小完全匹配且 dt2 中的日期值等于或晚于 dt1 中的日期字段的 id2。如果有多个匹配项,我想随机选择一个。
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))
dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))
结果 table 可能如下所示:
id1 size date id2
1: A 2 2013-03-27 1
2: B 3 2014-05-08 3
或像这样(取决于随机选择)
id1 size date id2
1: A 2 2013-03-27 4
2: B 3 2014-05-08 3
我不确定大多数人在说 'fuzzy matching' 时通常会想到什么——您想合并两个表,然后对匹配结果进行随机操作,如:
library(data.table)
library(tidyverse)
set.seed(1234)
dt1 <- data.table(c("A", "B"), c(2, 3), as.Date(c("2013-03-27", "2014-05-08"), format = '%Y-%m-%d'))
setnames(dt1, c("V1", "V2", "V3"),
c("id1", "size", "date"))
dt2 <- data.table(1:10, c(2, 4, 3, 2, 2, 2, 3, 2, 4, 4), as.Date(c("2014-02-25", "2011-08-02", "2014-06-21", "2013-11-29", "2012-02-21", "2011-12-02",
"2014-04-22", "2011-03-05", "2014-04-21", "2014-10-29"), format = '%Y-%m-%d'))
setnames(dt2, c("V1", "V2", "V3"),
c("id2", "size", "date"))
dt <- full_join(dt1, dt2, by = "size") %>%
filter(date.y >= date.x) %>%
group_by(size) %>%
sample_n(size = 1)
要按大小和 select 适当的日期条目加入,我们可以使用非等值加入:
> # Rename the date columns to make the join step clear:
> setnames(dt1, "date", "date1")
> setnames(dt2, "date", "date2")
> # Non equi-join will give all entries in dt2 matching on size where
> # date2 >= date1:
> dt2[dt1, on=.(size, date2 >= date1)]
id2 size date2 id1
1: 4 2 2013-03-27 A
2: 1 2 2013-03-27 A
3: 3 3 2014-05-08 B
我找不到一种可靠的方法来结合连接执行随机 selection 步骤。作为一个 hacky 解决方案,我们可以向上面的 table 添加一个新列,其中包含打乱的行号,然后 select 每个 id1
具有最大打乱行号的行:
> joined <- dt2[dt1, on=.(size, date2 >= date1)]
> joined[, selection_column := sample(.I, .N)]
> filtered <- joined[,.SD[which.max(selection_column)], by=id1]
> filtered[, selection_column := NULL]
> filtered
id1 id2 size date2
1: A 1 2 2013-03-27
2: B 3 3 2014-05-08
或者,我们可以使用 dplyr
进行随机 selection 步骤:
> library(dplyr)
> dt2[dt1, on=.(size, date2 >= date1)] %>%
+ group_by(id1) %>%
+ sample_n(1) %>%
+ as.data.table()
id2 size date2 id1
1: 4 2 2013-03-27 A
2: 3 3 2014-05-08 B