多次有效地子集 data.table
efficiently subsetting data.table multiple times
我有这种格式的数据
> data = data.table(id = 1:10, date = seq(as.Date("2016-01-01"), by = 1, length = 10))
> data
id date
1: 1 2016-01-01
2: 2 2016-01-02
3: 3 2016-01-03
4: 4 2016-01-04
5: 5 2016-01-05
6: 6 2016-01-06
7: 7 2016-01-07
8: 8 2016-01-08
9: 9 2016-01-09
10: 10 2016-01-10
我有另一个矩阵,它是我希望执行的查询/子集。
> query = data.table(id = c(1,4,7), date_start = c("2016-01-01", "2016-01-01", "2016-01-01"), date_end = c("2016-01-04", "2016-01-02", "2016-01-03"))
> query
id date_start date_end
1: 1 2016-01-01 2016-01-04
2: 4 2016-01-01 2016-01-02
3: 7 2016-01-01 2016-01-03
我想做这样的事情:
subset(data, (id == query[1] & date > date_start[1] & date < date_end[1]) |
(id == query[2] & date > date_start[2] & date < date_end[2]) |
(id == query[3] & date > date_start[3] & date < date_end[3]))
是否有不使用 for 循环和 rbinding 结果自动生成子集查询的方法。
谢谢
require(data.table)
data = data.table(id = 1:10, date = seq(as.Date("2016-01-01"), by = 1, length = 10))
query = data.table(id = c(1,4,7), date_start = c("2016-01-01", "2016-01-01",
"2016-01-01"), date_end = c("2016-01-04", "2016-01-02", "2016-01-03"))
首先你可以加入他们:
data.full <- merge(data,query,by="id", all.x=T)
接下来,如果您想排除 query
中未引用的观察结果并保留那些在日期范围内被引用的观察结果,那么您可以这样做:
data.final <- data.full[date >= date_start & date <= date_end,]
data.final
id date date_start date_end
1: 1 2016-01-01 2016-01-01 2016-01-04
或者如果您想保留 query
中未引用的记录并保留在日期范围内引用的记录:
data.final <- data.full[is.na(date_start) | (date >= date_start & date <= date_end),]
data.final
id date date_start date_end
1: 1 2016-01-01 2016-01-01 2016-01-04
2: 2 2016-01-02 NA NA
3: 3 2016-01-03 NA NA
4: 5 2016-01-05 NA NA
5: 6 2016-01-06 NA NA
6: 8 2016-01-08 NA NA
7: 9 2016-01-09 NA NA
8: 10 2016-01-10 NA NA
如果我们稍微转换一下 OP 的数据就可以得到
library(data.table)
data = setDT(structure(list(id = 1:10, date = structure(16801:16810, class = c("IDate",
"Date")), date2 = structure(16801:16810, class = c("IDate", "Date"
))), .Names = c("id", "date", "date2"), row.names = c(NA, -10L
), class = c("data.table", "data.frame"), sorted = c("id",
"date", "date2")))
query = setDT(structure(list(id = c(1, 4, 7), date_start =
structure(c(16801L,
16801L, 16801L), class = c("IDate", "Date")), date_end = structure(c(16804L,
16802L, 16803L), class = c("IDate", "Date"))), .Names = c("id",
"date_start", "date_end"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), sorted = c("id",
"date_start", "date_end")))
...然后我们可以像
一样使用foverlaps
foverlaps(data, query, nomatch=0)
# id date_start date_end date date2
# 1: 1 2016-01-01 2016-01-04 2016-01-01 2016-01-01
对于这种方法,我认为需要在合并之前执行以下步骤:
- 将所有日期设为
IDate
s
- 在主数据中创建一个额外的日期列
- 在每个 table
上设置密钥
在current development version中,可以直接进行non-equi
连接,如下:
# data.table v1.9.7+
data[query, .(id, x.date), on=.(id, date>=date_start, date<=date_end)]
如有必要,添加 nomatch=0L
以删除结果中不匹配的行。
目前 .(id, x.date)
是必需的,直到我研究出非 equi 连接的默认输出应该是什么样子。
我有这种格式的数据
> data = data.table(id = 1:10, date = seq(as.Date("2016-01-01"), by = 1, length = 10))
> data
id date
1: 1 2016-01-01
2: 2 2016-01-02
3: 3 2016-01-03
4: 4 2016-01-04
5: 5 2016-01-05
6: 6 2016-01-06
7: 7 2016-01-07
8: 8 2016-01-08
9: 9 2016-01-09
10: 10 2016-01-10
我有另一个矩阵,它是我希望执行的查询/子集。
> query = data.table(id = c(1,4,7), date_start = c("2016-01-01", "2016-01-01", "2016-01-01"), date_end = c("2016-01-04", "2016-01-02", "2016-01-03"))
> query
id date_start date_end
1: 1 2016-01-01 2016-01-04
2: 4 2016-01-01 2016-01-02
3: 7 2016-01-01 2016-01-03
我想做这样的事情:
subset(data, (id == query[1] & date > date_start[1] & date < date_end[1]) |
(id == query[2] & date > date_start[2] & date < date_end[2]) |
(id == query[3] & date > date_start[3] & date < date_end[3]))
是否有不使用 for 循环和 rbinding 结果自动生成子集查询的方法。
谢谢
require(data.table)
data = data.table(id = 1:10, date = seq(as.Date("2016-01-01"), by = 1, length = 10))
query = data.table(id = c(1,4,7), date_start = c("2016-01-01", "2016-01-01",
"2016-01-01"), date_end = c("2016-01-04", "2016-01-02", "2016-01-03"))
首先你可以加入他们:
data.full <- merge(data,query,by="id", all.x=T)
接下来,如果您想排除 query
中未引用的观察结果并保留那些在日期范围内被引用的观察结果,那么您可以这样做:
data.final <- data.full[date >= date_start & date <= date_end,]
data.final
id date date_start date_end
1: 1 2016-01-01 2016-01-01 2016-01-04
或者如果您想保留 query
中未引用的记录并保留在日期范围内引用的记录:
data.final <- data.full[is.na(date_start) | (date >= date_start & date <= date_end),]
data.final
id date date_start date_end
1: 1 2016-01-01 2016-01-01 2016-01-04
2: 2 2016-01-02 NA NA
3: 3 2016-01-03 NA NA
4: 5 2016-01-05 NA NA
5: 6 2016-01-06 NA NA
6: 8 2016-01-08 NA NA
7: 9 2016-01-09 NA NA
8: 10 2016-01-10 NA NA
如果我们稍微转换一下 OP 的数据就可以得到
library(data.table)
data = setDT(structure(list(id = 1:10, date = structure(16801:16810, class = c("IDate",
"Date")), date2 = structure(16801:16810, class = c("IDate", "Date"
))), .Names = c("id", "date", "date2"), row.names = c(NA, -10L
), class = c("data.table", "data.frame"), sorted = c("id",
"date", "date2")))
query = setDT(structure(list(id = c(1, 4, 7), date_start =
structure(c(16801L,
16801L, 16801L), class = c("IDate", "Date")), date_end = structure(c(16804L,
16802L, 16803L), class = c("IDate", "Date"))), .Names = c("id",
"date_start", "date_end"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), sorted = c("id",
"date_start", "date_end")))
...然后我们可以像
一样使用foverlaps
foverlaps(data, query, nomatch=0)
# id date_start date_end date date2
# 1: 1 2016-01-01 2016-01-04 2016-01-01 2016-01-01
对于这种方法,我认为需要在合并之前执行以下步骤:
- 将所有日期设为
IDate
s - 在主数据中创建一个额外的日期列
- 在每个 table 上设置密钥
在current development version中,可以直接进行non-equi
连接,如下:
# data.table v1.9.7+
data[query, .(id, x.date), on=.(id, date>=date_start, date<=date_end)]
如有必要,添加 nomatch=0L
以删除结果中不匹配的行。
目前 .(id, x.date)
是必需的,直到我研究出非 equi 连接的默认输出应该是什么样子。