R/data.table:部分滚动连接
R/data.table: Partial rolling join
我有以下数据结构:
> dt
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-03 3
3: B info_b 2001-01-01 1
4: B info_b 2001-01-04 5
5: B info_b 2001-01-07 13
其中 Date
缺少一些 ID-wise 条目,其中 Val == 0
和 MiscInfo
表示一组 N > 50
属性变量。我的最终目标是填补缺失的条目,以便获得以下结构。
> dt_pref
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-02 0
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B info_b 2001-01-02 0
6: B info_b 2001-01-03 0
7: B info_b 2001-01-04 5
8: B info_b 2001-01-05 0
9: B info_b 2001-01-06 0
10: B info_b 2001-01-07 13
从 similar requests 来看,滚动连接是实现这一目标的不错途径。我遇到的问题是无法 select 滚动哪些列,如下所述:
drange = dt[, .(Date = seq(min(Date), max(Date), 1)), ID] %>% setkey(ID, Date)
dt[drange, roll = T]
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-02 0
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B info_b 2001-01-02 1
6: B info_b 2001-01-03 1
7: B info_b 2001-01-04 5
8: B info_b 2001-01-05 5
9: B info_b 2001-01-06 5
10: B info_b 2001-01-07 13
在这种情况下,MiscInfo
列已适当地滚动到我满意的程度,但是 Val 列当然也会滚动,而我希望将它们设置为 0。我当然也可以采取措施在另一个方向通过 roll = 0
:
dt[drange, roll = 0]
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A <NA> 2000-01-02 NA
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B <NA> 2001-01-02 NA
6: B <NA> 2001-01-03 NA
7: B info_b 2001-01-04 5
8: B <NA> 2001-01-05 NA
9: B <NA> 2001-01-06 NA
10: B info_b 2001-01-07 13
在这种情况下,我当然可以应用类似 dt[is.na(Val), Val := 0]
的东西,但是使用类似的路径处理 MiscInfo
数组(非常大)的 NA 条目在计算上效率不高,并且我怀疑有一种与连接相关的方式来执行此操作。简而言之,我想将“已填充”条目的 Val 预设为 0,并以有效的方式滚动剩余的列。有什么想法吗?
可复制:
dt = data.table(
ID = c('A', 'A', 'B', 'B', 'B'),
MiscInfo = c(rep('info_a', 2), rep('info_b', 3)),
Date = as.Date(c('2000-01-01', '2000-01-03', '2001-01-01', '2001-01-04', '2001-01-07')),
Val = c(0,3,1,5,13)
) %>% setkey(ID, Date)
dt_pref = data.table(
ID = c(rep('A', 3), rep('B', 7)),
MiscInfo = c(rep("info_a", 3), rep("info_b", 7)),
Date = as.Date(c(10957, 10958, 10959, 11323, 11324, 11325, 11326, 11327, 11328, 11329), origin = '1970-01-01'),
Val = c(0, 0, 3, 1, 0, 0, 5, 0, 0, 13)
)
也许可以使用它,即使对于更复杂的情况也是如此:
merge(dt,
dt[, .(Date = seq.Date(from = min(Date), to = max(Date), by = 1)), by = c("ID", "MiscInfo") ],
by = c("ID", "Date"),
all = TRUE)[, .(ID, Date, MiscInfo.y, Val = case_when(is.na(Val) ~ 0,
TRUE ~ Val))]
我有以下数据结构:
> dt
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-03 3
3: B info_b 2001-01-01 1
4: B info_b 2001-01-04 5
5: B info_b 2001-01-07 13
其中 Date
缺少一些 ID-wise 条目,其中 Val == 0
和 MiscInfo
表示一组 N > 50
属性变量。我的最终目标是填补缺失的条目,以便获得以下结构。
> dt_pref
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-02 0
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B info_b 2001-01-02 0
6: B info_b 2001-01-03 0
7: B info_b 2001-01-04 5
8: B info_b 2001-01-05 0
9: B info_b 2001-01-06 0
10: B info_b 2001-01-07 13
从 similar requests 来看,滚动连接是实现这一目标的不错途径。我遇到的问题是无法 select 滚动哪些列,如下所述:
drange = dt[, .(Date = seq(min(Date), max(Date), 1)), ID] %>% setkey(ID, Date)
dt[drange, roll = T]
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A info_a 2000-01-02 0
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B info_b 2001-01-02 1
6: B info_b 2001-01-03 1
7: B info_b 2001-01-04 5
8: B info_b 2001-01-05 5
9: B info_b 2001-01-06 5
10: B info_b 2001-01-07 13
在这种情况下,MiscInfo
列已适当地滚动到我满意的程度,但是 Val 列当然也会滚动,而我希望将它们设置为 0。我当然也可以采取措施在另一个方向通过 roll = 0
:
dt[drange, roll = 0]
ID MiscInfo Date Val
1: A info_a 2000-01-01 0
2: A <NA> 2000-01-02 NA
3: A info_a 2000-01-03 3
4: B info_b 2001-01-01 1
5: B <NA> 2001-01-02 NA
6: B <NA> 2001-01-03 NA
7: B info_b 2001-01-04 5
8: B <NA> 2001-01-05 NA
9: B <NA> 2001-01-06 NA
10: B info_b 2001-01-07 13
在这种情况下,我当然可以应用类似 dt[is.na(Val), Val := 0]
的东西,但是使用类似的路径处理 MiscInfo
数组(非常大)的 NA 条目在计算上效率不高,并且我怀疑有一种与连接相关的方式来执行此操作。简而言之,我想将“已填充”条目的 Val 预设为 0,并以有效的方式滚动剩余的列。有什么想法吗?
可复制:
dt = data.table(
ID = c('A', 'A', 'B', 'B', 'B'),
MiscInfo = c(rep('info_a', 2), rep('info_b', 3)),
Date = as.Date(c('2000-01-01', '2000-01-03', '2001-01-01', '2001-01-04', '2001-01-07')),
Val = c(0,3,1,5,13)
) %>% setkey(ID, Date)
dt_pref = data.table(
ID = c(rep('A', 3), rep('B', 7)),
MiscInfo = c(rep("info_a", 3), rep("info_b", 7)),
Date = as.Date(c(10957, 10958, 10959, 11323, 11324, 11325, 11326, 11327, 11328, 11329), origin = '1970-01-01'),
Val = c(0, 0, 3, 1, 0, 0, 5, 0, 0, 13)
)
也许可以使用它,即使对于更复杂的情况也是如此:
merge(dt,
dt[, .(Date = seq.Date(from = min(Date), to = max(Date), by = 1)), by = c("ID", "MiscInfo") ],
by = c("ID", "Date"),
all = TRUE)[, .(ID, Date, MiscInfo.y, Val = case_when(is.na(Val) ~ 0,
TRUE ~ Val))]