是否有 data.table 方法来填补时间段的空白?
Is there a data.table way of filling in gaps of time periods?
在data.table
中有timetk::pad_by_time
和tsibble::fill_gaps
这样的缺失时间段的优雅填充方法吗?
数据可能是这样的
library(data.table)
data<-data.table(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-03-01","2020-03-01","2020-03-01"),
Card = c(1,2,3,1,3,1,2,3),
A = rnorm(8)
)
卡片 2 在 2020-02-01 的隐式缺失观察。
在tsibble
包中,您可以进行以下操作
library(tsibble)
data <- data[, .(Date = yearmonth(ymd(Date)),
Card = as.character(Card),
A= as.numeric(A))]
data<-as_tsibble(data, key = Card, index = Date)
data<-fill_gaps(data)
在timetk
包中,您可以进行以下操作
library(timetk)
data <- data[, .(Date = ymd(Date),
Card = as.character(Card),
A= as.numeric(A))]
data<-data %>%
group_by(Card) %>%
pad_by_time(Date, .by = "month") %>%
ungroup()
就data.table
:
如果没有设置key,则
data2 <- data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
data2
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 1.37095845
# 2: 2020-01-01 2 -0.56469817
# 3: 2020-01-01 3 0.36312841
# 4: 2020-02-01 1 0.63286260
# 5: 2020-02-01 2 NA
# 6: 2020-02-01 3 0.40426832
# 7: 2020-03-01 1 -0.10612452
# 8: 2020-03-01 2 1.51152200
# 9: 2020-03-01 3 -0.09465904
(updated/simplified,感谢@sindri_baldur!)
如果设置了key,那么可以使用@Frank的方法:
data2 <- data[ do.call(CJ, c(mget(key(data)), unique = TRUE)), ]
从这里开始,您可以根据需要使用 nafill
,也许
data2[, A := nafill(A, type = "locf"), by = .(Card)]
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 1.37095845
# 2: 2020-01-01 2 -0.56469817
# 3: 2020-01-01 3 0.36312841
# 4: 2020-02-01 1 0.63286260
# 5: 2020-02-01 2 -0.56469817
# 6: 2020-02-01 3 0.40426832
# 7: 2020-03-01 1 -0.10612452
# 8: 2020-03-01 2 1.51152200
# 9: 2020-03-01 3 -0.09465904
(如何填写取决于您对数据上下文的了解;它可能很容易 by=.(Date)
,或某种形式的插补。)
Update:上面对可能的组合进行了扩展,可能会填充特定 Card
的外部跨度,在这种情况下可能会看到:
data <- data[-1,]
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 NA
# 2: 2020-01-01 2 -0.42225588
# 3: 2020-01-01 3 -0.12235017
# 4: 2020-02-01 1 0.18819303
# 5: 2020-02-01 2 NA
# 6: 2020-02-01 3 0.11916096
# 7: 2020-03-01 1 -0.02509255
# 8: 2020-03-01 2 0.10807273
# 9: 2020-03-01 3 -0.48543524
我认为有两种方法:
执行上述代码,然后删除每组前导(和尾随)NA
s:
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)
][, .SD[ !is.na(A) | !seq_len(.N) %in% c(1, .N),], by = Card]
# Card Date A
# <num> <char> <num>
# 1: 1 2020-02-01 0.18819303
# 2: 1 2020-03-01 -0.02509255
# 3: 2 2020-01-01 -0.42225588
# 4: 2 2020-02-01 NA
# 5: 2 2020-03-01 0.10807273
# 6: 3 2020-01-01 -0.12235017
# 7: 3 2020-02-01 0.11916096
# 8: 3 2020-03-01 -0.48543524
完全不同的方法(假设Date
-class,上面没有严格要求):
data[,Date := as.Date(Date)]
data[data[, .(Date = do.call(seq, c(as.list(range(Date)), by = "month"))),
by = .(Card)],
on = .(Date, Card)]
# Date Card A
# <Date> <num> <num>
# 1: 2020-01-01 2 -0.42225588
# 2: 2020-02-01 2 NA
# 3: 2020-03-01 2 0.10807273
# 4: 2020-01-01 3 -0.12235017
# 5: 2020-02-01 3 0.11916096
# 6: 2020-03-01 3 -0.48543524
# 7: 2020-02-01 1 0.18819303
# 8: 2020-03-01 1 -0.02509255
在data.table
中有timetk::pad_by_time
和tsibble::fill_gaps
这样的缺失时间段的优雅填充方法吗?
数据可能是这样的
library(data.table)
data<-data.table(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-03-01","2020-03-01","2020-03-01"),
Card = c(1,2,3,1,3,1,2,3),
A = rnorm(8)
)
卡片 2 在 2020-02-01 的隐式缺失观察。
在tsibble
包中,您可以进行以下操作
library(tsibble)
data <- data[, .(Date = yearmonth(ymd(Date)),
Card = as.character(Card),
A= as.numeric(A))]
data<-as_tsibble(data, key = Card, index = Date)
data<-fill_gaps(data)
在timetk
包中,您可以进行以下操作
library(timetk)
data <- data[, .(Date = ymd(Date),
Card = as.character(Card),
A= as.numeric(A))]
data<-data %>%
group_by(Card) %>%
pad_by_time(Date, .by = "month") %>%
ungroup()
就data.table
:
如果没有设置key,则
data2 <- data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
data2
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 1.37095845
# 2: 2020-01-01 2 -0.56469817
# 3: 2020-01-01 3 0.36312841
# 4: 2020-02-01 1 0.63286260
# 5: 2020-02-01 2 NA
# 6: 2020-02-01 3 0.40426832
# 7: 2020-03-01 1 -0.10612452
# 8: 2020-03-01 2 1.51152200
# 9: 2020-03-01 3 -0.09465904
(updated/simplified,感谢@sindri_baldur!)
如果设置了key,那么可以使用@Frank的方法:
data2 <- data[ do.call(CJ, c(mget(key(data)), unique = TRUE)), ]
从这里开始,您可以根据需要使用 nafill
,也许
data2[, A := nafill(A, type = "locf"), by = .(Card)]
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 1.37095845
# 2: 2020-01-01 2 -0.56469817
# 3: 2020-01-01 3 0.36312841
# 4: 2020-02-01 1 0.63286260
# 5: 2020-02-01 2 -0.56469817
# 6: 2020-02-01 3 0.40426832
# 7: 2020-03-01 1 -0.10612452
# 8: 2020-03-01 2 1.51152200
# 9: 2020-03-01 3 -0.09465904
(如何填写取决于您对数据上下文的了解;它可能很容易 by=.(Date)
,或某种形式的插补。)
Update:上面对可能的组合进行了扩展,可能会填充特定 Card
的外部跨度,在这种情况下可能会看到:
data <- data[-1,]
data[CJ(Date, Card, unique = TRUE), on = .(Date, Card)]
# Date Card A
# <char> <num> <num>
# 1: 2020-01-01 1 NA
# 2: 2020-01-01 2 -0.42225588
# 3: 2020-01-01 3 -0.12235017
# 4: 2020-02-01 1 0.18819303
# 5: 2020-02-01 2 NA
# 6: 2020-02-01 3 0.11916096
# 7: 2020-03-01 1 -0.02509255
# 8: 2020-03-01 2 0.10807273
# 9: 2020-03-01 3 -0.48543524
我认为有两种方法:
执行上述代码,然后删除每组前导(和尾随)
NA
s:data[CJ(Date, Card, unique = TRUE), on = .(Date, Card) ][, .SD[ !is.na(A) | !seq_len(.N) %in% c(1, .N),], by = Card] # Card Date A # <num> <char> <num> # 1: 1 2020-02-01 0.18819303 # 2: 1 2020-03-01 -0.02509255 # 3: 2 2020-01-01 -0.42225588 # 4: 2 2020-02-01 NA # 5: 2 2020-03-01 0.10807273 # 6: 3 2020-01-01 -0.12235017 # 7: 3 2020-02-01 0.11916096 # 8: 3 2020-03-01 -0.48543524
完全不同的方法(假设
Date
-class,上面没有严格要求):data[,Date := as.Date(Date)] data[data[, .(Date = do.call(seq, c(as.list(range(Date)), by = "month"))), by = .(Card)], on = .(Date, Card)] # Date Card A # <Date> <num> <num> # 1: 2020-01-01 2 -0.42225588 # 2: 2020-02-01 2 NA # 3: 2020-03-01 2 0.10807273 # 4: 2020-01-01 3 -0.12235017 # 5: 2020-02-01 3 0.11916096 # 6: 2020-03-01 3 -0.48543524 # 7: 2020-02-01 1 0.18819303 # 8: 2020-03-01 1 -0.02509255