在数据框中按组填充日期 R
Fill dates by groups in a data frame R
我想从这里开始:
name
code
date
usage
result
Jennifer Aniston
23211
2021-11-04
345
1
Jennifer Aniston
23211
2021-11-05
260
1
Jennifer Aniston
23211
2021-11-06
230
0
Jennifer Aniston
23211
2021-11-07
0
0
Matthew Perry
44215
2022-10-01
312
1
Matthew Perry
44215
2022-10-04
230
0
Matthew Perry
44215
2021-10-05
232
0
Lisa Kudrow
55120
2022-01-01
132
0
Lisa Kudrow
55120
2022-01-02
125
0
Lisa Kudrow
55120
2022-01-04
345
1
Lisa Kudrow
55120
2022-01-06
321
1
Lisa Kudrow
55120
2022-01-07
431
1
(注意:对于 Jennifer Aniston,我们拥有从出现的最小日期到最大日期的所有日期,但对于 Matthew Perry,我们缺少日期,它从 2022-10-01 开始,但我们没有10 月 2 日和 3 日。Lisa Kudrow 也是如此,它从 1 月 1 日开始,但我们错过了 1 月 3 日和 5 日)
对此:
name
code
date
usage
result
Jennifer Aniston
23211
2021-11-04
345
1
Jennifer Aniston
23211
2021-11-05
260
1
Jennifer Aniston
23211
2021-11-06
230
0
Jennifer Aniston
23211
2021-11-07
0
0
Matthew Perry
44215
2022-10-01
312
1
Matthew Perry
44215
2022-10-02
NA
NA
Matthew Perry
44215
2022-10-03
NA
NA
Matthew Perry
44215
2022-10-04
230
0
Matthew Perry
44215
2021-10-05
232
0
Lisa Kudrow
55120
2022-01-01
132
0
Lisa Kudrow
55120
2022-01-02
125
0
Lisa Kudrow
55120
2022-01-03
NA
NA
Lisa Kudrow
55120
2022-01-04
345
1
Lisa Kudrow
55120
2022-01-05
NA
NA
Lisa Kudrow
55120
2022-01-06
321
1
Lisa Kudrow
55120
2022-01-07
431
1
所以现在我们有了所有日期,在我们没有可用数据的地方填上 NA,并填上此人的姓名和代码。
知道如何在 R 中实现这一点吗? (最好使用 dplyr 和管道)
假设你的日期实际上是日期格式而不是字符格式(我们无法从问题中的 table 判断),并假设第 7 行的年份错误(2021 年)与 2022 年相反),您可以这样做:
library(tidyverse)
df %>%
split(.$name) %>%
lapply(function(x) {
complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
fill = list(name = x$name[1], code = x$code[1]))}) %>%
bind_rows()
#> # A tibble: 16 x 5
#> date name code usage result
#> <date> <chr> <int> <int> <int>
#> 1 2021-11-04 Jennifer Aniston 23211 345 1
#> 2 2021-11-05 Jennifer Aniston 23211 260 1
#> 3 2021-11-06 Jennifer Aniston 23211 230 0
#> 4 2021-11-07 Jennifer Aniston 23211 0 0
#> 5 2022-01-01 Lisa Kudrow 55120 132 0
#> 6 2022-01-02 Lisa Kudrow 55120 125 0
#> 7 2022-01-03 Lisa Kudrow 55120 NA NA
#> 8 2022-01-04 Lisa Kudrow 55120 345 1
#> 9 2022-01-05 Lisa Kudrow 55120 NA NA
#> 10 2022-01-06 Lisa Kudrow 55120 321 1
#> 11 2022-01-07 Lisa Kudrow 55120 431 1
#> 12 2022-10-01 Matthew Perry 44215 312 1
#> 13 2022-10-02 Matthew Perry 44215 NA NA
#> 14 2022-10-03 Matthew Perry 44215 NA NA
#> 15 2022-10-04 Matthew Perry 44215 230 0
#> 16 2022-10-05 Matthew Perry 44215 232 0
由 reprex package (v2.0.1)
创建于 2022-06-02
以可复制格式从问题中获取的数据
df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston",
"Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry",
"Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow",
"Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L,
23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L,
55120L), date = structure(c(18935, 18936, 18937, 18938, 19266,
19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"),
usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L,
345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")
艾伦的回答很好。但是,由于我过去遇到 expand()
和 complete()
的困难,所以我制作了一个我现在经常使用的自定义函数。也许您会发现它很有用(如果您改进了它,请告诉我 ;-))。
请参阅下面的示例,使用 Allan 的 post 中的 df
:
autocomplete <- function(.data, .nestingvars, .indexvar) {
# initialize the index that will expand the groups formed in .data by .nestingvars
index <- data.frame(index = seq.Date(from = min(.data[[.indexvar]]),
to = max(.data[[.indexvar]]),
by = "day"))
names(index)[1] <- .indexvar
# peform a cross join to get all possible combinations of .nestingvars and .indexvars
index <- base::merge(index, unique(.data[.nestingvars]), by = NULL)
# merge index and data
out <- base::merge(.data, index, by = c(.nestingvars, .indexvar), all.y = TRUE)
return(out)
}
library(dplyr)
split(df, ~ name, drop = TRUE) %>%
purrr::map(.x = .,
.f = ~ autocomplete(.data = .,
.nestingvars = c("name", "code"),
.indexvar = "date")) %>%
bind_rows()
#> name code date usage result
#> 1 Jennifer Aniston 23211 2021-11-04 345 1
#> 2 Jennifer Aniston 23211 2021-11-05 260 1
#> 3 Jennifer Aniston 23211 2021-11-06 230 0
#> 4 Jennifer Aniston 23211 2021-11-07 0 0
#> 5 Lisa Kudrow 55120 2022-01-01 132 0
#> 6 Lisa Kudrow 55120 2022-01-02 125 0
#> 7 Lisa Kudrow 55120 2022-01-03 NA NA
#> 8 Lisa Kudrow 55120 2022-01-04 345 1
#> 9 Lisa Kudrow 55120 2022-01-05 NA NA
#> 10 Lisa Kudrow 55120 2022-01-06 321 1
#> 11 Lisa Kudrow 55120 2022-01-07 431 1
#> 12 Matthew Perry 44215 2022-10-01 312 1
#> 13 Matthew Perry 44215 2022-10-02 NA NA
#> 14 Matthew Perry 44215 2022-10-03 NA NA
#> 15 Matthew Perry 44215 2022-10-04 230 0
#> 16 Matthew Perry 44215 2022-10-05 232 0
由 reprex package (v2.0.1)
创建于 2022-06-02
我想从这里开始:
name | code | date | usage | result |
---|---|---|---|---|
Jennifer Aniston | 23211 | 2021-11-04 | 345 | 1 |
Jennifer Aniston | 23211 | 2021-11-05 | 260 | 1 |
Jennifer Aniston | 23211 | 2021-11-06 | 230 | 0 |
Jennifer Aniston | 23211 | 2021-11-07 | 0 | 0 |
Matthew Perry | 44215 | 2022-10-01 | 312 | 1 |
Matthew Perry | 44215 | 2022-10-04 | 230 | 0 |
Matthew Perry | 44215 | 2021-10-05 | 232 | 0 |
Lisa Kudrow | 55120 | 2022-01-01 | 132 | 0 |
Lisa Kudrow | 55120 | 2022-01-02 | 125 | 0 |
Lisa Kudrow | 55120 | 2022-01-04 | 345 | 1 |
Lisa Kudrow | 55120 | 2022-01-06 | 321 | 1 |
Lisa Kudrow | 55120 | 2022-01-07 | 431 | 1 |
(注意:对于 Jennifer Aniston,我们拥有从出现的最小日期到最大日期的所有日期,但对于 Matthew Perry,我们缺少日期,它从 2022-10-01 开始,但我们没有10 月 2 日和 3 日。Lisa Kudrow 也是如此,它从 1 月 1 日开始,但我们错过了 1 月 3 日和 5 日)
对此:
name | code | date | usage | result |
---|---|---|---|---|
Jennifer Aniston | 23211 | 2021-11-04 | 345 | 1 |
Jennifer Aniston | 23211 | 2021-11-05 | 260 | 1 |
Jennifer Aniston | 23211 | 2021-11-06 | 230 | 0 |
Jennifer Aniston | 23211 | 2021-11-07 | 0 | 0 |
Matthew Perry | 44215 | 2022-10-01 | 312 | 1 |
Matthew Perry | 44215 | 2022-10-02 | NA | NA |
Matthew Perry | 44215 | 2022-10-03 | NA | NA |
Matthew Perry | 44215 | 2022-10-04 | 230 | 0 |
Matthew Perry | 44215 | 2021-10-05 | 232 | 0 |
Lisa Kudrow | 55120 | 2022-01-01 | 132 | 0 |
Lisa Kudrow | 55120 | 2022-01-02 | 125 | 0 |
Lisa Kudrow | 55120 | 2022-01-03 | NA | NA |
Lisa Kudrow | 55120 | 2022-01-04 | 345 | 1 |
Lisa Kudrow | 55120 | 2022-01-05 | NA | NA |
Lisa Kudrow | 55120 | 2022-01-06 | 321 | 1 |
Lisa Kudrow | 55120 | 2022-01-07 | 431 | 1 |
所以现在我们有了所有日期,在我们没有可用数据的地方填上 NA,并填上此人的姓名和代码。
知道如何在 R 中实现这一点吗? (最好使用 dplyr 和管道)
假设你的日期实际上是日期格式而不是字符格式(我们无法从问题中的 table 判断),并假设第 7 行的年份错误(2021 年)与 2022 年相反),您可以这样做:
library(tidyverse)
df %>%
split(.$name) %>%
lapply(function(x) {
complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
fill = list(name = x$name[1], code = x$code[1]))}) %>%
bind_rows()
#> # A tibble: 16 x 5
#> date name code usage result
#> <date> <chr> <int> <int> <int>
#> 1 2021-11-04 Jennifer Aniston 23211 345 1
#> 2 2021-11-05 Jennifer Aniston 23211 260 1
#> 3 2021-11-06 Jennifer Aniston 23211 230 0
#> 4 2021-11-07 Jennifer Aniston 23211 0 0
#> 5 2022-01-01 Lisa Kudrow 55120 132 0
#> 6 2022-01-02 Lisa Kudrow 55120 125 0
#> 7 2022-01-03 Lisa Kudrow 55120 NA NA
#> 8 2022-01-04 Lisa Kudrow 55120 345 1
#> 9 2022-01-05 Lisa Kudrow 55120 NA NA
#> 10 2022-01-06 Lisa Kudrow 55120 321 1
#> 11 2022-01-07 Lisa Kudrow 55120 431 1
#> 12 2022-10-01 Matthew Perry 44215 312 1
#> 13 2022-10-02 Matthew Perry 44215 NA NA
#> 14 2022-10-03 Matthew Perry 44215 NA NA
#> 15 2022-10-04 Matthew Perry 44215 230 0
#> 16 2022-10-05 Matthew Perry 44215 232 0
由 reprex package (v2.0.1)
创建于 2022-06-02以可复制格式从问题中获取的数据
df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston",
"Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry",
"Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow",
"Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L,
23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L,
55120L), date = structure(c(18935, 18936, 18937, 18938, 19266,
19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"),
usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L,
345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")
艾伦的回答很好。但是,由于我过去遇到 expand()
和 complete()
的困难,所以我制作了一个我现在经常使用的自定义函数。也许您会发现它很有用(如果您改进了它,请告诉我 ;-))。
请参阅下面的示例,使用 Allan 的 post 中的 df
:
autocomplete <- function(.data, .nestingvars, .indexvar) {
# initialize the index that will expand the groups formed in .data by .nestingvars
index <- data.frame(index = seq.Date(from = min(.data[[.indexvar]]),
to = max(.data[[.indexvar]]),
by = "day"))
names(index)[1] <- .indexvar
# peform a cross join to get all possible combinations of .nestingvars and .indexvars
index <- base::merge(index, unique(.data[.nestingvars]), by = NULL)
# merge index and data
out <- base::merge(.data, index, by = c(.nestingvars, .indexvar), all.y = TRUE)
return(out)
}
library(dplyr)
split(df, ~ name, drop = TRUE) %>%
purrr::map(.x = .,
.f = ~ autocomplete(.data = .,
.nestingvars = c("name", "code"),
.indexvar = "date")) %>%
bind_rows()
#> name code date usage result
#> 1 Jennifer Aniston 23211 2021-11-04 345 1
#> 2 Jennifer Aniston 23211 2021-11-05 260 1
#> 3 Jennifer Aniston 23211 2021-11-06 230 0
#> 4 Jennifer Aniston 23211 2021-11-07 0 0
#> 5 Lisa Kudrow 55120 2022-01-01 132 0
#> 6 Lisa Kudrow 55120 2022-01-02 125 0
#> 7 Lisa Kudrow 55120 2022-01-03 NA NA
#> 8 Lisa Kudrow 55120 2022-01-04 345 1
#> 9 Lisa Kudrow 55120 2022-01-05 NA NA
#> 10 Lisa Kudrow 55120 2022-01-06 321 1
#> 11 Lisa Kudrow 55120 2022-01-07 431 1
#> 12 Matthew Perry 44215 2022-10-01 312 1
#> 13 Matthew Perry 44215 2022-10-02 NA NA
#> 14 Matthew Perry 44215 2022-10-03 NA NA
#> 15 Matthew Perry 44215 2022-10-04 230 0
#> 16 Matthew Perry 44215 2022-10-05 232 0
由 reprex package (v2.0.1)
创建于 2022-06-02