每月服务使用次数统计
Counts of numbers of date for service use in each month
我目前正在重新整理健康服务数据。我的数据框包括每个人使用服务的开始和结束日期
id <- c("A", "A", "B")
start <- c("2018-04-01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)
id start end
A 2018-04-01 2019-04-01
A 2019-04-02 2019-04-05
B 2018-09-01 2018-09-02
我想做以下事情:(1)计算每个服务使用的每个月的日期数; (2) 计算每个人的服务使用日期; (3) 为所有可能的月份构建新的列; (4) 生成一个新的数据框。最终目标是构建如下数据框:
id 2018_Jan 2018_Feb 2018_Mar 2018_Apr 2018_May 2018_Jun ... 2018_Sep ... 2019_Sep
A 0 0 0 30 31 31 ... 30 ... 1
B 0 0 0 0 0 0 ... 1 ... 0
lubridate
包和 function
命令应该对此有所帮助。我的问题类似于这个post ,它计算了每个月的天数。但是,我不确定如何应用它来制定我想要的数据框。
非常感谢你在这方面的帮助。
这是一种方法。首先,我将 id 和 year-months 从 2018 年 1 月到 2019 年 12 月的所有组合。然后,我按 id 和 year-month 汇总数据。最后,将两个数据集连接在一起(以确保您捕捉到没有发生任何事情的月份),然后扩大范围。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
id <- c("A", "A", "B")
start <- c("2018/04/01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)
all_dates <- expand.grid(id = unique(df$id),
month = c("Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
year = 2018:2019) %>%
mutate(yrmo = paste(year, month, sep="_")) %>%
select(id, yrmo)
df <- df %>%
mutate(start = ymd(start),
end = ymd(end)) %>%
rowwise() %>%
summarise(id = id, obs = 1, dates = seq(start, end, by=1)) %>%
mutate(yrmo = paste(year(dates), month(dates, label=TRUE, abbr=TRUE), sep="_")) %>%
group_by(id, yrmo) %>%
summarise(obs = n()) %>%
full_join(., all_dates) %>%
mutate(yrmo = factor(yrmo, levels = all_dates$yrmo[which(all_dates$id == "A")])) %>%
arrange(id, yrmo) %>%
pivot_wider(names_from="yrmo", values_from="obs") %>%
mutate(across(everything(), ~ifelse(is.na(.x), 0, .x)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> Joining, by = c("id", "yrmo")
df
#> # A tibble: 2 × 24
#> # Groups: id [2]
#> id `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_Jun` `2018_Jul`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 0 0 30 30 31
#> 2 B 0 0 0 0 0 0
#> # … with 17 more variables: `2018_Aug` <dbl>, `2018_Sep` <int>,
#> # `2018_Oct` <dbl>, `2018_Nov` <dbl>, `2018_Dec` <dbl>, `2019_Jan` <dbl>,
#> # `2019_Feb` <dbl>, `2019_Mar` <dbl>, `2019_Apr` <dbl>, `2019_Jun` <dbl>,
#> # `2019_Jul` <dbl>, `2019_Aug` <dbl>, `2019_Sep` <dbl>, `2019_Oct` <dbl>,
#> # `2019_Nov` <dbl>, `2019_Dec` <dbl>, `NA` <dbl>
由 reprex package (v2.0.1)
于 2022 年 3 月 4 日创建
这是一个 {tidyverse} 解决方案。
- 使用
dplyr::summarize()
和 seq()
为每个观察生成完整的日期范围。
- 我在
seq()
中包含 end - 1
到 而不是 在计数中包含结束日期,与您的示例一致。
- 使用
lubridate::floor_date(unit = "month")
将这些转换为月份(从技术上讲,将每个日期更改为该月的第一天)。
dplyr::count()
每个 id
. 向上 month-days
- 因为您想要在输出中没有观察到的月份的列,所以我编写了一个函数来根据
tidyr::complete()
. 添加未观察到的月份
- 最后,
tidyr::pivot_wider()
获取每个月的列。
library(tidyverse)
library(lubridate)
complete_months <- function(.data, month, ..., fill = list()) {
month <- pull(.data, {{ month }})
firstday <- floor_date(min(month, na.rm = TRUE), unit = "year")
lastday <- ceiling_date(max(month, na.rm = TRUE), unit = "year") - 1
allmonths <- seq(firstday, lastday, by = "month")
complete(.data, month = allmonths, ..., fill = fill)
}
month_counts <- df %>%
mutate(across(start:end, ymd)) %>%
group_by(id, obs = row_number()) %>%
summarize(
# use end - 1 in seq() to omit end date from count
month = floor_date(seq(start, end - 1, by = 1), unit = "month"),
.groups = "drop"
) %>%
count(month, id) %>%
complete_months(month, id, fill = list(n = 0)) %>%
mutate(month = strftime(month, "%Y_%b")) %>%
pivot_wider(
names_from = month,
values_from = n
)
month_counts
# # A tibble: 2 x 25
# id `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_May` `2018_Jun`
# <chr> <int> <int> <int> <int> <int> <int>
# 1 A 0 0 0 30 31 30
# 2 B 0 0 0 0 0 0
# # ... with 18 more variables: `2018_Jul` <int>, `2018_Aug` <int>,
# # `2018_Sep` <int>, `2018_Oct` <int>, `2018_Nov` <int>, `2018_Dec` <int>,
# # `2019_Jan` <int>, `2019_Feb` <int>, `2019_Mar` <int>, `2019_Apr` <int>,
# # `2019_May` <int>, `2019_Jun` <int>, `2019_Jul` <int>, `2019_Aug` <int>,
# # `2019_Sep` <int>, `2019_Oct` <int>, `2019_Nov` <int>, `2019_Dec` <int>
我目前正在重新整理健康服务数据。我的数据框包括每个人使用服务的开始和结束日期
id <- c("A", "A", "B")
start <- c("2018-04-01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)
id start end
A 2018-04-01 2019-04-01
A 2019-04-02 2019-04-05
B 2018-09-01 2018-09-02
我想做以下事情:(1)计算每个服务使用的每个月的日期数; (2) 计算每个人的服务使用日期; (3) 为所有可能的月份构建新的列; (4) 生成一个新的数据框。最终目标是构建如下数据框:
id 2018_Jan 2018_Feb 2018_Mar 2018_Apr 2018_May 2018_Jun ... 2018_Sep ... 2019_Sep
A 0 0 0 30 31 31 ... 30 ... 1
B 0 0 0 0 0 0 ... 1 ... 0
lubridate
包和 function
命令应该对此有所帮助。我的问题类似于这个post
非常感谢你在这方面的帮助。
这是一种方法。首先,我将 id 和 year-months 从 2018 年 1 月到 2019 年 12 月的所有组合。然后,我按 id 和 year-month 汇总数据。最后,将两个数据集连接在一起(以确保您捕捉到没有发生任何事情的月份),然后扩大范围。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
id <- c("A", "A", "B")
start <- c("2018/04/01", "2019-04-02", "2018-09-01")
end <- c("2019-04-01", "2019-04-05", "2018-09-02")
df <- data.frame(id, start, end)
all_dates <- expand.grid(id = unique(df$id),
month = c("Jan", "Feb", "Mar", "Apr", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
year = 2018:2019) %>%
mutate(yrmo = paste(year, month, sep="_")) %>%
select(id, yrmo)
df <- df %>%
mutate(start = ymd(start),
end = ymd(end)) %>%
rowwise() %>%
summarise(id = id, obs = 1, dates = seq(start, end, by=1)) %>%
mutate(yrmo = paste(year(dates), month(dates, label=TRUE, abbr=TRUE), sep="_")) %>%
group_by(id, yrmo) %>%
summarise(obs = n()) %>%
full_join(., all_dates) %>%
mutate(yrmo = factor(yrmo, levels = all_dates$yrmo[which(all_dates$id == "A")])) %>%
arrange(id, yrmo) %>%
pivot_wider(names_from="yrmo", values_from="obs") %>%
mutate(across(everything(), ~ifelse(is.na(.x), 0, .x)))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> Joining, by = c("id", "yrmo")
df
#> # A tibble: 2 × 24
#> # Groups: id [2]
#> id `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_Jun` `2018_Jul`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0 0 0 30 30 31
#> 2 B 0 0 0 0 0 0
#> # … with 17 more variables: `2018_Aug` <dbl>, `2018_Sep` <int>,
#> # `2018_Oct` <dbl>, `2018_Nov` <dbl>, `2018_Dec` <dbl>, `2019_Jan` <dbl>,
#> # `2019_Feb` <dbl>, `2019_Mar` <dbl>, `2019_Apr` <dbl>, `2019_Jun` <dbl>,
#> # `2019_Jul` <dbl>, `2019_Aug` <dbl>, `2019_Sep` <dbl>, `2019_Oct` <dbl>,
#> # `2019_Nov` <dbl>, `2019_Dec` <dbl>, `NA` <dbl>
由 reprex package (v2.0.1)
于 2022 年 3 月 4 日创建这是一个 {tidyverse} 解决方案。
- 使用
dplyr::summarize()
和seq()
为每个观察生成完整的日期范围。- 我在
seq()
中包含end - 1
到 而不是 在计数中包含结束日期,与您的示例一致。
- 我在
- 使用
lubridate::floor_date(unit = "month")
将这些转换为月份(从技术上讲,将每个日期更改为该月的第一天)。 dplyr::count()
每个id
. 向上 month-days
- 因为您想要在输出中没有观察到的月份的列,所以我编写了一个函数来根据
tidyr::complete()
. 添加未观察到的月份
- 最后,
tidyr::pivot_wider()
获取每个月的列。
library(tidyverse)
library(lubridate)
complete_months <- function(.data, month, ..., fill = list()) {
month <- pull(.data, {{ month }})
firstday <- floor_date(min(month, na.rm = TRUE), unit = "year")
lastday <- ceiling_date(max(month, na.rm = TRUE), unit = "year") - 1
allmonths <- seq(firstday, lastday, by = "month")
complete(.data, month = allmonths, ..., fill = fill)
}
month_counts <- df %>%
mutate(across(start:end, ymd)) %>%
group_by(id, obs = row_number()) %>%
summarize(
# use end - 1 in seq() to omit end date from count
month = floor_date(seq(start, end - 1, by = 1), unit = "month"),
.groups = "drop"
) %>%
count(month, id) %>%
complete_months(month, id, fill = list(n = 0)) %>%
mutate(month = strftime(month, "%Y_%b")) %>%
pivot_wider(
names_from = month,
values_from = n
)
month_counts
# # A tibble: 2 x 25
# id `2018_Jan` `2018_Feb` `2018_Mar` `2018_Apr` `2018_May` `2018_Jun`
# <chr> <int> <int> <int> <int> <int> <int>
# 1 A 0 0 0 30 31 30
# 2 B 0 0 0 0 0 0
# # ... with 18 more variables: `2018_Jul` <int>, `2018_Aug` <int>,
# # `2018_Sep` <int>, `2018_Oct` <int>, `2018_Nov` <int>, `2018_Dec` <int>,
# # `2019_Jan` <int>, `2019_Feb` <int>, `2019_Mar` <int>, `2019_Apr` <int>,
# # `2019_May` <int>, `2019_Jun` <int>, `2019_Jul` <int>, `2019_Aug` <int>,
# # `2019_Sep` <int>, `2019_Oct` <int>, `2019_Nov` <int>, `2019_Dec` <int>