在 R 中按组计算差异 b/w 日期 - 未解决
Calculate difference b/w dates by group in R - unsolved
我正在尝试在 R 中按组计算差异 b/w 最小和最大日期。我找到了实现此目的的代码 here。但是,复制示例不会导致预期的结果。这是使用的数据集示例:
HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012",
"4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012",
"4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012",
"3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012",
"3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012",
"4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID",
"DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
ClutchID = structure(list(), class = c("collector_integer",
"collector")), DateVisit = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_integer",
"collector")), Survive = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("ClutchID", "DateVisit", "Year",
"Survive")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
这是使用 dplyr 的建议解决方案:
library(dplyr)
HS_Hatch <- HS_Hatch %>%
mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))
exposure <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
这是预期的结果:
ClutchID first_visit last_visit exposure
<int> <date> <date> <dbl>
1 1 2012-03-15 2012-04-03 19
2 2 2012-03-18 2012-04-04 17
3 3 2012-03-22 2012-04-04 13
4 4 2012-03-18 2012-04-04 17
5 5 2012-03-20 2012-04-05 16
这是实际结果:
first_visit last_visit exposure
1 2012-03-15 2012-04-05 21 days
分组因素似乎被忽略了。我如何让它计算每个 ClutchID 的日期差异?
仅加载 dplyr
即可使用。
将 summarize
更改为 dplyr::summarize
以使其明确。我建议不要使用 plyr
,因为你可以用 dplyr
和 tidyverse 做任何事情。
导入数据框后,试试这个
HS_Hatch$DateVisit = as.Date(HS_Hatch$DateVisit, "%m/%d/%Y")
HS_Hatch$DateVisit = as.POSIXct(HS_Hatch$DateVisit, "%m/%d/%Y")
然后将您的 dplyr 管道更改为:
HS_Hatch <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
这给出了预期的结果并且有效,因为格式 Posixct
以秒为单位存储自 "the origin" 以来的时间,您可以计算差异。
我正在尝试在 R 中按组计算差异 b/w 最小和最大日期。我找到了实现此目的的代码 here。但是,复制示例不会导致预期的结果。这是使用的数据集示例:
HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012",
"4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012",
"4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012",
"3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012",
"3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012",
"4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID",
"DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
ClutchID = structure(list(), class = c("collector_integer",
"collector")), DateVisit = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_integer",
"collector")), Survive = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("ClutchID", "DateVisit", "Year",
"Survive")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
这是使用 dplyr 的建议解决方案:
library(dplyr)
HS_Hatch <- HS_Hatch %>%
mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))
exposure <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
这是预期的结果:
ClutchID first_visit last_visit exposure
<int> <date> <date> <dbl>
1 1 2012-03-15 2012-04-03 19
2 2 2012-03-18 2012-04-04 17
3 3 2012-03-22 2012-04-04 13
4 4 2012-03-18 2012-04-04 17
5 5 2012-03-20 2012-04-05 16
这是实际结果:
first_visit last_visit exposure
1 2012-03-15 2012-04-05 21 days
分组因素似乎被忽略了。我如何让它计算每个 ClutchID 的日期差异?
仅加载 dplyr
即可使用。
将 summarize
更改为 dplyr::summarize
以使其明确。我建议不要使用 plyr
,因为你可以用 dplyr
和 tidyverse 做任何事情。
导入数据框后,试试这个
HS_Hatch$DateVisit = as.Date(HS_Hatch$DateVisit, "%m/%d/%Y")
HS_Hatch$DateVisit = as.POSIXct(HS_Hatch$DateVisit, "%m/%d/%Y")
然后将您的 dplyr 管道更改为:
HS_Hatch <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
这给出了预期的结果并且有效,因为格式 Posixct
以秒为单位存储自 "the origin" 以来的时间,您可以计算差异。