如何在组内将第一个值与每个后续值进行比较,直到满足条件
How to compare within a group the first value to each subsequent value until a condition is met
所以我在下面的一般结构中有一个数据框:
数据框:
rownum
group
date
1
a
2021-05-01
2
a
2021-05-02
3
a
2021-05-03
4
b
2021-05-15
5
b
2021-05-17
6
b
2021-05-30
7
b
2021-05-31
8
b
2021-05-31
9
c
2021-05-01
10
c
2021-05-05
我想做的是,在组内,将第一行与下一行进行比较,直到日期之间的差异达到某个阈值,比如 10 天。然后,一旦该行达到阈值,我想针对后续行测试下一行。它看起来像这样:
结果,使用阈值 10:
|rownum|group |date |date diff|
|------|------|-----------|---|
|1 | a |2021-05-01 |NA|
|2 | a |2021-05-02 |1|
|3 | a |2021-05-03 |2|
|4 | b |2021-05-15 |NA|
|5 | b |2021-05-17 |2|
|6 | b |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7 | b |2021-05-31 | NA|
|8 | b |2021-05-31 | 0|
|9 | c |2021-05-01 | NA|
|10 | c |2021-05-05 | 4|
所以重申一下,它将一组的第一行与后续行进行比较,直到满足某个阈值。然后从组内的第一个代表开始计数到组内的后续行。差异记录为datediff。
我已经试过了,但我不知道 sapply 是否可行:
dataframe %>%
group_by(group) %>%
mutate(
datediff = sapply(date, function(x) {
all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
}
)
)
也试过这个,我觉得更接近我想要的:
for (m in 1:length(dataframe)) {
dataframe <- dataframe %>%
group_by(group) %>%
rowwise() %>%
mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
}
到目前为止,我还没有能够获得正确的行向比较来实现阈值位。
这是一种获取所需内容的迂回方式,其中一些 NA
使用此解决方案设置为 0
:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(date = as.Date(date),
date_diff = date - first(date),
flag = date_diff > 10) %>%
group_by(group, flag) %>%
mutate(temp_group = cur_group_id()) %>%
group_by(temp_group) %>%
mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>%
ungroup() %>%
select(group, date, date_diff)
# A tibble: 10 x 3
group date date_diff
<chr> <date> <drtn>
1 a 2021-05-01 0 days
2 a 2021-05-02 1 days
3 a 2021-05-03 2 days
4 b 2021-05-15 0 days
5 b 2021-05-17 2 days
6 b 2021-05-30 15 days
7 b 2021-05-31 0 days
8 b 2021-05-31 0 days
9 c 2021-05-01 0 days
10 c 2021-05-05 4 days
基础 R
func <- function(x, threshold = 10) {
r <- rle(c(0, diff(x)) > threshold)
if ((len <- length(r$values)) > 1) {
r$lengths[len] <- r$lengths[len] - 1L
r$lengths[1] <- r$lengths[1] + 1L
}
cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
# rownum group date datediff
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
dplyr
library(dplyr)
dat %>%
group_by(group) %>%
mutate(group2 = func(date)) %>%
group_by(group, group2) %>%
mutate(datediff = c(NA, (date - date[1])[-1])) %>%
ungroup() %>%
select(-group2)
# # A tibble: 10 x 4
# rownum group date datediff
# <int> <chr> <date> <dbl>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
数据
dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
(我已经将 dat$date
转换为 Date
-class。)
另一个tidyverse
解决方案。我们可以使用 accumulate
来完成这个任务。 dat
来自 r2evans 的例子。
library(tidyverse)
dat2 <- dat %>%
group_by(group) %>%
mutate(diff_lag = as.integer(date - lag(date))) %>%
mutate(diff = accumulate(diff_lag, function(x, y){
if (is.na(x)){
res <- y
} else if (x > 10){
res <- NA
} else {
res <- x + y
}
return(res)
})) %>%
select(-diff_lag) %>%
ungroup()
dat2
# # A tibble: 10 x 4
# rownum group date diff
# < int> <chr> <date> <int>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
所以我在下面的一般结构中有一个数据框:
数据框:
rownum | group | date |
---|---|---|
1 | a | 2021-05-01 |
2 | a | 2021-05-02 |
3 | a | 2021-05-03 |
4 | b | 2021-05-15 |
5 | b | 2021-05-17 |
6 | b | 2021-05-30 |
7 | b | 2021-05-31 |
8 | b | 2021-05-31 |
9 | c | 2021-05-01 |
10 | c | 2021-05-05 |
我想做的是,在组内,将第一行与下一行进行比较,直到日期之间的差异达到某个阈值,比如 10 天。然后,一旦该行达到阈值,我想针对后续行测试下一行。它看起来像这样:
结果,使用阈值 10:
|rownum|group |date |date diff|
|------|------|-----------|---|
|1 | a |2021-05-01 |NA|
|2 | a |2021-05-02 |1|
|3 | a |2021-05-03 |2|
|4 | b |2021-05-15 |NA|
|5 | b |2021-05-17 |2|
|6 | b |2021-05-30 |15 (meets criteria, start from row 7 now)|
|7 | b |2021-05-31 | NA|
|8 | b |2021-05-31 | 0|
|9 | c |2021-05-01 | NA|
|10 | c |2021-05-05 | 4|
所以重申一下,它将一组的第一行与后续行进行比较,直到满足某个阈值。然后从组内的第一个代表开始计数到组内的后续行。差异记录为datediff。
我已经试过了,但我不知道 sapply 是否可行:
dataframe %>%
group_by(group) %>%
mutate(
datediff = sapply(date, function(x) {
all(difftime(dataframe$date,dplyr::lag(dataframe, n = 1, default = NA)))
}
)
)
也试过这个,我觉得更接近我想要的:
for (m in 1:length(dataframe)) {
dataframe <- dataframe %>%
group_by(group) %>%
rowwise() %>%
mutate(datediff = difftime(dataframe$date,dplyr::lag(date, n = m, default = NA), units="days"))
}
到目前为止,我还没有能够获得正确的行向比较来实现阈值位。
这是一种获取所需内容的迂回方式,其中一些 NA
使用此解决方案设置为 0
:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(date = as.Date(date),
date_diff = date - first(date),
flag = date_diff > 10) %>%
group_by(group, flag) %>%
mutate(temp_group = cur_group_id()) %>%
group_by(temp_group) %>%
mutate(date_diff = case_when(date_diff == first(date_diff) ~ date_diff,
date_diff != first(date_diff) & date_diff < 10 ~ date - first(date),
date_diff != first(date_diff) & date_diff > 10 ~ date - nth(date, 2))) %>%
ungroup() %>%
select(group, date, date_diff)
# A tibble: 10 x 3
group date date_diff
<chr> <date> <drtn>
1 a 2021-05-01 0 days
2 a 2021-05-02 1 days
3 a 2021-05-03 2 days
4 b 2021-05-15 0 days
5 b 2021-05-17 2 days
6 b 2021-05-30 15 days
7 b 2021-05-31 0 days
8 b 2021-05-31 0 days
9 c 2021-05-01 0 days
10 c 2021-05-05 4 days
基础 R
func <- function(x, threshold = 10) {
r <- rle(c(0, diff(x)) > threshold)
if ((len <- length(r$values)) > 1) {
r$lengths[len] <- r$lengths[len] - 1L
r$lengths[1] <- r$lengths[1] + 1L
}
cumsum(inverse.rle(r))
}
dat$group2 <- ave(as.numeric(dat$date), dat$group, FUN = func)
dat$datediff <- ave(as.numeric(dat$date), dat[,c("group", "group2")], FUN = function(x) c(NA, (x - x[1])[-1]))
dat$group2 <- NULL
dat
# rownum group date datediff
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
dplyr
library(dplyr)
dat %>%
group_by(group) %>%
mutate(group2 = func(date)) %>%
group_by(group, group2) %>%
mutate(datediff = c(NA, (date - date[1])[-1])) %>%
ungroup() %>%
select(-group2)
# # A tibble: 10 x 4
# rownum group date datediff
# <int> <chr> <date> <dbl>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4
数据
dat <- structure(list(rownum = 1:10, group = c("a", "a", "a", "b", "b", "b", "b", "b", "c", "c"), date = structure(c(18748, 18749, 18750, 18762, 18764, 18777, 18778, 18778, 18748, 18752), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
(我已经将 dat$date
转换为 Date
-class。)
另一个tidyverse
解决方案。我们可以使用 accumulate
来完成这个任务。 dat
来自 r2evans 的例子。
library(tidyverse)
dat2 <- dat %>%
group_by(group) %>%
mutate(diff_lag = as.integer(date - lag(date))) %>%
mutate(diff = accumulate(diff_lag, function(x, y){
if (is.na(x)){
res <- y
} else if (x > 10){
res <- NA
} else {
res <- x + y
}
return(res)
})) %>%
select(-diff_lag) %>%
ungroup()
dat2
# # A tibble: 10 x 4
# rownum group date diff
# < int> <chr> <date> <int>
# 1 1 a 2021-05-01 NA
# 2 2 a 2021-05-02 1
# 3 3 a 2021-05-03 2
# 4 4 b 2021-05-15 NA
# 5 5 b 2021-05-17 2
# 6 6 b 2021-05-30 15
# 7 7 b 2021-05-31 NA
# 8 8 b 2021-05-31 0
# 9 9 c 2021-05-01 NA
# 10 10 c 2021-05-05 4