通过R中的几个条件组合数据
Combine data by several condition in R
我想根据两个条件合并两个数据:
- 相同ID(只保留第一个数据中的ID)
- 如果date_mid(来自dat2)在date_begin和date_end(都来自dat1)之间,粘贴结果(来自dat2),如果不是,记为“NA “
此外,如果组合数据中的 ID 已经有结果(健康或生病),我想删除这些行。在下面的示例中,我想删除第 3 行和第 12 行。
第一个数据(dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
第二个数据(dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
我试过 left_join
如下:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
正如我提到的,我想删除 ID A2 的第 3 行和第 12 行,因为 A2 在第 2 行和第 13 行中已经有健康或生病的结果。
我想要的确切结果是这样的(只有2行A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
感谢任何指点,谢谢。
如果加入后结果中 ID
的行不止一行,则仅保留非 NA 行。这可以写成 dplyr
为 -
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA
我想根据两个条件合并两个数据:
- 相同ID(只保留第一个数据中的ID)
- 如果date_mid(来自dat2)在date_begin和date_end(都来自dat1)之间,粘贴结果(来自dat2),如果不是,记为“NA “
此外,如果组合数据中的 ID 已经有结果(健康或生病),我想删除这些行。在下面的示例中,我想删除第 3 行和第 12 行。
第一个数据(dat1):
dat1 <- tibble(ID = c(paste0(rep("A"), 1:10), "A2", "A10"),
date_begin = seq(as.Date("2020/1/1"), by = "month", length.out = 12),
date_end = date_begin + 365)
dat1
# A tibble: 12 x 3
ID date_begin date_end
<chr> <date> <date>
1 A1 2020-01-01 2020-12-31
2 A2 2020-02-01 2021-01-31
3 A3 2020-03-01 2021-03-01
4 A4 2020-04-01 2021-04-01
5 A5 2020-05-01 2021-05-01
6 A6 2020-06-01 2021-06-01
7 A7 2020-07-01 2021-07-01
8 A8 2020-08-01 2021-08-01
9 A9 2020-09-01 2021-09-01
10 A10 2020-10-01 2021-10-01
11 A2 2020-11-01 2021-11-01
12 A10 2020-12-01 2021-12-01
第二个数据(dat2):
dat2 <- tibble(ID = c(paste0(rep("A"), 1:4), paste0(rep("A"), 9:15), "A2"),
date_mid = seq(as.Date("2020/1/1"), by = "month", length.out = 12) + 100,
result = rep(c("healthy", "sick"), length = 12))
dat2
# A tibble: 12 x 3
ID date_mid result
<chr> <date> <chr>
1 A1 2020-04-10 healthy
2 A2 2020-05-11 sick
3 A3 2020-06-09 healthy
4 A4 2020-07-10 sick
5 A9 2020-08-09 healthy
6 A10 2020-09-09 sick
7 A11 2020-10-09 healthy
8 A12 2020-11-09 sick
9 A13 2020-12-10 healthy
10 A14 2021-01-09 sick
11 A15 2021-02-09 healthy
12 A2 2021-03-11 sick
我试过 left_join
如下:
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA))
# A tibble: 14 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A2 2020-02-01 2021-01-31 2021-03-11 NA
4 A3 2020-03-01 2021-03-01 2020-06-09 healthy
5 A4 2020-04-01 2021-04-01 2020-07-10 sick
6 A5 2020-05-01 2021-05-01 NA NA
7 A6 2020-06-01 2021-06-01 NA NA
8 A7 2020-07-01 2021-07-01 NA NA
9 A8 2020-08-01 2021-08-01 NA NA
10 A9 2020-09-01 2021-09-01 2020-08-09 NA
11 A10 2020-10-01 2021-10-01 2020-09-09 NA
12 A2 2020-11-01 2021-11-01 2020-05-11 NA
13 A2 2020-11-01 2021-11-01 2021-03-11 sick
14 A10 2020-12-01 2021-12-01 2020-09-09 NA
正如我提到的,我想删除 ID A2 的第 3 行和第 12 行,因为 A2 在第 2 行和第 13 行中已经有健康或生病的结果。
我想要的确切结果是这样的(只有2行A2):
# A tibble: 12 x 5
ID date_begin date_end date_mid result
<chr> <date> <date> <date> <chr>
1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
2 A2 2020-02-01 2021-01-31 2020-05-11 sick
3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
4 A4 2020-04-01 2021-04-01 2020-07-10 sick
5 A5 2020-05-01 2021-05-01 NA NA
6 A6 2020-06-01 2021-06-01 NA NA
7 A7 2020-07-01 2021-07-01 NA NA
8 A8 2020-08-01 2021-08-01 NA NA
9 A9 2020-09-01 2021-09-01 2020-08-09 NA
10 A10 2020-10-01 2021-10-01 2020-09-09 NA
11 A2 2020-11-01 2021-11-01 2021-03-11 sick
12 A10 2020-12-01 2021-12-01 2020-09-09 NA
感谢任何指点,谢谢。
如果加入后结果中 ID
的行不止一行,则仅保留非 NA 行。这可以写成 dplyr
为 -
library(dplyr)
library(lubridate)
left_join(dat1, dat2, by = "ID") %>%
mutate(result = ifelse(date_mid %within% interval(date_begin, date_end), result, NA)) %>%
group_by(ID, date_begin, date_end) %>%
filter(if(n() > 1) !is.na(result) else TRUE) %>%
ungroup
# ID date_begin date_end date_mid result
# <chr> <date> <date> <date> <chr>
# 1 A1 2020-01-01 2020-12-31 2020-04-10 healthy
# 2 A2 2020-02-01 2021-01-31 2020-05-11 sick
# 3 A3 2020-03-01 2021-03-01 2020-06-09 healthy
# 4 A4 2020-04-01 2021-04-01 2020-07-10 sick
# 5 A5 2020-05-01 2021-05-01 NA NA
# 6 A6 2020-06-01 2021-06-01 NA NA
# 7 A7 2020-07-01 2021-07-01 NA NA
# 8 A8 2020-08-01 2021-08-01 NA NA
# 9 A9 2020-09-01 2021-09-01 2020-08-09 NA
#10 A10 2020-10-01 2021-10-01 2020-09-09 NA
#11 A2 2020-11-01 2021-11-01 2021-03-11 sick
#12 A10 2020-12-01 2021-12-01 2020-09-09 NA