查找满足条件的两个日期之间的总 ID
Find total IDs between two dates that satisfies a condition
我有一个这样的数据集 PosNeg。我需要找到具有这种模式的 ID 的计数 - P N P P 或 N P N N P N - 在两个 P(正)之间至少有一个 N(负)。如果此模式至少出现一次,则计算该 ID。日期始终按升序排列。
例如:对于 ID 1,我在 02/25 的两个 P 之间至少有 1 个 N,因此我将计算 ID 1。ID 2 和 3 在 2 个 P 之间没有 N,因此不计算在内. ID 4 在 03/18 的两个 P 之间也有一个 N,因此我将包括 4。所以满足条件的 ID 的总数是 2(1 和 4)
我的想法是为每个 ID 找到正数的最小(日期)和正数的最大(日期),并在这些日期之间寻找任何负数,但不确定如何实现它。 R/Python/SQL 中的任何建议都会有所帮助。
ID | Test | Date |
---|---|---|
1 | P | 2021-01-02 |
1 | P | 2021-01-08 |
1 | N | 2021-02-25 |
1 | P | 2021-03-26 |
2 | N | 2021-02-05 |
2 | P | 2021-03-04 |
2 | N | 2021-03-30 |
3 | N | 2021-01-24 |
3 | P | 2021-02-10 |
4 | N | 2021-02-15 |
4 | P | 2021-02-28 |
4 | N | 2021-03-18 |
4 | P | 2021-04-11 |
输出:
Total |
---|
2 |
EDIT1:两个 P 之间可能有多个 N(至少 1 个),而不仅仅是 1 个,我想将其包括在我的计数中。
EDIT2:我希望包含此 ID,但它未包含在结果数据框中。但是,2个P之间有多个N。
ID | DATE | TEST |
---|---|---|
1 | 2020-06-12 | N |
1 | 2020-08-20 | N |
1 | 2020-10-04 | N |
1 | 2020-12-09 | N |
1 | 2021-01-08 | P |
1 | 2021-02-05 | P |
1 | 2021-03-26 | P |
1 | 2021-05-26 | P |
1 | 2021-06-30 | N |
1 | 2021-07-21 | N |
1 | 2021-08-23 | N |
1 | 2021-09-16 | N |
1 | 2021-10-08 | N |
1 | 2021-10-18 | N |
1 | 2021-10-29 | P |
EDIT3:之前编辑中的 ID 是 1,而在我的真实数据输出中是从 15 开始的。我觉得应该是从1开始的。另外,我的真实数据中不是N和P而是'Negative'和'Positive'。我的代码现在是这样的:
data4c %>% group_by(STUDY_ID)
%>% summarise(isP = str_detect(str_c(TEST, collapse = ""),
"PositiveNegative+Positive"), .groups = 'drop')
%>% filter(isP)
更新答案
library(dplyr)
dat %>%
group_by(ID) %>%
summarize(
yourcond = grepl(pattern = "PN+P", x = paste(Test, collapse = "")))
结果:
# A tibble: 5 x 2
ID yourcond
<dbl> <lgl>
1 1 TRUE
2 2 FALSE
3 3 FALSE
4 4 TRUE
5 5 TRUE
新数据:
dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), Test = c("P", "P",
"N", "P", "N", "P", "N", "N", "P", "N", "P", "N", "P", "N", "N",
"N", "N", "P", "P", "P", "P", "N", "N", "N", "N", "N", "N", "P"
), Date = c("2021-01-02", "2021-01-08", "2021-02-25", "2021-03-26",
"2021-02-05", "2021-03-04", "2021-03-30", "2021-01-24", "2021-02-10",
"2021-02-15", "2021-02-28", "2021-03-18", "2021-04-11", "2020-06-12",
"2020-08-20", "2020-10-04", "2020-12-09", "2021-01-08", "2021-02-05",
"2021-03-26", "2021-05-26", "2021-06-30", "2021-07-21", "2021-08-23",
"2021-09-16", "2021-10-08", "2021-10-18", "2021-10-29")), row.names = c(NA,
-28L), class = "data.frame")
上一个回答
library(dplyr)
dat %>%
group_by(ID) %>%
summarize(
yourcond = any((Test == "N") & (lag(Test) == "P") & (lead(Test) == "P")))
结果:
# A tibble: 4 x 2
ID yourcond
<int> <lgl>
1 1 TRUE
2 2 NA
3 3 NA
4 4 TRUE
您可以将 count(yourcond)
添加到 dplyr 链以 return 每个 NA 和 TRUE 的计数。
数据:
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 4L, 4L), Test = c("P", "P", "N", "P", "N", "P", "N", "N",
"P", "N", "P", "N", "P"), Date = c("2021-01-02", "2021-01-08",
"2021-02-25", "2021-03-26", "2021-02-05", "2021-03-04", "2021-03-30",
"2021-01-24", "2021-02-10", "2021-02-15", "2021-02-28", "2021-03-18",
"2021-04-11")), class = "data.frame", row.names = c(NA, -13L))
这是一个带有 str_c/str_detect
的选项 - 按 'ID'、paste
和 'Test' 元素分组,然后检查模式 P
是否后跟一个或更多 N
(N+
),然后出现 P
library(stringr)
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(isP = str_detect(str_c(substr(Test,1, 1) collapse = ""), "PN+P"),
.groups = 'drop') %>%
filter(isP)
# A tibble: 2 × 2
ID isP
<int> <lgl>
1 1 TRUE
2 4 TRUE
使用 OP 的新数据
> df2 %>% group_by(ID) %>%
summarise(isP = str_detect(str_c(substr(TEST,1, 1), collapse = ""), "PN+P"),
.groups = 'drop') %>%
filter(isP)
# A tibble: 1 × 2
ID isP
<int> <lgl>
1 1 TRUE
编辑:添加 substr
以提取 'Test' 列中的第一个字母,因为原始数据值不是 'P' 或 'N',如示例[=20] 所示=]
数据
df2 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), DATE = c("2020-06-12", "2020-08-20", "2020-10-04",
"2020-12-09", "2021-01-08", "2021-02-05", "2021-03-26", "2021-05-26",
"2021-06-30", "2021-07-21", "2021-08-23", "2021-09-16", "2021-10-08",
"2021-10-18", "2021-10-29"), TEST = c("N", "N", "N", "N", "P",
"P", "P", "P", "N", "N", "N", "N", "N", "N", "P")),
class = "data.frame", row.names = c(NA,
-15L))