查找满足条件的两个日期之间的总 ID

Find total IDs between two dates that satisfies a condition

我有一个这样的数据集 PosNeg。我需要找到具有这种模式的 ID 的计数 - P N P P 或 N P N N P N - 在两个 P(正)之间至少有一个 N(负)。如果此模式至少出现一次,则计算该 ID。日期始终按升序排列。

例如:对于 ID 1,我在 02/25 的两个 P 之间至少有 1 个 N,因此我将计算 ID 1。ID 2 和 3 在 2 个 P 之间没有 N,因此不计算在内. ID 4 在 03/18 的两个 P 之间也有一个 N,因此我将包括 4。所以满足条件的 ID 的总数是 2(1 和 4)

我的想法是为每个 ID 找到正数的最小(日期)和正数的最大(日期),并在这些日期之间寻找任何负数,但不确定如何实现它。 R/Python/SQL 中的任何建议都会有所帮助。

ID Test Date
1 P 2021-01-02
1 P 2021-01-08
1 N 2021-02-25
1 P 2021-03-26
2 N 2021-02-05
2 P 2021-03-04
2 N 2021-03-30
3 N 2021-01-24
3 P 2021-02-10
4 N 2021-02-15
4 P 2021-02-28
4 N 2021-03-18
4 P 2021-04-11

输出:

Total
2

EDIT1:两个 P 之间可能有多个 N(至少 1 个),而不仅仅是 1 个,我想将其包括在我的计数中。

EDIT2:我希望包含此 ID,但它未包含在结果数据框中。但是,2个P之间有多个N。

ID DATE TEST
1 2020-06-12 N
1 2020-08-20 N
1 2020-10-04 N
1 2020-12-09 N
1 2021-01-08 P
1 2021-02-05 P
1 2021-03-26 P
1 2021-05-26 P
1 2021-06-30 N
1 2021-07-21 N
1 2021-08-23 N
1 2021-09-16 N
1 2021-10-08 N
1 2021-10-18 N
1 2021-10-29 P

EDIT3:之前编辑中的 ID 是 1,而在我的真实数据输出中是从 15 开始的。我觉得应该是从1开始的。另外,我的真实数据中不是N和P而是'Negative'和'Positive'。我的代码现在是这样的:

data4c %>% group_by(STUDY_ID) 
%>% summarise(isP = str_detect(str_c(TEST, collapse = ""), 
"PositiveNegative+Positive"), .groups = 'drop') 
 %>% filter(isP)

更新答案

library(dplyr)

dat %>%
  group_by(ID) %>%
  summarize(
    yourcond = grepl(pattern = "PN+P", x = paste(Test, collapse = "")))

结果:

# A tibble: 5 x 2
     ID yourcond
  <dbl> <lgl>   
1     1 TRUE    
2     2 FALSE   
3     3 FALSE   
4     4 TRUE    
5     5 TRUE  

新数据:

dat <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 
                             5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), Test = c("P", "P", 
                                                                                    "N", "P", "N", "P", "N", "N", "P", "N", "P", "N", "P", "N", "N", 
                                                                                    "N", "N", "P", "P", "P", "P", "N", "N", "N", "N", "N", "N", "P"
                             ), Date = c("2021-01-02", "2021-01-08", "2021-02-25", "2021-03-26", 
                                         "2021-02-05", "2021-03-04", "2021-03-30", "2021-01-24", "2021-02-10", 
                                         "2021-02-15", "2021-02-28", "2021-03-18", "2021-04-11", "2020-06-12", 
                                         "2020-08-20", "2020-10-04", "2020-12-09", "2021-01-08", "2021-02-05", 
                                         "2021-03-26", "2021-05-26", "2021-06-30", "2021-07-21", "2021-08-23", 
                                         "2021-09-16", "2021-10-08", "2021-10-18", "2021-10-29")), row.names = c(NA, 
                                                                                                                 -28L), class = "data.frame")

上一个回答

library(dplyr)

dat %>%
  group_by(ID) %>%
  summarize(
    yourcond = any((Test == "N") & (lag(Test) == "P") & (lead(Test) == "P")))

结果:

# A tibble: 4 x 2
     ID yourcond
  <int> <lgl>   
1     1 TRUE    
2     2 NA      
3     3 NA      
4     4 TRUE 

您可以将 count(yourcond) 添加到 dplyr 链以 return 每个 NA 和 TRUE 的计数。

数据:

dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 4L, 4L), Test = c("P", "P", "N", "P", "N", "P", "N", "N", 
"P", "N", "P", "N", "P"), Date = c("2021-01-02", "2021-01-08", 
"2021-02-25", "2021-03-26", "2021-02-05", "2021-03-04", "2021-03-30", 
"2021-01-24", "2021-02-10", "2021-02-15", "2021-02-28", "2021-03-18", 
"2021-04-11")), class = "data.frame", row.names = c(NA, -13L))

这是一个带有 str_c/str_detect 的选项 - 按 'ID'、paste 和 'Test' 元素分组,然后检查模式 P 是否后跟一个或更多 N (N+),然后出现 P

library(stringr)
library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  summarise(isP = str_detect(str_c(substr(Test,1, 1) collapse = ""), "PN+P"), 
    .groups = 'drop') %>% 
 filter(isP)
# A tibble: 2 × 2
     ID isP  
  <int> <lgl>
1     1 TRUE 
2     4 TRUE 

使用 OP 的新数据

> df2 %>%  group_by(ID) %>% 
   summarise(isP = str_detect(str_c(substr(TEST,1, 1), collapse = ""), "PN+P"), 
     .groups = 'drop') %>% 
  filter(isP)
# A tibble: 1 × 2
     ID isP  
  <int> <lgl>
1     1 TRUE 

编辑:添加 substr 以提取 'Test' 列中的第一个字母,因为原始数据值不是 'P' 或 'N',如示例[=20] 所示=]

数据

df2 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), DATE = c("2020-06-12", "2020-08-20", "2020-10-04", 
"2020-12-09", "2021-01-08", "2021-02-05", "2021-03-26", "2021-05-26", 
"2021-06-30", "2021-07-21", "2021-08-23", "2021-09-16", "2021-10-08", 
"2021-10-18", "2021-10-29"), TEST = c("N", "N", "N", "N", "P", 
"P", "P", "P", "N", "N", "N", "N", "N", "N", "P")), 
class = "data.frame", row.names = c(NA, 
-15L))