如果有 3 个连续相同的值并且前导值小于前一个,则删除组

Delete group based if there are 3 consecutive identical values and if the leading value is less than the previous

我只需要按组删除连续 3 次包含相同值的行和小于前几行的前导行。我试过 dplyrlead 功能,但没有成功。示例数据只是数千行的子集,其中条件条件最有效。

之前的数据

print(df)
   sample measurement
1       a      0.1443
2       a      0.2220
3       a      0.3330
4       a      0.9435
5       a      0.8051 # Delete sample "a" as value is less than previous
6       b      0.1554
7       b      0.2775
8       b      0.3885
9       b      1.2210
10      b      1.8093
11      c      0.0000
12      c      0.0000
13      c      0.0000 # Delete sample "c" as there a 3 consecutive values in a row
14      c      0.0333
15      c      0.2997

期望输出

   sample measurement
1       b      0.1554
2       b      0.2775
3       b      0.3885
4       b      1.2210
5       b      1.8093

尝试失败

在这里,我试图过滤任何大于或等于前一行的前导测量值,但失败了。

df %>%
    group_by(sample) %>%
        filter(!any(lead(measurement) <= measurement)) %>% 
            ungroup()

# A tibble: 0 x 2
# ... with 2 variables: sample <chr>, measurement <dbl>

如果我试图只收集符合条件的行,则上面的代码会执行预期的操作。我相信有更好的方法来做到这一点。

df %>%
    group_by(sample) %>%
        filter(any(lead(measurement) <= measurement)) %>% 
            ungroup()

   sample measurement
   <chr>        <dbl>
 1 a           0.144 
 2 a           0.222 
 3 a           0.333 
 4 a           0.944 
 5 a           0.805 
 6 c           0     
 7 c           0     
 8 c           0     
 9 c           0.0333
10 c           0.300 

可重现代码

structure(list(sample = c("a", "a", "a", "a", "a", "b", "b", 
"b", "b", "b", "c", "c", "c", "c", "c"), measurement = c(0.1443, 
0.222, 0.333, 0.9435, 0.8051, 0.1554, 0.2775, 0.3885, 1.221, 
1.8093, 0, 0, 0, 0.0333, 0.2997)), row.names = c(NA, -15L), na.action = structure(c(`1` = 1L, 
`2` = 2L, `3` = 3L, `5` = 5L, `6` = 6L, `8` = 8L, `9` = 9L, `11` = 11L, 
`12` = 12L, `15` = 15L, `16` = 16L, `17` = 17L, `18` = 18L, `20` = 20L, 
`21` = 21L, `23` = 23L, `24` = 24L, `26` = 26L, `27` = 27L, `30` = 30L, 
`31` = 31L, `32` = 32L, `33` = 33L, `35` = 35L, `36` = 36L, `38` = 38L, 
`39` = 39L, `41` = 41L, `42` = 42L, `45` = 45L), class = "omit"), class = "data.frame")
d %>%
    group_by(sample) %>%
    filter(!any(measurement - lag(measurement, default = 0) <= 0) & n() >= 3) %>%
    ungroup()

一个选项是使用 diff 进行两次检查 -1) 获取相邻元素的差异,检查 any 值是否小于 0 或获取 运行- 'measurement'、tabulate的length-id(rleid),按'sample'[=16=分组后检查any频率是否大于等于3 ]

library(dplyr)
library(data.table)
df %>%
    group_by(sample) %>%
    filter( !(any(diff(measurement) < 0)| any(tabulate(rleid(measurement)) >=3)))
# A tibble: 5 x 2
# Groups:   sample [1]
#  sample measurement
#  <chr>        <dbl>
#1 b            0.155
#2 b            0.278
#3 b            0.388
#4 b            1.22 
#5 b            1.81