如果有 3 个连续相同的值并且前导值小于前一个,则删除组
Delete group based if there are 3 consecutive identical values and if the leading value is less than the previous
我只需要按组删除连续 3 次包含相同值的行和小于前几行的前导行。我试过 dplyr
的 lead
功能,但没有成功。示例数据只是数千行的子集,其中条件条件最有效。
之前的数据
print(df)
sample measurement
1 a 0.1443
2 a 0.2220
3 a 0.3330
4 a 0.9435
5 a 0.8051 # Delete sample "a" as value is less than previous
6 b 0.1554
7 b 0.2775
8 b 0.3885
9 b 1.2210
10 b 1.8093
11 c 0.0000
12 c 0.0000
13 c 0.0000 # Delete sample "c" as there a 3 consecutive values in a row
14 c 0.0333
15 c 0.2997
期望输出
sample measurement
1 b 0.1554
2 b 0.2775
3 b 0.3885
4 b 1.2210
5 b 1.8093
尝试失败
在这里,我试图过滤任何大于或等于前一行的前导测量值,但失败了。
df %>%
group_by(sample) %>%
filter(!any(lead(measurement) <= measurement)) %>%
ungroup()
# A tibble: 0 x 2
# ... with 2 variables: sample <chr>, measurement <dbl>
如果我试图只收集符合条件的行,则上面的代码会执行预期的操作。我相信有更好的方法来做到这一点。
df %>%
group_by(sample) %>%
filter(any(lead(measurement) <= measurement)) %>%
ungroup()
sample measurement
<chr> <dbl>
1 a 0.144
2 a 0.222
3 a 0.333
4 a 0.944
5 a 0.805
6 c 0
7 c 0
8 c 0
9 c 0.0333
10 c 0.300
可重现代码
structure(list(sample = c("a", "a", "a", "a", "a", "b", "b",
"b", "b", "b", "c", "c", "c", "c", "c"), measurement = c(0.1443,
0.222, 0.333, 0.9435, 0.8051, 0.1554, 0.2775, 0.3885, 1.221,
1.8093, 0, 0, 0, 0.0333, 0.2997)), row.names = c(NA, -15L), na.action = structure(c(`1` = 1L,
`2` = 2L, `3` = 3L, `5` = 5L, `6` = 6L, `8` = 8L, `9` = 9L, `11` = 11L,
`12` = 12L, `15` = 15L, `16` = 16L, `17` = 17L, `18` = 18L, `20` = 20L,
`21` = 21L, `23` = 23L, `24` = 24L, `26` = 26L, `27` = 27L, `30` = 30L,
`31` = 31L, `32` = 32L, `33` = 33L, `35` = 35L, `36` = 36L, `38` = 38L,
`39` = 39L, `41` = 41L, `42` = 42L, `45` = 45L), class = "omit"), class = "data.frame")
d %>%
group_by(sample) %>%
filter(!any(measurement - lag(measurement, default = 0) <= 0) & n() >= 3) %>%
ungroup()
一个选项是使用 diff
进行两次检查 -1) 获取相邻元素的差异,检查 any
值是否小于 0 或获取 运行- 'measurement'、tabulate
的length-id(rleid
),按'sample'[=16=分组后检查any
频率是否大于等于3 ]
library(dplyr)
library(data.table)
df %>%
group_by(sample) %>%
filter( !(any(diff(measurement) < 0)| any(tabulate(rleid(measurement)) >=3)))
# A tibble: 5 x 2
# Groups: sample [1]
# sample measurement
# <chr> <dbl>
#1 b 0.155
#2 b 0.278
#3 b 0.388
#4 b 1.22
#5 b 1.81
我只需要按组删除连续 3 次包含相同值的行和小于前几行的前导行。我试过 dplyr
的 lead
功能,但没有成功。示例数据只是数千行的子集,其中条件条件最有效。
之前的数据
print(df)
sample measurement
1 a 0.1443
2 a 0.2220
3 a 0.3330
4 a 0.9435
5 a 0.8051 # Delete sample "a" as value is less than previous
6 b 0.1554
7 b 0.2775
8 b 0.3885
9 b 1.2210
10 b 1.8093
11 c 0.0000
12 c 0.0000
13 c 0.0000 # Delete sample "c" as there a 3 consecutive values in a row
14 c 0.0333
15 c 0.2997
期望输出
sample measurement
1 b 0.1554
2 b 0.2775
3 b 0.3885
4 b 1.2210
5 b 1.8093
尝试失败
在这里,我试图过滤任何大于或等于前一行的前导测量值,但失败了。
df %>%
group_by(sample) %>%
filter(!any(lead(measurement) <= measurement)) %>%
ungroup()
# A tibble: 0 x 2
# ... with 2 variables: sample <chr>, measurement <dbl>
如果我试图只收集符合条件的行,则上面的代码会执行预期的操作。我相信有更好的方法来做到这一点。
df %>%
group_by(sample) %>%
filter(any(lead(measurement) <= measurement)) %>%
ungroup()
sample measurement
<chr> <dbl>
1 a 0.144
2 a 0.222
3 a 0.333
4 a 0.944
5 a 0.805
6 c 0
7 c 0
8 c 0
9 c 0.0333
10 c 0.300
可重现代码
structure(list(sample = c("a", "a", "a", "a", "a", "b", "b",
"b", "b", "b", "c", "c", "c", "c", "c"), measurement = c(0.1443,
0.222, 0.333, 0.9435, 0.8051, 0.1554, 0.2775, 0.3885, 1.221,
1.8093, 0, 0, 0, 0.0333, 0.2997)), row.names = c(NA, -15L), na.action = structure(c(`1` = 1L,
`2` = 2L, `3` = 3L, `5` = 5L, `6` = 6L, `8` = 8L, `9` = 9L, `11` = 11L,
`12` = 12L, `15` = 15L, `16` = 16L, `17` = 17L, `18` = 18L, `20` = 20L,
`21` = 21L, `23` = 23L, `24` = 24L, `26` = 26L, `27` = 27L, `30` = 30L,
`31` = 31L, `32` = 32L, `33` = 33L, `35` = 35L, `36` = 36L, `38` = 38L,
`39` = 39L, `41` = 41L, `42` = 42L, `45` = 45L), class = "omit"), class = "data.frame")
d %>%
group_by(sample) %>%
filter(!any(measurement - lag(measurement, default = 0) <= 0) & n() >= 3) %>%
ungroup()
一个选项是使用 diff
进行两次检查 -1) 获取相邻元素的差异,检查 any
值是否小于 0 或获取 运行- 'measurement'、tabulate
的length-id(rleid
),按'sample'[=16=分组后检查any
频率是否大于等于3 ]
library(dplyr)
library(data.table)
df %>%
group_by(sample) %>%
filter( !(any(diff(measurement) < 0)| any(tabulate(rleid(measurement)) >=3)))
# A tibble: 5 x 2
# Groups: sample [1]
# sample measurement
# <chr> <dbl>
#1 b 0.155
#2 b 0.278
#3 b 0.388
#4 b 1.22
#5 b 1.81