仅在 R 中的行索引的特定范围内删除列中的重复值
Dropping duplicate values in a column only in specific ranges of row indices in R
我有一个测试数据框 df
,我想从中删除 Hits
列中的重复值,但不删除与重复值关联的行。然而,条件是删除只能在某些特定的行索引范围内进行。
df <- data.frame(
Hits = c("#a", "#ID:987129470", "#b", "Hit1", "Hit1", "Hit2", "Hit3", "Hit3", "#a", "#ID:6971324987", "#b", "Hit1", "Hit2", "Hit2", "Hit3"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))
df
看起来像这样
本例中要进行drop操作的行索引范围为4:8
和12:15
。基本上,每个 ID 下的重复命中将被删除,保持其他列中的关联值不变。输出应该是这个样子
在原始数据框中(大约有 100k 行!)无法指定范围。我该如何解决这个问题?
首先从 #a
开始分组
然后使用 ifelse
语句。
library(dplyr)
df %>%
group_by(id_Group = cumsum(Hits=="#a")) %>%
mutate(Hits = ifelse(duplicated(Hits), "", Hits)) %>%
ungroup() %>%
select(-id_Group)
Hits Category1 Category2 Category3
<chr> <dbl> <dbl> <dbl>
1 "#a" NA NA NA
2 "#ID:987129470" NA NA NA
3 "#b" NA NA NA
4 "Hit1" 0.001 100 100
5 "" 0.001 100 100
6 "Hit2" 0.002 99 99
7 "Hit3" 0.003 98 98
8 "" 0.003 98 98
9 "#a" NA NA NA
10 "#ID:6971324987" NA NA NA
11 "#b" NA NA NA
12 "Hit1" 0.023 100 98
13 "Hit2" 0.341 95 97
14 "" 0.341 95 97
15 "Hit3" 0.569 97 92
另一个可能的解决方案:
library(tidyverse)
df <- data.frame(
Hits = c("#a", "#ID:987129470", "#b", "Hit1", "Hit1", "Hit2", "Hit3", "Hit3", "#a", "#ID:6971324987", "#b", "Hit1", "Hit2", "Hit2", "Hit3"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))
df %>%
group_by(Hits, if_else(str_detect(Hits, "Hits*"), 1, 0) %>% data.table::rleid(.)) %>%
mutate(Hits = if_else(row_number() > 1 & str_detect(Hits, "Hits*"), "", Hits)) %>%
ungroup %>% select(-last_col())
#> # A tibble: 15 × 4
#> Hits Category1 Category2 Category3
#> <chr> <dbl> <dbl> <dbl>
#> 1 "#a" NA NA NA
#> 2 "#ID:987129470" NA NA NA
#> 3 "#b" NA NA NA
#> 4 "Hit1" 0.001 100 100
#> 5 "" 0.001 100 100
#> 6 "Hit2" 0.002 99 99
#> 7 "Hit3" 0.003 98 98
#> 8 "" 0.003 98 98
#> 9 "#a" NA NA NA
#> 10 "#ID:6971324987" NA NA NA
#> 11 "#b" NA NA NA
#> 12 "Hit1" 0.023 100 98
#> 13 "Hit2" 0.341 95 97
#> 14 "" 0.341 95 97
#> 15 "Hit3" 0.569 97 92
我有一个测试数据框 df
,我想从中删除 Hits
列中的重复值,但不删除与重复值关联的行。然而,条件是删除只能在某些特定的行索引范围内进行。
df <- data.frame(
Hits = c("#a", "#ID:987129470", "#b", "Hit1", "Hit1", "Hit2", "Hit3", "Hit3", "#a", "#ID:6971324987", "#b", "Hit1", "Hit2", "Hit2", "Hit3"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))
df
看起来像这样
本例中要进行drop操作的行索引范围为4:8
和12:15
。基本上,每个 ID 下的重复命中将被删除,保持其他列中的关联值不变。输出应该是这个样子
在原始数据框中(大约有 100k 行!)无法指定范围。我该如何解决这个问题?
首先从 #a
开始分组
然后使用 ifelse
语句。
library(dplyr)
df %>%
group_by(id_Group = cumsum(Hits=="#a")) %>%
mutate(Hits = ifelse(duplicated(Hits), "", Hits)) %>%
ungroup() %>%
select(-id_Group)
Hits Category1 Category2 Category3
<chr> <dbl> <dbl> <dbl>
1 "#a" NA NA NA
2 "#ID:987129470" NA NA NA
3 "#b" NA NA NA
4 "Hit1" 0.001 100 100
5 "" 0.001 100 100
6 "Hit2" 0.002 99 99
7 "Hit3" 0.003 98 98
8 "" 0.003 98 98
9 "#a" NA NA NA
10 "#ID:6971324987" NA NA NA
11 "#b" NA NA NA
12 "Hit1" 0.023 100 98
13 "Hit2" 0.341 95 97
14 "" 0.341 95 97
15 "Hit3" 0.569 97 92
另一个可能的解决方案:
library(tidyverse)
df <- data.frame(
Hits = c("#a", "#ID:987129470", "#b", "Hit1", "Hit1", "Hit2", "Hit3", "Hit3", "#a", "#ID:6971324987", "#b", "Hit1", "Hit2", "Hit2", "Hit3"),
Category1 = c(NA, NA, NA, 0.001, 0.001, 0.002, 0.003, 0.003, NA, NA, NA, 0.023, 0.341, 0.341, 0.569),
Category2 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 100, 95, 95, 97),
Category3 = c(NA, NA, NA, 100, 100, 99, 98, 98, NA, NA, NA, 98, 97, 97, 92))
df %>%
group_by(Hits, if_else(str_detect(Hits, "Hits*"), 1, 0) %>% data.table::rleid(.)) %>%
mutate(Hits = if_else(row_number() > 1 & str_detect(Hits, "Hits*"), "", Hits)) %>%
ungroup %>% select(-last_col())
#> # A tibble: 15 × 4
#> Hits Category1 Category2 Category3
#> <chr> <dbl> <dbl> <dbl>
#> 1 "#a" NA NA NA
#> 2 "#ID:987129470" NA NA NA
#> 3 "#b" NA NA NA
#> 4 "Hit1" 0.001 100 100
#> 5 "" 0.001 100 100
#> 6 "Hit2" 0.002 99 99
#> 7 "Hit3" 0.003 98 98
#> 8 "" 0.003 98 98
#> 9 "#a" NA NA NA
#> 10 "#ID:6971324987" NA NA NA
#> 11 "#b" NA NA NA
#> 12 "Hit1" 0.023 100 98
#> 13 "Hit2" 0.341 95 97
#> 14 "" 0.341 95 97
#> 15 "Hit3" 0.569 97 92