Filter_all() 与否定 str_detect 方法
Filter_all() with a negate str_detect approach
亲爱的同事们,您好,
首先,这是我第一次在这里提问,所以我希望我能说清楚。
在处理大量具有可变长度和非常规名称的数据帧时,我目前面临一些挑战。挑战在于删除与多个关键字匹配的不需要的行(此处为全基因组鸟枪法测序样本的行),确实我们可以很容易地拥有一个关键字……为此我不同意 filter_all(any_vars(str_detect(., "WGS"))
。但是,尝试用 negate=T
或 !str_detect()
return 整个数据帧来否定代码,但似乎没有任何效果。使用 all_vars()
删除 df 中的每一行。
我找到了一个解决方案,但我发现它很重,而且我很确定有更好的方法来执行此操作:
> tmp <- metadata[["PRJNA237362"]]
> no <- tmp %>% filter_all(any_vars(str_detect(., "WGS")))
> final <- tmp[tmp$Run %notin% no$Run,]
我对 tidyverse 不是很熟悉,还有很多东西要学,所以我可能在这里漏掉了一些东西。
我不明白为什么 filter
return 否定表达式
时整个 df
感谢回答者!
祝你有美好的一天。
雷米
我正在处理的一个可重现的例子
> data(msleep)
> msleep%>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> msleep%>% filter_all(any_vars(str_detect(., "omni", negate=T))) %>% glimpse()
> no <- msleep %>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> yes <- msleep[msleep$vore %notin% no$vore,] %>% glimpse()
这里是我正在研究的 df 的一部分:
> df = structure(list(Run = c("ERR2804817", "ERR2804818", "ERR2804819",
"ERR2804820", "ERR2804821", "ERR2834367", "ERR2834371", "ERR2834373",
"ERR2834374", "ERR2834375", "ERR2834376", "ERR2834377", "ERR2834379",
"ERR2828323", "ERR2828326", "ERR2828327", "ERR2828328", "ERR2828330"
), LibraryLayout = c("PAIRED", "PAIRED", "PAIRED", "PAIRED",
"PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED",
"PAIRED", "PAIRED", "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SINGLE"
), Library.Name = c("Bangladeshi_2yr", "Bangladeshi_2yr", "Bangladeshi_2yr",
"Bangladeshi_2yr", "Bangladeshi_2yr", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S12", "table S12",
"table S12", "table S12", "table S12"), LibrarySource = c("METAGENOMIC",
"METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "GENOMIC",
"GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC",
"GENOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC",
"METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC"), Instrument = c("Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 73L, 74L, 75L, 76L, 77L,
78L, 79L, 80L, 806L, 807L, 808L, 809L, 810L), class = "data.frame")
> #Here is what I have for now
> `%notin%` = Negate(`%in%`)
> tmp = metadata %>% filter_all(any_vars(everything(), str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> meta= meta[meta$Run%notin%tmp$Run,]
最终,我想做这样的事情:
> tmp = meta %>% filter_all(any_vars(!str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> #OR this version
> tmp = meta %>% filter_all(any_vars(str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS", negate=T)))
诀窍是我无法预测我的 df 的 colnames 或我的 df 的维度所以我写了 for()
带有条件的循环来检测模式,删除它们并用清洁 df.
目前我的代码可以正常工作,但我确信有更好的方法。
非常感谢。
> packageVersion("tidyverse")
[1] ‘1.3.0’
> packageVersion("dplyr")
[1] ‘1.0.5’
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rmdformats_1.0.1 ggpubr_0.4.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.0 tidyverse_1.3.0 ade4_1.7-16 factoextra_1.0.7
[13] ggplot2_3.3.3 FactoMineR_2.4
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.2 prettydoc_0.4.1
[4] carData_3.0-4 modelr_0.1.8 assertthat_0.2.1
[7] cellranger_1.1.0 yaml_2.2.1 progress_1.2.2
[10] ggrepel_0.9.1 pillar_1.5.1 backports_1.2.1
[13] lattice_0.20-41 glue_1.4.2 digest_0.6.27
[16] ggsignif_0.6.1 rvest_0.3.6 colorspace_2.0-0
[19] cowplot_1.1.1 htmltools_0.5.1.1 pkgconfig_2.0.3
[22] broom_0.7.5 haven_2.3.1 bookdown_0.21
[25] scales_1.1.1 openxlsx_4.2.3 rio_0.5.26
[28] farver_2.1.0 generics_0.1.0 car_3.0-10
[31] ellipsis_0.3.1 DT_0.17 withr_2.4.1
[34] cli_2.3.1 magrittr_2.0.1 crayon_1.4.1
[37] readxl_1.3.1 evaluate_0.14 fs_1.5.0
[40] fansi_0.4.2 MASS_7.3-53.1 rstatix_0.7.0
[43] xml2_1.3.2 foreign_0.8-81 tools_4.0.4
[46] data.table_1.14.0 prettyunits_1.1.1 hms_1.0.0
[49] lifecycle_1.0.0 munsell_0.5.0 reprex_1.0.0
[52] zip_2.1.1 cluster_2.1.1 flashClust_1.01-2
[55] compiler_4.0.4 rlang_0.4.10 grid_4.0.4
[58] rstudioapi_0.13 htmlwidgets_1.5.3 leaps_3.1
[61] labeling_0.4.2 rmarkdown_2.7 gtable_0.3.0
[64] abind_1.4-5 DBI_1.1.1 curl_4.3
[67] R6_2.5.0 lubridate_1.7.10 knitr_1.31
[70] utf8_1.2.1 stringi_1.5.3 Rcpp_1.0.6
[73] vctrs_0.3.6 scatterplot3d_0.3-41 dbplyr_2.1.0
[76] tidyselect_1.1.0 xfun_0.22
由于您提到了多个关键字,您可以使用正则表达式 |
(或)运算符将多个关键字传递给 str_detect()
。
以下行将过滤掉(通过 negate = TRUE
所有行,其中至少一个变量具有至少一种给定模式 ui|Br|Ch|lis
。
library(tidyverse)
keywords_to_remove <- c("ui", "Br", "lis", "Ch", "omni")
keywords_regex <- paste0(keywords_to_remove, collapse = "|")
msleep %>%
filter(if_all(
.cols = everything(),
.fns = ~ stringr::str_detect(.x, keywords_regex, negate = TRUE))
)
#> # A tibble: 9 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 2 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
#> 3 Long-… Dasyp… carni Cing… lc 17.4 3.1 0.383 6.6
#> 4 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
#> 5 Golde… Mesoc… herbi Rode… en 14.3 3.1 0.2 9.7
#> 6 House… Mus herbi Rode… nt 12.5 1.4 0.183 11.5
#> 7 Rabbit Oryct… herbi Lago… domesticated 8.4 0.9 0.417 15.6
#> 8 Labor… Rattus herbi Rode… lc 13 2.4 0.183 11
#> 9 Easte… Scalo… inse… Sori… lc 8.4 2.1 0.167 15.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
packageVersion("dplyr")
#> [1] '1.0.5'
由 reprex package (v1.0.0)
于 2021 年 3 月 23 日创建
根据更新的信息进行第二次编辑:
解决此问题的另一种方法是执行按行操作并根据您选择的正则表达式匹配项添加匹配列:
如果您想在最终过滤器中保留 NA 值,那么这应该可行:
regex_match <- "omni"
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character),
regex(regex_match)), na.rm = FALSE)) %>%
filter(!regex_match)
如果要排除 NA,则添加 replace_na() 步骤:
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex("omni")), na.rm = FALSE),
regex_match = replace_na(regex_match, TRUE)) %>%
filter(!regex_match)
第一个包含您的元数据的版本:
regex_match <- "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS"
metadata %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex(regex_match)), na.rm = FALSE)) %>%
filter(!regex_match)
编辑 1
我认为问题在于否定与语法 any_vars
的组合,这意味着您要返回整个数据框,因为每一列都有一行值不包含“omni”或“WGS”来自您的数据。
使用最新版本的 dplyr 语法,您可以尝试以下操作:
msleep %>% filter(if_all(starts_with("vore"), ~!str_detect(.x, "omni")))
这只关注一栏,或者
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
对于整个数据帧。
这是否满足您的需求?
@Marcelo Avila 和@awaji98 命题解决了我的问题。但是,我想表明这段代码的微妙之处在于它似乎 NA 被上面的命题删除了:
msleep%>% filter_all(all_vars(str_detect(., "omni", negate=T)))```
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
msleep %>%
filter(if_all(
.cols = everything(),
.fns = ~ stringr::str_detect(.x, "omni", negate = TRUE))
)
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
no
# A tibble: 20 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Owl mon… Aotus omni Prima… NA 17 1.8 NA
2 Greater… Blari… omni Soric… lc 14.9 2.3 0.133
3 Grivet Cerco… omni Prima… lc 10 0.7 NA
4 Star-no… Condy… omni Soric… lc 10.3 2.2 NA
5 African… Crice… omni Roden… NA 8.3 2 NA
6 Lesser … Crypt… omni Soric… lc 9.1 1.4 0.15
7 North A… Didel… omni Didel… lc 18 4.9 0.333
8 Europea… Erina… omni Erina… lc 10.1 3.5 0.283
9 Patas m… Eryth… omni Prima… lc 10.9 1.1 NA
10 Galago Galago omni Prima… NA 9.8 1.1 0.55
11 Human Homo omni Prima… NA 8 1.9 1.5
12 Macaque Macaca omni Prima… NA 10.1 1.2 0.75
13 Chimpan… Pan omni Prima… NA 9.7 1.4 1.42
14 Baboon Papio omni Prima… NA 9.4 1 0.667
15 Potto Perod… omni Prima… lc 11 NA NA
16 African… Rhabd… omni Roden… NA 8.7 NA NA
17 Squirre… Saimi… omni Prima… NA 9.6 1.4 NA
18 Pig Sus omni Artio… domesticated 9.1 2.4 0.5
19 Tenrec Tenrec omni Afros… NA 15.6 2.3 NA
20 Tree sh… Tupaia omni Scand… NA 8.9 2.6 0.233
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
我们找到 20 行包含模式“omni”
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
msleep[msleep$vore %notin% no$vore,]
# A tibble: 63 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cheetah Acino… carni Carni… lc 12.1 NA NA
2 Mountai… Aplod… herbi Roden… nt 14.4 2.4 NA
3 Cow Bos herbi Artio… domesticated 4 0.7 0.667
4 Three-t… Brady… herbi Pilosa NA 14.4 2.2 0.767
5 Norther… Callo… carni Carni… vu 8.7 1.4 0.383
6 Vesper … Calom… NA Roden… NA 7 NA NA
7 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
8 Roe deer Capre… herbi Artio… lc 3 NA NA
9 Goat Capri herbi Artio… lc 5.3 0.6 NA
10 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
# … with 53 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
# bodywt <dbl>
这有效地删除了 20 行和 return 63 行 df。
但是,由于 NA 似乎以下代码(以及上面的其他代码)return 是错误的 df.
library(tidyverse)
msleep %>%
filter(
if_all(
everything(),
~stringr::str_detect(., "omni", negate = T)
)
)
A tibble: 15 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cow Bos herbi Artio… domesticated 4 0.7 0.667
2 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
3 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
4 Chinchi… Chinc… herbi Roden… domesticated 12.5 1.5 0.117
5 Long-no… Dasyp… carni Cingu… lc 17.4 3.1 0.383
6 Big bro… Eptes… inse… Chiro… lc 19.7 3.9 0.117
7 Horse Equus herbi Peris… domesticated 2.9 0.6 1
8 Domesti… Felis carni Carni… domesticated 12.5 3.2 0.417
9 Golden … Mesoc… herbi Roden… en 14.3 3.1 0.2
10 House m… Mus herbi Roden… nt 12.5 1.4 0.183
11 Rabbit Oryct… herbi Lagom… domesticated 8.4 0.9 0.417
12 Laborat… Rattus herbi Roden… lc 13 2.4 0.183
13 Eastern… Scalo… inse… Soric… lc 8.4 2.1 0.167
14 Thirtee… Sperm… herbi Roden… lc 13.8 3.4 0.217
15 Brazili… Tapir… herbi Peris… vu 4.4 1 0.9
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
否定 str_detect()
时有些奇怪。
如果有人对此有所了解,那将是巨大的,因为我担心我今晚会睡不着觉。
非常感谢,
来自巴黎的欢呼。
亲爱的同事们,您好,
首先,这是我第一次在这里提问,所以我希望我能说清楚。
在处理大量具有可变长度和非常规名称的数据帧时,我目前面临一些挑战。挑战在于删除与多个关键字匹配的不需要的行(此处为全基因组鸟枪法测序样本的行),确实我们可以很容易地拥有一个关键字……为此我不同意 filter_all(any_vars(str_detect(., "WGS"))
。但是,尝试用 negate=T
或 !str_detect()
return 整个数据帧来否定代码,但似乎没有任何效果。使用 all_vars()
删除 df 中的每一行。
我找到了一个解决方案,但我发现它很重,而且我很确定有更好的方法来执行此操作:
> tmp <- metadata[["PRJNA237362"]]
> no <- tmp %>% filter_all(any_vars(str_detect(., "WGS")))
> final <- tmp[tmp$Run %notin% no$Run,]
我对 tidyverse 不是很熟悉,还有很多东西要学,所以我可能在这里漏掉了一些东西。
我不明白为什么 filter
return 否定表达式
感谢回答者! 祝你有美好的一天。 雷米
我正在处理的一个可重现的例子
> data(msleep)
> msleep%>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> msleep%>% filter_all(any_vars(str_detect(., "omni", negate=T))) %>% glimpse()
> no <- msleep %>% filter_all(any_vars(str_detect(., "omni"))) %>% glimpse()
> yes <- msleep[msleep$vore %notin% no$vore,] %>% glimpse()
这里是我正在研究的 df 的一部分:
> df = structure(list(Run = c("ERR2804817", "ERR2804818", "ERR2804819",
"ERR2804820", "ERR2804821", "ERR2834367", "ERR2834371", "ERR2834373",
"ERR2834374", "ERR2834375", "ERR2834376", "ERR2834377", "ERR2834379",
"ERR2828323", "ERR2828326", "ERR2828327", "ERR2828328", "ERR2828330"
), LibraryLayout = c("PAIRED", "PAIRED", "PAIRED", "PAIRED",
"PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED", "PAIRED",
"PAIRED", "PAIRED", "SINGLE", "SINGLE", "SINGLE", "SINGLE", "SINGLE"
), Library.Name = c("Bangladeshi_2yr", "Bangladeshi_2yr", "Bangladeshi_2yr",
"Bangladeshi_2yr", "Bangladeshi_2yr", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS", "table S7A,B; WGS",
"table S7A,B; WGS", "table S7A,B; WGS", "table S12", "table S12",
"table S12", "table S12", "table S12"), LibrarySource = c("METAGENOMIC",
"METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "METAGENOMIC", "GENOMIC",
"GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC", "GENOMIC",
"GENOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC",
"METATRANSCRIPTOMIC", "METATRANSCRIPTOMIC"), Instrument = c("Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq", "Illumina MiSeq",
"NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500", "NextSeq 500"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 73L, 74L, 75L, 76L, 77L,
78L, 79L, 80L, 806L, 807L, 808L, 809L, 810L), class = "data.frame")
> #Here is what I have for now
> `%notin%` = Negate(`%in%`)
> tmp = metadata %>% filter_all(any_vars(everything(), str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> meta= meta[meta$Run%notin%tmp$Run,]
最终,我想做这样的事情:
> tmp = meta %>% filter_all(any_vars(!str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS")))
> #OR this version
> tmp = meta %>% filter_all(any_vars(str_detect(., "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS", negate=T)))
诀窍是我无法预测我的 df 的 colnames 或我的 df 的维度所以我写了 for()
带有条件的循环来检测模式,删除它们并用清洁 df.
目前我的代码可以正常工作,但我确信有更好的方法。
非常感谢。
> packageVersion("tidyverse")
[1] ‘1.3.0’
> packageVersion("dplyr")
[1] ‘1.0.5’
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rmdformats_1.0.1 ggpubr_0.4.0 forcats_0.5.1 stringr_1.4.0
[5] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3
[9] tibble_3.1.0 tidyverse_1.3.0 ade4_1.7-16 factoextra_1.0.7
[13] ggplot2_3.3.3 FactoMineR_2.4
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.2 prettydoc_0.4.1
[4] carData_3.0-4 modelr_0.1.8 assertthat_0.2.1
[7] cellranger_1.1.0 yaml_2.2.1 progress_1.2.2
[10] ggrepel_0.9.1 pillar_1.5.1 backports_1.2.1
[13] lattice_0.20-41 glue_1.4.2 digest_0.6.27
[16] ggsignif_0.6.1 rvest_0.3.6 colorspace_2.0-0
[19] cowplot_1.1.1 htmltools_0.5.1.1 pkgconfig_2.0.3
[22] broom_0.7.5 haven_2.3.1 bookdown_0.21
[25] scales_1.1.1 openxlsx_4.2.3 rio_0.5.26
[28] farver_2.1.0 generics_0.1.0 car_3.0-10
[31] ellipsis_0.3.1 DT_0.17 withr_2.4.1
[34] cli_2.3.1 magrittr_2.0.1 crayon_1.4.1
[37] readxl_1.3.1 evaluate_0.14 fs_1.5.0
[40] fansi_0.4.2 MASS_7.3-53.1 rstatix_0.7.0
[43] xml2_1.3.2 foreign_0.8-81 tools_4.0.4
[46] data.table_1.14.0 prettyunits_1.1.1 hms_1.0.0
[49] lifecycle_1.0.0 munsell_0.5.0 reprex_1.0.0
[52] zip_2.1.1 cluster_2.1.1 flashClust_1.01-2
[55] compiler_4.0.4 rlang_0.4.10 grid_4.0.4
[58] rstudioapi_0.13 htmlwidgets_1.5.3 leaps_3.1
[61] labeling_0.4.2 rmarkdown_2.7 gtable_0.3.0
[64] abind_1.4-5 DBI_1.1.1 curl_4.3
[67] R6_2.5.0 lubridate_1.7.10 knitr_1.31
[70] utf8_1.2.1 stringi_1.5.3 Rcpp_1.0.6
[73] vctrs_0.3.6 scatterplot3d_0.3-41 dbplyr_2.1.0
[76] tidyselect_1.1.0 xfun_0.22
由于您提到了多个关键字,您可以使用正则表达式 |
(或)运算符将多个关键字传递给 str_detect()
。
以下行将过滤掉(通过 negate = TRUE
所有行,其中至少一个变量具有至少一种给定模式 ui|Br|Ch|lis
。
library(tidyverse)
keywords_to_remove <- c("ui", "Br", "lis", "Ch", "omni")
keywords_regex <- paste0(keywords_to_remove, collapse = "|")
msleep %>%
filter(if_all(
.cols = everything(),
.fns = ~ stringr::str_detect(.x, keywords_regex, negate = TRUE))
)
#> # A tibble: 9 x 11
#> name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
#> 2 Dog Canis carni Carn… domesticated 10.1 2.9 0.333 13.9
#> 3 Long-… Dasyp… carni Cing… lc 17.4 3.1 0.383 6.6
#> 4 Horse Equus herbi Peri… domesticated 2.9 0.6 1 21.1
#> 5 Golde… Mesoc… herbi Rode… en 14.3 3.1 0.2 9.7
#> 6 House… Mus herbi Rode… nt 12.5 1.4 0.183 11.5
#> 7 Rabbit Oryct… herbi Lago… domesticated 8.4 0.9 0.417 15.6
#> 8 Labor… Rattus herbi Rode… lc 13 2.4 0.183 11
#> 9 Easte… Scalo… inse… Sori… lc 8.4 2.1 0.167 15.6
#> # … with 2 more variables: brainwt <dbl>, bodywt <dbl>
packageVersion("dplyr")
#> [1] '1.0.5'
由 reprex package (v1.0.0)
于 2021 年 3 月 23 日创建根据更新的信息进行第二次编辑:
解决此问题的另一种方法是执行按行操作并根据您选择的正则表达式匹配项添加匹配列:
如果您想在最终过滤器中保留 NA 值,那么这应该可行:
regex_match <- "omni"
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character),
regex(regex_match)), na.rm = FALSE)) %>%
filter(!regex_match)
如果要排除 NA,则添加 replace_na() 步骤:
msleep %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex("omni")), na.rm = FALSE),
regex_match = replace_na(regex_match, TRUE)) %>%
filter(!regex_match)
第一个包含您的元数据的版本:
regex_match <- "shotgun|WGS|whole genome|all_genome|WXS|WholeGenomeShotgun|Whole genome shotgun|Metatranscriptomic|WXS"
metadata %>%
rowwise() %>%
mutate(regex_match = any(str_detect(c_across(is.character), regex(regex_match)), na.rm = FALSE)) %>%
filter(!regex_match)
编辑 1
我认为问题在于否定与语法 any_vars
的组合,这意味着您要返回整个数据框,因为每一列都有一行值不包含“omni”或“WGS”来自您的数据。
使用最新版本的 dplyr 语法,您可以尝试以下操作:
msleep %>% filter(if_all(starts_with("vore"), ~!str_detect(.x, "omni")))
这只关注一栏,或者
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
对于整个数据帧。
这是否满足您的需求?
@Marcelo Avila 和@awaji98 命题解决了我的问题。但是,我想表明这段代码的微妙之处在于它似乎 NA 被上面的命题删除了:
msleep%>% filter_all(all_vars(str_detect(., "omni", negate=T)))```
msleep %>% filter(if_all(everything(), ~!str_detect(.x, "omni")))
msleep %>%
filter(if_all(
.cols = everything(),
.fns = ~ stringr::str_detect(.x, "omni", negate = TRUE))
)
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
no
# A tibble: 20 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Owl mon… Aotus omni Prima… NA 17 1.8 NA
2 Greater… Blari… omni Soric… lc 14.9 2.3 0.133
3 Grivet Cerco… omni Prima… lc 10 0.7 NA
4 Star-no… Condy… omni Soric… lc 10.3 2.2 NA
5 African… Crice… omni Roden… NA 8.3 2 NA
6 Lesser … Crypt… omni Soric… lc 9.1 1.4 0.15
7 North A… Didel… omni Didel… lc 18 4.9 0.333
8 Europea… Erina… omni Erina… lc 10.1 3.5 0.283
9 Patas m… Eryth… omni Prima… lc 10.9 1.1 NA
10 Galago Galago omni Prima… NA 9.8 1.1 0.55
11 Human Homo omni Prima… NA 8 1.9 1.5
12 Macaque Macaca omni Prima… NA 10.1 1.2 0.75
13 Chimpan… Pan omni Prima… NA 9.7 1.4 1.42
14 Baboon Papio omni Prima… NA 9.4 1 0.667
15 Potto Perod… omni Prima… lc 11 NA NA
16 African… Rhabd… omni Roden… NA 8.7 NA NA
17 Squirre… Saimi… omni Prima… NA 9.6 1.4 NA
18 Pig Sus omni Artio… domesticated 9.1 2.4 0.5
19 Tenrec Tenrec omni Afros… NA 15.6 2.3 NA
20 Tree sh… Tupaia omni Scand… NA 8.9 2.6 0.233
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
我们找到 20 行包含模式“omni”
no <- msleep %>% filter(if_any(everything(), ~str_detect(., "omni")))
msleep[msleep$vore %notin% no$vore,]
# A tibble: 63 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cheetah Acino… carni Carni… lc 12.1 NA NA
2 Mountai… Aplod… herbi Roden… nt 14.4 2.4 NA
3 Cow Bos herbi Artio… domesticated 4 0.7 0.667
4 Three-t… Brady… herbi Pilosa NA 14.4 2.2 0.767
5 Norther… Callo… carni Carni… vu 8.7 1.4 0.383
6 Vesper … Calom… NA Roden… NA 7 NA NA
7 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
8 Roe deer Capre… herbi Artio… lc 3 NA NA
9 Goat Capri herbi Artio… lc 5.3 0.6 NA
10 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
# … with 53 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>,
# bodywt <dbl>
这有效地删除了 20 行和 return 63 行 df。 但是,由于 NA 似乎以下代码(以及上面的其他代码)return 是错误的 df.
library(tidyverse)
msleep %>%
filter(
if_all(
everything(),
~stringr::str_detect(., "omni", negate = T)
)
)
A tibble: 15 x 11
name genus vore order conservation sleep_total sleep_rem sleep_cycle
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Cow Bos herbi Artio… domesticated 4 0.7 0.667
2 Dog Canis carni Carni… domesticated 10.1 2.9 0.333
3 Guinea … Cavis herbi Roden… domesticated 9.4 0.8 0.217
4 Chinchi… Chinc… herbi Roden… domesticated 12.5 1.5 0.117
5 Long-no… Dasyp… carni Cingu… lc 17.4 3.1 0.383
6 Big bro… Eptes… inse… Chiro… lc 19.7 3.9 0.117
7 Horse Equus herbi Peris… domesticated 2.9 0.6 1
8 Domesti… Felis carni Carni… domesticated 12.5 3.2 0.417
9 Golden … Mesoc… herbi Roden… en 14.3 3.1 0.2
10 House m… Mus herbi Roden… nt 12.5 1.4 0.183
11 Rabbit Oryct… herbi Lagom… domesticated 8.4 0.9 0.417
12 Laborat… Rattus herbi Roden… lc 13 2.4 0.183
13 Eastern… Scalo… inse… Soric… lc 8.4 2.1 0.167
14 Thirtee… Sperm… herbi Roden… lc 13.8 3.4 0.217
15 Brazili… Tapir… herbi Peris… vu 4.4 1 0.9
# … with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
否定 str_detect()
时有些奇怪。
如果有人对此有所了解,那将是巨大的,因为我担心我今晚会睡不着觉。
非常感谢, 来自巴黎的欢呼。