如何取消列出字符串列以计算匹配项?
How to unlist column of strings to count matches?
我想计算 2 个数据集之间的任何匹配字符串。这是一个数据集,其中包含一列基因和另一列与这些基因相互作用的基因。
例如:
#dataset1
Gene Interactors
ACE BRCA2, NOS2, SEPT9
HER2 AGT, TGRF
YUO SEPT9, NOS2, TET2
我有第二个数据集也有类似的基因和相互作用的基因。例如:
#dataset2
Gene Interactors
RTY ADFD, NOS3, SEPT9
TERT ADAM2, GERP
GHJ TET2, NOS2
我希望能够计算数据集 1 中有多少 Interactors
与数据集 2 中的匹配 Interactors
。
示例输出:
Gene Interactors Secondary_interaction_count
ACE BRCA2, NOS2, SEPT9 2 #SEPT9 and NOS2 are in the 2nd dataset under interacting genes
HER2 AGT, TGRF 0
YUO SEPT9, ADAM2, TET2 3 #all 3 are in dataset 2
目前我有 2 个版本可以尝试获取它。一个只给出真假的我不知道怎么改成计数:
temp <- unlist(strsplit(df2$interactors, ', '))
df1$secondary_count <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
还有一个我认为没有拆分字符串,但我不确定如何修改它:
df1 %>%
mutate(secondary_count = str_count(interactors, str_c(df2$interactors, collapse = '|')))
有没有办法修改这两种编码尝试中的任何一种以获得计数?或者我应该尝试其他方法?
输入数据:
#df1:
structure(list(Gene = c("ACE", "HER2", "YUO"), Interactors = c("BRCA2, NOS2, SEPT9",
"AGT, TGRF", "SEPT9, NOS2, TET2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
#df2:
structure(list(Gene = c("RTY", "TERT", "GHJ"), Interactors = c("ADFD, NOS3, SEPT9",
"ADAM2, GERP", "TET2, NOS2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7 proto_1.0.0
[5] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
[13] plyr_1.8.6 dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] gtools_3.8.2 tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2
[5] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 chron_2.3-56
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.1
[13] withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_1.4.4
[17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
[21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] fansi_0.4.1 broom_0.7.2 Rcpp_1.0.5 scales_1.1.1
[29] backports_1.1.10 jsonlite_1.7.1 fs_1.5.0 bit_4.0.4
[33] hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[37] cli_2.1.0 tools_4.0.2 magrittr_1.5 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
[45] lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11
[49] R6_2.4.1 compiler_4.0.2
试试这个
library(tidyr)
library(dplyr)
sep_rows <- . %>% separate_rows(Interactors, sep = ", ")
df1 %>%
sep_rows() %>%
mutate(
found = !is.na(match(Interactors, sep_rows(df2)$Interactors))
) %>%
group_by(Gene) %>%
summarise(
Interactors = toString(Interactors),
Secondary_interaction_count = sum(found)
)
输出
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Gene Interactors Secondary_interaction_count
<chr> <chr> <int>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
再试一次:
> df1 %>% separate_rows(Interactors) %>% rowwise() %>%
+ mutate(secondary_interactions = str_extract_all(Interactors, paste0(df2 %>% separate_rows(Interactors) %>% pull(Interactors), collapse = '|'))) %>%
+ unnest(secondary_interactions, keep_empty = T) %>% group_by(Gene) %>%
+ mutate(Interactors = toString(Interactors), secondary_interactions_cnt = case_when(is.na(secondary_interactions) ~ 0, TRUE ~ 1)) %>%
+ mutate(secondary_interactions = sum(secondary_interactions_cnt)) %>% select(-4)%>% distinct()
# A tibble: 3 x 3
# Groups: Gene [3]
Gene Interactors secondary_interactions
<chr> <chr> <dbl>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
>
我想计算 2 个数据集之间的任何匹配字符串。这是一个数据集,其中包含一列基因和另一列与这些基因相互作用的基因。
例如:
#dataset1
Gene Interactors
ACE BRCA2, NOS2, SEPT9
HER2 AGT, TGRF
YUO SEPT9, NOS2, TET2
我有第二个数据集也有类似的基因和相互作用的基因。例如:
#dataset2
Gene Interactors
RTY ADFD, NOS3, SEPT9
TERT ADAM2, GERP
GHJ TET2, NOS2
我希望能够计算数据集 1 中有多少 Interactors
与数据集 2 中的匹配 Interactors
。
示例输出:
Gene Interactors Secondary_interaction_count
ACE BRCA2, NOS2, SEPT9 2 #SEPT9 and NOS2 are in the 2nd dataset under interacting genes
HER2 AGT, TGRF 0
YUO SEPT9, ADAM2, TET2 3 #all 3 are in dataset 2
目前我有 2 个版本可以尝试获取它。一个只给出真假的我不知道怎么改成计数:
temp <- unlist(strsplit(df2$interactors, ', '))
df1$secondary_count <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
还有一个我认为没有拆分字符串,但我不确定如何修改它:
df1 %>%
mutate(secondary_count = str_count(interactors, str_c(df2$interactors, collapse = '|')))
有没有办法修改这两种编码尝试中的任何一种以获得计数?或者我应该尝试其他方法?
输入数据:
#df1:
structure(list(Gene = c("ACE", "HER2", "YUO"), Interactors = c("BRCA2, NOS2, SEPT9",
"AGT, TGRF", "SEPT9, NOS2, TET2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
#df2:
structure(list(Gene = c("RTY", "TERT", "GHJ"), Interactors = c("ADFD, NOS3, SEPT9",
"ADAM2, GERP", "TET2, NOS2")), row.names = c(NA, -3L), class = c("data.table",
"data.frame"))
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7 proto_1.0.0
[5] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
[13] plyr_1.8.6 dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] gtools_3.8.2 tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2
[5] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 chron_2.3-56
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.1
[13] withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_1.4.4
[17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
[21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] fansi_0.4.1 broom_0.7.2 Rcpp_1.0.5 scales_1.1.1
[29] backports_1.1.10 jsonlite_1.7.1 fs_1.5.0 bit_4.0.4
[33] hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[37] cli_2.1.0 tools_4.0.2 magrittr_1.5 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
[45] lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11
[49] R6_2.4.1 compiler_4.0.2
试试这个
library(tidyr)
library(dplyr)
sep_rows <- . %>% separate_rows(Interactors, sep = ", ")
df1 %>%
sep_rows() %>%
mutate(
found = !is.na(match(Interactors, sep_rows(df2)$Interactors))
) %>%
group_by(Gene) %>%
summarise(
Interactors = toString(Interactors),
Secondary_interaction_count = sum(found)
)
输出
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Gene Interactors Secondary_interaction_count
<chr> <chr> <int>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
再试一次:
> df1 %>% separate_rows(Interactors) %>% rowwise() %>%
+ mutate(secondary_interactions = str_extract_all(Interactors, paste0(df2 %>% separate_rows(Interactors) %>% pull(Interactors), collapse = '|'))) %>%
+ unnest(secondary_interactions, keep_empty = T) %>% group_by(Gene) %>%
+ mutate(Interactors = toString(Interactors), secondary_interactions_cnt = case_when(is.na(secondary_interactions) ~ 0, TRUE ~ 1)) %>%
+ mutate(secondary_interactions = sum(secondary_interactions_cnt)) %>% select(-4)%>% distinct()
# A tibble: 3 x 3
# Groups: Gene [3]
Gene Interactors secondary_interactions
<chr> <chr> <dbl>
1 ACE BRCA2, NOS2, SEPT9 2
2 HER2 AGT, TGRF 0
3 YUO SEPT9, NOS2, TET2 3
>