如何按字符串列 semi_join 两个数据帧,其中一个以冒号分隔

How to semi_join two dataframes by string column with one being colon-separated

我有两个数据框,dfadfb:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

看起来像这样:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa 是我感兴趣的基因列表:我想将 dfb 行保留在它们出现的位置,注意数字(MUC1 而不是 MUC16)。我的 new_df 应该是这样的:

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

我的问题是常规 dplyr::semi_join() 进行精确匹配,这没有考虑到 dfb$gene_names 可以包含用 "; " 分隔的基因这一事实。这意味着在此示例中,"MET" 未保留。

我试图调查 fuzzyjoin::regex_semi_join,但我无法让它做我想做的事...

欢迎使用 tidyverse 解决方案。 (也许 stringr?!)

编辑:后续问题...

我将如何进行倒数 anti_join?在此方法中简单地将 semi_join 更改为 anti_join 是行不通的,因为行 MET; BLEP 在不应该存在的时候出现了...

anti_join 之后添加一个 filter(gene_name == new_col) 适用于提供的简单数据集,但如果我像这样稍微扭曲它:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
  id = c(6:10)
)

...然后就没有了。在这里和我的真实数据集中,dfa 不包含分号,它只是单个基因名称的一列。但是dfb包含的信息很多,分号的多种组合...

您可以在加入前使用seperate_rows()拆分数据框。请注意,如果 BLEP 存在于 dfa 中,则会导致重复,这就是使用 distinct 的原因

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5),
  stringsAsFactors = FALSE
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10),
  stringsAsFactors = FALSE
)


library(tidyverse)

dfb%>%
  mutate(new_col = gene_name)%>%
  separate_rows(new_col,sep = "; ")%>%
  semi_join(dfa,by = c("new_col" = "gene_name"))%>%
  select(gene_name,id)%>%
  distinct()


这是使用 stringrpurrr 的解决方案。

library(tidyverse)

dfb %>%
 mutate(gene_name_list = str_split(gene_name, "; ")) %>%
 mutate(gene_of_interest = map_lgl(gene_name_list, some, ~ . %in% dfa$gene_name)) %>%
 filter(gene_of_interest == TRUE) %>%
 select(gene_name, id)

我想我终于设法让 fuzzyjoin::regex_joins 做我想做的事了。这非常简单,我只需要调整我的 dfa 过滤器列表:

library(fuzzyjoin)

# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\b", dfa$gene_name, "\b")

# to keep genes from dfb that are present in the dfa filter list
dfb %>% 
  regex_semi_join(dfa, by = c(gene_name = "gene_name"))

# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>% 
  regex_anti_join(dfa, by = c(gene_name = "gene_name"))

不过有一个缺点:速度很慢...