如何在多个未列出的字符串中查找匹配值?
How to finding matching values in multiple unlists of strings?
我有两个数据集:
#df1:
Gene interactors
ACE BRCA, HER2
NOS NA, NA
P53 NA
CDON TGBP
df2:
Gene interactors
AGT NOS, HER2
NPKB CDON
P70 GPC
IK TGBP
我正在寻找 df1 中被列为 df2 交互因子的基因,并识别 df1 中与 df2 中的交互因子相匹配的基因
输出:
Gene interactors matched_gene_interactor matched_interactor_interactor
ACE BRCA, HER2 FALSE TRUE
NOS NA, NA TRUE FALSE
P53 NA FALSE FALSE
CDON TGBP TRUE TRUE
#ACE has an interactor (HER2) in both df1 and df2
#NOS matches itself as an interactor in df2
#CDON matches itself as an interactor in df2 and as having an interactor (TGBP) in both df1 and df2
我已经能够通过以下代码获取 matched_gene_interactor
列:
df1$matched_gene_interactor <- df1$Gene %in% unlist(strsplit(df2$interactors, ", "))
但我一直坚持获取第二个 matched_interactor_interactor
列
我已经尝试了一些方法,但还没有找到如何让它达到我想要的第二列的程度,例如:
df1interactors <- unlist(strsplit(df1$interactors, ", "))
df2interactors <- unlist(strsplit(df2$interactors, ", "))
matched_interactor_interactor <- df1interactors %in% df2interactors
如何匹配两个具有未列出的字符串拆分的数据集?我有生物学背景,所以不确定从哪里开始。
示例输入数据:
df1:
structure(list(Gene = c("ACE", "NOS", "P53", "CDON"), interactors = c("BRCA, HER2",
"NA, NA", NA, "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
df2:
structure(list(Gene = c("AGT", "NPKB", "P70", "IK"), interactors = c("NOS, HER2",
"CDON", "GPC", "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
您可以用逗号拆分 df2
的 interactors
,并且对于每一行检查是否存在 df1
中 interactors
的任何值。
temp <- unlist(strsplit(df2$interactors, ', '))
df1$matched_interactor_interactor <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
df1
# Gene interactors matched_gene_interactor matched_interactor_interactor
#1: ACE BRCA, HER2 FALSE TRUE
#2: NOS NA, NA TRUE FALSE
#3: P53 <NA> FALSE FALSE
#4: CDON TGBP TRUE TRUE
如果 df2$interactors
不是很大,您也可以通过创建动态正则表达式模式在不拆分 df1$interactors
的情况下执行此操作:
grepl(paste0('\b', temp, '\b', collapse = '|'), df1$interactors)
#[1] TRUE FALSE FALSE TRUE
我有两个数据集:
#df1:
Gene interactors
ACE BRCA, HER2
NOS NA, NA
P53 NA
CDON TGBP
df2:
Gene interactors
AGT NOS, HER2
NPKB CDON
P70 GPC
IK TGBP
我正在寻找 df1 中被列为 df2 交互因子的基因,并识别 df1 中与 df2 中的交互因子相匹配的基因
输出:
Gene interactors matched_gene_interactor matched_interactor_interactor
ACE BRCA, HER2 FALSE TRUE
NOS NA, NA TRUE FALSE
P53 NA FALSE FALSE
CDON TGBP TRUE TRUE
#ACE has an interactor (HER2) in both df1 and df2
#NOS matches itself as an interactor in df2
#CDON matches itself as an interactor in df2 and as having an interactor (TGBP) in both df1 and df2
我已经能够通过以下代码获取 matched_gene_interactor
列:
df1$matched_gene_interactor <- df1$Gene %in% unlist(strsplit(df2$interactors, ", "))
但我一直坚持获取第二个 matched_interactor_interactor
列
我已经尝试了一些方法,但还没有找到如何让它达到我想要的第二列的程度,例如:
df1interactors <- unlist(strsplit(df1$interactors, ", "))
df2interactors <- unlist(strsplit(df2$interactors, ", "))
matched_interactor_interactor <- df1interactors %in% df2interactors
如何匹配两个具有未列出的字符串拆分的数据集?我有生物学背景,所以不确定从哪里开始。
示例输入数据:
df1:
structure(list(Gene = c("ACE", "NOS", "P53", "CDON"), interactors = c("BRCA, HER2",
"NA, NA", NA, "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
df2:
structure(list(Gene = c("AGT", "NPKB", "P70", "IK"), interactors = c("NOS, HER2",
"CDON", "GPC", "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
您可以用逗号拆分 df2
的 interactors
,并且对于每一行检查是否存在 df1
中 interactors
的任何值。
temp <- unlist(strsplit(df2$interactors, ', '))
df1$matched_interactor_interactor <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
df1
# Gene interactors matched_gene_interactor matched_interactor_interactor
#1: ACE BRCA, HER2 FALSE TRUE
#2: NOS NA, NA TRUE FALSE
#3: P53 <NA> FALSE FALSE
#4: CDON TGBP TRUE TRUE
如果 df2$interactors
不是很大,您也可以通过创建动态正则表达式模式在不拆分 df1$interactors
的情况下执行此操作:
grepl(paste0('\b', temp, '\b', collapse = '|'), df1$interactors)
#[1] TRUE FALSE FALSE TRUE