如何根据两个不同的数字列找到两个数据框之间的公共行?

How can I find common rows between two dataframes based on two different numeric columns?

我有两个不同的数据框,我想根据两个不同的数字列(这些列是基因组中基因的位置,开始和结束)找到它们之间的公共行,但我不想基于这两列完全匹配,我想找到具有 (-+) 范围的公地。

实际上我想找到共同的行,即使它们在开始和结束位置有 200 bp 或 500 bp 的差异。我是 R 的初学者,我找不到办法做到这一点。

在df2中,虽然DEC1的起始位置不同,但它与df1中的DEC1相差400 bp,所以我想将该基因视为df1和2之间的共同基因,但PSA和AKT在START中的差异超过500bp并且我不需要它们,尽管它们在 df1 和 2 之间具有相同的 END 位置。

df1 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494957", "39689186", "89435677", "78484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
df2 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494557", "37689186", "89435677", "79484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))

据我所知,mergejoin 是比较两列的唯一方法。

正在使用 data.table

require(data.table)
#> Loading required package: data.table
df1 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494957, 39689186, 89435677, 78484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df2 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494557, 37689186, 89435677, 79484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df3 <- df1[df2[,.(name,START2=START, END2 = END)], on='name']

df3[abs(START2-START) %between% c(0,500) |
      abs(START2-START) %between% c(0,500)] 
#>    name    START       END STRAND   START2      END2
#> 1: DEC1  9494957  52521320      +  9494557  52521320
#> 2: DEC2 89435677 100952138      + 89435677 100952138

或使用dplyr,

df3 <- inner_join(df1, df2, suffix=c('1','2'),by='name')
df3 %>% filter(abs(START2-START1)<500)

reprex package (v2.0.1)

于 2022-04-30 创建