数据框中的公共元素

Question

我有三个数据框，有很多信息和以下行名：

ENSG00000000971 ENSG00000000971 ENSG00000000971
ENSG00000004139 ENSG00000004139 ENSG00000003987
ENSG00000005001 ENSG00000004848 ENSG00000004848
ENSG00000005102 ENSG00000002330 ENSG00000002330
ENSG00000005486 ENSG00000005102 ENSG00000006047
...             ...             ...

我想要做的是找到在至少 2 个数据框中常见的所有条目（行名）。即，最终结果应该是一个列表，如下所示：

ENSG00000000971
ENSG00000004139
ENSG00000004848
ENSG00000005102
ENSG00000002330

我该怎么做？我试过这样做：

shared.DESeq2.edgeR = data.frame(row.names(res.DESeq2) %in% row.names(res.edgeR))
shared.DESeq2.limma = data.frame(row.names(res.DESeq2) %in% row.names(res.limma))
shared.edgeR.limma = data.frame(row.names(res.edgeR) %in% row.names(res.limma))
shared = merge(merge(shared.DESeq2.edgeR, shared.DESeq2.limma), shared.edgeR.limma)

...其中三个res.[DESeq2/edgeR/limma]是三个数据框，但这需要很长时间才能运行（我什至没有让它完成，所以我不知道它是否实际工作）。我有一些代码可以针对 所有三个 数据帧共有的元素执行此操作，但我也对两个或多个数据框共有的元素感兴趣，但我真的找不到一个好的方法来做到这一点。有什么想法吗？

Answer 1

试试这个例子：

#dummy data, with real data we would do: res.DESeq2_rn <-row.names(res.DESeq2)
res.DESeq2_rn <- letters[1:4]
res.edgeR_rn <- letters[3:8]
res.limma_rn <- letters[c(1,3,8,10)]

#get counts
res <- table(c(res.DESeq2_rn, res.edgeR_rn, res.limma_rn))
res
# a b c d e f g h j 
# 2 1 3 2 1 1 1 2 1 

#result
names(res)[ res>=2 ]
#[1] "a" "c" "d" "h"

编辑： 基准测试表明@vaettchen 的解决方案是赢家！

library(microbenchmark)
library(ggplot2)
# create a large random character vector (this takes a lot of time!)
set.seed(123)
myNames <- sapply(1:1000000,
                  function(i)paste( sample( letters, 8, replace = TRUE ), collapse = "" ))
A <- sample(myNames,1000)
B <- sample(myNames,2000)
C <- sample(myNames,3000)

#benchmarking 3 options
myBench <- microbenchmark(
  Which={
    res <- c(A,B,C)
    out1 <- unique( res[ which( duplicated( res ) ) ] ) },
  Table={ 
    res <- c(A,B,C)
    y <- table( res )
    out2 <- names( y )[ y >= 2 ] },
  Intersect={ 
    out3 <- 
      unique(
        c(intersect(A,B),
          intersect(A,C),
          intersect(B,C)))},
  times=1000)

print(myBench)
qplot(y=time, data=myBench, colour=expr) + scale_y_log10()

Unit: microseconds
      expr       min         lq       mean     median         uq       max neval cld
     Which   266.837   280.4190   527.8266   288.2680   301.2475  59255.34  1000  a 
     Table 32167.286 32739.5945 34851.2260 33072.0825 33524.2550 108176.22  1000   b
 Intersect   450.965   472.3965   667.3316   484.7725   499.8650  60266.54  1000  a

Answer 2

另一种方法，采用@zx8754 的示例数据：

# dummy data
res.DESeq2 <- letters[ 1:4 ]
res.edgeR <- letters[ 3:8 ]
res.limma <- letters[ c( 1, 3, 8, 10 ) ]

# combine into one vector                  
res <- c( res.DESeq2, res.edgeR, res.limma )
res
[1] "a" "b" "c" "d" "c" "d" "e" "f" "g" "h" "a" "c" "h" "j"

# result
unique( res[ which( duplicated( res ) ) ] )
[1] "c" "d" "a" "h"

编辑

@zx8754 的回答被采纳，出于各种原因，它干净优雅。纯粹出于我的求知欲，我研究了他和我的大样本方法之间的性能差异，发现它足够有趣 post 它：

# create a large random character vector (this takes a lot of time!)
res <- rep( "x", 1000000 )
for( i in 1:1000000) 
    res[ i ] <- paste( sample( letters, 8, replace = TRUE ), collapse = "" )
head( res )
[1] "vsvkljgr" "ulxhqnas" "upqqtrdk" "pynuaihp" "srjtnvqm" "mxnlytvd"

# vaettchen:
system.time( x <- unique( res[ which( duplicated( res ) ) ] ) )
 user  system elapsed 
0.173   0.000   0.171 
x
[1] "zlzlwinb" "wielycpx"

# zx8754
system.time( { y <- table( res ); z <- names( y )[ y >= 2 ] } )
  user  system elapsed
18.945   0.020  19.058 
z
[1] "wielycpx" "zlzlwinb"

对于足够大的数据或重复调用，差异可能很重要。简要说明我的代码的作用：

duplicated( res ) 创建一个长度为 res 的向量，其中包含逻辑 TRUE 或 FALSE，具体取决于字符串是否重复出现
which( ... ) 将其转换为索引向量，其中该值为 TRUE
res[ ... ]提取索引位置res的实际字符值，
unique( ... ) 将每个字符值减少到只有一次出现，这就是@Sajber 正在寻找的答案（据我了解）

数据框中的公共元素

Common elements in data frames

intersection

r

bioinformatics

dataframe

编辑