用 R 数据框中的匹配 ID 替换单元格

Replace cells with a matching ID in R dataframe

我有以下形式的数据:

Input_SNP       Set_1     Set_2     Set_3     Set_4     Set_5    Set_6     Set_7
rs70812    4:12309   7:189029   2:2134   17:43232  12:51123  11:15123  19:4312
rs34812    5:61233   2:571022  1:57012   3:537012  14:57123  4:57129   1:61507
rs15602    1:571209  12:34120  9:41236   12:32417  3:57120   9:34123   3:41235
rs90143    7:83541   9:659123  5:23412   16:98234  18:472351 20:12357  1:13421
rs70823    14:89023  13:42081  8:32098   5:431332  9:234134  13:7831   2:74012
rs100980   11:51003  1:100098  10:409123 12:412309 13:34123  16:431098 3:58023
rs10341    18:90312  15:609123 1:70923   2:102358  5:019824  17:120394 9:80123

我实际上有 10,000 套和大约 4,000 行。但这是一个很好的例子。我还有一个文件是:

set snpID     rsMatch
1   4:12309   rs241984
2   7:189029  rs104141
3   2:2134    rs485506
4   17:43232  rs345180
5   12:51123  rs129819
6   11:15123  rs757492
7   19:4312   rs711403
1   5:61233   rs341098
2   2:571022  rs512309
3   1:57012   rs120394
4   3:537012  rs510293
5   14:571234 rs234098
6   4:57129   rs71302
7   1:61507   rs234109
1   1:571209  rs09384
... ...       ...

我想将 Set_1、Set_2、Set_3 等的数字格式替换为其 rsMatch 格式,如下所示:

    Input_SNP  Set_1     Set_2     Set_3     Set_4     Set_5     Set_6     Set_7
    rs70812    rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403
    rs34812    rs341098 rs512309 rs120394 rs510293 rs234098 rs71302  rs234109
    rs15602    rs098384 ...       ...       ...       ...       ...
...        ...       ...       ...       ...       ...       ...

你们对如何做到这一点有什么建议吗?我在考虑 R 数据框,但我对任何事情都持开放态度...

你应该制作一份副本,但我的生活很危险,而且还在制作原件。首先,我们需要将 Set_n 列中的值与第二个输入数据帧相匹配:

 sapply(inp1[-1], match, inp2$snpID)
     Set_1 Set_2 Set_3 Set_4 Set_5 Set_6 Set_7
[1,]     1     2     3     4     5     6     7
[2,]     8     9    10    11    NA    13    14
[3,]    15    NA    NA    NA    NA    NA    NA
[4,]    NA    NA    NA    NA    NA    NA    NA
[5,]    NA    NA    NA    NA    NA    NA    NA
[6,]    NA    NA    NA    NA    NA    NA    NA
[7,]    NA    NA    NA    NA    NA    NA    NA

您没有给我们所有需要的值,但需要 NA 作为占位符。这些值是第二个数据帧中的索引位置。请注意它是转置的(这很容易用 t():

修复

下一步是用 rsMatch 列中的查找值替换项目:

inp1[-1][] <- inp2$rsMatch[ t(sapply(inp1[-1], match, inp2$snpID)) ]
#----------------
> inp1
  Input_SNP    Set_1    Set_2   Set_3 Set_4 Set_5 Set_6 Set_7
1   rs70812 rs241984 rs341098 rs09384  <NA>  <NA>  <NA>  <NA>
2   rs34812 rs104141 rs512309    <NA>  <NA>  <NA>  <NA>  <NA>
3   rs15602 rs485506 rs120394    <NA>  <NA>  <NA>  <NA>  <NA>
4   rs90143 rs345180 rs510293    <NA>  <NA>  <NA>  <NA>  <NA>
5   rs70823 rs129819     <NA>    <NA>  <NA>  <NA>  <NA>  <NA>
6  rs100980 rs757492  rs71302    <NA>  <NA>  <NA>  <NA>  <NA>
7   rs10341 rs711403 rs234109    <NA>  <NA>  <NA>  <NA>  <NA>

第二次尝试:索引可能是:'cbind( 1.1+(.9:nrow(inp2))%/%7, inp2$set+1)' 确实成功了,但所示的 seq(.) 方法更加可靠。

   out1 <- inp1; out1[ cbind( rep(1:(nrow(inp2)), length=nrow(inp2), each=7), inp2$set+1) ] <- inp2$rsMatch

> out1
  Input_SNP    Set_1     Set_2     Set_3     Set_4     Set_5     Set_6    Set_7
1   rs70812 rs241984  rs104141  rs485506  rs345180  rs129819  rs757492 rs711403
2   rs34812 rs341098  rs512309  rs120394  rs510293  rs234098   rs71302 rs234109
3   rs15602  rs09384  12:34120   9:41236  12:32417   3:57120   9:34123  3:41235
4   rs90143  7:83541  9:659123   5:23412  16:98234 18:472351  20:12357  1:13421
5   rs70823 14:89023  13:42081   8:32098  5:431332  9:234134   13:7831  2:74012
6  rs100980 11:51003  1:100098 10:409123 12:412309  13:34123 16:431098  3:58023
7   rs10341 18:90312 15:609123   1:70923  2:102358  5:019824 17:120394  9:80123

在我看来,请求实际上没有在匹配中使用 Input_SNP 值。

提前原谅我,但我在这里看到了 Excel 和 SQL 解决方案,因为您正在关联两个不同的数据集(即数据库表、工作表)。在导入 R 之前,这两种解决方案仍然可以集成为数据准备。这可能比 OP 更适合未来的读者。

Excel 解

简单 VLookupIndex/Match(请参阅使用名为 RsmatchWide、RsmatchLong 的工作表的两个示例)。 IFERROR() 用于删除 #NA.

=IFERROR(INDEX(RsmatchLong!$C:$C, 
         MATCH(RsmatchWide!B2,RsmatchLong!$B:$B, FALSE)), "")

=IFERROR(VLOOKUP(RsmatchWide!B2,RsmatchLong!$B:$C,2,FALSE),"")

准备好后,将工作表保存为 csv,然后导入 R:

df <- read.csv("C:/Path/To/RsMatchDataset.csv")

SQL 解

运行 一个 select 查询,每个集合都有单独的子查询(下面的示例使用了 MS Access,但应该适用于任何 SQL 方言,包括 SQLite,MySQL, SQL 服务器等):

SELECT rFinal.Input_SNP,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r1 ON RsmatchLong.snpID = r1.Set_1
   WHERE r1.Input_SNP = rFinal.Input_SNP) As Set_1,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r2 ON RsmatchLong.snpID = r2.Set_2
   WHERE r2.Input_SNP = rFinal.Input_SNP) As Set_2,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r3 ON RsmatchLong.snpID = r3.Set_3
   WHERE r3.Input_SNP = rFinal.Input_SNP) As Set_3,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r4 ON RsmatchLong.snpID = r4.Set_4
   WHERE r4.Input_SNP = rFinal.Input_SNP) As Set_4,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r5 ON RsmatchLong.snpID = r5.Set_5
   WHERE r5.Input_SNP = rFinal.Input_SNP) As Set_5,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r6 ON RsmatchLong.snpID = r6.Set_6
   WHERE r6.Input_SNP = rFinal.Input_SNP) As Set_6,

  (SELECT RsmatchLong.rsMatch
   FROM RsmatchLong INNER JOIN RsmatchWide r7 ON RsmatchLong.snpID = r7.Set_7
   WHERE r7.Input_SNP = rFinal.Input_SNP) As Set_7

FROM RsMatchWide rFinal

甚至 R 也可以创建基础表,然后 运行 使用 RODBC 的查询:

library(RODBC) 

conn <-odbcDriverConnect('driver={Microsoft Access Driver (*.mdb, *.accdb)};
                          DBQ=C:\PathTo\Database.accdb')

# SAVING DATA FRAMES AS NEW DB TABLES
sqlSave(conn, RsMatchWide, append=FALSE, rownames=TRUE)
sqlSave(conn, RsMatchLong, append=FALSE, rownames=TRUE)

# CREATING DATA FRAME FROM QUERY, 
# QUERY STRING, strSQL, WILL BE SQL SELECT STATEMENT ABOVE
newdf <- sqlQuery(conn, strSQL)

close(conn) 

我在上面预见到的唯一挑战是将其扩展到您的 10,000 组。 Excel 和各种 SQL 数据库一样有列限制。考虑在 R 中拆分和合并。

经过适当的改造,merge可以解决这个问题。我正在使用 library(reshape2) 以正确的形状获取数据以进行合并并返回输出。

#read in files
df1<-read.table("file1",header=TRUE,stringsAsFactors=FALSE)   
df2<-read.table("file2",header=TRUE,stringsAsFactors=FALSE)

library(reshape2)
m1<-melt(df1,id.vars="Input_SNP")
m2<-transform(df2,variable=paste0("Set_",set),value=snpID)
m<-merge(m1,m2)
out<-dcast(m,Input_SNP~variable,value.var="rsMatch")

print(out)

  Input_SNP    Set_1    Set_2    Set_3    Set_4    Set_5    Set_6    Set_7
1   rs15602  rs09384     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>
2   rs34812 rs341098 rs512309 rs120394 rs510293     <NA>  rs71302 rs234109
3   rs70812 rs241984 rs104141 rs485506 rs345180 rs129819 rs757492 rs711403

使用data.table v1.9.5 - 安装说明here:

require(data.table) # v1.9.5+
setDT(dt)
setDT(key)
ids  = seq_len(7L) # or 10000L in your case
cols = paste("Set", ids, sep="_")
on   = "snpID"
for (i in ids) {
    names(on) = cols[i]
    dt[key[set == i], cols[i] := rsMatch, on = on]
}
dt[]

key[set == i] 子集化应该非常快,因为它通过 自动索引 set 列上使用二进制搜索。对于对应于 i 的每个子集,我们将子集 data.table 中的 snpID 与对应的 Set* 列上的 dt 连接起来,并更新 (cols[i] := rsMatch) 与列 rsMatch.

对应的列引用

这应该既快速又节省内存。