如果样本出现在另一个数据集的行中，如何有条件地计数和记录？

Question

我有一个 ID 遗传数据集 (dataset1) 和一个相互作用的 ID 数据集 (dataset2)。我正在尝试计算数据集 1 中出现在数据集 2 中的 2 个交互列之一中的 ID，并记录第 3 列中的 interacting/matching ID。

数据集 1:

ID
1
2
3

数据集2：

Interactor1    Interactor2
1                  5
2                  3
1                  10

输出：

ID   InteractionCount    Interactors
1            2               5, 10
2            1                3
3            1                2

因此输出包含数据集 1 的所有 ID，这些 ID 的计数也出现在数据集 2 的第 1 列或第 2 列中，如果确实出现，它还会存储与之交互的数据集 2 中的哪些 ID 号。

我有生物学背景，所以已经猜到了接近这个，到目前为止我已经设法使用 merge() 和 setDT(mergeddata)[, .N, by=ID] 来尝试计算数据集 2 中出现的数据集 1 ID，但是我不确定这是否是能够添加到存储交互 ID 的列的创建中的正确方法。对于可以在第 3 列中存储匹配 ID 的可能函数的任何帮助，我们将不胜感激。

输入数据：

dput(dataset1)
structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

dput(dataset2)
structure(list(Interactor1 = c(1L, 2L, 1L), Interactor2 = c(5L, 
3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

Answer 1

这是一个基于 tidyverse 包的解决方案。

library(tidyverse)

d1 <- tibble(ID=1:3)
d2 <- tibble(Interactor1=c(1, 2, 1), Interactor2=c(5, 3, 10))

我认为你的一些困难是由于你的数据不整洁造成的。您可以在 tidyverse 主页上了解这意味着什么。让我们把 d2 整理一下：

d2narrow <- d2 %>% gather(key="Where", value="ID", Interactor1, Interactor2)
d2narrow

给出：

# A tibble: 6 x 2
  Where          ID
  <chr>       <dbl>
1 Interactor1     1
2 Interactor1     2
3 Interactor1     1
4 Interactor2     5
5 Interactor2     3
6 Interactor2    10

现在获得 InteractionCount 很容易：

counts <- d2narrow %>% group_by(ID) %>% summarise(InteractionCount=n())
counts

# A tibble: 5 x 2
     ID InteractionCount
  <dbl>            <int>
1     1                2
2     2                1
3     3                1
4     5                1
5    10                1

我们可以通过返回原始 d2...

来获得 Interactor1 的每个值的 Interactor2 列表

interactors1 <- d2 %>% 
                  group_by(Interactor1) %>% 
                  summarise(With1=list(unique(Interactor2))) %>% 
                  rename(ID=Interactor1)
interactors1

# A tibble: 2 x 2
     ID With1    
  <dbl> <list>   
1     1 <dbl [2]>
2     2 <dbl [1]>

如果 ID 可以同时出现在 Interactor1 和 Interactor2 中，事情就会变得有点复杂。（在您的示例中不会发生这种情况，但以防万一...）

interactors2 <- d2 %>% group_by(Interactor2) %>% summarise(With2=list(unique(Interactor1))) %>% rename(ID=Interactor2)
interactors <- interactors1 %>% 
                 full_join(interactors2, by="ID") %>% 
                 unnest(cols=c(With1, With2)) %>% 
                 mutate(With=ifelse(is.na(With1), With2, With1)) %>% 
                 select(-With1, -With2)
interactors <- interactors %>% 
                 group_by(ID) %>% 
                 summarise(Interactors=list(unique(With)))

现在您可以将所有内容放在一起，并确保只获取您想要的 ID 的数据：

interactors <- d1 %>% left_join(counts, by="ID") %>% left_join(interactors, by="ID")
interactors

# A tibble: 3 x 3
     ID InteractionCount Interactors
  <dbl>            <int> <list>     
1     1                2 <dbl [2]>  
2     2                1 <dbl [1]>  
3     3                1 <dbl [1]>

这是您要求的格式的数据（一列包含每个 ID 的交互者列表）。证明一下：

interactors$Interactors[1]

[[1]]
[1]  5 10

但我认为您可能会发现，如果答案形式整洁，您可能会更容易做更多事情：

interactors %>% unnest(cols=c(Interactors))

# A tibble: 4 x 3
     ID InteractionCount Interactors
  <dbl>            <int>       <dbl>
1     1                2           5
2     1                2          10
3     2                1           3
4     3                1           2

Answer 2

这是一个使用 data.table 的选项：

x <- names(DT2)
cols <- c("InteractionCount", "Interactors")

#ensure that the pairs are ordered for each row and there are no duplicated pairs
DT2 <- setkeyv(unique(DT2[,(x) := .(pmin(i1, i2), pmax(i1, i2))]), x)

#for each ID find the neighbours linked to it
neighbours <- rbindlist(list(DT2[, .(.N, toString(i2)), i1],
    DT2[, .(.N, toString(i1)), i2]), use.names=FALSE)
setnames(neighbours, names(neighbours), c("ID", cols))

#update dataset1 using the above data
dataset1[, (cols) := neighbours[dataset1, on=.(ID), mget(cols)]]

dataset1 的输出：

   ID InteractionCount Interactors
1:  1                2       5, 10
2:  2                1           3
3:  3                1           2

数据：

library(data.table)
DT1 <- structure(list(ID = 1:3), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
DT2 <- structure(list(i1 = c(1L, 2L, 1L), i2 = c(5L, 3L, 10L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

Answer 3

另一个 data.table 答案。

library(data.table)
d1 <- data.table(ID=1:3)
d2 <- data.table(I1=c(1,2,1),I2=c(5,3,10))

# first stack I1 on I2 and vice versa
Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3
# 4:  5  1
# 5: 10  1
# 6:  3  2

# then collect the desired columns
Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1           3
# 3:  3                1           2

编辑：如果 ID 不是数字，您可以在 d1:

上设置一个键

library(data.table)
d1 <- data.table(ID=c("1","2","3A"))
setkey(d1,ID)
d2 <- data.table(I1=c("1","2","1"),I2=c("5","3A","10"))

Output <- d2[,.(ID=c(I1,I2),x=c(I2,I1))]
Output
#    ID  x
# 1:  1  5
# 2:  1 10
# 3:  2  3A
# 4:  5  1
# 5: 10  1
# 6: 3A  2

Output <- Output[ID %in% unlist(d1[(ID)])][
  ,.(InteractionCount=.N,
    Interactors = list(x)),
  by=ID]
Output
#    ID InteractionCount Interactors
# 1:  1                2        5,10
# 2:  2                1          3A
# 3:  3A               1           2

如果样本出现在另一个数据集的行中，如何有条件地计数和记录？

How to conditionally count and record if a sample appears in rows of another dataset?

r

bioinformatics

count

data.table