如何根据连接或重复信息对行进行分组？

Question

我有一个基因组位置的遗传数据集，我希望根据连接的重复项information.What对该数据集中的 rows/genome 个位置进行分组 information.What 我的意思是：

如果我有一个点 A、B、C 等的数据集：

Point Connections
A       A, B
B       B, C
C       C, B
D       D, E, F, G

我想通过为这些行设置匹配的组编号列来对所有相互连接（无论是否直接连接）的数据进行分组，例如，此数据集分组为：

Point Connections     Group
A       A, B            1
B       B, C            1
C       C, B            1 
D       D, E, F, G      2

#A B and C are all connected to each other so are in the same group, even if A and C are 
#not directly connected in the Connections column
#D is the first row seen that is unrelated so is put in a separate group which would also
#include D's connecting letters and any connectors of those letters

我的实际数据集的一个样本是染色体位置 (CP)，其中第一个数字是染色体，第二个数字（在 : 之后）是该染色体上的基因组位置，所以看起来像这样（真实数据是 ~ 3000 行):

CP        linked_CPS
1:100    1:100, 1:203
1:102    1:102
1:203    1:100, 1:203, 1:400
1:400    1:400
2:400    2:400, 2:401
2:401    2:401, 2:400

预期输出分组连接行：

CP        linked_CPS          Group
1:100    1:100, 1:203           1
1:203    1:100, 1:203, 1:400    1
1:400    1:400                  1
1:102    1:102                  2
2:400    2:400, 2:401           3
2:401    2:401, 2:402           3

需要注意的是，不同的染色体（CP的第1:或2:号即使第2号相同也不能属于同一组，例如1:400和2:400 不会是同一个组，因为它们在 1 号和 2 号染色体上。

同样对于上下文，我的最终目标是取每个组的最小和最大位置来设置基因组中每个组的区域距离。

我看过其他具有类似 matching/duplicate 信息分组基础的问题，但不确定如何将它们应用于此问题，而且我有生物学背景，所以不确定哪个 packages/functions 最好。任何帮助将不胜感激。

输入数据：

structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400", 
"2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400", 
"1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA, 
-6L), class = c("data.table", "data.frame"))

Answer 1

如果我对你的问题的理解正确，那么你正在寻找图中的连通分量。

下面的代码将您的 data.frame 转换为图表并找到这些组件。

library(tidyverse)
library(tidygraph)

df <- structure(list(CP = c("1:100", "1:102", "1:203", "1:400", "2:400", 
                      "2:401"), linked_CPS = c("1:100, 1:203", "1:102", "1:100, 1:203, 1:400", 
                                               "1:400", "2:400, 2:401", "2:401, 2:402")), row.names = c(NA, 
                                                                                                        -6L), class = c("data.table", "data.frame"))

df %>% 
  separate_rows(linked_CPS, sep = ", ") %>% 
  as_tbl_graph() %>% 
  activate(nodes) %>% 
  mutate(group = group_components()) %>% 
  as_tibble()

这给出了

# A tibble: 7 x 2
  name  group
  <chr> <int>
1 1:100     1
2 1:102     3
3 1:203     1
4 1:400     1
5 2:400     2
6 2:401     2
7 2:402     2

如何根据连接或重复信息对行进行分组？

How to group rows based on connecting or duplicate information?

sorting

r

bioinformatics

data.table