筛选 table 以仅保留非冗余组

Filter a table to keep only non-redundant groups

我在 R 中进行系统发育分析已有一段时间了,使用了 ape、phangorn 和 phytools 等库。

在解决问题时,我遇到了一个 presence/absence data.frame,它指定感兴趣的基因是否属于(或不属于)某个组。

例如:

             gene11    gene25  gene33  gene54  gene55 gene65 gene73   gene88
group_1         1        1        0      0       0     0      0       0      
group_2         1        1        1      0       0     0      0       0      
group_3         1        0        1      0       0     0      0       0      
group_4         0        1        1      0       0     0      0       0      
group_5         0        0        0      1       1     0      0       0      
group_6         0        0        0      1       0     0      0       0      
group_7         0        0        0      0       1     0      0       0      
group_8         0        0        0      0       0     1      1       1      
group_9         0        0        0      0       0     1      1       0      
group_10        0        0        0      0       0     1      0       1      
group_11        0        0        0      0       0     0      1       1  

正如预期的那样,在处理生物实体组时,这些实体有多种关联方式:基因 11、25 和 33 形成一个组,它们的关系也可以描述为更小的组,描述成对关系。

重要的是group_2, group_5group_8是生物学上相关的基因组,它们事先并不知道是相关组。 其他较小的组是由于这些相关组中显示的关系而出现的:group_1 与 gene11 和 gene25 相关,但是是嵌套在 broader 中的组(和相关)group_2。 其他情况同理:group_8表示gene65、gene73、gene88之间的关系;关于这些基因的其他组(group_9、group_10 和 group_11)只是描述作为更广泛组的一部分的基因之间存在的成对关系的子组 group_8 .

事先已知的是基因形成不相交组的簇,每个簇由其他(逐渐变小的)簇组成。我有兴趣捕获最大的不相交组。

问题的明确定义 由另一位用户 (@Shree) 完成:

Find minimum number of groups such that all other groups are a sub-group of at least one of those groups. Also a group has to have at least 2 genes i.e. two 1s in a row. Also assuming, 1,01,0 is a subgroup of 1,1,1,0 but 0,1,1,1 is not a subgroup of 1,1,1,0.

提前致谢!

这是一种使用混合整数规划方法的方法。我使用 ompr 进行数学建模,使用 glpk(免费开源)作为求解器。建模逻辑在代码中作为注释提供。

我认为这个问题可以用数学方法描述如下-

Filter dataframe to minimize number of rows such that sum of all columns is 1. Selected rows are called primary groups and every other row should be a subgroup of a primary group. A column (gene) can belong to only one primary group. Any unselected row is a subgroup of a primary group when subgroup <= primary group at all positions (columns). Therefore, (0,0,1,1) is subgroup of (0,1,1,1) but (1,0,1,1) is not a subgroup of (0,1,1,1).

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

gene_mat <- as.matrix(df)
nr <- nrow(gene_mat)
nc <- ncol(gene_mat)

model <- MIPModel() %>% 
  # binary variable x[i] is 1 if row i is selected else 0
  add_variable(x[i], i = 1:nr, type = "binary") %>% 
  # minimize total rows selected
  set_objective(sum_expr(x[i], i = 1:nr), "min") %>% 
  # sum of columns of selected rows must be = 1
  add_constraint(sum_expr(gene_mat[i,j]*x[i], i = 1:nr) == 1, j = 1:nc) %>% 
  solve_model(with_ROI(solver = "glpk"))

# get rows selected
group_rows <- model %>% 
  get_solution(x[i]) %>% 
  filter(value > 0) %>% 
  pull(i) %>% 
  print()

result <- df[group_rows, ]

        gene11 gene25 gene33 gene54 gene55 gene65 gene73 gene88
group_2      1      1      1      0      0      0      0      0
group_5      0      0      0      1      1      0      0      0
group_8      0      0      0      0      0      1      1      1

重要提示-

上述公式没有解决 subgroup <= primary group,而是依赖于 OP 提到的事实 "What is known beforehand is that genes form clusters of disjoint groups"。这意味着数据中不存在如下所示的情况,因为第 1、3、4 行不形成不相交的组,即第 3 列属于不允许的 2 个主要组。

1 1 0 0 0
0 1 0 0 0
1 0 1 0 0 <- this row is not a subgroup of any row
0 0 1 1 1

无论如何,这里的代码用于进行安全检查以确保所有未选中的行都是只有一个主要组的子组 -

test <- lapply(group_rows, function(x) {
  sweep(df, 2, as.numeric(df[x, ]), "<=") %>% 
    {which(rowSums(.) == ncol(df))}
})

# all is okay if below returns TRUE
length(Reduce(intersect, test)) == 0

数据-

df <- structure(list(
  gene11 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), 
  gene25 = c(1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
  gene33 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
  gene54 = c(0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), 
  gene55 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L), 
  gene65 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L), 
  gene73 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L), 
  gene88 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L)), 
  class = "data.frame", 
  row.names = c("group_1", "group_2", "group_3", "group_4", 
                "group_5", "group_6", "group_7", "group_8",
                "group_9", "group_10", "group_11")
)

这是一个选项,我们 split 将数据分成数据列块,并使用 rep 创建的分组索引(此处,1 和 2 对应于前 3 列和接下来的 2 列),然后遍历 list,使用 filter_all 提取全 1 和 mutate 的行,方法是将 NA 替换为 0

library(dplyr)
library(purrr)
library(tibble)
split.default(df, rep(1:2, c(3, 2))) %>%
     map_dfr(~ .x %>%
                  rownames_to_column('rn') %>% 
                  filter_at(-1, all_vars(.==1))) %>% 
        mutate_all(replace_na, 0) %>%
        column_to_rownames('rn')
#.       gene1 gene2 gene3 gene4 gene5
#group_1     1     1     1     0     0
#group_7     0     0     0     1     1

数据

df <- structure(list(gene1 = c(1L, 0L, 1L, 1L, 0L, 0L, 0L), gene2 = c(1L, 
1L, 1L, 0L, 0L, 0L, 0L), gene3 = c(1L, 1L, 0L, 1L, 0L, 0L, 0L
), gene4 = c(0L, 0L, 0L, 0L, 1L, 0L, 1L), gene5 = c(0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("gene1", "gene2", "gene3", "gene4", 
"gene5"), class = "data.frame", row.names = c("group_1", "group_2", 
"group_3", "group_4", "group_5", "group_6", "group_7"))