如何将函数应用于 2 个数据帧之间的分组行?

How to apply a function to grouped rows between 2 dataframes?

我有 2 个遗传数据数据框,我正在寻找 运行 两个数据集之间所有表型的超几何测试函数(使用 GeneOverlap 测试函数包) .我试图自动化这个过程并将每个表型的结果存储在一个新的数据框中,但我坚持对两个数据框中的所有表型的函数进行自动化。

我的数据集如下所示:

数据集 1:

Gene      Gene_count   Phenotype
Gene1          5       Phenotype1
Gene1          5       Phenotype2
Gene2          3       Phenotype1
Gene3         16       Phenotype6
Gene3.        16       Phenotype2
Gene3         16       Phenotype1

数据集2:

Gene    Gene_count     Phenotype
Gene1         10       Phenotype1
Gene1         10       Phenotype2
Gene4         4        Phenotype1
Gene2         17       Phenotype6
Gene6         3        Phenotype2
Gene7         2        Phenotype1

目前我一次运行一个超几何测试,看起来像这样:

dataset1_pheno1 <- dataset1  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

dataset2_pheno1 <- dataset2  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

go.obj <- newGeneOverlap(dataset1_pheno1$Gene, 
                         dataset2_pheno1$Gene,
                         genome.size=1871)
go.obj <- testGeneOverlap(go.obj)
go.obj 

我想对 2 个数据集中的每个表型重复此函数,到目前为止,我一直在尝试使用 Dplyr 中的 group_by() 函数,然后尝试获得 Geneoverlap 函数 运行 在里面,但我没能让它工作。我可以使用哪些函数按 2 个数据集中的列和行进行分组,然后 运行 一次处理一组?

示例输入数据:

library(GeneOverlap)
library(dplyr)
library(stringr)

dataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.", 
"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))


dataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6", 
"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))

您可以 split 按“表现型”将每个数据集放入列表中,然后使用 Map 到 运行 针对每个集的测试。但请注意,每个数据集必须以相同的顺序具有相同数量的独特表型。换句话说,all(names(d1_split) == names(d2_split)) 必须为真。

d1_split <- split(dataset1, dataset1$Phenotype)
d2_split <- split(dataset2, dataset2$Phenotype)

# this should be TRUE in order for Map to work correctly
all(names(d1_split) == names(d2_split))

tests <- Map(function(d1, d2) {
  go.obj <- newGeneOverlap(d1$Gene, d2$Gene, genome.size = 1871)
  return(testGeneOverlap(go.obj))
}, d1_split, d2_split)