从系统发育树的集合中获得平均系统发育树分支长度

Obtaining mean phylogenetic tree branch lengths from an ensemble of phylogenetic trees

我有一组系统发育树,其中一些具有不同的拓扑结构和不同的分支长度。这里和示例集:

(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);

(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);

(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);

(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);

(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);

(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);

(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);

(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);

在此示例中有 2 个独特的拓扑 - 使用 Rape unique.multiPhylo 表明(假设上面的示例保存到文件 tree.fn) :

tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)

> length(tree)
[1] 8
> length(unique.tree)
[1] 2

我的问题是如何获得树列表,每棵树代表输入列表中的唯一拓扑结构,分支长度是具有相同拓扑结构的所有树的汇总统计数据,例如均值或中值.

在上面的例子中,它会 return 第一棵树,因为它的拓扑结构是唯一的,而另一棵树是其他树的拓扑结构,具有平均或中值分支长度?

如果我理解得很好,您想将每个唯一树的所有树分类到不同的组中(例如,在您的示例中,第一组包含一棵树,等等...)然后测量每个组的一些统计数据?

您可以先将拓扑分组到一个列表中:

set.seed(5)
## Generating 20 4 tip trees (hopefully they will be identical topologies!)
tree_list <- rmtree(20, 4)
## How many unique topologies?
length(unique(tree_list))
## Sorting the trees by topologies
tree_list_tmp <- tree_list
sorted_tree_list <- list()
counter <- 0
while(length(tree_list_tmp) != 0) {
    counter <- counter+1
    ## Is the first tree equal to any of the trees in the list
    equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
    ## Saving the identical trees
    sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
    ## Removing them from the list
    tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
    ## Repeat while there are still some trees!
}
## The list of topologies should be equal to the number of unique trees
length(sorted_tree_list) == length(unique(tree_list))
## Giving them names for fancyness
names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))

然后对于每个唯一拓扑组中的所有树,您可以通过制作一个函数来提取不同的汇总统计信息。例如,我将测量分支长度 sd、平均值和 90% 分位数。

## function for getting some stats
get.statistics <- function(unique_topology_group) {
    ## Extract the branch lengths of all the trees
    branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))

    ## Apply some statistics
    return(c(   n = length(unique_topology_group),
             mean = mean(branch_lengths),
               sd = sd(branch_lengths),
               quantile(branch_lengths, prob = c(0.05, 0.95))))
} 

## Getting all the stats
all_stats <- lapply(sorted_tree_list, get.statistics)
## and making it into a nice table
round(do.call(rbind, all_stats), digits = 3)
#            n  mean    sd    5%   95%
# topology1  3 0.559 0.315 0.113 0.962
# topology2  2 0.556 0.259 0.201 0.889
# topology3  4 0.525 0.378 0.033 0.989
# topology4  2 0.489 0.291 0.049 0.855
# topology5  2 0.549 0.291 0.062 0.882
# topology6  1 0.731 0.211 0.443 0.926
# topology7  3 0.432 0.224 0.091 0.789
# topology8  1 0.577 0.329 0.115 0.890
# topology9  1 0.473 0.351 0.108 0.833
# topology10 1 0.439 0.307 0.060 0.795

当然你可以调整它以获得你自己想要的统计数据,甚至可以得到每组每棵树的统计数据(使用双 lapply lapply(sorted_trees_list, lapply, get.statistics) 或类似的东西)。