计算物种出生-死亡的总系统发育分支长度之和 table

Question

背景

这个问题有点复杂，我先介绍一下背景。要生成物种出生-死亡的示例 table（L table），我建议使用 DDD 中的 dd_sim() 函数包。

library(DDD)
library(tidyverse)
library(picante)

result <- dd_sim(c(0.2, 0.1, 20), 10) 
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L

L

            [,1] [,2] [,3]       [,4]
 [1,] 10.0000000    0   -1 -1.0000000
 [2,] 10.0000000   -1    2  2.7058965
 [3,]  8.5908774    2    3  6.6301616
 [4,]  8.4786474    3    4  3.3866813
 [5,]  8.4455262   -1   -5 -1.0000000
 [6,]  8.3431071    4    6  3.5624756
 [7,]  5.3784683    2    7  0.6975934
 [8,]  3.8950593    6    8 -1.0000000
 [9,]  1.5032100   -5   -9 -1.0000000
[10,]  0.8393589    7   10 -1.0000000
[11,]  0.6118985   -5  -11 -1.0000000

Ltable有4列：

第一列是一个物种在 Mya 出生的时间
第二列是物种的亲本标签；正值和负值表示该物种属于左冠系还是右冠系
第三列是子种本身的标签；正负值表示物种属于左冠系还是右冠系
第四栏是物种灭绝的时间；如果第四个元素等于-1，则该物种仍然存在。

我做了什么

有了这个 L table，我现在有了一个现存的社区。我想计算它的系统发育多样性（也称为信仰指数）

使用 DDD 和 picante 函数我可以做到：

# convert L table into community data matrix
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()

# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)

# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)

问题

虽然我达到了目的，但过程似乎是多余且耗时的。我必须来回转换我原来的 L table 因为我必须使用现有的函数。

根据定义，Faith 指数基本上是群落总系统发育分支长度的总和，所以我的问题是：

Is it possible to calculate Faith's index directly from the L table?

提前谢谢！

Answer 1

我检查了 pd::sample2matrix 以了解它的内部功能。 tapply 调用和以下行看起来是唯一必要的部分。

library(DDD)
library(tidyverse)
library(picante)
#> Loading required package: ape
#> Loading required package: vegan
#> Loading required package: permute
#> Loading required package: lattice
#> This is vegan 2.5-6
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
set.seed(100)
result <- dd_sim(c(0.2, 0.1, 20), 10) 
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L

# convert L table into community data matrix
comm_original = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
  dplyr::rename(id = V3, pa = V4) %>%
  dplyr::mutate(id = paste0("t",abs(id))) %>%
  dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
  dplyr::mutate(plot = 0) %>%
  dplyr::select(plot, pa, id) %>%
  picante::sample2matrix()


# Instead of using dplyr, we'll do some base R operations
# on L. The code doesn't look as nice, but it should be
# significantly faster.
pa <- ifelse(L[, 4] == -1, 1, 0)
plot <- rep(0, length(pa))
id <- paste0("t", abs(L[,3]))
comm_new <- tapply(pa, list(plot, id), sum)
comm_new[is.na(comm_new)] <- 0
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)

# calculate Faith's index using pd() function
picante::pd(comm_original,phy)
#>         PD SR
#> 0 29.82483  6

picante::pd(comm_new, phy)
#>         PD SR
#> 0 29.82483  6
Created on 2019-11-17 by the reprex package (v0.3.0)

编辑：original() 是您最初构建的方式 comm，new() 是上面给出的方式。如果将其换入，看起来您可以期待 2 倍的加速。我知道根据工作负载的大小，这并不是一个巨大的收益，但总比没有好。

expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result               memory          time    gc            
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>               <list>          <list>  <list>        
1 original()   9.76ms  10.24ms      96.1     552KB     2.04    47     1      489ms <df[,1125] [1 x 1,1~ <df[,3] [107 x~ <bch:t~ <tibble [48 x~
2 new()        4.57ms   4.84ms     201.      464KB     2.07    97     1      483ms <dbl[,1125] [1 x 1,~ <df[,3] [63 x ~ <bch:t~ <tibble [98 x~

Answer 2

您可以简单地使用 DDD::L2phylo 生成的 phylo 对象的 phy$edge.length 组件：

## Measuring the sum of the branch lengths from `phy`
sum_br_length <- sum(phy$edge.length)
sum_br_length == Faith$PD
# [1] TRUE

## Measuring the sum of the branch length from `L`
sum_br_length <- sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length)
sum_br_length == Faith$PD
# [1] TRUE

还有一些有趣的微基准测试：

library(microbenchmark)
## Function 1
fun1 <- function(L) {
    comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
    dplyr::rename(id = V3, pa = V4) %>%
    dplyr::mutate(id = paste0("t",abs(id))) %>%
    dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
    dplyr::mutate(plot = 0) %>%
    dplyr::select(plot, pa, id) %>%
    picante::sample2matrix()

    # convert L table into phylogeny
    phy = DDD::L2phylo(L, dropextinct = T)

    # calculate Faith's index using pd() function
    Faith = picante::pd(comm,phy)
    return(Faith$PD)
}
## Function 2
fun2 <- function(L) {
    phy <- DDD::L2phylo(L, dropextinct = T)
    return(sum(phy$edge.length))
}
## Function 3
fun3 <- function(L) {
    return(sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length))
}

## Do all of them give the same results
fun1(L) == Faith$PD
# [1] TRUE
fun2(L) == Faith$PD
# [1] TRUE
fun3(L) == Faith$PD
# [1] TRUE

## Which function fastest?
microbenchmark(fun1(L), fun2(L), fun3(L))
# Unit: milliseconds
#     expr      min       lq     mean   median       uq       max neval
#  fun1(L) 6.486462 6.900641 8.273386 7.445334 8.667535 16.888429   100
#  fun2(L) 1.627854 1.683204 2.215531 1.771219 2.229408  9.522366   100
#  fun3(L) 1.630635 1.663181 2.229206 1.859733 2.448196  7.573001   100

计算物种出生-死亡的总系统发育分支长度之和 table

Calculate sum of the total phylogenetic branch length from species birth-death table

r

bioinformatics

phylogeny

背景

我做了什么

问题