计算物种出生-死亡的总系统发育分支长度之和 table
Calculate sum of the total phylogenetic branch length from species birth-death table
背景
这个问题有点复杂,我先介绍一下背景。要生成物种出生-死亡的示例 table(L table),我建议使用 DDD
中的 dd_sim()
函数包。
library(DDD)
library(tidyverse)
library(picante)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
L
[,1] [,2] [,3] [,4]
[1,] 10.0000000 0 -1 -1.0000000
[2,] 10.0000000 -1 2 2.7058965
[3,] 8.5908774 2 3 6.6301616
[4,] 8.4786474 3 4 3.3866813
[5,] 8.4455262 -1 -5 -1.0000000
[6,] 8.3431071 4 6 3.5624756
[7,] 5.3784683 2 7 0.6975934
[8,] 3.8950593 6 8 -1.0000000
[9,] 1.5032100 -5 -9 -1.0000000
[10,] 0.8393589 7 10 -1.0000000
[11,] 0.6118985 -5 -11 -1.0000000
Ltable有4列:
- 第一列是一个物种在 Mya 出生的时间
- 第二列是物种的亲本标签;正值和负值表示该物种属于左冠系还是右冠系
- 第三列是子种本身的标签;正负值表示物种属于左冠系还是右冠系
- 第四栏是物种灭绝的时间;如果
第四个元素等于-1,则该物种仍然存在。
我做了什么
有了这个 L table,我现在有了一个现存的社区。我想计算它的系统发育多样性(也称为信仰指数)
使用 DDD
和 picante
函数我可以做到:
# convert L table into community data matrix
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
问题
虽然我达到了目的,但过程似乎是多余且耗时的。我必须来回转换我原来的 L table 因为我必须使用现有的函数。
根据定义,Faith 指数基本上是群落总系统发育分支长度的总和,所以我的问题是:
Is it possible to calculate Faith's index directly from the L table?
提前谢谢!
我检查了 pd::sample2matrix
以了解它的内部功能。 tapply
调用和以下行看起来是唯一必要的部分。
library(DDD)
library(tidyverse)
library(picante)
#> Loading required package: ape
#> Loading required package: vegan
#> Loading required package: permute
#> Loading required package: lattice
#> This is vegan 2.5-6
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
set.seed(100)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
# convert L table into community data matrix
comm_original = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# Instead of using dplyr, we'll do some base R operations
# on L. The code doesn't look as nice, but it should be
# significantly faster.
pa <- ifelse(L[, 4] == -1, 1, 0)
plot <- rep(0, length(pa))
id <- paste0("t", abs(L[,3]))
comm_new <- tapply(pa, list(plot, id), sum)
comm_new[is.na(comm_new)] <- 0
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
picante::pd(comm_original,phy)
#> PD SR
#> 0 29.82483 6
picante::pd(comm_new, phy)
#> PD SR
#> 0 29.82483 6
Created on 2019-11-17 by the reprex package (v0.3.0)
编辑:original()
是您最初构建的方式 comm
,new()
是上面给出的方式。如果将其换入,看起来您可以期待 2 倍的加速。我知道根据工作负载的大小,这并不是一个巨大的收益,但总比没有好。
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 original() 9.76ms 10.24ms 96.1 552KB 2.04 47 1 489ms <df[,1125] [1 x 1,1~ <df[,3] [107 x~ <bch:t~ <tibble [48 x~
2 new() 4.57ms 4.84ms 201. 464KB 2.07 97 1 483ms <dbl[,1125] [1 x 1,~ <df[,3] [63 x ~ <bch:t~ <tibble [98 x~
您可以简单地使用 DDD::L2phylo
生成的 phylo
对象的 phy$edge.length
组件:
## Measuring the sum of the branch lengths from `phy`
sum_br_length <- sum(phy$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
## Measuring the sum of the branch length from `L`
sum_br_length <- sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
还有一些有趣的微基准测试:
library(microbenchmark)
## Function 1
fun1 <- function(L) {
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
return(Faith$PD)
}
## Function 2
fun2 <- function(L) {
phy <- DDD::L2phylo(L, dropextinct = T)
return(sum(phy$edge.length))
}
## Function 3
fun3 <- function(L) {
return(sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length))
}
## Do all of them give the same results
fun1(L) == Faith$PD
# [1] TRUE
fun2(L) == Faith$PD
# [1] TRUE
fun3(L) == Faith$PD
# [1] TRUE
## Which function fastest?
microbenchmark(fun1(L), fun2(L), fun3(L))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(L) 6.486462 6.900641 8.273386 7.445334 8.667535 16.888429 100
# fun2(L) 1.627854 1.683204 2.215531 1.771219 2.229408 9.522366 100
# fun3(L) 1.630635 1.663181 2.229206 1.859733 2.448196 7.573001 100
背景
这个问题有点复杂,我先介绍一下背景。要生成物种出生-死亡的示例 table(L table),我建议使用 DDD
中的 dd_sim()
函数包。
library(DDD)
library(tidyverse)
library(picante)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
L
[,1] [,2] [,3] [,4]
[1,] 10.0000000 0 -1 -1.0000000
[2,] 10.0000000 -1 2 2.7058965
[3,] 8.5908774 2 3 6.6301616
[4,] 8.4786474 3 4 3.3866813
[5,] 8.4455262 -1 -5 -1.0000000
[6,] 8.3431071 4 6 3.5624756
[7,] 5.3784683 2 7 0.6975934
[8,] 3.8950593 6 8 -1.0000000
[9,] 1.5032100 -5 -9 -1.0000000
[10,] 0.8393589 7 10 -1.0000000
[11,] 0.6118985 -5 -11 -1.0000000
Ltable有4列:
- 第一列是一个物种在 Mya 出生的时间
- 第二列是物种的亲本标签;正值和负值表示该物种属于左冠系还是右冠系
- 第三列是子种本身的标签;正负值表示物种属于左冠系还是右冠系
- 第四栏是物种灭绝的时间;如果 第四个元素等于-1,则该物种仍然存在。
我做了什么
有了这个 L table,我现在有了一个现存的社区。我想计算它的系统发育多样性(也称为信仰指数)
使用 DDD
和 picante
函数我可以做到:
# convert L table into community data matrix
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
问题
虽然我达到了目的,但过程似乎是多余且耗时的。我必须来回转换我原来的 L table 因为我必须使用现有的函数。
根据定义,Faith 指数基本上是群落总系统发育分支长度的总和,所以我的问题是:
Is it possible to calculate Faith's index directly from the L table?
提前谢谢!
我检查了 pd::sample2matrix
以了解它的内部功能。 tapply
调用和以下行看起来是唯一必要的部分。
library(DDD)
library(tidyverse)
library(picante)
#> Loading required package: ape
#> Loading required package: vegan
#> Loading required package: permute
#> Loading required package: lattice
#> This is vegan 2.5-6
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
set.seed(100)
result <- dd_sim(c(0.2, 0.1, 20), 10)
# with birth rate 0.2, death rate 0.1, carrying capacity 20 and overall 10 million years.
L <- result$L
# convert L table into community data matrix
comm_original = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# Instead of using dplyr, we'll do some base R operations
# on L. The code doesn't look as nice, but it should be
# significantly faster.
pa <- ifelse(L[, 4] == -1, 1, 0)
plot <- rep(0, length(pa))
id <- paste0("t", abs(L[,3]))
comm_new <- tapply(pa, list(plot, id), sum)
comm_new[is.na(comm_new)] <- 0
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
picante::pd(comm_original,phy)
#> PD SR
#> 0 29.82483 6
picante::pd(comm_new, phy)
#> PD SR
#> 0 29.82483 6
Created on 2019-11-17 by the reprex package (v0.3.0)
编辑:original()
是您最初构建的方式 comm
,new()
是上面给出的方式。如果将其换入,看起来您可以期待 2 倍的加速。我知道根据工作负载的大小,这并不是一个巨大的收益,但总比没有好。
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 original() 9.76ms 10.24ms 96.1 552KB 2.04 47 1 489ms <df[,1125] [1 x 1,1~ <df[,3] [107 x~ <bch:t~ <tibble [48 x~
2 new() 4.57ms 4.84ms 201. 464KB 2.07 97 1 483ms <dbl[,1125] [1 x 1,~ <df[,3] [63 x ~ <bch:t~ <tibble [98 x~
您可以简单地使用 DDD::L2phylo
生成的 phylo
对象的 phy$edge.length
组件:
## Measuring the sum of the branch lengths from `phy`
sum_br_length <- sum(phy$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
## Measuring the sum of the branch length from `L`
sum_br_length <- sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length)
sum_br_length == Faith$PD
# [1] TRUE
还有一些有趣的微基准测试:
library(microbenchmark)
## Function 1
fun1 <- function(L) {
comm = as.data.frame(L) %>% dplyr::select(V3:V4) %>%
dplyr::rename(id = V3, pa = V4) %>%
dplyr::mutate(id = paste0("t",abs(id))) %>%
dplyr::mutate(pa = dplyr::if_else(pa == -1, 1, 0)) %>%
dplyr::mutate(plot = 0) %>%
dplyr::select(plot, pa, id) %>%
picante::sample2matrix()
# convert L table into phylogeny
phy = DDD::L2phylo(L, dropextinct = T)
# calculate Faith's index using pd() function
Faith = picante::pd(comm,phy)
return(Faith$PD)
}
## Function 2
fun2 <- function(L) {
phy <- DDD::L2phylo(L, dropextinct = T)
return(sum(phy$edge.length))
}
## Function 3
fun3 <- function(L) {
return(sum(DDD::L2phylo(L, dropextinct = TRUE)$edge.length))
}
## Do all of them give the same results
fun1(L) == Faith$PD
# [1] TRUE
fun2(L) == Faith$PD
# [1] TRUE
fun3(L) == Faith$PD
# [1] TRUE
## Which function fastest?
microbenchmark(fun1(L), fun2(L), fun3(L))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(L) 6.486462 6.900641 8.273386 7.445334 8.667535 16.888429 100
# fun2(L) 1.627854 1.683204 2.215531 1.771219 2.229408 9.522366 100
# fun3(L) 1.630635 1.663181 2.229206 1.859733 2.448196 7.573001 100