R 中的系统发育学:在树上工作与在树上阅读时的不同结果

Phylogenetics in R: different results when working on a tree compared to reading it in

这个问题直接源于我在这里问的前一个问题:

((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;

我用 R 中的基因树创建了这棵树,使用 midpoint 函数对其进行生根。现在我对其应用以下函数:

drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

现在我的问题是,当我应用这个函数时,我没有得到我期望的树。绘制时的输出如下所示:

但是,如果我使用 write.tree 将树写入文本文件,然后将其作为 Newick 字符串再次读入,然后应用上面的 drop_dupes 函数,我得到不同的树,如下图所示:

这就是让我困惑的地方。为什么将相同的函数应用于同一棵树时会得到两个不同的输出,唯一的区别是它是读入的还是已经是 R 中的变量?

编辑:这里有一些更详细的信息。初始基因树如下:

(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);

我把它作为tree1读入R。然后我使用以下代码:

#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
  tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
  ##some of these are tips, some are nodes, need to treat differently
  tr$tip.label<-1:(length(tr$tip.label))
  return(tr)
}

#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
  if(is.rooted(tree) == FALSE){
    rootedTree <- midpoint(tree)
  }
  return(rootedTree)
}

#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)

此时,如果我绘制树 d,我将得到上面的第一个图。但是,如果我将树 c 写入文本文件,然后将其读回,然后在其上使用 drop_dupes 函数,我将得到后一棵树。

我对照页面顶部的Newick树检查了树c的newick文件,肯定是一样的。

问题出在 sort.names 函数中。它有效地重新排列了树的书写方式,然后索引引用了它不应该引用的其他节点。

如果您需要 tip.labels 编号,为什么不单独标记它们?

tax.num <- data.frame(taxa = tree1$tip.label, 
    numbers = 1:(length(tree1$tip.label)))

a <- midpoint(drop_dupes(tree1))
a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]

plot(a)

此外,di2multi似乎是多余的。它创建了多项式,据我所知,目标是丢弃短分支的提示。