序列分析聚类CHI2 EUCLID错误

Sequence analysis clustering CHI2 EUCLID error

我对序列分析和尝试识别聚合序列矩阵中的簇非常陌生,主要关注状态持续时间。但是,当使用 method='CHI2'/'EUCLID' 结合 step=1 (否则)我收到错误:

if (SCres > currentSCres) { 出错: TRUE/FALSE 需要

的缺失值

知道为什么(距离矩阵中有一些 NaN,它们可能是由不同长度的序列引起的)?

序列对象和距离矩阵是什么样的 代码:

Sequence                                         
1    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a  
3    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5    b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a

Distance matrix
           1         2      3          4
2        NaN                              
3        289.92897   NaN                    
4        141.07472   NaN    263.22855          
5        10.22425    NaN    290.10919  141.44473

代码:

library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4

SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8, 
      properties=c("state", "duration", "spell.age","spell.dur",
        "transition","pattern", "AFtransition", "AFpattern","Complexity"))

两个序列xy之间的"CHI2"距离是由TraMineR计算出来的在长度 step 的连续周期内状态分布之间的卡方距离。参见 Studer and Ritschard (2014, p 8)

这意味着对于 step=1,在每个位置计算卡方距离。当其中一个序列在某些位置(例如第二个序列中的最后一个位置)具有空值时,无法计算这些位置的距离,我们得到一个 NaN 值,表示此之间的 CHI2 距离序列和任何其他序列。

为避免这种情况,您可以使用以下解决方法:

1) 设置一个足够大的step值,以确保每个序列在每个周期间隔中至少包含一个非空元素。对于您的示例,最长序列的长度为 25。要确保最后一个句点包含非空元素,您必须设置 step=5

DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
         "a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",  
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
         "b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000

2) 删除具有空元素的列:

SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
##          [,1]     [,2]      [,3]      [,4]     [,5]
## [1,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [2,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [3,] 10.04158 10.04158  0.000000  4.472136 10.34811
## [4,] 10.04158 10.04158  4.472136  0.000000 10.34811
## [5,]  2.50000  2.50000 10.348108 10.348108  0.00000

3) 用缺失值填充较短的序列,并将缺失值视为额外的可能状态。默认情况下 right='DEL' in seqdef,这会产生空隙。这里我们设置 right=NA 来获取缺失值。

SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
##          [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  0.000000  2.738613 10.408330 10.408330  2.500000
## [2,]  2.738613  0.000000 10.527741 10.527741  3.708099
## [3,] 10.408330 10.527741  0.000000  5.477226 10.704360
## [4,] 10.408330 10.527741  5.477226  0.000000 10.704360
## [5,]  2.500000  3.708099 10.704360 10.704360  0.000000

现在,问题中报告的错误不是 seqdist 的错误,而是 WeightedCluster 库中的 seqpropclust 函数的错误。错误显然是由相异矩阵中的 NaN 引起的。