序列分析聚类CHI2 EUCLID错误
Sequence analysis clustering CHI2 EUCLID error
我对序列分析和尝试识别聚合序列矩阵中的簇非常陌生,主要关注状态持续时间。但是,当使用 method='CHI2'/'EUCLID' 结合 step=1 (否则)我收到错误:
if (SCres > currentSCres) { 出错:
TRUE/FALSE 需要
的缺失值
知道为什么(距离矩阵中有一些 NaN,它们可能是由不同长度的序列引起的)?
序列对象和距离矩阵是什么样的
代码:
Sequence
1 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
3 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5 b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
Distance matrix
1 2 3 4
2 NaN
3 289.92897 NaN
4 141.07472 NaN 263.22855
5 10.22425 NaN 290.10919 141.44473
代码:
library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4
SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8,
properties=c("state", "duration", "spell.age","spell.dur",
"transition","pattern", "AFtransition", "AFpattern","Complexity"))
两个序列x和y之间的"CHI2"
距离是由TraMineR
计算出来的在长度 step
的连续周期内状态分布之间的卡方距离。参见 Studer and Ritschard (2014, p 8)。
这意味着对于 step=1
,在每个位置计算卡方距离。当其中一个序列在某些位置(例如第二个序列中的最后一个位置)具有空值时,无法计算这些位置的距离,我们得到一个 NaN
值,表示此之间的 CHI2
距离序列和任何其他序列。
为避免这种情况,您可以使用以下解决方法:
1) 设置一个足够大的step
值,以确保每个序列在每个周期间隔中至少包含一个非空元素。对于您的示例,最长序列的长度为 25。要确保最后一个句点包含非空元素,您必须设置 step=5
。
DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
"b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000
2) 删除具有空元素的列:
SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [2,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [3,] 10.04158 10.04158 0.000000 4.472136 10.34811
## [4,] 10.04158 10.04158 4.472136 0.000000 10.34811
## [5,] 2.50000 2.50000 10.348108 10.348108 0.00000
3) 用缺失值填充较短的序列,并将缺失值视为额外的可能状态。默认情况下 right='DEL'
in seqdef
,这会产生空隙。这里我们设置 right=NA
来获取缺失值。
SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 2.738613 10.408330 10.408330 2.500000
## [2,] 2.738613 0.000000 10.527741 10.527741 3.708099
## [3,] 10.408330 10.527741 0.000000 5.477226 10.704360
## [4,] 10.408330 10.527741 5.477226 0.000000 10.704360
## [5,] 2.500000 3.708099 10.704360 10.704360 0.000000
现在,问题中报告的错误不是 seqdist
的错误,而是 WeightedCluster
库中的 seqpropclust
函数的错误。错误显然是由相异矩阵中的 NaN
引起的。
我对序列分析和尝试识别聚合序列矩阵中的簇非常陌生,主要关注状态持续时间。但是,当使用 method='CHI2'/'EUCLID' 结合 step=1 (否则)我收到错误:
if (SCres > currentSCres) { 出错: TRUE/FALSE 需要
的缺失值知道为什么(距离矩阵中有一些 NaN,它们可能是由不同长度的序列引起的)?
序列对象和距离矩阵是什么样的 代码:
Sequence
1 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
3 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5 b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
Distance matrix
1 2 3 4
2 NaN
3 289.92897 NaN
4 141.07472 NaN 263.22855
5 10.22425 NaN 290.10919 141.44473
代码:
library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4
SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8,
properties=c("state", "duration", "spell.age","spell.dur",
"transition","pattern", "AFtransition", "AFpattern","Complexity"))
两个序列x和y之间的"CHI2"
距离是由TraMineR
计算出来的在长度 step
的连续周期内状态分布之间的卡方距离。参见 Studer and Ritschard (2014, p 8)。
这意味着对于 step=1
,在每个位置计算卡方距离。当其中一个序列在某些位置(例如第二个序列中的最后一个位置)具有空值时,无法计算这些位置的距离,我们得到一个 NaN
值,表示此之间的 CHI2
距离序列和任何其他序列。
为避免这种情况,您可以使用以下解决方法:
1) 设置一个足够大的step
值,以确保每个序列在每个周期间隔中至少包含一个非空元素。对于您的示例,最长序列的长度为 25。要确保最后一个句点包含非空元素,您必须设置 step=5
。
DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
"b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000
2) 删除具有空元素的列:
SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [2,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [3,] 10.04158 10.04158 0.000000 4.472136 10.34811
## [4,] 10.04158 10.04158 4.472136 0.000000 10.34811
## [5,] 2.50000 2.50000 10.348108 10.348108 0.00000
3) 用缺失值填充较短的序列,并将缺失值视为额外的可能状态。默认情况下 right='DEL'
in seqdef
,这会产生空隙。这里我们设置 right=NA
来获取缺失值。
SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 2.738613 10.408330 10.408330 2.500000
## [2,] 2.738613 0.000000 10.527741 10.527741 3.708099
## [3,] 10.408330 10.527741 0.000000 5.477226 10.704360
## [4,] 10.408330 10.527741 5.477226 0.000000 10.704360
## [5,] 2.500000 3.708099 10.704360 10.704360 0.000000
现在,问题中报告的错误不是 seqdist
的错误,而是 WeightedCluster
库中的 seqpropclust
函数的错误。错误显然是由相异矩阵中的 NaN
引起的。