R 中的周期性模式识别

Periodic Patterns Identification in R

我想识别时间序列中的时间模式。

structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h", 
"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", 
"v", "w", "x"), `2016/01` = c(1, NA, NA, 1, NA, NA, 1, NA, NA, 
1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA, NA), `2016/02` = c(NA, 
1, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 
1, NA, NA, 1, NA), `2016/03` = c(NA, NA, 1, NA, NA, 1, NA, NA, 
1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2016/04` = c(NA, 
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 
1, NA, NA, NA, NA, NA), `2016/05` = c(NA, NA, NA, NA, 1, NA, 
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 
NA), `2016/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2016/07` = c(NA, 
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA, 
1, NA, NA, NA, NA, NA), `2016/08` = c(NA, NA, NA, NA, 1, NA, 
NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, 
NA), `2016/09` = c(NA, NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1, 
1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA), `2016/10` = c(NA, 
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 
1, NA, NA, NA, NA, NA), `2016/11` = c(NA, NA, NA, NA, 1, NA, 
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 
NA), `2016/12` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2017/01` = c(1, 
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA, 
1, NA, NA, 1, NA, NA), `2017/02` = c(NA, 1, NA, NA, 1, NA, NA, 
1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA), 
    `2017/03` = c(NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, NA, 1, 
    1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2017/04` = c(NA, 
    NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, 
    NA, 1, NA, NA, NA, NA, NA), `2017/05` = c(NA, NA, NA, NA, 
    1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, 
    NA, NA, NA, NA), `2017/06` = c(NA, NA, NA, NA, NA, 1, NA, 
    NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, 
    NA), `2017/07` = c(NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 
    1, 1, 1, NA, 1, NA, NA, 1, NA, NA, NA, NA, NA), `2017/08` = c(NA, 
    NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, 
    NA, NA, 1, NA, NA, NA, NA), `2017/09` = c(NA, NA, NA, NA, 
    NA, 1, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, 
    1, NA, NA, NA), `2017/10` = c(NA, NA, NA, 1, NA, NA, NA, 
    NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, 
    NA), `2017/11` = c(NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 
    NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA), `2017/12` = c(1, 
    NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, 
    NA, NA, NA, 1, 1, NA, NA), `2018/01` = c(NA, 1, NA, 1, NA, 
    NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA, 
    NA, 1, NA), `2018/02` = c(NA, NA, 1, NA, 1, NA, NA, 1, NA, 
    NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, 1), `2018/03` = c(NA, 
    NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1, 1, 1, NA, NA, NA, 
    1, NA, NA, 1, NA, NA, NA), `2018/04` = c(NA, NA, NA, 1, NA, 
    NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA, 
    NA, NA, NA), `2018/05` = c(NA, NA, NA, NA, 1, NA, NA, NA, 
    NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA
    ), `2018/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 
    1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2018/07` = c(NA, 
    NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, 
    NA, 1, NA, NA, NA, NA, NA), `2018/08` = c(NA, NA, NA, NA, 
    1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, 
    NA, NA, NA), `2018/09` = c(NA, NA, NA, NA, NA, 1, NA, NA, 
    1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA), 
    `2018/10` = c(NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 
    1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA), `2018/11` = c(NA, 
    NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, 
    NA, NA, 1, NA, NA, NA, NA), `2018/12` = c(NA, NA, NA, NA, 
    NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA, 
    1, NA, NA, NA)), row.names = c(NA, -24L), class = c("tbl_df", 
"tbl", "data.frame"))

上层数据框中个体:

列表项 a 与 v 具有相同的模式 b 与 w 具有相同的模式 c 与 x

具有相同的模式

在上面的数据框中,个体 a、b、c、v、w 和 x 具有相同的频率 - 每年。

还有一些其他情况,如双月经、每季月经和半月经。

我的 objective 是识别所有这些案例并根据时间模式对所有个体进行分类。

我想包 arulesSequences 可能会有用。

你能帮帮我吗?

我认为完整的层次聚类是一个好的开始:

library(gplots)
library(dendsort)

# data preparation
dm <- matrix( as.numeric(!is.na(dat[,-1])), nrow=nrow(dat[,-1]) )
rownames(dm) <- dat$ID
colnames(dm) <- colnames(dat[,-1])

heatmap.2( dm, trace="none", hclustfun=function(x){
  dendsort(hclust(x, method="single"), type="average")
  }, col=c("grey90","darkblue") )

通过列的所有时间相关连接都清晰可见。 我包括 dendsort 以将相似的集群聚集在一起,使 ID 相关模式更加明显。

此外,仅绘制行簇可以让您更好地可视化时间模式。

heatmap.2( dm, trace="none", Colv=NA, dendrogram="row", 
  hclustfun=function(x){ dendsort(hclust(x, method="single"), 
  type="average") }, col=c("grey90","darkblue") )

添加摘要和 k-means 进行比较:

层次集群

dis <- dist(dm, method="euclidean")
hc <- hclust(dis, method="single")
# choose the height where to cut
# lower means more fine grained cluster, less member per cluster
cutree(hc, h=4)
a b c d e f g h i j k l m n o p q r s t u v w x 
1 2 1 3 2 4 1 2 1 5 6 7 7 5 6 1 2 1 3 2 4 1 2 1
# higher h means larger clusters, i.e. more member per cluster
cutree(hc, h=5)
a b c d e f g h i j k l m n o p q r s t u v w x 
1 2 1 1 2 1 1 2 1 1 2 3 3 1 2 1 2 1 1 2 1 1 2 1

k-均值

# pre-defining k=6, has to be rerun to change k
km <- kmeans(dm, 6, algorithm="Hartigan-Wong")
km$cluster
a b c d e f g h i j k l m n o p q r s t u v w x 
2 5 2 6 5 4 2 5 4 3 1 1 1 3 1 2 5 4 6 5 4 2 5 2