R 中的周期性模式识别
Periodic Patterns Identification in R
我想识别时间序列中的时间模式。
structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
"v", "w", "x"), `2016/01` = c(1, NA, NA, 1, NA, NA, 1, NA, NA,
1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA, NA), `2016/02` = c(NA,
1, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA,
1, NA, NA, 1, NA), `2016/03` = c(NA, NA, 1, NA, NA, 1, NA, NA,
1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2016/04` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA,
1, NA, NA, NA, NA, NA), `2016/05` = c(NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA), `2016/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2016/07` = c(NA,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA,
1, NA, NA, NA, NA, NA), `2016/08` = c(NA, NA, NA, NA, 1, NA,
NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA,
NA), `2016/09` = c(NA, NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1,
1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA), `2016/10` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA,
1, NA, NA, NA, NA, NA), `2016/11` = c(NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA), `2016/12` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2017/01` = c(1,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA,
1, NA, NA, 1, NA, NA), `2017/02` = c(NA, 1, NA, NA, 1, NA, NA,
1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA),
`2017/03` = c(NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, NA, 1,
1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2017/04` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA,
NA, 1, NA, NA, NA, NA, NA), `2017/05` = c(NA, NA, NA, NA,
1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1,
NA, NA, NA, NA), `2017/06` = c(NA, NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA,
NA), `2017/07` = c(NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, NA,
1, 1, 1, NA, 1, NA, NA, 1, NA, NA, NA, NA, NA), `2017/08` = c(NA,
NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1,
NA, NA, 1, NA, NA, NA, NA), `2017/09` = c(NA, NA, NA, NA,
NA, 1, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA,
1, NA, NA, NA), `2017/10` = c(NA, NA, NA, 1, NA, NA, NA,
NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA,
NA), `2017/11` = c(NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA), `2017/12` = c(1,
NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA,
NA, NA, NA, 1, 1, NA, NA), `2018/01` = c(NA, 1, NA, 1, NA,
NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA,
NA, 1, NA), `2018/02` = c(NA, NA, 1, NA, 1, NA, NA, 1, NA,
NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, 1), `2018/03` = c(NA,
NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1, 1, 1, NA, NA, NA,
1, NA, NA, 1, NA, NA, NA), `2018/04` = c(NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA,
NA, NA, NA), `2018/05` = c(NA, NA, NA, NA, 1, NA, NA, NA,
NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA
), `2018/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2018/07` = c(NA,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA,
NA, 1, NA, NA, NA, NA, NA), `2018/08` = c(NA, NA, NA, NA,
1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA,
NA, NA, NA), `2018/09` = c(NA, NA, NA, NA, NA, 1, NA, NA,
1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA),
`2018/10` = c(NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1,
1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA), `2018/11` = c(NA,
NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA,
NA, NA, 1, NA, NA, NA, NA), `2018/12` = c(NA, NA, NA, NA,
NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA,
1, NA, NA, NA)), row.names = c(NA, -24L), class = c("tbl_df",
"tbl", "data.frame"))
上层数据框中个体:
列表项
a 与 v 具有相同的模式
b 与 w 具有相同的模式
c 与 x
具有相同的模式
在上面的数据框中,个体 a、b、c、v、w 和 x 具有相同的频率 - 每年。
还有一些其他情况,如双月经、每季月经和半月经。
我的 objective 是识别所有这些案例并根据时间模式对所有个体进行分类。
我想包 arulesSequences 可能会有用。
你能帮帮我吗?
我认为完整的层次聚类是一个好的开始:
library(gplots)
library(dendsort)
# data preparation
dm <- matrix( as.numeric(!is.na(dat[,-1])), nrow=nrow(dat[,-1]) )
rownames(dm) <- dat$ID
colnames(dm) <- colnames(dat[,-1])
heatmap.2( dm, trace="none", hclustfun=function(x){
dendsort(hclust(x, method="single"), type="average")
}, col=c("grey90","darkblue") )
通过列的所有时间相关连接都清晰可见。
我包括 dendsort 以将相似的集群聚集在一起,使 ID 相关模式更加明显。
此外,仅绘制行簇可以让您更好地可视化时间模式。
heatmap.2( dm, trace="none", Colv=NA, dendrogram="row",
hclustfun=function(x){ dendsort(hclust(x, method="single"),
type="average") }, col=c("grey90","darkblue") )
添加摘要和 k-means 进行比较:
层次集群
dis <- dist(dm, method="euclidean")
hc <- hclust(dis, method="single")
# choose the height where to cut
# lower means more fine grained cluster, less member per cluster
cutree(hc, h=4)
a b c d e f g h i j k l m n o p q r s t u v w x
1 2 1 3 2 4 1 2 1 5 6 7 7 5 6 1 2 1 3 2 4 1 2 1
# higher h means larger clusters, i.e. more member per cluster
cutree(hc, h=5)
a b c d e f g h i j k l m n o p q r s t u v w x
1 2 1 1 2 1 1 2 1 1 2 3 3 1 2 1 2 1 1 2 1 1 2 1
k-均值
# pre-defining k=6, has to be rerun to change k
km <- kmeans(dm, 6, algorithm="Hartigan-Wong")
km$cluster
a b c d e f g h i j k l m n o p q r s t u v w x
2 5 2 6 5 4 2 5 4 3 1 1 1 3 1 2 5 4 6 5 4 2 5 2
我想识别时间序列中的时间模式。
structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
"v", "w", "x"), `2016/01` = c(1, NA, NA, 1, NA, NA, 1, NA, NA,
1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA, NA), `2016/02` = c(NA,
1, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA,
1, NA, NA, 1, NA), `2016/03` = c(NA, NA, 1, NA, NA, 1, NA, NA,
1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2016/04` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA,
1, NA, NA, NA, NA, NA), `2016/05` = c(NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA), `2016/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2016/07` = c(NA,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA,
1, NA, NA, NA, NA, NA), `2016/08` = c(NA, NA, NA, NA, 1, NA,
NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA,
NA), `2016/09` = c(NA, NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1,
1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA), `2016/10` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA,
1, NA, NA, NA, NA, NA), `2016/11` = c(NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA,
NA), `2016/12` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2017/01` = c(1,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA,
1, NA, NA, 1, NA, NA), `2017/02` = c(NA, 1, NA, NA, 1, NA, NA,
1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, 1, NA),
`2017/03` = c(NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, NA, 1,
1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, 1), `2017/04` = c(NA,
NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA,
NA, 1, NA, NA, NA, NA, NA), `2017/05` = c(NA, NA, NA, NA,
1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1,
NA, NA, NA, NA), `2017/06` = c(NA, NA, NA, NA, NA, 1, NA,
NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA,
NA), `2017/07` = c(NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, NA,
1, 1, 1, NA, 1, NA, NA, 1, NA, NA, NA, NA, NA), `2017/08` = c(NA,
NA, NA, NA, 1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1,
NA, NA, 1, NA, NA, NA, NA), `2017/09` = c(NA, NA, NA, NA,
NA, 1, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA,
1, NA, NA, NA), `2017/10` = c(NA, NA, NA, 1, NA, NA, NA,
NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA,
NA), `2017/11` = c(NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA), `2017/12` = c(1,
NA, NA, NA, NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA,
NA, NA, NA, 1, 1, NA, NA), `2018/01` = c(NA, 1, NA, 1, NA,
NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA,
NA, 1, NA), `2018/02` = c(NA, NA, 1, NA, 1, NA, NA, 1, NA,
NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA, NA, NA, 1), `2018/03` = c(NA,
NA, NA, NA, NA, 1, NA, NA, 1, 1, NA, 1, 1, 1, NA, NA, NA,
1, NA, NA, 1, NA, NA, NA), `2018/04` = c(NA, NA, NA, 1, NA,
NA, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, 1, NA, NA,
NA, NA, NA), `2018/05` = c(NA, NA, NA, NA, 1, NA, NA, NA,
NA, 1, NA, 1, 1, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA
), `2018/06` = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1,
1, 1, NA, 1, NA, NA, NA, NA, NA, 1, NA, NA, NA), `2018/07` = c(NA,
NA, NA, 1, NA, NA, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA,
NA, 1, NA, NA, NA, NA, NA), `2018/08` = c(NA, NA, NA, NA,
1, NA, NA, 1, NA, NA, 1, 1, 1, NA, 1, NA, 1, NA, NA, 1, NA,
NA, NA, NA), `2018/09` = c(NA, NA, NA, NA, NA, 1, NA, NA,
1, 1, NA, 1, 1, 1, NA, NA, NA, 1, NA, NA, 1, NA, NA, NA),
`2018/10` = c(NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1,
1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA), `2018/11` = c(NA,
NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, 1, 1, 1, NA, NA, NA,
NA, NA, 1, NA, NA, NA, NA), `2018/12` = c(NA, NA, NA, NA,
NA, 1, NA, NA, NA, NA, 1, 1, 1, NA, 1, NA, NA, NA, NA, NA,
1, NA, NA, NA)), row.names = c(NA, -24L), class = c("tbl_df",
"tbl", "data.frame"))
上层数据框中个体:
列表项 a 与 v 具有相同的模式 b 与 w 具有相同的模式 c 与 x
具有相同的模式在上面的数据框中,个体 a、b、c、v、w 和 x 具有相同的频率 - 每年。
还有一些其他情况,如双月经、每季月经和半月经。
我的 objective 是识别所有这些案例并根据时间模式对所有个体进行分类。
我想包 arulesSequences 可能会有用。
你能帮帮我吗?
我认为完整的层次聚类是一个好的开始:
library(gplots)
library(dendsort)
# data preparation
dm <- matrix( as.numeric(!is.na(dat[,-1])), nrow=nrow(dat[,-1]) )
rownames(dm) <- dat$ID
colnames(dm) <- colnames(dat[,-1])
heatmap.2( dm, trace="none", hclustfun=function(x){
dendsort(hclust(x, method="single"), type="average")
}, col=c("grey90","darkblue") )
通过列的所有时间相关连接都清晰可见。 我包括 dendsort 以将相似的集群聚集在一起,使 ID 相关模式更加明显。
此外,仅绘制行簇可以让您更好地可视化时间模式。
heatmap.2( dm, trace="none", Colv=NA, dendrogram="row",
hclustfun=function(x){ dendsort(hclust(x, method="single"),
type="average") }, col=c("grey90","darkblue") )
添加摘要和 k-means 进行比较:
层次集群
dis <- dist(dm, method="euclidean")
hc <- hclust(dis, method="single")
# choose the height where to cut
# lower means more fine grained cluster, less member per cluster
cutree(hc, h=4)
a b c d e f g h i j k l m n o p q r s t u v w x
1 2 1 3 2 4 1 2 1 5 6 7 7 5 6 1 2 1 3 2 4 1 2 1
# higher h means larger clusters, i.e. more member per cluster
cutree(hc, h=5)
a b c d e f g h i j k l m n o p q r s t u v w x
1 2 1 1 2 1 1 2 1 1 2 3 3 1 2 1 2 1 1 2 1 1 2 1
k-均值
# pre-defining k=6, has to be rerun to change k
km <- kmeans(dm, 6, algorithm="Hartigan-Wong")
km$cluster
a b c d e f g h i j k l m n o p q r s t u v w x
2 5 2 6 5 4 2 5 4 3 1 1 1 3 1 2 5 4 6 5 4 2 5 2