基于案例日期的聚类数据

Clustering data based on case dates

我有一个包含 20000 个病例的数据集,每个病例都有一个发病日期 ('onsetdate')。每个病例都住在一个集体住宅中,我想根据他们在家中的发病日期对病例进行聚类。

所以我想确定家里出现的第一个病例。如果在第一个案例的 14 天内出现另一个案例,我想将它们添加到同一个集群中。如果集群中的任何其他案例在 14 天内出现另一个案例,我想将它们添加到同一个集群。一旦另一个案例距上一个案例超过 14 天,我将停止向集群添加案例;届时,将形成一个新的集群,并且该过程将重新开始,直到每个人都被排序为止。集群 'start date' 将是添加到集群的第一个病例的发病日期,结束日期将是最后一个病例添加到集群后的 14 天。

这是一些虚拟数据:

dummy <- data.frame(case = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), 
                onsetdate  = as.Date(c("2012-08-30", "2012-09-03", "2012-09-09", "2012-09-17", "2012-11-01", "2012-11-05", "2012-11-30", "2012-08-30", "2012-09-03", "2012-10-09", "2012-10-17", "2012-10-30", "2020-12-26", "2020-12-23", "2020-12-30", "2020-12-25", "2021-04-22", "2021-05-03", "2021-05-10")),
                position = c("Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident") , 
                grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 2","Group Home 2","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3")
                )

输出将如下所示:

result <- data.frame(grouphome  = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3"), 
                 clusterNumber = c("1", "2", "3", "1", "2", "1", "2"), 
                 clusterStart = as.Date(c("2012-08-30", "2012-11-01", "2012-11-30", "2012-09-03", "2012-10-09", "2020-12-23", "2021-04-22")),
                 cases = c("5", "2", "1", "1", "3", "4", "3"))

非常感谢您

看来您首先要 group_by grouphome

您还可以 group_by clusterNumber,这可以通过查看 onsetdate 中大于 14 天的差异来确定。使用 cumsum 或累积总和将为此提供一个计数器。

最后的 summarise 将第一个日期作为群组主页集群中的 clusterStartcases 将是该集群的行数。

这假定日期已经按时间顺序排序。如果不是这种情况,您需要先 arrange

编辑:要同时为每个 clusterNumber 添加两列“居民”和“员工”的总数,您可以 sum position 对于这两种情况中的每一种。

library(dplyr)

dummy %>%
  group_by(grouphome) %>%
  group_by(clusterNumber = 1 + cumsum(c(0, diff(onsetdate) > 14)), .add = TRUE) %>%
  summarise(clusterStart = first(onsetdate),
            cases = n(),
            resident = sum(position == "Resident"),
            staff = sum(position == "Staff"))

输出

  grouphome    clusterNumber clusterStart cases resident staff
  <chr>                <dbl> <date>       <int>    <int> <int>
1 Group Home 1             1 2012-08-30       4        2     2
2 Group Home 1             2 2012-11-01       2        1     1
3 Group Home 1             3 2012-11-30       2        1     1
4 Group Home 2             1 2012-09-03       1        1     0
5 Group Home 2             2 2012-10-09       3        1     2