基于案例日期的聚类数据
Clustering data based on case dates
我有一个包含 20000 个病例的数据集,每个病例都有一个发病日期 ('onsetdate')。每个病例都住在一个集体住宅中,我想根据他们在家中的发病日期对病例进行聚类。
所以我想确定家里出现的第一个病例。如果在第一个案例的 14 天内出现另一个案例,我想将它们添加到同一个集群中。如果集群中的任何其他案例在 14 天内出现另一个案例,我想将它们添加到同一个集群。一旦另一个案例距上一个案例超过 14 天,我将停止向集群添加案例;届时,将形成一个新的集群,并且该过程将重新开始,直到每个人都被排序为止。集群 'start date' 将是添加到集群的第一个病例的发病日期,结束日期将是最后一个病例添加到集群后的 14 天。
这是一些虚拟数据:
dummy <- data.frame(case = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19),
onsetdate = as.Date(c("2012-08-30", "2012-09-03", "2012-09-09", "2012-09-17", "2012-11-01", "2012-11-05", "2012-11-30", "2012-08-30", "2012-09-03", "2012-10-09", "2012-10-17", "2012-10-30", "2020-12-26", "2020-12-23", "2020-12-30", "2020-12-25", "2021-04-22", "2021-05-03", "2021-05-10")),
position = c("Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident") ,
grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 2","Group Home 2","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3")
)
输出将如下所示:
result <- data.frame(grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3"),
clusterNumber = c("1", "2", "3", "1", "2", "1", "2"),
clusterStart = as.Date(c("2012-08-30", "2012-11-01", "2012-11-30", "2012-09-03", "2012-10-09", "2020-12-23", "2021-04-22")),
cases = c("5", "2", "1", "1", "3", "4", "3"))
非常感谢您
看来您首先要 group_by
grouphome
。
您还可以 group_by
clusterNumber
,这可以通过查看 onsetdate
中大于 14 天的差异来确定。使用 cumsum
或累积总和将为此提供一个计数器。
最后的 summarise
将第一个日期作为群组主页集群中的 clusterStart
,cases
将是该集群的行数。
这假定日期已经按时间顺序排序。如果不是这种情况,您需要先 arrange
。
编辑:要同时为每个 clusterNumber
添加两列“居民”和“员工”的总数,您可以 sum
position
对于这两种情况中的每一种。
library(dplyr)
dummy %>%
group_by(grouphome) %>%
group_by(clusterNumber = 1 + cumsum(c(0, diff(onsetdate) > 14)), .add = TRUE) %>%
summarise(clusterStart = first(onsetdate),
cases = n(),
resident = sum(position == "Resident"),
staff = sum(position == "Staff"))
输出
grouphome clusterNumber clusterStart cases resident staff
<chr> <dbl> <date> <int> <int> <int>
1 Group Home 1 1 2012-08-30 4 2 2
2 Group Home 1 2 2012-11-01 2 1 1
3 Group Home 1 3 2012-11-30 2 1 1
4 Group Home 2 1 2012-09-03 1 1 0
5 Group Home 2 2 2012-10-09 3 1 2
我有一个包含 20000 个病例的数据集,每个病例都有一个发病日期 ('onsetdate')。每个病例都住在一个集体住宅中,我想根据他们在家中的发病日期对病例进行聚类。
所以我想确定家里出现的第一个病例。如果在第一个案例的 14 天内出现另一个案例,我想将它们添加到同一个集群中。如果集群中的任何其他案例在 14 天内出现另一个案例,我想将它们添加到同一个集群。一旦另一个案例距上一个案例超过 14 天,我将停止向集群添加案例;届时,将形成一个新的集群,并且该过程将重新开始,直到每个人都被排序为止。集群 'start date' 将是添加到集群的第一个病例的发病日期,结束日期将是最后一个病例添加到集群后的 14 天。
这是一些虚拟数据:
dummy <- data.frame(case = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19),
onsetdate = as.Date(c("2012-08-30", "2012-09-03", "2012-09-09", "2012-09-17", "2012-11-01", "2012-11-05", "2012-11-30", "2012-08-30", "2012-09-03", "2012-10-09", "2012-10-17", "2012-10-30", "2020-12-26", "2020-12-23", "2020-12-30", "2020-12-25", "2021-04-22", "2021-05-03", "2021-05-10")),
position = c("Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident") ,
grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 2","Group Home 2","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3")
)
输出将如下所示:
result <- data.frame(grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3"),
clusterNumber = c("1", "2", "3", "1", "2", "1", "2"),
clusterStart = as.Date(c("2012-08-30", "2012-11-01", "2012-11-30", "2012-09-03", "2012-10-09", "2020-12-23", "2021-04-22")),
cases = c("5", "2", "1", "1", "3", "4", "3"))
非常感谢您
看来您首先要 group_by
grouphome
。
您还可以 group_by
clusterNumber
,这可以通过查看 onsetdate
中大于 14 天的差异来确定。使用 cumsum
或累积总和将为此提供一个计数器。
最后的 summarise
将第一个日期作为群组主页集群中的 clusterStart
,cases
将是该集群的行数。
这假定日期已经按时间顺序排序。如果不是这种情况,您需要先 arrange
。
编辑:要同时为每个 clusterNumber
添加两列“居民”和“员工”的总数,您可以 sum
position
对于这两种情况中的每一种。
library(dplyr)
dummy %>%
group_by(grouphome) %>%
group_by(clusterNumber = 1 + cumsum(c(0, diff(onsetdate) > 14)), .add = TRUE) %>%
summarise(clusterStart = first(onsetdate),
cases = n(),
resident = sum(position == "Resident"),
staff = sum(position == "Staff"))
输出
grouphome clusterNumber clusterStart cases resident staff
<chr> <dbl> <date> <int> <int> <int>
1 Group Home 1 1 2012-08-30 4 2 2
2 Group Home 1 2 2012-11-01 2 1 1
3 Group Home 1 3 2012-11-30 2 1 1
4 Group Home 2 1 2012-09-03 1 1 0
5 Group Home 2 2 2012-10-09 3 1 2