如果数据行在指定时间段内具有顺序时间间隔,则合并数据行
merge data rows if they have sequential time intervals within a specified time period
我有一个 Start.Date 和 Stop.Date 的患者用药数据集。每个都在一行中表示。我想合并代表相同药物的行,但前提是后面间隔的 Start.Date 在较早间隔的停止日期后 30 天内(或我选择指定的任何间隔天数) ).假设你有下面的数据框
ID = c(2, 2, 2, 2, 3, 5)
Medication = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil")
Start.Date = c("05/01/2017", "05/30/2017", "07/15/2017", "05/01/2017", "05/06/2017", "05/28/2017")
Stop.Date = c("05/04/2017", "06/10/2017", "07/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
df = data.frame(ID, Medication, Start.Date, Stop.Date)
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 05/04/2017
2 aspirin 05/30/2017 06/10/2017
2 aspirin 07/15/2017 07/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
如果一个 Stop.Date 在下一个 Start.Date 之后的指定 30 天内,我想按 ID 和药物减少行数。新的 Start.Date 和 Stop.Date 将包含两种药物的时间间隔以及它们之间的 30 天或更短时间间隔。它应该如下所示:
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 06/10/2017
2 aspirin 07/15/2017 07/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
能否请您尝试关注,如果对您有帮助,请告诉我。
df$date_diff <- as.Date(as.character(df$Stop.Date), format="%m/%d/%Y")-as.Date(as.character(df$Start.Date), format="%m/%d/%Y")
ind <- apply( df[5] , 1 , function(x) any( x < 30 ) )
df[ ind , ]
首先将你的天数转换为日期格式,这样你就可以计算间隔了:
df$Start.Date <- as.Date(df$Start.Date, '%m/%d/%Y')
df$Stop.Date <- as.Date(df$Stop.Date, '%m/%d/%Y')
> df$Stop.Date - df$Start.Date
Time differences in days
[1] 3 11 12 14 6 16
要计算停止日期和下一个开始日期之间的差异:
c(Inf, df[-1,'Start.Date'] - df[-nrow(df),'Stop.Date'])
使用c(SOMETHING,...) 是为了保持相同的长度,因为第一个日期没有间隔。首先订购您的数据,以确保您的日期按顺序排列:
df <- df[order(df$ID, df$Medication, df$Start.Date), ]
现在计算每个患者和药物的间隔时间。可以使用 dplyr 或 data.table:
轻松完成
# Using dplyr:
library(dplyr)
df %<>% group_by(ID, Medication) %>% mutate(interval = c(Inf, Start.Date[-1] - Stop.Date[-n()]))
# Using data.table:
library(data.table)
df <- as.data.table(df)
df[, interval := c(Inf, Start.Date[-1] - Stop.Date[-.N]), by = .(ID, Medication)]
在间隔为 30 天或更短之前更新 Stop.Date 行:
for(i in 1:nrow(df)) if(df$interval[i]<=30) df$Stop.Date[i-1] <- df$Stop.Date[i]
最后,排除间隔<= 30的行和列间隔:
# If you're using dplyr:
df %<>% filter(!interval<=30) %>% select(-interval)
# If you're using data.table:
df <- df[!interval<=30, ]; df[, interval := NULL]
> df
ID Medication Start.Date Stop.Date
1: 2 aspirin 2017-05-01 2017-06-10
2: 2 aspirin 2017-07-15 2017-07-27
3: 2 tylenol 2017-05-01 2017-05-15
4: 3 lipitor 2017-05-06 2017-05-12
5: 5 advil 2017-05-28 2017-06-13
如果一个周期结束与下一周期开始之间的间隔不超过 30 天,OP 已要求缩短用药周期。
下面的解决方案要求同一个人和同一药物的服药期永远不会重叠,这是一个明智的假设(并经过检查)。
1。 dplyr
library(dplyr)
library(magrittr)
min_gap <- 30
df %>%
# convert date strings to class Date
mutate_at(c("Start.Date", "Stop.Date"), lubridate::mdy) %>%
arrange(ID, Medication, Start.Date) %>%
group_by(ID, Medication) %T>%
# medication periods must not overlap for ID and Medication
{summarize(., tmp = all(Start.Date >= lag(Stop.Date, default = Start.Date[1] - 1))) %$%
stopifnot(all(tmp))} %>%
# count non-subsequent medication periods, i.e., with gaps of at least min_gap days
mutate(Medic.Period = cumsum(Start.Date > lag(Stop.Date, default = Start.Date[1]) + min_gap)) %>%
# determine start and stop dates for each collapsed period
group_by(ID, Medication, Medic.Period) %>%
summarise(Start.Date = first(Start.Date), Stop.Date = last(Stop.Date))
# A tibble: 5 x 5
# Groups: ID, Medication [?]
ID Medication Medic.Period Start.Date Stop.Date
<dbl> <fct> <int> <date> <date>
1 2 aspirin 0 2017-05-01 2017-06-10
2 2 aspirin 1 2017-07-15 2017-07-27
3 2 tylenol 0 2017-05-01 2017-05-15
4 3 lipitor 0 2017-05-06 2017-05-12
5 5 advil 0 2017-05-28 2017-06-13
cumsum()
函数用于在遇到新的经期时递增服药期计数器,即前一经期停止与实际经期开始之间的间隔超过 30 天。
2。 data.table
library(data.table)
min_gap <- 30
# coerce date strings to class Date
cols <- stringr::str_subset(names(df), "Date$")
setDT(df)[, (cols) := lapply(.SD, lubridate::mdy), .SDcols = cols][
# create medication counters for each ID and Medication
order(Start.Date),
Medic.Period := {
tmp <- shift(Stop.Date, fill = Start.Date[1] - 1)
stopifnot(all(Start.Date > tmp))
cumsum(Start.Date > tmp + min_gap)
},
by = .(ID, Medication)][
# aggregate to find the overal start and stop dates for each medication period
, .(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)),
by = .(ID, Medication, Medic.Period)]
ID Medication Medic.Period Start.Date Stop.Date
1: 2 aspirin 0 2017-05-01 2017-06-10
2: 2 aspirin 1 2017-07-15 2017-07-27
3: 2 tylenol 0 2017-05-01 2017-05-15
4: 3 lipitor 0 2017-05-06 2017-05-12
5: 5 advil 0 2017-05-28 2017-06-13
我有一个 Start.Date 和 Stop.Date 的患者用药数据集。每个都在一行中表示。我想合并代表相同药物的行,但前提是后面间隔的 Start.Date 在较早间隔的停止日期后 30 天内(或我选择指定的任何间隔天数) ).假设你有下面的数据框
ID = c(2, 2, 2, 2, 3, 5)
Medication = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil")
Start.Date = c("05/01/2017", "05/30/2017", "07/15/2017", "05/01/2017", "05/06/2017", "05/28/2017")
Stop.Date = c("05/04/2017", "06/10/2017", "07/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
df = data.frame(ID, Medication, Start.Date, Stop.Date)
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 05/04/2017
2 aspirin 05/30/2017 06/10/2017
2 aspirin 07/15/2017 07/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
如果一个 Stop.Date 在下一个 Start.Date 之后的指定 30 天内,我想按 ID 和药物减少行数。新的 Start.Date 和 Stop.Date 将包含两种药物的时间间隔以及它们之间的 30 天或更短时间间隔。它应该如下所示:
ID Medication Start.Date Stop.Date
2 aspirin 05/01/2017 06/10/2017
2 aspirin 07/15/2017 07/27/2017
2 tylenol 05/01/2017 05/15/2017
3 lipitor 05/06/2017 05/12/2017
5 advil 05/28/2017 06/13/2017
能否请您尝试关注,如果对您有帮助,请告诉我。
df$date_diff <- as.Date(as.character(df$Stop.Date), format="%m/%d/%Y")-as.Date(as.character(df$Start.Date), format="%m/%d/%Y")
ind <- apply( df[5] , 1 , function(x) any( x < 30 ) )
df[ ind , ]
首先将你的天数转换为日期格式,这样你就可以计算间隔了:
df$Start.Date <- as.Date(df$Start.Date, '%m/%d/%Y')
df$Stop.Date <- as.Date(df$Stop.Date, '%m/%d/%Y')
> df$Stop.Date - df$Start.Date
Time differences in days
[1] 3 11 12 14 6 16
要计算停止日期和下一个开始日期之间的差异:
c(Inf, df[-1,'Start.Date'] - df[-nrow(df),'Stop.Date'])
使用c(SOMETHING,...) 是为了保持相同的长度,因为第一个日期没有间隔。首先订购您的数据,以确保您的日期按顺序排列:
df <- df[order(df$ID, df$Medication, df$Start.Date), ]
现在计算每个患者和药物的间隔时间。可以使用 dplyr 或 data.table:
轻松完成# Using dplyr:
library(dplyr)
df %<>% group_by(ID, Medication) %>% mutate(interval = c(Inf, Start.Date[-1] - Stop.Date[-n()]))
# Using data.table:
library(data.table)
df <- as.data.table(df)
df[, interval := c(Inf, Start.Date[-1] - Stop.Date[-.N]), by = .(ID, Medication)]
在间隔为 30 天或更短之前更新 Stop.Date 行:
for(i in 1:nrow(df)) if(df$interval[i]<=30) df$Stop.Date[i-1] <- df$Stop.Date[i]
最后,排除间隔<= 30的行和列间隔:
# If you're using dplyr:
df %<>% filter(!interval<=30) %>% select(-interval)
# If you're using data.table:
df <- df[!interval<=30, ]; df[, interval := NULL]
> df
ID Medication Start.Date Stop.Date
1: 2 aspirin 2017-05-01 2017-06-10
2: 2 aspirin 2017-07-15 2017-07-27
3: 2 tylenol 2017-05-01 2017-05-15
4: 3 lipitor 2017-05-06 2017-05-12
5: 5 advil 2017-05-28 2017-06-13
如果一个周期结束与下一周期开始之间的间隔不超过 30 天,OP 已要求缩短用药周期。
下面的解决方案要求同一个人和同一药物的服药期永远不会重叠,这是一个明智的假设(并经过检查)。
1。 dplyr
library(dplyr)
library(magrittr)
min_gap <- 30
df %>%
# convert date strings to class Date
mutate_at(c("Start.Date", "Stop.Date"), lubridate::mdy) %>%
arrange(ID, Medication, Start.Date) %>%
group_by(ID, Medication) %T>%
# medication periods must not overlap for ID and Medication
{summarize(., tmp = all(Start.Date >= lag(Stop.Date, default = Start.Date[1] - 1))) %$%
stopifnot(all(tmp))} %>%
# count non-subsequent medication periods, i.e., with gaps of at least min_gap days
mutate(Medic.Period = cumsum(Start.Date > lag(Stop.Date, default = Start.Date[1]) + min_gap)) %>%
# determine start and stop dates for each collapsed period
group_by(ID, Medication, Medic.Period) %>%
summarise(Start.Date = first(Start.Date), Stop.Date = last(Stop.Date))
# A tibble: 5 x 5 # Groups: ID, Medication [?] ID Medication Medic.Period Start.Date Stop.Date <dbl> <fct> <int> <date> <date> 1 2 aspirin 0 2017-05-01 2017-06-10 2 2 aspirin 1 2017-07-15 2017-07-27 3 2 tylenol 0 2017-05-01 2017-05-15 4 3 lipitor 0 2017-05-06 2017-05-12 5 5 advil 0 2017-05-28 2017-06-13
cumsum()
函数用于在遇到新的经期时递增服药期计数器,即前一经期停止与实际经期开始之间的间隔超过 30 天。
2。 data.table
library(data.table)
min_gap <- 30
# coerce date strings to class Date
cols <- stringr::str_subset(names(df), "Date$")
setDT(df)[, (cols) := lapply(.SD, lubridate::mdy), .SDcols = cols][
# create medication counters for each ID and Medication
order(Start.Date),
Medic.Period := {
tmp <- shift(Stop.Date, fill = Start.Date[1] - 1)
stopifnot(all(Start.Date > tmp))
cumsum(Start.Date > tmp + min_gap)
},
by = .(ID, Medication)][
# aggregate to find the overal start and stop dates for each medication period
, .(Start.Date = min(Start.Date), Stop.Date = max(Stop.Date)),
by = .(ID, Medication, Medic.Period)]
ID Medication Medic.Period Start.Date Stop.Date 1: 2 aspirin 0 2017-05-01 2017-06-10 2: 2 aspirin 1 2017-07-15 2017-07-27 3: 2 tylenol 0 2017-05-01 2017-05-15 4: 3 lipitor 0 2017-05-06 2017-05-12 5: 5 advil 0 2017-05-28 2017-06-13