如何在 R 中构建高效的查找循环

How to build efficient loops for lookup in R

我有一个数据集,其中包含一个人离开网络的日期。一个人可以多次离开网络,因为他们可以在离开后再次加入网络。以下代码复制了该场景。

library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))

(ID 在此 table 中重复多次,因为一个人可以多次离开网络,前提是他们再次加入网络)

 > Leaving_Date
   Id       Date
1:  1 2017-01-01
2:  2 2017-02-03
3:  3 2017-01-01
4:  4 2017-03-10
5:  3 2017-02-09
6:  5 2017-02-05

我有另一个数据集,它给出了某个人被跟踪的日期,可以是在他们离开网络之前或之后。以下代码复制了该场景。

FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
                        Date =as.Date(c("2016-10-01","2017-02-04",
                        "2017-01-17","2017-02-23", "2017-03-03",
                        "2017-02-10","2017-02-11","2017-01-01",
                        "2017-01-15","2017-01-01")))


> FOLLOWUPs
    Id       Date
 1:  1 2016-10-01
 2:  2 2017-02-04
 3:  3 2017-01-17
 4:  2 2017-02-23
 5:  2 2017-03-03
 6:  3 2017-02-10
 7:  3 2017-02-11
 8:  4 2017-01-01
 9:  1 2017-01-15
10:  5 2017-01-01

现在我想查找 Leaving_Date 中的每个案例并找到他们跟进的日期,并在0 和 1。我正在使用以下代码:

SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
  sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
  if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                   Date < (Leaving_Date[i,Date]+7)])== 0){
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
   }
   else{
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+14)])== 0){
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
   }
   else{
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+30)])== 0){
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
   }
   else{
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
   }
 }               


 Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
 Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
 Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)

最终数据

 > Leaving_Date
    Id       Date SEVENDAY FOURTEENDAY THIRTYDAY
 1:  1 2017-01-01        0           0         1
 2:  2 2017-02-03        1           1         1
 3:  3 2017-01-01        0           0         1
 4:  4 2017-03-10        0           0         0
 5:  3 2017-02-09        1           1         1
 6:  5 2017-02-05        0           0         0

这段代码效率很低,因为我必须 运行 它才能进行 100k 次观察,而且会花费很多时间。有什么有效的方法可以做到这一点。

我认为这可以满足您使用 dplyr 的需求。

它按 Id 执行 'inner join' - 为给定的 Id 在两个数据框中生成所有日期组合 - 然后计算日期差异,按 Id 分组,然后检查是否有值落入三个类别的范围。

library(dplyr)

Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>% 
  mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>% 
  summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
            FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
            THIRTYDAY=as.numeric(any(datediff %in% 0:29)))

我们可以将其作为查询而不是循环来执行。首先,我稍微清理了你的 data.tables,因为我对变量名称感到困惑。

为了简化比较步骤,我们首先预先计算 7、14 和 30 天阈值的跟进日期限制。

library(dplyr)

dt_leaving_neat = Leaving_Date %>%
  mutate(.id = 1:n()) %>%
  mutate(limit_07 = Date + 7) %>%
  mutate(limit_14 = Date + 14) %>%
  mutate(limit_30 = Date + 30) %>%
  rename(id = .id, id_person = Id, leaving_date = Date)

dt_follow_neat = FOLLOWUPs %>% 
  select(id_person = Id, followed_up_date = Date)

实际操作只是查询。为了便于阅读,它写成 dplyr,但如果速度是您的主要关注点,您可以将其翻译成 data.table。我建议 运行 管道中的每个步骤,以确保您了解正在发生的事情。

dt_followed_up = dt_leaving_neat %>%
  tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
  left_join(dt_follow_neat, by = "id_person") %>%
  mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
  select(id, id_person, leaving_date, follow_up, followed_up) %>%
  filter(followed_up == TRUE) %>%
  unique() %>%
  tidyr::spread(follow_up, followed_up, fill = 0) %>%
  select(id, id_person, leaving_date, limit_07, limit_14, limit_30)

想法是将离开日期加入跟进日期并检查跟进日期是否在阈值内(并且也在离开日期之后,因为大概你不能在离开前跟进)。

然后对 return 您想要的格式进行一些最后的清理。您也可以使用 selectrename 来改回列名。

dt_result = dt_leaving_neat %>%
  select(id, id_person, leaving_date) %>%
  left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))

dt_result[is.na(dt_result)] = 0

结果

> dt_result
  id id_person leaving_date limit_07 limit_14 limit_30
1  1         1   2017-01-01        0        0        1
2  2         2   2017-02-03        1        1        1
3  3         3   2017-01-01        0        0        1
4  4         4   2017-03-10        0        0        0
5  5         3   2017-02-09        1        1        1
6  6         5   2017-02-05        0        0        0

根据 Andrew 的回答,等效的 1 行 data.table soln 是

FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]

使用非相等连接:

setorder(FOLLOWUPs, Id, Date)
Leaving_Date[, n := 
  FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date]
]

   Id       Date       n
1:  1 2017-01-01 14 days
2:  2 2017-02-03  1 days
3:  3 2017-01-01 16 days
4:  4 2017-03-10 NA days
5:  3 2017-02-09  1 days
6:  5 2017-02-05 NA days

Date 切换到 IDate 可能会使速度提高一倍左右。参见 ?IDate


我认为最好就此打住,但如果需要,可以将 n 与 7、14、30 进行比较,例如

Leaving_Date[, bin := c(7, 14, 30)[ findInterval(n, c(0, 7, 14, 30)) ]]

   Id       Date       n bin
1:  1 2017-01-01 14 days  30
2:  2 2017-02-03  1 days   7
3:  3 2017-01-01 16 days  30
4:  4 2017-03-10 NA days  NA
5:  3 2017-02-09  1 days   7
6:  5 2017-02-05 NA days  NA

旁注:请不要给表起这样的名字。