根据打开ID时间和关闭ID时间生成活动ID的周期时间序列

Generate periodic time series of active ID based on the open ID time and close ID time

我有一个标题,每一行都包含一个 ID 的打开日期和关闭日期。 有了这 2 个信息,我应该可以提取出每周有多少 ID 处于活跃状态,每周有多少 ID 处于关闭状态,以及活跃 ID 在该时间段内的增长率。

例如 ID aa 的开放日期为 week 1,结束日期为 week 5。 因此,从 week 1week 5.

,ID aa 将被计为 active_id

另一个 ID bb 的开放日期是 week 1 但没有关闭日期 NA,这意味着 ID 从 week 1 开始开放但直到现在才关闭(比如现在是 week 10)。因此,从 week 1week 10.

,ID bb 将被计为 active_id
# create weekly row
set.seed(1990)
have <- tibble(id = as.vector(outer(letters, letters, paste0))[1:48]
                        , open_week = sample(1:10,48, replace = T)
                        , age_week = sample(1:7,48, replace = T)) %>%
  mutate(close_week = open_week + age_week) %>%
  arrange(open_week)

# some are closed, some are not closed
# if not closed, set to NA
have$close_week[sample(c(TRUE, FALSE),48, replace = T, prob = c(0.3,0.7))] <- NA

# recalculate ID age for NA
have <- have %>%
  mutate(age_week = if_else(is.na(close_week), max(open_week) - open_week, age_week))

have

> have
# A tibble: 48 x 4
   id    open_week age_week close_week
   <chr>     <int>    <int>      <int>
 1 wa           10        0         NA
 2 sb            4        1          5
 3 ja            8        1          9
 4 cb            9        1         NA
 5 tb            9        1         NA
 6 hb           10        1         11
 7 pb            1        2          3
 8 la            3        2          5
 9 oa            6        2          8
10 rb            6        2          8


您可能会注意到,我想概括地说,我想每周生成指标(至少基于这个可重现的数据时间范围)以用于特征工程目的。我没有得到每周快照的奢侈,这肯定会简化许多这些操作。但我认为这非常有趣(至少对我而言)如何单独使用这 3 列(IDopen timeclose time),我可以 re-generate 每周数据快照.


# Daily time series
# these active_id numbers, close_id, median age week, active_growth_rate are fictionous, not actual values based on 
# the have data above
want <- tibble(open_week = seq(min(have$open_week),max(have$open_week))
               ,active_id = c(sample(10:18,length(open_week), replace = T))
               ,close_id = 20 - active_id
               ,median_age_week_active = c(sample(2:6,length(open_week), replace = T))
               ,median_age_week_closed = c(sample(2:6,length(open_week), replace = T)) 
               ,active_growth_rate = ((active_id - lag(active_id))/active_id) * 100)
> want
# A tibble: 10 x 6
   open_week active_id close_id median_age_week_a… median_age_week_c… active_growth_r…
       <int>     <int>    <dbl>              <int>              <int>            <dbl>
 1         1        12        8                  4                  2            NA   
 2         2        10       10                  3                  4           -20   
 3         3        11        9                  6                  6             9.09
 4         4        11        9                  4                  3             0   
 5         5        16        4                  3                  5            31.2 
 6         6        10       10                  3                  3           -60   
 7         7        14        6                  4                  5            28.6 
 8         8        10       10                  4                  2           -40   
 9         9        18        2                  4                  6            44.4 
10        10        18        2                  4                  4             0   

鉴于更新修改了答案。

遵循此 tidyverse 策略。解释-

  • 为第一周生成一个 seq 到一周的最大值
  • 使用rowwise进行以下操作
    • 计算在周数期间打开的案例 - 只需检查序列中的周数等于 have$start
    • 的位置
    • cases cases closed during week similarly (here we have to use na.rm = T, so that conditional may return only T/F and no NA
    • 收集列表中的所有活动案例并计算它们与起点的距离(但这里要小心,have$end 中有 NA 的案例也被计算在内。)
    • 类似地收集列表中所有关闭的案例并计算它们的距离
    • 计算活跃案例的长度
    • 计算两个列表的中位数
  • 取消分组后(按行)计算增长百分比。
library(tidyverse)

seq_len(max(have$close_week, na.rm = T)) %>% 
  as.data.frame() %>%
  set_names("Week") %>% 
  rowwise() %>%
  mutate(opened = sum(Week == have$open_week),
         closed = sum(Week == have$close_week, na.rm = T),
         act_ages_med = list(Week - have$open_week[Week >= have$open_week & 
                                                     Week < ifelse(is.na(have$close_week),
                                                                   max(have$close_week, na.rm = T) +1,
                                                                   have$close_week)]),
         cls_ages_med = list((Week - have$open_week[Week == have$close_week]) %>% na.omit()),
         active = length(act_ages_med),
         act_ages_med = median(act_ages_med),
         cls_ages_med = median(cls_ages_med)) %>% 
  ungroup() %>%
  mutate(active_grth = (active - lag(active))*100/lag(active))

# A tibble: 14 x 7
    Week opened closed act_ages_med cls_ages_med active active_grth
   <int>  <int>  <int>        <dbl>        <dbl>  <int>       <dbl>
 1     1      8      0          0           NA        8       NA   
 2     2      5      0          1           NA       13       62.5 
 3     3      5      1          1            2       17       30.8 
 4     4      7      2          1            3       22       29.4 
 5     5      4      3          2            2       23        4.55
 6     6      4      3          2            4       24        4.35
 7     7      4      4          2.5          4       24        0   
 8     8      4      7          3            4       21      -12.5 
 9     9      3      3          3            5       21        0   
10    10      4      1          3            3       24       14.3 
11    11      0      3          4            2       21      -12.5 
12    12      0      4          5            7       17      -19.0 
13    13      0      1          6            6       16       -5.88
14    14      0      2          7.5          5.5     14      -12.5

较早的回答

这行得通吗?在我看来 age_week 根本不需要

have %>% select(-age_week) %>%
  pivot_longer(-id, names_to = "event", values_to = "week") %>%
  mutate(event = factor(event, levels = c("open_week", "close_week"), ordered = T)) %>%
  filter(!is.na(week)) %>%
  arrange(week, event) %>%
  mutate(d = ifelse(event == "open_week", 1, -1),
         d = cumsum(d)) %>%
  group_by(week) %>%
  summarise(opened = sum(event == "open_week"),
            active = last(d),
            closed = sum(event == "close_week")) %>%
  mutate(active_gr_rate = (active - lag(active))*100/lag(active))

# A tibble: 14 x 5
    week opened active closed active_gr_rate
   <int>  <int>  <dbl>  <int>          <dbl>
 1     1      8      8      0          NA   
 2     2      5     13      0          62.5 
 3     3      5     17      1          30.8 
 4     4      7     22      2          29.4 
 5     5      4     23      3           4.55
 6     6      4     24      3           4.35
 7     7      4     24      4           0   
 8     8      4     21      7         -12.5 
 9     9      3     21      3           0   
10    10      4     24      1          14.3 
11    11      0     21      3         -12.5 
12    12      0     17      4         -19.0 
13    13      0     16      1          -5.88
14    14      0     14      2         -12.5 

具有 NA 状态的笔记 ID 被假定为开放截止日期