根据打开ID时间和关闭ID时间生成活动ID的周期时间序列
Generate periodic time series of active ID based on the open ID time and close ID time
我有一个标题,每一行都包含一个 ID 的打开日期和关闭日期。
有了这 2 个信息,我应该可以提取出每周有多少 ID 处于活跃状态,每周有多少 ID 处于关闭状态,以及活跃 ID 在该时间段内的增长率。
例如 ID aa
的开放日期为 week 1
,结束日期为 week 5
。
因此,从 week 1
到 week 5
.
,ID aa
将被计为 active_id
另一个 ID bb
的开放日期是 week 1
但没有关闭日期 NA
,这意味着 ID 从 week 1
开始开放但直到现在才关闭(比如现在是 week 10
)。因此,从 week 1
到 week 10
.
,ID bb
将被计为 active_id
# create weekly row
set.seed(1990)
have <- tibble(id = as.vector(outer(letters, letters, paste0))[1:48]
, open_week = sample(1:10,48, replace = T)
, age_week = sample(1:7,48, replace = T)) %>%
mutate(close_week = open_week + age_week) %>%
arrange(open_week)
# some are closed, some are not closed
# if not closed, set to NA
have$close_week[sample(c(TRUE, FALSE),48, replace = T, prob = c(0.3,0.7))] <- NA
# recalculate ID age for NA
have <- have %>%
mutate(age_week = if_else(is.na(close_week), max(open_week) - open_week, age_week))
have
> have
# A tibble: 48 x 4
id open_week age_week close_week
<chr> <int> <int> <int>
1 wa 10 0 NA
2 sb 4 1 5
3 ja 8 1 9
4 cb 9 1 NA
5 tb 9 1 NA
6 hb 10 1 11
7 pb 1 2 3
8 la 3 2 5
9 oa 6 2 8
10 rb 6 2 8
您可能会注意到,我想概括地说,我想每周生成指标(至少基于这个可重现的数据时间范围)以用于特征工程目的。我没有得到每周快照的奢侈,这肯定会简化许多这些操作。但我认为这非常有趣(至少对我而言)如何单独使用这 3 列(ID
、open time
、close time
),我可以 re-generate 每周数据快照.
# Daily time series
# these active_id numbers, close_id, median age week, active_growth_rate are fictionous, not actual values based on
# the have data above
want <- tibble(open_week = seq(min(have$open_week),max(have$open_week))
,active_id = c(sample(10:18,length(open_week), replace = T))
,close_id = 20 - active_id
,median_age_week_active = c(sample(2:6,length(open_week), replace = T))
,median_age_week_closed = c(sample(2:6,length(open_week), replace = T))
,active_growth_rate = ((active_id - lag(active_id))/active_id) * 100)
> want
# A tibble: 10 x 6
open_week active_id close_id median_age_week_a… median_age_week_c… active_growth_r…
<int> <int> <dbl> <int> <int> <dbl>
1 1 12 8 4 2 NA
2 2 10 10 3 4 -20
3 3 11 9 6 6 9.09
4 4 11 9 4 3 0
5 5 16 4 3 5 31.2
6 6 10 10 3 3 -60
7 7 14 6 4 5 28.6
8 8 10 10 4 2 -40
9 9 18 2 4 6 44.4
10 10 18 2 4 4 0
鉴于更新修改了答案。
遵循此 tidyverse
策略。解释-
- 为第一周生成一个
seq
到一周的最大值
- 使用
rowwise
进行以下操作
- 计算在周数期间打开的案例 - 只需检查序列中的周数等于 have$start
的位置
- cases cases closed during week similarly (here we have to use
na.rm = T
, so that conditional may return only T/F and no NA
- 收集列表中的所有活动案例并计算它们与起点的距离(但这里要小心,have$end 中有 NA 的案例也被计算在内。)
- 类似地收集列表中所有关闭的案例并计算它们的距离
- 计算活跃案例的长度
- 计算两个列表的中位数
- 取消分组后(按行)计算增长百分比。
library(tidyverse)
seq_len(max(have$close_week, na.rm = T)) %>%
as.data.frame() %>%
set_names("Week") %>%
rowwise() %>%
mutate(opened = sum(Week == have$open_week),
closed = sum(Week == have$close_week, na.rm = T),
act_ages_med = list(Week - have$open_week[Week >= have$open_week &
Week < ifelse(is.na(have$close_week),
max(have$close_week, na.rm = T) +1,
have$close_week)]),
cls_ages_med = list((Week - have$open_week[Week == have$close_week]) %>% na.omit()),
active = length(act_ages_med),
act_ages_med = median(act_ages_med),
cls_ages_med = median(cls_ages_med)) %>%
ungroup() %>%
mutate(active_grth = (active - lag(active))*100/lag(active))
# A tibble: 14 x 7
Week opened closed act_ages_med cls_ages_med active active_grth
<int> <int> <int> <dbl> <dbl> <int> <dbl>
1 1 8 0 0 NA 8 NA
2 2 5 0 1 NA 13 62.5
3 3 5 1 1 2 17 30.8
4 4 7 2 1 3 22 29.4
5 5 4 3 2 2 23 4.55
6 6 4 3 2 4 24 4.35
7 7 4 4 2.5 4 24 0
8 8 4 7 3 4 21 -12.5
9 9 3 3 3 5 21 0
10 10 4 1 3 3 24 14.3
11 11 0 3 4 2 21 -12.5
12 12 0 4 5 7 17 -19.0
13 13 0 1 6 6 16 -5.88
14 14 0 2 7.5 5.5 14 -12.5
较早的回答
这行得通吗?在我看来 age_week
根本不需要
have %>% select(-age_week) %>%
pivot_longer(-id, names_to = "event", values_to = "week") %>%
mutate(event = factor(event, levels = c("open_week", "close_week"), ordered = T)) %>%
filter(!is.na(week)) %>%
arrange(week, event) %>%
mutate(d = ifelse(event == "open_week", 1, -1),
d = cumsum(d)) %>%
group_by(week) %>%
summarise(opened = sum(event == "open_week"),
active = last(d),
closed = sum(event == "close_week")) %>%
mutate(active_gr_rate = (active - lag(active))*100/lag(active))
# A tibble: 14 x 5
week opened active closed active_gr_rate
<int> <int> <dbl> <int> <dbl>
1 1 8 8 0 NA
2 2 5 13 0 62.5
3 3 5 17 1 30.8
4 4 7 22 2 29.4
5 5 4 23 3 4.55
6 6 4 24 3 4.35
7 7 4 24 4 0
8 8 4 21 7 -12.5
9 9 3 21 3 0
10 10 4 24 1 14.3
11 11 0 21 3 -12.5
12 12 0 17 4 -19.0
13 13 0 16 1 -5.88
14 14 0 14 2 -12.5
具有 NA 状态的笔记 ID 被假定为开放截止日期
我有一个标题,每一行都包含一个 ID 的打开日期和关闭日期。 有了这 2 个信息,我应该可以提取出每周有多少 ID 处于活跃状态,每周有多少 ID 处于关闭状态,以及活跃 ID 在该时间段内的增长率。
例如 ID aa
的开放日期为 week 1
,结束日期为 week 5
。
因此,从 week 1
到 week 5
.
aa
将被计为 active_id
另一个 ID bb
的开放日期是 week 1
但没有关闭日期 NA
,这意味着 ID 从 week 1
开始开放但直到现在才关闭(比如现在是 week 10
)。因此,从 week 1
到 week 10
.
bb
将被计为 active_id
# create weekly row
set.seed(1990)
have <- tibble(id = as.vector(outer(letters, letters, paste0))[1:48]
, open_week = sample(1:10,48, replace = T)
, age_week = sample(1:7,48, replace = T)) %>%
mutate(close_week = open_week + age_week) %>%
arrange(open_week)
# some are closed, some are not closed
# if not closed, set to NA
have$close_week[sample(c(TRUE, FALSE),48, replace = T, prob = c(0.3,0.7))] <- NA
# recalculate ID age for NA
have <- have %>%
mutate(age_week = if_else(is.na(close_week), max(open_week) - open_week, age_week))
have
> have
# A tibble: 48 x 4
id open_week age_week close_week
<chr> <int> <int> <int>
1 wa 10 0 NA
2 sb 4 1 5
3 ja 8 1 9
4 cb 9 1 NA
5 tb 9 1 NA
6 hb 10 1 11
7 pb 1 2 3
8 la 3 2 5
9 oa 6 2 8
10 rb 6 2 8
您可能会注意到,我想概括地说,我想每周生成指标(至少基于这个可重现的数据时间范围)以用于特征工程目的。我没有得到每周快照的奢侈,这肯定会简化许多这些操作。但我认为这非常有趣(至少对我而言)如何单独使用这 3 列(ID
、open time
、close time
),我可以 re-generate 每周数据快照.
# Daily time series
# these active_id numbers, close_id, median age week, active_growth_rate are fictionous, not actual values based on
# the have data above
want <- tibble(open_week = seq(min(have$open_week),max(have$open_week))
,active_id = c(sample(10:18,length(open_week), replace = T))
,close_id = 20 - active_id
,median_age_week_active = c(sample(2:6,length(open_week), replace = T))
,median_age_week_closed = c(sample(2:6,length(open_week), replace = T))
,active_growth_rate = ((active_id - lag(active_id))/active_id) * 100)
> want
# A tibble: 10 x 6
open_week active_id close_id median_age_week_a… median_age_week_c… active_growth_r…
<int> <int> <dbl> <int> <int> <dbl>
1 1 12 8 4 2 NA
2 2 10 10 3 4 -20
3 3 11 9 6 6 9.09
4 4 11 9 4 3 0
5 5 16 4 3 5 31.2
6 6 10 10 3 3 -60
7 7 14 6 4 5 28.6
8 8 10 10 4 2 -40
9 9 18 2 4 6 44.4
10 10 18 2 4 4 0
鉴于更新修改了答案。
遵循此 tidyverse
策略。解释-
- 为第一周生成一个
seq
到一周的最大值 - 使用
rowwise
进行以下操作- 计算在周数期间打开的案例 - 只需检查序列中的周数等于 have$start 的位置
- cases cases closed during week similarly (here we have to use
na.rm = T
, so that conditional may return only T/F and no NA - 收集列表中的所有活动案例并计算它们与起点的距离(但这里要小心,have$end 中有 NA 的案例也被计算在内。)
- 类似地收集列表中所有关闭的案例并计算它们的距离
- 计算活跃案例的长度
- 计算两个列表的中位数
- 取消分组后(按行)计算增长百分比。
library(tidyverse)
seq_len(max(have$close_week, na.rm = T)) %>%
as.data.frame() %>%
set_names("Week") %>%
rowwise() %>%
mutate(opened = sum(Week == have$open_week),
closed = sum(Week == have$close_week, na.rm = T),
act_ages_med = list(Week - have$open_week[Week >= have$open_week &
Week < ifelse(is.na(have$close_week),
max(have$close_week, na.rm = T) +1,
have$close_week)]),
cls_ages_med = list((Week - have$open_week[Week == have$close_week]) %>% na.omit()),
active = length(act_ages_med),
act_ages_med = median(act_ages_med),
cls_ages_med = median(cls_ages_med)) %>%
ungroup() %>%
mutate(active_grth = (active - lag(active))*100/lag(active))
# A tibble: 14 x 7
Week opened closed act_ages_med cls_ages_med active active_grth
<int> <int> <int> <dbl> <dbl> <int> <dbl>
1 1 8 0 0 NA 8 NA
2 2 5 0 1 NA 13 62.5
3 3 5 1 1 2 17 30.8
4 4 7 2 1 3 22 29.4
5 5 4 3 2 2 23 4.55
6 6 4 3 2 4 24 4.35
7 7 4 4 2.5 4 24 0
8 8 4 7 3 4 21 -12.5
9 9 3 3 3 5 21 0
10 10 4 1 3 3 24 14.3
11 11 0 3 4 2 21 -12.5
12 12 0 4 5 7 17 -19.0
13 13 0 1 6 6 16 -5.88
14 14 0 2 7.5 5.5 14 -12.5
较早的回答
这行得通吗?在我看来 age_week
根本不需要
have %>% select(-age_week) %>%
pivot_longer(-id, names_to = "event", values_to = "week") %>%
mutate(event = factor(event, levels = c("open_week", "close_week"), ordered = T)) %>%
filter(!is.na(week)) %>%
arrange(week, event) %>%
mutate(d = ifelse(event == "open_week", 1, -1),
d = cumsum(d)) %>%
group_by(week) %>%
summarise(opened = sum(event == "open_week"),
active = last(d),
closed = sum(event == "close_week")) %>%
mutate(active_gr_rate = (active - lag(active))*100/lag(active))
# A tibble: 14 x 5
week opened active closed active_gr_rate
<int> <int> <dbl> <int> <dbl>
1 1 8 8 0 NA
2 2 5 13 0 62.5
3 3 5 17 1 30.8
4 4 7 22 2 29.4
5 5 4 23 3 4.55
6 6 4 24 3 4.35
7 7 4 24 4 0
8 8 4 21 7 -12.5
9 9 3 21 3 0
10 10 4 24 1 14.3
11 11 0 21 3 -12.5
12 12 0 17 4 -19.0
13 13 0 16 1 -5.88
14 14 0 14 2 -12.5
具有 NA 状态的笔记 ID 被假定为开放截止日期