R:计算R中数据集中每个唯一个体在过去特定时间段内出现的次数
R: Calculating the number of occurrences within a specific time period in the past for each unique individual in a dataset in R
我正在尝试统计过去特定时间段内给定个人发生事件的次数。在这种特殊情况下,对于每个新观察(反映单个调度请求),我需要知道个人在过去 60 天内安排了多少次旅行 (trip_scheduled)。最终我需要统计那个人在前 60 天的预定行程当天取消的次数。但我只是从 "moving" 60 天期间的计数开始。
我在这个 post 中找到了一些类似但略有不同的问题的优雅答案:
我的情况在几个方面有所不同:首先,我正在尝试查看以前的时间段,我不知道这是否会改变我的方法,其次,我需要进行分析对于超过 40,000 个人,我一直在尝试通过混合使用我在另一个答案中找到的代码、一个 for 循环(我知道这是不受欢迎的)和 dplyr 分组来完成。它根本不起作用。
谁能帮我指明正确的方向?我很乐意坚持使用 dplyr 和 base。我只是对 data.table.
了解不多
这是我一直尝试使用的代码和测试数据:
test_set2 <- structure(list(tripID = c("20180112-100037-674-101", "20180112-100037-674-201",
"20180112-100037-674-301", "20180113-100037-676-101", "20180113-100037-676-201",
"20180115-100037-675-101", "20180115-100037-675-201", "20180116-100037-677-101",
"20180116-100037-677-201", "20180131-100037-678-101", "20180101-100146-707-101",
"20180101-100146-707-201", "20180102-100146-708-101", "20180102-100146-708-201",
"20180103-100146-709-101", "20180103-100146-709-201", "20180104-100146-710-101",
"20180104-100146-710-201", "20180105-100146-711-101", "20180105-100146-711-201",
"20180403-100532-223-101", "20180403-100532-223-201", "20180620-100532-224-101",
"20180620-100532-224-201", "20180704-100532-225-101", "20180704-100532-225-201",
"20180926-100532-228-101", "20180926-100532-228-201", "20180927-100532-226-101",
"20180927-100532-226-201"), CUSTOMER_ID = c(100037L, 100037L,
100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L,
100037L, 100146L, 100146L, 100146L, 100146L, 100146L, 100146L,
100146L, 100146L, 100146L, 100146L, 100532L, 100532L, 100532L,
100532L, 100532L, 100532L, 100532L, 100532L, 100532L, 100532L
), trip_date = structure(c(17543, 17543, 17543, 17544, 17544,
17546, 17546, 17547, 17547, 17562, 17532, 17532, 17533, 17533,
17534, 17534, 17535, 17535, 17536, 17536, 17624, 17624, 17702,
17702, 17716, 17716, 17800, 17800, 17801, 17801), class = "Date"),
trip_scheduled = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), same_day_cancel = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -30L), groups = structure(list(
CUSTOMER_ID = c(100037L, 100146L, 100532L), .rows = list(
1:10, 11:20, 21:30)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
running_frame <- test_set2[1,]
unique_customers <- unique(test_set2$CUSTOMER_ID)
for (cust in unique_customers){
temp_events <- test_set2 %>% filter(CUSTOMER_ID == i)
cs = cumsum(temp_events$trip_scheduled) # cumulative number of trips of individual
output_temp <- data.frame(temp_events,
trips_minus_60 = cs[findInterval(temp_events$trip_date - 60, temp_events$trip_date, left.open = TRUE)] - cs)
new_table <- rbind(new_table,output_temp)
}
这是我最近产生的错误:
错误 data.frame(temp_events, trips_minus_60 = cs[findInterval(temp_events$trip_date - :
参数表示不同的行数:10、0
我不确定这是否满足您的需求,但这是基于您链接到的@Axeman 的 tidyverse
解决方案。在 group_by
您的 CUSTOMER_ID
之后,您可以对所有行求和 trip_scheduled
是 1 并且日期介于当前日期和 60 天之前。我希望您也可以为 same_day_cancel
做类似的事情。
library(tidyverse)
test_set2 %>%
group_by(CUSTOMER_ID) %>%
mutate(schedule_60 = unlist(map(trip_date, ~sum(trip_scheduled == 1 & between(trip_date, . - 60, .))))) %>%
print(n=30)
# A tibble: 30 x 6
# Groups: CUSTOMER_ID [3]
tripID CUSTOMER_ID trip_date trip_scheduled same_day_cancel schedule_60
<chr> <int> <date> <dbl> <dbl> <int>
1 20180112-100037-674-101 100037 2018-01-12 1 1 3
2 20180112-100037-674-201 100037 2018-01-12 1 1 3
3 20180112-100037-674-301 100037 2018-01-12 1 1 3
4 20180113-100037-676-101 100037 2018-01-13 1 0 5
5 20180113-100037-676-201 100037 2018-01-13 1 0 5
6 20180115-100037-675-101 100037 2018-01-15 1 1 7
7 20180115-100037-675-201 100037 2018-01-15 1 1 7
8 20180116-100037-677-101 100037 2018-01-16 1 0 9
9 20180116-100037-677-201 100037 2018-01-16 1 0 9
10 20180131-100037-678-101 100037 2018-01-31 1 0 10
11 20180101-100146-707-101 100146 2018-01-01 1 1 2
12 20180101-100146-707-201 100146 2018-01-01 1 1 2
13 20180102-100146-708-101 100146 2018-01-02 1 1 4
14 20180102-100146-708-201 100146 2018-01-02 1 1 4
15 20180103-100146-709-101 100146 2018-01-03 1 1 6
16 20180103-100146-709-201 100146 2018-01-03 1 1 6
17 20180104-100146-710-101 100146 2018-01-04 1 1 8
18 20180104-100146-710-201 100146 2018-01-04 1 1 8
19 20180105-100146-711-101 100146 2018-01-05 1 1 10
20 20180105-100146-711-201 100146 2018-01-05 1 1 10
21 20180403-100532-223-101 100532 2018-04-03 1 0 2
22 20180403-100532-223-201 100532 2018-04-03 1 0 2
23 20180620-100532-224-101 100532 2018-06-20 1 0 2
24 20180620-100532-224-201 100532 2018-06-20 1 0 2
25 20180704-100532-225-101 100532 2018-07-04 1 0 4
26 20180704-100532-225-201 100532 2018-07-04 1 0 4
27 20180926-100532-228-101 100532 2018-09-26 1 0 2
28 20180926-100532-228-201 100532 2018-09-26 1 0 2
29 20180927-100532-226-101 100532 2018-09-27 1 0 4
30 20180927-100532-226-201 100532 2018-09-27 1 0 4
我正在尝试统计过去特定时间段内给定个人发生事件的次数。在这种特殊情况下,对于每个新观察(反映单个调度请求),我需要知道个人在过去 60 天内安排了多少次旅行 (trip_scheduled)。最终我需要统计那个人在前 60 天的预定行程当天取消的次数。但我只是从 "moving" 60 天期间的计数开始。
我在这个 post 中找到了一些类似但略有不同的问题的优雅答案:
我的情况在几个方面有所不同:首先,我正在尝试查看以前的时间段,我不知道这是否会改变我的方法,其次,我需要进行分析对于超过 40,000 个人,我一直在尝试通过混合使用我在另一个答案中找到的代码、一个 for 循环(我知道这是不受欢迎的)和 dplyr 分组来完成。它根本不起作用。
谁能帮我指明正确的方向?我很乐意坚持使用 dplyr 和 base。我只是对 data.table.
了解不多这是我一直尝试使用的代码和测试数据:
test_set2 <- structure(list(tripID = c("20180112-100037-674-101", "20180112-100037-674-201",
"20180112-100037-674-301", "20180113-100037-676-101", "20180113-100037-676-201",
"20180115-100037-675-101", "20180115-100037-675-201", "20180116-100037-677-101",
"20180116-100037-677-201", "20180131-100037-678-101", "20180101-100146-707-101",
"20180101-100146-707-201", "20180102-100146-708-101", "20180102-100146-708-201",
"20180103-100146-709-101", "20180103-100146-709-201", "20180104-100146-710-101",
"20180104-100146-710-201", "20180105-100146-711-101", "20180105-100146-711-201",
"20180403-100532-223-101", "20180403-100532-223-201", "20180620-100532-224-101",
"20180620-100532-224-201", "20180704-100532-225-101", "20180704-100532-225-201",
"20180926-100532-228-101", "20180926-100532-228-201", "20180927-100532-226-101",
"20180927-100532-226-201"), CUSTOMER_ID = c(100037L, 100037L,
100037L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L,
100037L, 100146L, 100146L, 100146L, 100146L, 100146L, 100146L,
100146L, 100146L, 100146L, 100146L, 100532L, 100532L, 100532L,
100532L, 100532L, 100532L, 100532L, 100532L, 100532L, 100532L
), trip_date = structure(c(17543, 17543, 17543, 17544, 17544,
17546, 17546, 17547, 17547, 17562, 17532, 17532, 17533, 17533,
17534, 17534, 17535, 17535, 17536, 17536, 17624, 17624, 17702,
17702, 17716, 17716, 17800, 17800, 17801, 17801), class = "Date"),
trip_scheduled = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), same_day_cancel = c(1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -30L), groups = structure(list(
CUSTOMER_ID = c(100037L, 100146L, 100532L), .rows = list(
1:10, 11:20, 21:30)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
running_frame <- test_set2[1,]
unique_customers <- unique(test_set2$CUSTOMER_ID)
for (cust in unique_customers){
temp_events <- test_set2 %>% filter(CUSTOMER_ID == i)
cs = cumsum(temp_events$trip_scheduled) # cumulative number of trips of individual
output_temp <- data.frame(temp_events,
trips_minus_60 = cs[findInterval(temp_events$trip_date - 60, temp_events$trip_date, left.open = TRUE)] - cs)
new_table <- rbind(new_table,output_temp)
}
这是我最近产生的错误:
错误 data.frame(temp_events, trips_minus_60 = cs[findInterval(temp_events$trip_date - : 参数表示不同的行数:10、0
我不确定这是否满足您的需求,但这是基于您链接到的@Axeman 的 tidyverse
解决方案。在 group_by
您的 CUSTOMER_ID
之后,您可以对所有行求和 trip_scheduled
是 1 并且日期介于当前日期和 60 天之前。我希望您也可以为 same_day_cancel
做类似的事情。
library(tidyverse)
test_set2 %>%
group_by(CUSTOMER_ID) %>%
mutate(schedule_60 = unlist(map(trip_date, ~sum(trip_scheduled == 1 & between(trip_date, . - 60, .))))) %>%
print(n=30)
# A tibble: 30 x 6
# Groups: CUSTOMER_ID [3]
tripID CUSTOMER_ID trip_date trip_scheduled same_day_cancel schedule_60
<chr> <int> <date> <dbl> <dbl> <int>
1 20180112-100037-674-101 100037 2018-01-12 1 1 3
2 20180112-100037-674-201 100037 2018-01-12 1 1 3
3 20180112-100037-674-301 100037 2018-01-12 1 1 3
4 20180113-100037-676-101 100037 2018-01-13 1 0 5
5 20180113-100037-676-201 100037 2018-01-13 1 0 5
6 20180115-100037-675-101 100037 2018-01-15 1 1 7
7 20180115-100037-675-201 100037 2018-01-15 1 1 7
8 20180116-100037-677-101 100037 2018-01-16 1 0 9
9 20180116-100037-677-201 100037 2018-01-16 1 0 9
10 20180131-100037-678-101 100037 2018-01-31 1 0 10
11 20180101-100146-707-101 100146 2018-01-01 1 1 2
12 20180101-100146-707-201 100146 2018-01-01 1 1 2
13 20180102-100146-708-101 100146 2018-01-02 1 1 4
14 20180102-100146-708-201 100146 2018-01-02 1 1 4
15 20180103-100146-709-101 100146 2018-01-03 1 1 6
16 20180103-100146-709-201 100146 2018-01-03 1 1 6
17 20180104-100146-710-101 100146 2018-01-04 1 1 8
18 20180104-100146-710-201 100146 2018-01-04 1 1 8
19 20180105-100146-711-101 100146 2018-01-05 1 1 10
20 20180105-100146-711-201 100146 2018-01-05 1 1 10
21 20180403-100532-223-101 100532 2018-04-03 1 0 2
22 20180403-100532-223-201 100532 2018-04-03 1 0 2
23 20180620-100532-224-101 100532 2018-06-20 1 0 2
24 20180620-100532-224-201 100532 2018-06-20 1 0 2
25 20180704-100532-225-101 100532 2018-07-04 1 0 4
26 20180704-100532-225-201 100532 2018-07-04 1 0 4
27 20180926-100532-228-101 100532 2018-09-26 1 0 2
28 20180926-100532-228-201 100532 2018-09-26 1 0 2
29 20180927-100532-226-101 100532 2018-09-27 1 0 4
30 20180927-100532-226-201 100532 2018-09-27 1 0 4