根据考虑组的另一个变量的行差异创建唯一变量

creating a unique variable based on row differences of another variable considering groups

通过使用下面的数据,我想根据他们的联系日期创建一个新的唯一客户 ID。 规则:每两天后,我希望每个客户获得一个新的唯一客户 ID 并将其保存在以下记录中,如果同一客户的以下联系日期在接下来的两天内,如果没有为同一客户分配新的 ID .

我只能计算日期差异。 我工作的原始数据集更大;因此,如果可能的话,我更喜欢 data.table 解决方案。

library(data.table)
treshold <- 2
dt <- structure(list(customer_id = c('10','20','20','20','20','20','30','30','30','30','30','40','50','50'),
                      contact_date = as.Date(c("2019-01-05","2019-01-01","2019-01-01","2019-01-02",
                                               "2019-01-08","2019-01-09","2019-02-02","2019-02-05",
                                               "2019-02-05","2019-02-09","2019-02-12","2019-02-01",
                                               "2019-02-01","2019-02-05")),
                      desired_output = c(1,2,2,2,3,3,4,5,5,6,7,8,9,10)), 
                 class = "data.frame", 
                 row.names = 1:14)
setDT(dt)
setorder(dt, customer_id, contact_date)
dt[, date_diff_in_days:=contact_date - shift(contact_date, type = c("lag")), by=customer_id]
dt[, date_diff_in_days:=as.numeric(date_diff_in_days)]
dt

    customer_id contact_date desired_output date_diff_in_days
 1:          10   2019-01-05              1                NA
 2:          20   2019-01-01              2                NA
 3:          20   2019-01-01              2                 0
 4:          20   2019-01-02              2                 1
 5:          20   2019-01-08              3                 6
 6:          20   2019-01-09              3                 1
 7:          30   2019-02-02              4                NA
 8:          30   2019-02-05              5                 3
 9:          30   2019-02-05              5                 0
10:          30   2019-02-09              6                 4
11:          30   2019-02-12              7                 3
12:          40   2019-02-01              8                NA
13:          50   2019-02-01              9                NA
14:          50   2019-02-05             10                 4

每当 date_diff_in_daysNA 或超过阈值时,我们使用 cumsum 递增。

dt[, result := cumsum(is.na(date_diff_in_days) | date_diff_in_days > treshold)]
#     customer_id contact_date desired_output date_diff_in_days result
#  1:          10   2019-01-05              1                NA      1
#  2:          20   2019-01-01              2                NA      2
#  3:          20   2019-01-01              2                 0      2
#  4:          20   2019-01-02              2                 1      2
#  5:          20   2019-01-08              3                 6      3
#  6:          20   2019-01-09              3                 1      3
#  7:          30   2019-02-02              4                NA      4
#  8:          30   2019-02-05              5                 3      5
#  9:          30   2019-02-05              5                 0      5
# 10:          30   2019-02-09              6                 4      6
# 11:          30   2019-02-12              7                 3      7
# 12:          40   2019-02-01              8                NA      8
# 13:          50   2019-02-01              9                NA      9
# 14:          50   2019-02-05             10                 4     10

Rule: After every two days, I want each customer to get a new unique customer id and preserve it on the following record if the following contact date for the same customer is within the following two days if not assign a new id to this same customer.

创建新 ID 时,如果您正确设置 by= 向量以捕获规则,则可以使用 auto-counter .GRP

thresh <- 2
dt[, g := .GRP, by=.(
  customer_id, 
  cumsum(contact_date - shift(contact_date, fill=first(contact_date)) > thresh)
)]

dt[, any(g != desired_output)]
# [1] FALSE

我认为上面的代码是正确的,因为它适用于该示例,但您可能需要检查您的实际数据(与 Gregor 等方法的结果进行比较)以确定。