R：计算指定时间范围内不同类别的数量

Question

这是一些虚拟数据：

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

我想为每个 user_id 计算 在指定时间段内（例如过去 7、14 天）不同 categories 的数量，包括当前订单

解决方案如下所示：

 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

我发布了类似的问题 or ，但是其中 none 提到计算指定时间段内的累积唯一值。非常感谢您的帮助！

Answer 1

这里有两个 data.table 解决方案，一个有两个嵌套 lapply，另一个使用 non-equi joins.

第一个是一个相当笨拙的 data.table 解决方案，但它重现了预期的答案。它适用于任意数量的时间范围。（尽管他在评论中建议的@alistaire 的简洁 tidyverse 解决方案也可以修改）。

它使用两个嵌套lapply。第一个循环遍历时间范围，第二个循环遍历日期。临时结果与原始数据相结合，然后从长格式重塑为宽格式，这样我们将以每个时间范围的单独列结束。

library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

这是一个变体 following a suggestion by @Frank，它使用 data.table 的 non-equi 连接 而不是第二个 lapply：

tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) {
           dt[.(user_id = user_id, dago = date - ldays, d = date), 
              on=.(user_id, date >= dago, date <= d), 
              .(distinct = sprintf("distinct_%02i", ldays),
                N = uniqueN(category)), 
              by = .EACHI]
         }
  )
)[, date := NULL]
# 
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)]

数据：

dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]

顺便说一句：过去 7 天、14 天 的措辞有些误导，因为时间段实际上由 8 天和 15 天组成，分别为

Answer 2

在 tidyverse 中，您可以使用 map_int 迭代一组值并简化为整数 à la sapply 或 vapply。通过比较或助手 between 计算对象子集 n_distinct（如 length(unique(...))）的不同出现次数，最小值设置为从当天减去的适当数量，然后您就设置好了.

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

Answer 3

我推荐使用 runner 包。您可以将运行 windows 上的任何 R 函数与 runner 函数一起使用。下面的代码获取 desided 输出，即过去 7 天 + 当前和过去 14 天 + 当前（当前 8 天和 15 天）：

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))

更多信息请参见 package and function 文档。

R：计算指定时间范围内不同类别的数量

R: calculate number of distinct categories in the specified time frame

r

data.table

dplyr

distinct-values