R - 如何根据 open/close 帐户日期计算用户数,但用户有多个帐户

R - how to count users based on open/close dates of accounts, but with users having multiple accounts

我有一个帐户列表(超过 30 万行),可以追溯到六年前,其中包含用户编号、打开和关闭日期以及其他信息,例如位置。我们提供多种账户,一个用户可以拥有一个或多个账户,可以任意组合,可以连续也可以重叠。

有人要求我了解我们在任何给定月份有多少用户。他们希望它按位置和总数分开。

所以我有一个 table 这样的:

   User    Open       Close      Area 
 1 A       2018-02-13 2018-07-31 West 
 2 B       2018-02-26 2018-06-04 North
 3 B       2018-02-27 2018-03-15 North
 4 C       2018-02-27 2018-05-26 South
 5 C       2018-03-15 2018-06-03 South
 6 D       2018-03-20 2018-07-02 East 
 7 E       2018-04-01 2018-06-19 West 
 8 E       2018-04-14 2018-05-04 West 
 9 F       2018-03-20 2018-04-19 North
10 G       2018-04-26 2018-07-04 South
11 H       2017-29-12 2018-03-21 East
12 I       2016-11-29 2020-04-10 West
13 J       2018-01-31 2018-12-20 West
14 K       2017-10-31 2018-10-30 North
15 K       2018-10-31 2019-10-30 North

我想要一个看起来像这样的人:

      Month  Total North  East South  West
1 Feb 18     3     1     0     1     1
2 Mar 18     5     2     1     1     1
3 Apr 18     7     2     1     2     2
4 May 18     6     1     1     2     2
5 Jun 18     6     1     1     2     2
6 Jul 18     3     0     1     1     1

我可以使用

过滤数据以获得我需要的个别月份的数据
 df%>%
   filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
   count(Area) 

但我想不通的是如何在数据集中自动重复每个月。有什么地方可以让 r 在我的数据集中每个月重复上述操作,然后将结果传递给第二个 table?

非常感谢您提供的所有帮助,非常感谢您的宝贵时间。

编辑:在 Matin Gal 的解决方案多年返回 NA 的源数据中添加示例

这是我的做法:

library(tidyverse)

set.seed(14159)

## generating some data that looks roughly
##  like your data

data <- tibble(
  user = sample(LETTERS[1:5], size = 20, replace = TRUE),
  open = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
  close = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
  area = sample(c("N", "E", "S", "W"), 20, replace = T)
) %>%
  filter(
    close > open
  )

data
#> # A tibble: 9 × 4
#>   user  open       close      area 
#>   <chr> <date>     <date>     <chr>
#> 1 A     1999-04-03 1999-07-28 N    
#> 2 B     1999-01-27 1999-05-12 W    
#> 3 B     1999-06-05 1999-12-29 W    
#> 4 C     1999-09-26 1999-12-30 W    
#> 5 C     1999-04-21 1999-12-04 E    
#> 6 C     1999-08-11 1999-12-12 N    
#> 7 A     1999-02-13 1999-09-16 W    
#> 8 E     1999-02-17 1999-05-21 E    
#> 9 B     1999-07-26 1999-08-16 S

## figuring out what months are in between open and close
get_months_in_range <- function(open, close) {
  seq.Date(
    open,
    close,
    by = "month"
  ) %>%
    list()
}

data %>%
  rowwise() %>%
  mutate(
    Month = get_months_in_range(open, close)
  ) %>%
  ungroup() %>%
  unnest_longer(
    col = Month
  ) %>%
  count(Month, area) %>%
  pivot_wider(
    names_from = area,
    values_from = n,
    values_fill = 0
  ) %>%
  rowwise() %>%
  mutate(
    Total = sum(
      c_across(
        -Month
      )
    )
  ) %>%
  ungroup()
#> # A tibble: 45 × 6
#>    Month          W     E     N     S Total
#>    <date>     <int> <int> <int> <int> <int>
#>  1 1999-01-27     1     0     0     0     1
#>  2 1999-02-13     1     0     0     0     1
#>  3 1999-02-17     0     1     0     0     1
#>  4 1999-02-27     1     0     0     0     1
#>  5 1999-03-13     1     0     0     0     1
#>  6 1999-03-17     0     1     0     0     1
#>  7 1999-03-27     1     0     0     0     1
#>  8 1999-04-03     0     0     1     0     1
#>  9 1999-04-13     1     0     0     0     1
#> 10 1999-04-17     0     1     0     0     1
#> # … with 35 more rows

reprex package (v2.0.1)

于 2021-08-18 创建

这不是世界上最性感的解决方案,但我认为它会带您到达您想去的地方。基本上,我只是制作了一个辅助函数,它为我提供了 openclose 之间的所有日期,然后您可以将这些日期分组以计算出您在任何给定月份有多少用户。如果您想了解有关 dplyr 长链正在做什么的更多解释,请告诉我。

欢迎来到 SO。我无法测试此代码,因为您没有以正确的格式提供数据片段(有关这一点的建议,请参见下文),但我认为您想要做的基本想法是提取一个月-来自 Open 的年份值,然后使用 group_by。例如:

library(lubridate)
library(dplyr)

df %>% mutate(
  Date = dmy(Open),
  Month_Yr = format_ISO8601(Date, precision = "ym")) %>% 
  group_by(Month_Yr) %>% 
  distinct(PERSON.ID, .keep_all = TRUE) %>%
  count(Area) 

通常在 SO 上共享数据时最好使用 dput。如果您不确定,请参阅 ?dput 了解如何使用它。

这是适用于超过一年的日期的通用解决方案。

library(dplyr)
library(tidyr)
library(lubridate)

data %>%
  group_by(rn = row_number()) %>%
  mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>% 
  unnest(seq) %>%
  mutate(
    seq_2 = (seq - 1) %% 12 + 1,
    month = month(seq_2, label = TRUE),
    year  = year(Open + months(seq - first(seq)))
    ) %>% 
  ungroup() %>% 
  distinct(User, month, year, Area) %>% 
  count(month, year, Area) %>% 
  pivot_wider(
    names_from = "Area", 
    values_from = "n", 
    values_fill = 0
    ) %>% 
  mutate(Total = rowSums(across(c(North, South, West, East))))

returns

  month  year North South  West  East Total
  <ord> <dbl> <int> <int> <int> <int> <dbl>
1 Feb    2018     1     1     1     0     3
2 Mar    2018     2     1     1     1     5
3 Apr    2018     2     2     2     1     7
4 May    2018     1     2     2     1     6
5 Jun    2018     1     2     2     1     6
6 Jul    2018     0     1     1     1     3

数据

df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A", 
"B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575, 
17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
), class = "Date"), Close = structure(c(17743, 17686, 17605, 
17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"), 
    Area = c("West", "North", "North", "South", "South", "East", 
    "West", "West", "North", "South")), problems = structure(list(
    row = 10L, col = "Area", expected = "", actual = "embedded null", 
    file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
    cols = list(id = structure(list(), class = c("collector_double", 
    "collector")), User = structure(list(), class = c("collector_character", 
    "collector")), Open = structure(list(format = ""), class = c("collector_date", 
    "collector")), Close = structure(list(format = ""), class = c("collector_date", 
    "collector")), Area = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))