为 R 中 data.table 中的每一行定义经过的月份

Defining elapsed months for every row in a data.table in R

我有一个 data.table 如下所示

    ID   Date_col    Value
    1    2017-08-01  A
    1    2017-09-01  A
    1    2017-10-01  B
    2    2017-06-01  A
    2    2017-07-01  A        
    2    2017-08-01  C        
    2    2017-09-01  A   

而且我想添加一列,按 ID 表示从第一次观察开始所花费的月份,如下所示:

    ID   Date_col    Value  Months_spent  
    1    2017-08-01  A      0
    1    2017-09-01  A      1
    1    2017-10-01  B      2
    2    2017-06-01  A      0
    2    2017-07-01  A      1 
    2    2017-08-01  C      2 
    2    2017-09-01  A      3

我试过了,但出现错误,"to" 日期的长度必须为 1

DT[, Months_spent := length(seq.Date(Date_col[1L], Date_col, by = "month")), by = ID]

请帮我解决这个错误,但当然,任何其他有效的解决方案都会受到赞赏。

我们可以使用dplyr

library(dplyr)
df1 %>%
    group_by(ID) %>%
    mutate(months_spent = month(Date_col) - first(month(Date_col)))
# A tibble: 7 x 4
# Groups: ID [2]
#     ID   Date_col Value months_spent
#  <int>     <date> <chr>        <int>
#1     1 2017-08-01     A            0
#2     1 2017-09-01     A            1
#3     1 2017-10-01     B            2
#4     2 2017-06-01     A            0
#5     2 2017-07-01     A            1
#6     2 2017-08-01     C            2
#7     2 2017-09-01     A            3

或者如果有多个years

df1 %>% 
    group_by(ID) %>% 
    mutate(months_spent =  as.integer(round((Date_col - first(Date_col))/(365/12))))
# A tibble: 7 x 4
# Groups: ID [2]
#     ID   Date_col Value months_spent
#  <int>     <date> <chr>        <int>
#1     1 2017-08-01     A            0
#2     1 2017-09-01     A            1
#3     1 2017-10-01     B            2
#4     2 2017-06-01     A            0
#5     2 2017-07-01     A            1
#6     2 2017-08-01     C            2
#7     2 2017-09-01     A            3

data.table中有一个选项:

dt[, Months_spent := {
  full_seq <- seq.Date(Date_col[1L], max(Date_col), by = "month")
  match(Date_col, full_seq) - 1L
}, by = ID]
#    ID   Date_col Value Months_spent
# 1:  1 2017-08-01     A            0
# 2:  1 2017-09-01     A            1
# 3:  1 2017-10-01     B            2
# 4:  2 2017-06-01     A            0
# 5:  2 2017-07-01     A            1
# 6:  2 2017-08-01     C            2
# 7:  2 2017-09-01     A            3

这假设初始数据已经订购,即每个 ID 都从最早的月份开始并且是正确的 Date/IDate 格式。

这是使用 dplyr 和 zoo 的可靠解决方案。它将按每个 ID 分组,并始终根据该 ID 的第一个 Date_col 值获取月份差异。

library(dplyr)
library(zoo)

df <- data.frame(ID = c(1,1,1,2,2,2,2),
                 Date_col = c("8/1/2017","9/1/2017","10/1/2017","6/1/2017","7/1/2017","8/1/2017","9/1/2017"),
                 Value = c("A","A","B","A","A","C","A"),
                 stringsAsFactors = FALSE)


df$Date_col <- as.Date(df$Date_col, format = "%m/%d/%Y")

df <- df %>%
      arrange(ID, Date_col) %>%
      group_by(ID) %>%
      mutate(Months_spent = 12 * as.numeric(as.yearmon(Date_col, "%Y %b") - as.yearmon(first(Date_col), "%Y %b")))

#> df
#Source: local data frame [7 x 5]
#Groups: ID [2]
#
#     ID   Date_col Value       lead Months_spent
#  <dbl>     <date> <chr>     <date>        <dbl>
#1     1 2017-08-01     A 2017-08-01            0
#2     1 2017-09-01     A 2017-09-01            1
#3     1 2017-10-01     B 2017-10-01            2
#4     2 2017-06-01     A 2017-06-01            0
#5     2 2017-07-01     A 2017-07-01            1
#6     2 2017-08-01     C 2017-08-01            2
#7     2 2017-09-01     A 2017-09-01            3

这是一个 data.table 解决方案,它也适用于无序日期、每月日期序列中的间隔以及一个月内的日期,例如 2017-08-15:

as.IMonth <- function(x) 12 * year(x) + month(x)
DT2[, Months_spent := as.IMonth(Date_col) - as.IMonth(min(Date_col)), by = ID][]

   ID   Date_col Value Months_spent
1:  1 2017-08-01     A            0
2:  1 2017-09-01     A            1
3:  1 2017-10-01     B            2
4:  1 2018-02-01     X            6
5:  2 2017-06-01     A            0
6:  2 2017-07-01     A            1
7:  2 2017-08-01     C            2
8:  2 2017-08-15     X            2
9:  2 2017-09-01     A            3

请注意,带有 Value == X 的行已添加到 OP 的示例数据集中,以展示间隔和月内日期。

数据

DT <- fread(
  "    ID   Date_col    Value
  1    2017-08-01  A
  1    2017-09-01  A
  1    2017-10-01  B
  2    2017-06-01  A
  2    2017-07-01  A        
  2    2017-08-01  C        
  2    2017-09-01  A  ")[, Date_col := as.IDate(Date_col)][]

DT2 <- rbind(DT, 
             fread("ID Date_col Value\n1 2018-02-01 X\n2 2017-08-15 X")[
               , Date_col := as.IDate(Date_col)])
setorder(DT2, ID, Date_col)