为 R 中 data.table 中的每一行定义经过的月份
Defining elapsed months for every row in a data.table in R
我有一个 data.table 如下所示
ID Date_col Value
1 2017-08-01 A
1 2017-09-01 A
1 2017-10-01 B
2 2017-06-01 A
2 2017-07-01 A
2 2017-08-01 C
2 2017-09-01 A
而且我想添加一列,按 ID 表示从第一次观察开始所花费的月份,如下所示:
ID Date_col Value Months_spent
1 2017-08-01 A 0
1 2017-09-01 A 1
1 2017-10-01 B 2
2 2017-06-01 A 0
2 2017-07-01 A 1
2 2017-08-01 C 2
2 2017-09-01 A 3
我试过了,但出现错误,"to" 日期的长度必须为 1
DT[, Months_spent := length(seq.Date(Date_col[1L], Date_col, by = "month")), by = ID]
请帮我解决这个错误,但当然,任何其他有效的解决方案都会受到赞赏。
我们可以使用dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(months_spent = month(Date_col) - first(month(Date_col)))
# A tibble: 7 x 4
# Groups: ID [2]
# ID Date_col Value months_spent
# <int> <date> <chr> <int>
#1 1 2017-08-01 A 0
#2 1 2017-09-01 A 1
#3 1 2017-10-01 B 2
#4 2 2017-06-01 A 0
#5 2 2017-07-01 A 1
#6 2 2017-08-01 C 2
#7 2 2017-09-01 A 3
或者如果有多个year
s
df1 %>%
group_by(ID) %>%
mutate(months_spent = as.integer(round((Date_col - first(Date_col))/(365/12))))
# A tibble: 7 x 4
# Groups: ID [2]
# ID Date_col Value months_spent
# <int> <date> <chr> <int>
#1 1 2017-08-01 A 0
#2 1 2017-09-01 A 1
#3 1 2017-10-01 B 2
#4 2 2017-06-01 A 0
#5 2 2017-07-01 A 1
#6 2 2017-08-01 C 2
#7 2 2017-09-01 A 3
data.table中有一个选项:
dt[, Months_spent := {
full_seq <- seq.Date(Date_col[1L], max(Date_col), by = "month")
match(Date_col, full_seq) - 1L
}, by = ID]
# ID Date_col Value Months_spent
# 1: 1 2017-08-01 A 0
# 2: 1 2017-09-01 A 1
# 3: 1 2017-10-01 B 2
# 4: 2 2017-06-01 A 0
# 5: 2 2017-07-01 A 1
# 6: 2 2017-08-01 C 2
# 7: 2 2017-09-01 A 3
这假设初始数据已经订购,即每个 ID 都从最早的月份开始并且是正确的 Date/IDate 格式。
这是使用 dplyr 和 zoo 的可靠解决方案。它将按每个 ID 分组,并始终根据该 ID 的第一个 Date_col 值获取月份差异。
library(dplyr)
library(zoo)
df <- data.frame(ID = c(1,1,1,2,2,2,2),
Date_col = c("8/1/2017","9/1/2017","10/1/2017","6/1/2017","7/1/2017","8/1/2017","9/1/2017"),
Value = c("A","A","B","A","A","C","A"),
stringsAsFactors = FALSE)
df$Date_col <- as.Date(df$Date_col, format = "%m/%d/%Y")
df <- df %>%
arrange(ID, Date_col) %>%
group_by(ID) %>%
mutate(Months_spent = 12 * as.numeric(as.yearmon(Date_col, "%Y %b") - as.yearmon(first(Date_col), "%Y %b")))
#> df
#Source: local data frame [7 x 5]
#Groups: ID [2]
#
# ID Date_col Value lead Months_spent
# <dbl> <date> <chr> <date> <dbl>
#1 1 2017-08-01 A 2017-08-01 0
#2 1 2017-09-01 A 2017-09-01 1
#3 1 2017-10-01 B 2017-10-01 2
#4 2 2017-06-01 A 2017-06-01 0
#5 2 2017-07-01 A 2017-07-01 1
#6 2 2017-08-01 C 2017-08-01 2
#7 2 2017-09-01 A 2017-09-01 3
这是一个 data.table
解决方案,它也适用于无序日期、每月日期序列中的间隔以及一个月内的日期,例如 2017-08-15
:
as.IMonth <- function(x) 12 * year(x) + month(x)
DT2[, Months_spent := as.IMonth(Date_col) - as.IMonth(min(Date_col)), by = ID][]
ID Date_col Value Months_spent
1: 1 2017-08-01 A 0
2: 1 2017-09-01 A 1
3: 1 2017-10-01 B 2
4: 1 2018-02-01 X 6
5: 2 2017-06-01 A 0
6: 2 2017-07-01 A 1
7: 2 2017-08-01 C 2
8: 2 2017-08-15 X 2
9: 2 2017-09-01 A 3
请注意,带有 Value == X
的行已添加到 OP 的示例数据集中,以展示间隔和月内日期。
数据
DT <- fread(
" ID Date_col Value
1 2017-08-01 A
1 2017-09-01 A
1 2017-10-01 B
2 2017-06-01 A
2 2017-07-01 A
2 2017-08-01 C
2 2017-09-01 A ")[, Date_col := as.IDate(Date_col)][]
DT2 <- rbind(DT,
fread("ID Date_col Value\n1 2018-02-01 X\n2 2017-08-15 X")[
, Date_col := as.IDate(Date_col)])
setorder(DT2, ID, Date_col)
我有一个 data.table 如下所示
ID Date_col Value
1 2017-08-01 A
1 2017-09-01 A
1 2017-10-01 B
2 2017-06-01 A
2 2017-07-01 A
2 2017-08-01 C
2 2017-09-01 A
而且我想添加一列,按 ID 表示从第一次观察开始所花费的月份,如下所示:
ID Date_col Value Months_spent
1 2017-08-01 A 0
1 2017-09-01 A 1
1 2017-10-01 B 2
2 2017-06-01 A 0
2 2017-07-01 A 1
2 2017-08-01 C 2
2 2017-09-01 A 3
我试过了,但出现错误,"to" 日期的长度必须为 1
DT[, Months_spent := length(seq.Date(Date_col[1L], Date_col, by = "month")), by = ID]
请帮我解决这个错误,但当然,任何其他有效的解决方案都会受到赞赏。
我们可以使用dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(months_spent = month(Date_col) - first(month(Date_col)))
# A tibble: 7 x 4
# Groups: ID [2]
# ID Date_col Value months_spent
# <int> <date> <chr> <int>
#1 1 2017-08-01 A 0
#2 1 2017-09-01 A 1
#3 1 2017-10-01 B 2
#4 2 2017-06-01 A 0
#5 2 2017-07-01 A 1
#6 2 2017-08-01 C 2
#7 2 2017-09-01 A 3
或者如果有多个year
s
df1 %>%
group_by(ID) %>%
mutate(months_spent = as.integer(round((Date_col - first(Date_col))/(365/12))))
# A tibble: 7 x 4
# Groups: ID [2]
# ID Date_col Value months_spent
# <int> <date> <chr> <int>
#1 1 2017-08-01 A 0
#2 1 2017-09-01 A 1
#3 1 2017-10-01 B 2
#4 2 2017-06-01 A 0
#5 2 2017-07-01 A 1
#6 2 2017-08-01 C 2
#7 2 2017-09-01 A 3
data.table中有一个选项:
dt[, Months_spent := {
full_seq <- seq.Date(Date_col[1L], max(Date_col), by = "month")
match(Date_col, full_seq) - 1L
}, by = ID]
# ID Date_col Value Months_spent
# 1: 1 2017-08-01 A 0
# 2: 1 2017-09-01 A 1
# 3: 1 2017-10-01 B 2
# 4: 2 2017-06-01 A 0
# 5: 2 2017-07-01 A 1
# 6: 2 2017-08-01 C 2
# 7: 2 2017-09-01 A 3
这假设初始数据已经订购,即每个 ID 都从最早的月份开始并且是正确的 Date/IDate 格式。
这是使用 dplyr 和 zoo 的可靠解决方案。它将按每个 ID 分组,并始终根据该 ID 的第一个 Date_col 值获取月份差异。
library(dplyr)
library(zoo)
df <- data.frame(ID = c(1,1,1,2,2,2,2),
Date_col = c("8/1/2017","9/1/2017","10/1/2017","6/1/2017","7/1/2017","8/1/2017","9/1/2017"),
Value = c("A","A","B","A","A","C","A"),
stringsAsFactors = FALSE)
df$Date_col <- as.Date(df$Date_col, format = "%m/%d/%Y")
df <- df %>%
arrange(ID, Date_col) %>%
group_by(ID) %>%
mutate(Months_spent = 12 * as.numeric(as.yearmon(Date_col, "%Y %b") - as.yearmon(first(Date_col), "%Y %b")))
#> df
#Source: local data frame [7 x 5]
#Groups: ID [2]
#
# ID Date_col Value lead Months_spent
# <dbl> <date> <chr> <date> <dbl>
#1 1 2017-08-01 A 2017-08-01 0
#2 1 2017-09-01 A 2017-09-01 1
#3 1 2017-10-01 B 2017-10-01 2
#4 2 2017-06-01 A 2017-06-01 0
#5 2 2017-07-01 A 2017-07-01 1
#6 2 2017-08-01 C 2017-08-01 2
#7 2 2017-09-01 A 2017-09-01 3
这是一个 data.table
解决方案,它也适用于无序日期、每月日期序列中的间隔以及一个月内的日期,例如 2017-08-15
:
as.IMonth <- function(x) 12 * year(x) + month(x)
DT2[, Months_spent := as.IMonth(Date_col) - as.IMonth(min(Date_col)), by = ID][]
ID Date_col Value Months_spent
1: 1 2017-08-01 A 0
2: 1 2017-09-01 A 1
3: 1 2017-10-01 B 2
4: 1 2018-02-01 X 6
5: 2 2017-06-01 A 0
6: 2 2017-07-01 A 1
7: 2 2017-08-01 C 2
8: 2 2017-08-15 X 2
9: 2 2017-09-01 A 3
请注意,带有 Value == X
的行已添加到 OP 的示例数据集中,以展示间隔和月内日期。
数据
DT <- fread(
" ID Date_col Value
1 2017-08-01 A
1 2017-09-01 A
1 2017-10-01 B
2 2017-06-01 A
2 2017-07-01 A
2 2017-08-01 C
2 2017-09-01 A ")[, Date_col := as.IDate(Date_col)][]
DT2 <- rbind(DT,
fread("ID Date_col Value\n1 2018-02-01 X\n2 2017-08-15 X")[
, Date_col := as.IDate(Date_col)])
setorder(DT2, ID, Date_col)