R从组的开始日期和结束日期创建时间序列的最佳方法
R Optimal way to create time series from start and end dates for groups
我有一个数据集,其中每个组都有开始和结束日期。我想将此数据转换为一个数据,其中每个时间段(月)我对每个组都有一行观察值。
这是输入数据的示例,组由 id 标识:
structure(list(id = c(723654, 885618, 269861, 1383642, 250276,
815511, 1506680, 1567855, 667345, 795731), startdate = c("2008-06-29",
"2008-12-01", "2006-09-27", "2010-02-03", "2006-08-31", "2008-09-10",
"2010-04-11", "2010-05-15", "2008-04-12", "2008-08-28"), enddate = c("2008-08-13",
"2009-02-08", "2007-10-12", "2010-09-09", "2007-06-30", "2010-04-27",
"2010-04-13", "2010-05-16", "2010-04-20", "2010-03-09")), .Names = c("id",
"startdate", "enddate"), class = "data.frame", row.names = c("1",
"2", "3", "4", "6", "7", "8", "9", "10", "11"))
我写了一个函数并将其向量化。该函数采用存储在每一行中的三个参数,并生成具有组标识符的时间序列。
genDateRange<-function(start, end, id){
dates<-seq(as.Date(start), as.Date(end), by="month")
return( cbind(month=as.character(dates), id=rep(id, length(dates))))
}
genDataRange<-Vectorize(genDateRange)
我运行函数如下得到一个数据框。我在输出中有超过 600 万行,所以它需要很长时间。我需要一个更快的方法。
range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))
前十行输出如下所示:
structure(c("2008-06-29", "2008-07-29", "2008-12-01", "2009-01-01",
"2009-02-01", "2006-09-27", "2006-10-27", "2006-11-27", "2006-12-27",
"2007-01-27", "723654", "723654", "885618", "885618", "885618",
"269861", "269861", "269861", "269861", "269861"), .Dim = c(10L,
2L), .Dimnames = list(NULL, c("month", "id")))
我希望有一种更快的方法来做到这一点。我想我太专注于某些事情而错过了一个更简单的解决方案。
对于大型数据集,这
library(data.table)
range <- rbindlist(lapply(genDataRange(dat$startdate, dat$enddate, dat$id),as.data.frame))
应该比
快
range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))
无需使用生成器函数或 rbindlist
,因为 data.table
无需它即可轻松处理此问题。
# start with a data.table and date columns
library(data.table)
dat <- data.table(dat)
dat[,`:=`(startdate = as.Date(startdate), enddate = as.Date(enddate))]
dat[,num_mons:= length(seq(from=startdate, to=enddate, by='month')),by=1:nrow(dat)]
dat # now your data.table looks like this
# id startdate enddate num_mons
# 1: 723654 2008-06-29 2008-08-13 2
# 2: 885618 2008-12-01 2009-02-08 3
# 3: 269861 2006-09-27 2007-10-12 13
# 4: 1383642 2010-02-03 2010-09-09 8
# 5: 250276 2006-08-31 2007-06-30 10
# 6: 815511 2008-09-10 2010-04-27 20
# 7: 1506680 2010-04-11 2010-04-13 1
# 8: 1567855 2010-05-15 2010-05-16 1
# 9: 667345 2008-04-12 2010-04-20 25
# 10: 795731 2008-08-28 2010-03-09 19
out <- dat[, list(month=seq.Date(startdate, by="month",length.out=num_mons)), by=id]
out
# id month
# 1: 723654 2008-06-29
# 2: 723654 2008-07-29
# 3: 885618 2008-12-01
# 4: 885618 2009-01-01
# 5: 885618 2009-02-01
# ---
# 98: 795731 2009-10-28
# 99: 795731 2009-11-28
# 100: 795731 2009-12-28
# 101: 795731 2010-01-28
# 102: 795731 2010-02-28
这 question 是相关的,但不同之处在于,在您提出的问题中,我们正在迭代,而不是复制数据中的行 table。
我有一个数据集,其中每个组都有开始和结束日期。我想将此数据转换为一个数据,其中每个时间段(月)我对每个组都有一行观察值。
这是输入数据的示例,组由 id 标识:
structure(list(id = c(723654, 885618, 269861, 1383642, 250276,
815511, 1506680, 1567855, 667345, 795731), startdate = c("2008-06-29",
"2008-12-01", "2006-09-27", "2010-02-03", "2006-08-31", "2008-09-10",
"2010-04-11", "2010-05-15", "2008-04-12", "2008-08-28"), enddate = c("2008-08-13",
"2009-02-08", "2007-10-12", "2010-09-09", "2007-06-30", "2010-04-27",
"2010-04-13", "2010-05-16", "2010-04-20", "2010-03-09")), .Names = c("id",
"startdate", "enddate"), class = "data.frame", row.names = c("1",
"2", "3", "4", "6", "7", "8", "9", "10", "11"))
我写了一个函数并将其向量化。该函数采用存储在每一行中的三个参数,并生成具有组标识符的时间序列。
genDateRange<-function(start, end, id){
dates<-seq(as.Date(start), as.Date(end), by="month")
return( cbind(month=as.character(dates), id=rep(id, length(dates))))
}
genDataRange<-Vectorize(genDateRange)
我运行函数如下得到一个数据框。我在输出中有超过 600 万行,所以它需要很长时间。我需要一个更快的方法。
range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))
前十行输出如下所示:
structure(c("2008-06-29", "2008-07-29", "2008-12-01", "2009-01-01",
"2009-02-01", "2006-09-27", "2006-10-27", "2006-11-27", "2006-12-27",
"2007-01-27", "723654", "723654", "885618", "885618", "885618",
"269861", "269861", "269861", "269861", "269861"), .Dim = c(10L,
2L), .Dimnames = list(NULL, c("month", "id")))
我希望有一种更快的方法来做到这一点。我想我太专注于某些事情而错过了一个更简单的解决方案。
对于大型数据集,这
library(data.table)
range <- rbindlist(lapply(genDataRange(dat$startdate, dat$enddate, dat$id),as.data.frame))
应该比
快range<-do.call(rbind,genDataRange(dat$startdate, dat$enddate, dat$id))
无需使用生成器函数或 rbindlist
,因为 data.table
无需它即可轻松处理此问题。
# start with a data.table and date columns
library(data.table)
dat <- data.table(dat)
dat[,`:=`(startdate = as.Date(startdate), enddate = as.Date(enddate))]
dat[,num_mons:= length(seq(from=startdate, to=enddate, by='month')),by=1:nrow(dat)]
dat # now your data.table looks like this
# id startdate enddate num_mons
# 1: 723654 2008-06-29 2008-08-13 2
# 2: 885618 2008-12-01 2009-02-08 3
# 3: 269861 2006-09-27 2007-10-12 13
# 4: 1383642 2010-02-03 2010-09-09 8
# 5: 250276 2006-08-31 2007-06-30 10
# 6: 815511 2008-09-10 2010-04-27 20
# 7: 1506680 2010-04-11 2010-04-13 1
# 8: 1567855 2010-05-15 2010-05-16 1
# 9: 667345 2008-04-12 2010-04-20 25
# 10: 795731 2008-08-28 2010-03-09 19
out <- dat[, list(month=seq.Date(startdate, by="month",length.out=num_mons)), by=id]
out
# id month
# 1: 723654 2008-06-29
# 2: 723654 2008-07-29
# 3: 885618 2008-12-01
# 4: 885618 2009-01-01
# 5: 885618 2009-02-01
# ---
# 98: 795731 2009-10-28
# 99: 795731 2009-11-28
# 100: 795731 2009-12-28
# 101: 795731 2010-01-28
# 102: 795731 2010-02-28
这 question 是相关的,但不同之处在于,在您提出的问题中,我们正在迭代,而不是复制数据中的行 table。