根据R中的日期范围分类

Question

如何根据单独的、小得多的 R 数据帧（12 行）中的日期范围定义对大型 R 数据帧（> 200 万行）中的每一行进行分类？

通过 head(captures) 调用时，我的大型数据框捕获看起来与此类似：

       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1

我的小数据框 seasons 整体看起来与此类似：

Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15

我需要在我的捕获数据框中添加一个 'season' 列，其中的值将根据 captures$date 是否落在季节中定义的范围内以及其中的位置来确定。

这是我想出的一个长期解决方案，它对我不起作用，因为我的数据框太大了。

#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
  for (j in 1:length(captures$id{
    captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
  }
}

同样，这对我不起作用，因为 R 每次都会崩溃。我也意识到这没有利用 R 中的矢量化。感谢您的帮助！

Answer 1

如果您可以根据值的范围而不是 相等性 [=28= 有效地执行 join 操作，那确实很棒].不幸的是，我不知道是否存在通用解决方案。目前，我建议使用单个 for 循环。

矢量化的效率最好沿着最高的数据进行。也就是说，如果我们在一个 data.frame 上循环并向量化另一个，那么向量化较长的向量并在较短的向量上循环更有意义。考虑到这一点，我们将在季节框架上循环并对 2M 行数据进行矢量化。

您的数据：

txt <- "Season Opening.Date Closing.Date 2016 2016-09-24 2017-01-15 2015 2015-09-26 2016-01-10 2014 2014-09-27 2015-01-11 2013 2013-09-28 2014-01-12 2012 2012-09-22 2013-01-13 2011 2011-09-24 2012-01-08 2010 2010-09-25 2011-01-16 2009 2009-09-26 2010-01-17 2008 2008-09-27 2009-01-18 2007 2007-09-22 2008-01-13 2006 2006-09-23 2007-01-14 2005 2005-09-24 2006-01-15" seasons <- read.table(text = txt, header = TRUE) seasons[2:3] <- lapply(seasons[2:3], as.Date) txt <- " id date sex 1 160520 2016-11-22 1 2 1029735 2016-11-12 1 3 1885200 2016-11-05 1 4 2058366 2015-09-26 2 5 2058367 2015-09-26 1 6 2058368 2015-09-26 1" dat <- read.table(text = txt, header = TRUE) dat$date <- as.Date(dat$date)

并且开始这个过程，我们假设所有数据的season还没有定义：

dat$season <- NA

循环每个季节的行：

for (i in seq_len(nrow(seasons))) { dat$season <- ifelse(is.na(dat$season) & dat$date >= seasons$Opening.Date[i] & dat$date < seasons$Closing.Date[i], seasons$Season[i], dat$season) } dat # id date sex season # 1 160520 2016-11-22 1 2016 # 2 1029735 2016-11-12 1 2016 # 3 1885200 2016-11-05 1 2016 # 4 2058366 2015-09-26 2 2015 # 5 2058367 2015-09-26 1 2015 # 6 2058368 2015-09-26 1 2015

Answer 2

这里使用 non equi 从 data.table:

加入

require(data.table) # v1.10.4+
setDT(captures) # convert data.frames to data.tables
setDT(seasons)

ans <- seasons[captures, Season,
                 on=.(Opening.Date<=date, Closing.Date>=date), 
                 mult="first"]
# [1] 2016 2016 2016 2015 2015 2015
seasons[, season := ans]

对于captures中的每一行，seasons中第一个匹配行（mult="first"）对应的索引是根据在提供给 on 参数的条件下。然后返回相应索引的 Season 的值并保存在 ans 下。然后通过引用将其作为新列添加到 seasons。

为了便于理解，我分两步展示了它。

您可以使用 which=TRUE 查看第一个匹配的索引：

seasons[captures, 
          on=.(Opening.Date<=date, Closing.Date>=date),
          mult="first", 
          which=TRUE]
# [1] 1 1 1 2 2 2

Answer 3

你可以试试 sqldf。请注意，我必须将 Opening_Date 和 Closing_Date 中的点更改为“_”。

library(sqldf)

captures$season <- sqldf("select Season from seasons s, captures c
where c.date >= s.Opening_Date and c.date <= s.Closing_Date")
captures    
       id       date sex Season
1  160520 2016-11-22   1   2016
2 1029735 2016-11-12   1   2016
3 1885200 2016-11-05   1   2016
4 2058366 2015-09-26   2   2015
5 2058367 2015-09-26   1   2015
6 2058368 2015-09-26   1   2015

数据

txt <- "Season Opening_Date Closing_Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)

txt <- "       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1"
captures <- read.table(text = txt, header = TRUE)
captures$date <- as.Date(captures$date)

根据R中的日期范围分类

categorize based on date ranges in R

r

date

categorization

lubridate

dplyr