R有条件地将一个数据帧中的日期时间匹配到第二个数据帧中最接近的日期时间字段

R conditionally matching date-time from one dataframe to closest date-time field in second dataframe

我有两个数据框,df.events 和 df.activ。

df.activ 具有非常细粒度的分钟级数据和比 df.events 多一个数量级的记录 (1,000,000+),df.events 具有约 100,000 条记录,也是分钟级粒度。这两个数据框有两个公共字段,DateTime 和 Geo。两个 DateTime 列均采用 as.POSIXlt, %Y-%m-%d %H:%M:%S 格式。

df.activ <- read.table(text=
                          '"DateTime","Geo","Bin1","Bin2"
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:11:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,510,0,1
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:12:00,NA,0,0
                        2014-07-01 00:13:00,618,1,1
                        2014-07-01 00:13:00,510,0,1
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0
                        2014-07-01 00:13:00,NA,0,0',header=TRUE,sep=",")

df.events <- read.table(text=
                          '"Units","Geo","DateTime"
                        225,999,2014-07-01 00:09:00
                        40,510,2014-07-01 00:12:00
                        5,999,2014-07-01 00:28:00
                        115,999,2014-07-01 00:44:00
                        0,999,2014-07-01 00:47:00',header=TRUE,sep=",")

如果同一行(df.events)中的 Geo 字段值为 999,我的目标是将 df.activ 合并到 df.events 中最近的 DateTime。

如果 df.event 的 Geo 不是 999,那么我只想在 Geo 字段匹配时合并 df.event(例如,在提供的数据框中 Geo = 510 的情况)。

我知道 for 循环不是解决 R 中问题的正确方法,但从概念上讲,我希望通过向下循环 df.activ 的 DateTime 字段并引入来做一个嵌套的 for 循环如果 Geo 字段为 999 或与 df.activ.

中的 Geo 字段匹配,则在记录中与 df.events 最接近的日期时间

下面的数据框是我想要的:

df.idealresults <- read.table(text=
                              'DateTime,Geo,Bin1,Bin2,events.DateTime,events.Units,Events.Geo
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:11,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:12,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,618,1,1,7/1/2014 0:09,225,999
                              7/1/2014 0:13,510,0,1,7/1/2014 0:12,40,510
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999
                              7/1/2014 0:13,NA,0,0,7/1/2014 0:09,225,999',header=TRUE,sep=',')

到目前为止,我已经能够将 df.activ 合并到 df.events 中最近的 DateTime。我使用基于 na.locf 的方法完成此操作,其灵感来自 this SO post 答案的后半部分。我一直在努力尝试将地理匹配逻辑纳入这种方法; na.locf 的性质使得这项工作很难正常进行,因为它依赖于在合并步骤之前绑定的向量到向量 NA。

有时很难避免循环,尤其是当您遇到像您这样的情况时。有时我们最终会花很多精力避免它们,而它们可能是我们能做的最好的,或者在性能 and/or 可读性方面并没有落后太多。话虽如此,这可以解决问题:

df.activ$DateTime <- as.POSIXct(df.activ$DateTime)
df.events$DateTime <- as.POSIXct(df.events$DateTime)

results <- df.activ
results$events.Units=NA
results$events.Geo=NA
results$events.Datetime=NA

for(i in seq_len(nrow(df.activ))) {
  diffs <- order(abs(df.activ$DateTime[i] - df.events$DateTime))
  for(j in seq_along(diffs)) {
    if(df.events$Geo[diffs[j]] == 999) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    } else if(isTRUE(df.events$Geo[diffs[j]] == df.activ$Geo[i])) {
      results[i, 5:7] <- df.events[diffs[j],]
      break
    }
  }
}

results$events.DateTime <- as.POSIXct(results$events.Datetime,origin = "1970-01-01")

results
              DateTime Geo Bin1 Bin2 events.Units events.Geo events.Datetime     events.DateTime
1  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
2  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
3  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
4  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
5  2014-07-01 00:11:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
6  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
7  2014-07-01 00:12:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
8  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
9  2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
10 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
11 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
12 2014-07-01 00:12:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
13 2014-07-01 00:13:00 618    1    1          225        999      1404187740 2014-07-01 00:09:00
14 2014-07-01 00:13:00 510    0    1           40        510      1404187920 2014-07-01 00:12:00
15 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
16 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
17 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
18 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
19 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
20 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00
21 2014-07-01 00:13:00  NA    0    0          225        999      1404187740 2014-07-01 00:09:00

我在上班,这个问题好像比较解决了,所以我就长话短说吧。您还可以进行完整的外部合并,然后简单地获取日期的差异。使用按日期差异的绝对值排序的不同。

这可能是算法上最快的合并方式,但比循环需要更多的 RAM(您的完整合并将有 n1*n2 个观察值)。