如何按最近的时间日期加入两个数据框？

Question

我有 2 个数据集，每个数据集都包含一个 POSIXlt 格式的日期时间值，以及一些其他数字和字符变量。

我想根据日期时间列合并两个数据集。但是两个数据集的日期戳不匹配，所以我需要按最近的日期（之前或之后）将它们组合起来。在我的示例中，2016-03-01 23:52:00 的数据值 "e" 需要与 2016-03-02 00:00:00 的 "binH" 合并，而不是 "binG" .

是否有一个函数可以让我按最近的日期时间值组合我的数据集，即使它晚于？

我找到了使用 cut() 函数或 data.tables 中的 roll=Inf 函数将日期合并到前一个日期的方法。但是我无法将时间戳转换为 roll='nearest' 可以接受的任何格式。

    >df1
    date1 value
    1 2016-03-01 17:52:00     a
    2 2016-03-01 18:01:30     b
    3 2016-03-01 18:05:00     c
    4 2016-03-01 20:42:30     d
    5 2016-03-01 23:52:00     e

    >df2
    date2 bin_name
    1 2016-03-01 17:00:00     binA
    2 2016-03-01 18:00:00     binB
    3 2016-03-01 19:00:00     binC
    4 2016-03-01 20:00:00     binD
    5 2016-03-01 21:00:00     binE
    6 2016-03-01 22:00:00     binF
    7 2016-03-01 23:00:00     binG
    8 2016-03-02 00:00:00     binH
    9 2016-03-02 01:00:00     binI

Answer 1

data.table 应该可以解决这个问题（你能解释一下你遇到的错误吗？），尽管它确实倾向于自己将 POSIXlt 转换为 POSIXct（也许在你的日期时间列上进行转换手动保持 data.table 快乐）。还要确保在使用 roll.

之前设置键列

（我在这里创建了自己的示例表，让我的生活更轻松一些。如果你想在你的上使用 dput，我很乐意用你的数据更新这个示例）：

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

                  date data.new
1: 2016-03-02 12:20:00        t
2: 2016-03-07 12:20:00        u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )


                  date data.old
1: 2016-03-02 12:20:00        a
2: 2016-03-07 12:20:00        b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

                  date data.new data.old
1: 2015-03-02 12:20:00        t        d
2: 2016-03-02 12:20:00        t        a
3: 2016-03-07 12:20:00        u        b
4: 2016-04-02 12:20:00        v        c

我特意让两个表的行长度不同，以展示滚动连接如何处理多个匹配项。您可以切换它加入的方式：

combined <- old[ new, roll = "nearest" ]
combined

                  date data.old data.new
1: 2016-03-02 12:20:00        a        t
2: 2016-03-07 12:20:00        b        u
3: 2016-04-02 12:20:00        c        v

Answer 2

我遇到了类似的问题，但我没有使用 data.table 或 tidyverse，而是为“近似合并”创建了自己的函数 amerge。它需要 4 个参数：

两个数据框，
用于“公司”（非近似）合并的列名向量 - 这些必须存在于两个数据框中，
和用于近似合并的单个列（在两个数据框中）的名称。它适用于任何数值，包括日期。

我们的想法是将最佳匹配的行 1 对 1 合并，并且不丢失任何数据框中的任何行。这是我的注释代码和一个工作示例。

amerge <- function(d1, d2, firm=NULL, approx=NULL) {
  rt = Sys.time()

  # Take care of conflicting column names
  n2 = data.frame(oldname = names(d2), newname = names(d2))
  n2$newname = as.character(n2$newname)
  n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
    paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")

  # Add unique row IDs
  if (length(c(firm, approx))>1) {
    d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
    d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
  } else {
    d1$ID1 = factor(d1[,c(approx,firm)])
    d2$ID2 = factor(d2[,c(approx,firm)])
  }

  # Perform initial merge on the 'firm' parameters, if any
  # Otherwise match all to all
  if (length(firm)>0) {
    t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
  } else {
    names(d2)= c(n2$newname,"ID2")
    t1 = data.frame()
    for (i1 in 1:nrow(d1)) {
      trow = d1[i1,]
      t1 = rbind(t1, cbind(trow, d2))
    }
  }

  # Match by the most approximate record
  if (length(approx)==1) {
    # Calculate the differential for approximate merging
    t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
    # Sort data by ascending DIFF, so that best matching records are used first
    t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
    t2 = data.frame()
    d2$used = 0
    # For each record of d1, find match from d2
    for (i1 in na.omit(unique(t1$ID1))) {
      tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
      # If there are non-missing records, get the one with minimum DIFF (top one)
      if (nrow(tx)>0) {
        tx = tx[1,]
        # If matching record found, remove it from the pool, so it's not used again
        t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
        # And mark it as used
        d2$used[d2$ID2==tx$ID2] = 1
      } else {
        # If there are no non-missing records, just get the first one from the top
        tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
      }
      t2 = rbind(t2,tx)
    }
  } else {
    t2 = t1
  }
  # Make the records the same order as d1
  t2 = t2[match(d1$ID1, t2$ID1),]
  # Add unmatched records from d2 to the end of output
  if (any(d2$used==0)) {
    tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
    tx = tx[!duplicated(tx$ID2),]
    tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
    t2 = rbind(t2,tx)
    t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
  }
  t2$DIFF = t2$ID1 = t2$ID2 = NULL
  cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
  return(t2)
}

以及示例：

new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))

amerge(old, new, firm="ID", approx="date")

它输出：

   ID                date  old              date.2  new
2   1 2016-03-07 12:20:00    a 2016-03-07 12:20:00    u
6   1 2016-04-02 12:20:00    b 2016-04-02 12:20:00    v
7   1 2016-03-01 10:09:00    c                <NA> <NA>
10  1 2015-04-12 10:09:00    d                <NA> <NA>
13  1 2016-03-03 12:20:00    e 2016-03-02 12:20:00    t
16  2 2016-04-12 11:03:00 <NA> 2016-04-12 11:03:00    x

所以按我的预期工作 - 两个数据帧的每一行都有一个副本 - 匹配最短时间差。一个注意事项：该函数将 date.2 复制到 date 列，其中 date 将丢失。

如何按最近的时间日期加入两个数据框？

How to join two dataframes by nearest time-date?

datetime

r

posixlt