根据多种条件合并数据帧

Merging data frames based on several conditions

我有 2 个数据框。

df1 <- data.frame(
oID = c(111,112,113,114,115,116,117,118,119,120),
x1 = c(1,2,3,4,5,6,7,8,9,10),
x2 = c(1,2,3,4,5,6,7,8,9,10),
y1 = c(10,9,8,7,6,5,4,3,2,1),
y2 = c(10,9,8,7,6,5,4,3,2,1)
)

df2 <- data.frame(
oID = (115,116,117,118,119,120,121,122,123),
sID = c(105,106,107,108,109,110,111,112,113),
x1 = c(1,2,3,4,5,6,7,8,9),
x2 = c(1,2,2,2,2,2,2,2,2)
)

我想将特定案例从 df2 添加到 df1。 我只想添加案例,如果案例的 sID 与 df1 中的任何 oID 匹配。

将案例从 df2 添加到 df1 时,我想进行一些额外的操作:

示例:查看来自 df2 的案例 oID 123。它的 sID 是 113 与 df1 中的一个 case 匹配。我想在 df1 中创建一个具有以下特征的新案例: oID = 123; x1 = 9; x2 = 2; y1 = 8; y2 = 8

如果我正确理解了你的问题,你可以合并表格,然后比较一些值并替换它们。此后,您只需再次清理一些列。

library(data.table)

setDT(df1)
setDT(df2)

merged <- merge(df1, df2, by.x = "oID", by.y = "sID", all.x = T)
merged[!is.na(oID.y), oID := oID.y][!is.na(x1.y), x1.x := x1.y][!is.na(x2.y), x2.x := x2.y]
merged <- merged[, .(oID, x1.x, x2.x, y1, y2)]
setnames(merged, names(df1))

merged

    oID x1 x2 y1 y2
 1: 121  7  2 10 10
 2: 122  8  2  9  9
 3: 123  9  2  8  8
 4: 114  4  4  7  7
 5: 115  5  5  6  6
 6: 116  6  6  5  5
 7: 117  7  7  4  4
 8: 118  8  8  3  3
 9: 119  9  9  2  2
10: 120 10 10  1  1

数据

df1 <- data.frame(
  oID = c(111,112,113,114,115,116,117,118,119,120),
  x1 = c(1,2,3,4,5,6,7,8,9,10),
  x2 = c(1,2,3,4,5,6,7,8,9,10),
  y1 = c(10,9,8,7,6,5,4,3,2,1),
  y2 = c(10,9,8,7,6,5,4,3,2,1)
)

df2 <- data.frame(
  oID = c(115,116,117,118,119,120,121,122,123),
  sID = c(105,106,107,108,109,110,111,112,113),
  x1 = c(1,2,3,4,5,6,7,8,9),
  x2 = c(1,2,2,2,2,2,2,2,2)
)

这是一个tidyverse方法。首先,将 df1 中的行添加到 df1df2semi_join 中。这将在 sIDoID 之间的匹配行上添加 x 值。然后,所缺少的就是 y 值。如果有一个 sID 值(并且没有丢失或 NA),那么它将 match oID 并使用那个 y 值。

library(tidyverse)

bind_rows(
  df1,
  semi_join(
    df2,
    df1,
    by = c("sID" = "oID")
  )
) %>%
  mutate(across(y1:y2, ~ifelse(!is.na(sID), .[match(sID, oID)], .))) %>%
  select(-sID)

另一种方法是在执行一系列两个连续连接后合并行:

bind_rows(
  df1,
  semi_join(
    df2,
    df1,
    by = c("sID" = "oID")
  ) %>%
    left_join(
      df1[, c("oID", "y1", "y2")],
      by = c("sID" = "oID")
    ) %>%
      select(-sID)
)

输出

   oID x1 x2 y1 y2
1  111  1  1 10 10
2  112  2  2  9  9
3  113  3  3  8  8
4  114  4  4  7  7
5  115  5  5  6  6
6  116  6  6  5  5
7  117  7  7  4  4
8  118  8  8  3  3
9  119  9  9  2  2
10 120 10 10  1  1
11 121  7  2 10 10
12 122  8  2  9  9
13 123  9  2  8  8