R中不同数据帧中两列对的匹配值

Question

我有两个数据框，它们是包含“源”和“目标”列作为其前两列的边缘列表，第二个数据框包括具有边缘属性的第三列。这两个数据帧的长度不同，我希望 (1) 从一个数据帧中检索不在另一个数据帧中的边，以及 (2) 从第二个数据帧中获取值以匹配边。

示例：

> A <- data.frame(source=c("v1", "v1", "v2", "v2"), target=c("v2", "v4", "v3", "v4"))
> B <- data.frame(source=c("V1", "V2", "v1", "V4", "V4", "V5"), target=c("V2", "V5", "V3", "V3", "V2", "V4"), variable=c(3,4,0,2,1,0))
> A
  source target
1     v1     v2
2     v1     v4
3     v2     v3
4     v2     v4
> B
  source target variable
1     V1     V2        3
2     V2     V5        4
3     v1     V3        0
4     V4     V3        2
5     V4     V2        1
6     V5     V4        0

理想的结果 (1)：

  source target
1     V2     V5
2     V1     V3
3     V4     V3
4     V5     V4

理想的结果 (2)：

  source target variable
1     V1     V2        3
2     V2     V4        1

如何使用 R 实现这一点？

Answer 1

您将首先获得 anti_join，但您需要对源和目标的两种组合进行 anti-join，因为在您的示例中方向似乎无关紧要。请注意，我不得不使用 toupper，因为您示例中的大写字母不稳定，应忽略示例建议的大小写。

library(dplyr)

anti_join(anti_join(B, A %>% mutate_all(toupper), 
                    by = c("source", "target")),
          A %>% mutate_all(toupper), 
          by = c(target = "source", source = "target")) %>%
  select(-variable)
#>   source target
#> 1     V2     V5
#> 2     v1     V3
#> 3     V4     V3
#> 4     V5     V4

绑定两个 inner_join 可以获得的第二个结果：

bind_rows(inner_join(B, A %>% mutate_all(toupper), 
                     by = c("source", "target")), 
          inner_join(B, A %>% mutate_all(toupper), 
                     by = c(source = "target", target = "source")))
#>   source target variable
#> 1     V1     V2        3
#> 2     V4     V2        1

Answer 2

使用data.table:

# Load data.table and convert to data.frames to data.tables
library(data.table)
setDT(A)
setDT(B)

# If direction doesn't matter sort "source/target"
# Also need to standardise the data format, toupper()
cols <- c("source", "target")
foo <- function(x) paste(toupper(sort(unlist(x))), collapse="-")
A[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(A))]
B[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(B))]

# Do anti-join and inner join
B[!A, .SD, on="oedge", .SDcols=cols]
#    source target
# 1:     V2     V5
# 2:     v1     V3
# 3:     V4     V3
# 4:     V5     V4
B[A, .SD, on="oedge", .SDcols=c(cols, "variable"), nomatch = NULL]
#    source target variable
# 1:     V1     V2        3
# 2:     V4     V2        1

R中不同数据帧中两列对的匹配值

Matching values from two column pairs in different data frames in R

r

match

dataframe

pairwise