R中不同数据帧中两列对的匹配值
Matching values from two column pairs in different data frames in R
我有两个数据框,它们是包含“源”和“目标”列作为其前两列的边缘列表,第二个数据框包括具有边缘属性的第三列。这两个数据帧的长度不同,我希望 (1) 从一个数据帧中检索不在另一个数据帧中的边,以及 (2) 从第二个数据帧中获取值以匹配边。
示例:
> A <- data.frame(source=c("v1", "v1", "v2", "v2"), target=c("v2", "v4", "v3", "v4"))
> B <- data.frame(source=c("V1", "V2", "v1", "V4", "V4", "V5"), target=c("V2", "V5", "V3", "V3", "V2", "V4"), variable=c(3,4,0,2,1,0))
> A
source target
1 v1 v2
2 v1 v4
3 v2 v3
4 v2 v4
> B
source target variable
1 V1 V2 3
2 V2 V5 4
3 v1 V3 0
4 V4 V3 2
5 V4 V2 1
6 V5 V4 0
理想的结果 (1):
source target
1 V2 V5
2 V1 V3
3 V4 V3
4 V5 V4
理想的结果 (2):
source target variable
1 V1 V2 3
2 V2 V4 1
如何使用 R 实现这一点?
您将首先获得 anti_join
,但您需要对源和目标的两种组合进行 anti-join,因为在您的示例中方向似乎无关紧要。请注意,我不得不使用 toupper
,因为您示例中的大写字母不稳定,应忽略示例建议的大小写。
library(dplyr)
anti_join(anti_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
A %>% mutate_all(toupper),
by = c(target = "source", source = "target")) %>%
select(-variable)
#> source target
#> 1 V2 V5
#> 2 v1 V3
#> 3 V4 V3
#> 4 V5 V4
绑定两个 inner_join
可以获得的第二个结果:
bind_rows(inner_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
inner_join(B, A %>% mutate_all(toupper),
by = c(source = "target", target = "source")))
#> source target variable
#> 1 V1 V2 3
#> 2 V4 V2 1
使用data.table
:
# Load data.table and convert to data.frames to data.tables
library(data.table)
setDT(A)
setDT(B)
# If direction doesn't matter sort "source/target"
# Also need to standardise the data format, toupper()
cols <- c("source", "target")
foo <- function(x) paste(toupper(sort(unlist(x))), collapse="-")
A[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(A))]
B[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(B))]
# Do anti-join and inner join
B[!A, .SD, on="oedge", .SDcols=cols]
# source target
# 1: V2 V5
# 2: v1 V3
# 3: V4 V3
# 4: V5 V4
B[A, .SD, on="oedge", .SDcols=c(cols, "variable"), nomatch = NULL]
# source target variable
# 1: V1 V2 3
# 2: V4 V2 1
我有两个数据框,它们是包含“源”和“目标”列作为其前两列的边缘列表,第二个数据框包括具有边缘属性的第三列。这两个数据帧的长度不同,我希望 (1) 从一个数据帧中检索不在另一个数据帧中的边,以及 (2) 从第二个数据帧中获取值以匹配边。
示例:
> A <- data.frame(source=c("v1", "v1", "v2", "v2"), target=c("v2", "v4", "v3", "v4"))
> B <- data.frame(source=c("V1", "V2", "v1", "V4", "V4", "V5"), target=c("V2", "V5", "V3", "V3", "V2", "V4"), variable=c(3,4,0,2,1,0))
> A
source target
1 v1 v2
2 v1 v4
3 v2 v3
4 v2 v4
> B
source target variable
1 V1 V2 3
2 V2 V5 4
3 v1 V3 0
4 V4 V3 2
5 V4 V2 1
6 V5 V4 0
理想的结果 (1):
source target
1 V2 V5
2 V1 V3
3 V4 V3
4 V5 V4
理想的结果 (2):
source target variable
1 V1 V2 3
2 V2 V4 1
如何使用 R 实现这一点?
您将首先获得 anti_join
,但您需要对源和目标的两种组合进行 anti-join,因为在您的示例中方向似乎无关紧要。请注意,我不得不使用 toupper
,因为您示例中的大写字母不稳定,应忽略示例建议的大小写。
library(dplyr)
anti_join(anti_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
A %>% mutate_all(toupper),
by = c(target = "source", source = "target")) %>%
select(-variable)
#> source target
#> 1 V2 V5
#> 2 v1 V3
#> 3 V4 V3
#> 4 V5 V4
绑定两个 inner_join
可以获得的第二个结果:
bind_rows(inner_join(B, A %>% mutate_all(toupper),
by = c("source", "target")),
inner_join(B, A %>% mutate_all(toupper),
by = c(source = "target", target = "source")))
#> source target variable
#> 1 V1 V2 3
#> 2 V4 V2 1
使用data.table
:
# Load data.table and convert to data.frames to data.tables
library(data.table)
setDT(A)
setDT(B)
# If direction doesn't matter sort "source/target"
# Also need to standardise the data format, toupper()
cols <- c("source", "target")
foo <- function(x) paste(toupper(sort(unlist(x))), collapse="-")
A[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(A))]
B[, oedge := foo(.SD), .SDcols = cols, by = seq_len(nrow(B))]
# Do anti-join and inner join
B[!A, .SD, on="oedge", .SDcols=cols]
# source target
# 1: V2 V5
# 2: v1 V3
# 3: V4 V3
# 4: V5 V4
B[A, .SD, on="oedge", .SDcols=c(cols, "variable"), nomatch = NULL]
# source target variable
# 1: V1 V2 3
# 2: V4 V2 1