R 中的自连接

Question

这是一个示例标题：

test <- tibble(a = c("dd1","dd2","dd3","dd4","dd5"), 
               name = c("a", "b", "c", "d", "e"), 
               b = c("dd3","dd4","dd1","dd5","dd2"))

我想添加一个新列 b_name 作为 self-join 来测试使用：

dplyr::inner_join(test, test, by = c("a" = "b"))

我的 table 太大了（270 万行，4 列），我收到以下错误：

Error: std::bad_alloc

请告知正确的做法/最佳做法。

我的最终目标是得到如下结构：

   a     name  b     b_name
   dd1   a     dd3   c
   dd2   b     dd4   d
   dd3   c     dd1   a
   dd4   d     dd5   e
   dd5   e     dd2   b

Answer 1

对于这样的行数，我认为 data.table 可能会提高您的速度。所以这是一个 data.table 解决方案：

library(data.table)
setDT(test)

方法 #1：自加入：

test[test, on = c(a = "b")]
# test[test, on = .(a == b)] ## identical

方法 # 2：使用 data.table::merge:

merge(test, test, by.x = "a", by.y = "b")

Answer 2

这是另一个使用 base 中的 match 函数和 dplyr 包中的 mutate 函数的简单解决方案：

library(dplyr)

new_test <- test %>% 
  mutate(b_name = name[match(test$b,test$a)])

但是，请注意非常长的表格，因为 match 可能不是最佳实施方式。

Answer 3

另一个选项是 fmatch 来自 fastmatch

library(fastmatch)
test$b_name <- with(test, name[fmatch(b, a)])
test$b_name
#[1] "c" "d" "a" "e" "b"

根据?fmatch描述

fmatch is a faster version of the built-in match() function.

R 中的自连接

Self Joining in R

join

r

dataframe

dplyr

tibble

方法 #1：自加入：

方法 # 2：使用 data.table::merge: