使用 plyr 在两列上加入两个海量数据帧

Question

我有一个非常大的数据框，我需要在两列上连接到另一个数据框。我一直在使用 merge 来完成 ir，但 R 内存不足，表越大。是否有使用 dplyr 或 plyr 的类似解决方案？我听说他们需要更少的内存来完成。我通常知道如何在 plyr 中使用 join 函数，我正在努力解决的问题是通过两列进行连接。我一直在使用的合并语法如下：

Correlation_Table <- merge(Correlation_Table, inter, by.x = c(1,2), by.y = c(1,2), all.x = TRUE, all.y = TRUE)

例如，如果我有以下两个数据帧：

> head(df1)
  x y         z          a
1 1 2 429.57410  43.746670
2 2 3 717.98184 524.288886
3 3 4 601.66938 640.245469
4 4 5  87.41476 318.964765
5 5 6 586.22234 196.759991
6 6 7 619.82194   3.308136
> head(df2)
   b  c        d
1  5  8 152.2855
2  6  9 191.5406
3  7 10 197.0520
4  8 11 175.4209
5  9 12 157.6239
6 10 13 136.3286

其中 df1 的 x 和 y 列是维度，而 df2 的 b 和 c 列也是维度，其他列是度量。我的目标是创建一个包含所有三个度量的新数据框，其中 df1.x 和 df1.y 的记录匹配 df2.a 和 df2.b。

这可以使用 plyr 吗？

Answer 1

你可以试试

library(dplyr)
res1 <- full_join(df1, df2, by=c('x'='b', 'y'='c'))

根据?full_join

by: a character vector of variables to join by. If ‘NULL’, the default, ‘join’ will do a natural join, using all variables with common names across the two tables. A message lists the variables so that you can check they're right. To join by different variables on x and y use a named vector. For example, ‘by = c("a" = "b")’ will match ‘x.a’ to ‘y.b’.

并将结果与

进行比较

res2 <-  merge(df1, df2, by.x = c(1,2), by.y = c(1,2),
                           all.x = TRUE, all.y = TRUE)

注意：行的顺序会有所不同

使用 plyr 在两列上加入两个海量数据帧

Using plyr to join two massive dataframes on two columns

r

dplyr