如何基于 2 列合并 R 上的 2 个数据框？

Question

我想合并 2 个数据帧基于 R 中的 2 列。这两个数据帧称为 popr 和 dropped column，它们共享相同的 2 个变量：USUBJID和 TRTAG2N，这是我想要组合 2 个数据帧的变量。

当我仅尝试基于一列进行合并时，合并功能有效：

merged <- merge(popr,droppedcol,by="USUBJID")

当我尝试使用 2 列合并并查看数据框 "Duration" 时，table 为空且没有值，只有列 headers。它说 "no data available in table"。

我的任务是在 R 中为此复制 SAS 代码：

data duration;
  set pop combined1 ;
  by usubjid trtag2n;
run;

在 R 上，我尝试了以下

duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

我想查看包含列 USUBJID、TRTAG2N、TRTAG2 和 FUDURAG2 的数据框，首先按 FUDURAG2 排序，然后按 USUBJID 排序。

Answer 1

根据 SAS 文档，Combining SAS Data Sets，并由 SAS 大师 @Tom 在上面的评论中确认，set 和 by 仅表示您正在交错数据集。没有 merge（顺便说一句，这也是您不使用的 SAS 方法）正在发生：

Interleaving uses a SET statement and a BY statement to combine multiple data sets into one new data set. The number of observations in the new data set is the sum of the number of observations from the original data sets. However, the observations in the new data set are arranged by the values of the BY variable or variables and, within each BY group, by the order of the data sets in which they occur. You can interleave data sets either by using a BY variable or by using an index.

因此，R中不带by的set最好的翻译是rbind()，带by的set是rbind + order（在行上）：

duration <- rbind(pop, combined1)                                # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),])   # ORDER ROWS

但是，请注意：rbind 不允许串联数据集之间存在不匹配的列。但是，third-party 包允许不匹配的列，包括：plyr::rbind.fill、dplyr::bind_rows、data.table::rbindlist.

如何基于 2 列合并 R 上的 2 个数据框？

How do I merge 2 data frames on R based on 2 columns?

merge

r

sas

merging-data

dataframe