查找两个表之间的差异

Find discrepancies between two tables

我在 SAS/SQL 背景下使用 R,我正在尝试编写代码来获取两个 table,比较它们,并提供差异列表。此代码将重复用于许多不同的 table 集,因此我需要避免硬编码。

我正在与 Identifying specific differences between two data sets in R 合作,但它并没有让我一路走来。

示例数据,使用 LastName/FirstName(唯一)的组合作为键 --

Dataset One --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4321 Tower St    54321   10
Don        Bob         771  North Ave   23232   5
Smith      Mike        732 South Blvd.  77777   3        

Dataset Two --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4111 Tower St    32132   17
Donn       Bob         771  North Ave   11111   5

   Desired Output --

   LastName FirstName VarName         TableOne        TableTwo
   Doe      Jane      StreetAddress   4321 Tower St   4111 Tower St 
   Doe      Jane      Zip             23232           32132
   Doe      Jane      VisitCount      5               17

请注意,此输出忽略了我在两个 table 中没有相同 ID 的记录(例如,因为 Bob 的姓氏在一个 table 中是 "Don", "Donn" 在另一个 table 中,我们完全忽略该记录)。

我已经探索过通过在两个数据集上应用 melt 函数然后比较它们来实现这一点,但是我正在使用的大小数据表明这不切实际。在 SAS 中,我使用 Proc Compare 进行此类工作,但我还没有在 R 中找到完全等效的方法。

dplyrtidyr 在这里工作得很好。首先,稍微缩小的数据集:

dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
                   First_Name = c('John', 'Jane', 'Bob', 'Mike'),
                   ZIP = c(12345, 54321, 23232, 77777),
                   VisitCount = c(20, 10, 5, 3),
                   stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
                   First_Name = c('John', 'Jane', 'Bob'),
                   ZIP = c(12345, 32132, 11111),
                   VisitCount = c(20, 17, 5),
                   stringsAsFactors = FALSE)

(抱歉,我不想全部输入。如果它很重要,请提供具有明确数据结构的 reproducible example。)

此外,您的 "desired output" 似乎与 Jane Doe 的 ZIPVisitCount 有点不一样。

你把它们融化的想法很有效:

library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
##   Last_Name First_Name        key value
## 1       Doe       John        ZIP 12345
## 2       Doe       Jane        ZIP 54321
## 3       Don        Bob        ZIP 23232
## 4     Smith       Mike        ZIP 77777
## 5       Doe       John VisitCount    20
## 6       Doe       Jane VisitCount    10

从这里开始,它看似简单:

dat1g %>%
    inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
    filter(value.x != value.y)
##   Last_Name First_Name        key value.x value.y
## 1       Doe       Jane        ZIP   54321   32132
## 2       Doe       Jane VisitCount      10      17

这是一个基于data.table的解决方案:

library(data.table)

# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)

# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]

#     Last_Name First_Name        VarName      TableOne      TableTwo
#     1:       Doe       Jane Street_Address 4321 Tower St 4111 Tower St
#     2:       Doe       Jane            ZIP         54321         32132
#     3:       Doe       Jane     VisitCount            10            17

其中输入数据集是:

# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", 
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", 
"771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", 
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", 
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", 
"771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", 
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

dataCompareR 包旨在解决这个确切的问题。该软件包的小插图包含一些简单的示例,我已经使用该软件包解决了下面的原始问题。

免责声明:我参与了这个包的创建。

library(dataCompareR)

d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))

print(compd1d2)

All columns were compared, 3 row(s) were dropped from comparison
There are  3 mismatched variables:
First and last 5 observations for the  3 mismatched variables
FIRST_NAME LAST_NAME        valueA        valueB       variable     typeA  typeB diffAB
1       Jane       Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character       
2       Jane       Doe            10            17     VISITCOUNT   integer   integer     -7
3       Jane       Doe         54321         32132            ZIP   integer   integer  22189

要获得更详细、更漂亮的摘要,用户可以运行

summary(compd1d2)

使用FIRST_NAME和LAST_NAME作为两个表之间的'join'是由rCompare函数的keys =参数控制的。在这种情况下,任何与这两个变量不匹配的行都会从比较中删除,但您可以通过使用 summary

获得更详细的比较输出