查找两个表之间的差异
Find discrepancies between two tables
我在 SAS/SQL 背景下使用 R,我正在尝试编写代码来获取两个 table,比较它们,并提供差异列表。此代码将重复用于许多不同的 table 集,因此我需要避免硬编码。
我正在与 Identifying specific differences between two data sets in R 合作,但它并没有让我一路走来。
示例数据,使用 LastName/FirstName(唯一)的组合作为键 --
Dataset One --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4321 Tower St 54321 10
Don Bob 771 North Ave 23232 5
Smith Mike 732 South Blvd. 77777 3
Dataset Two --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4111 Tower St 32132 17
Donn Bob 771 North Ave 11111 5
Desired Output --
LastName FirstName VarName TableOne TableTwo
Doe Jane StreetAddress 4321 Tower St 4111 Tower St
Doe Jane Zip 23232 32132
Doe Jane VisitCount 5 17
请注意,此输出忽略了我在两个 table 中没有相同 ID 的记录(例如,因为 Bob 的姓氏在一个 table 中是 "Don", "Donn" 在另一个 table 中,我们完全忽略该记录)。
我已经探索过通过在两个数据集上应用 melt 函数然后比较它们来实现这一点,但是我正在使用的大小数据表明这不切实际。在 SAS 中,我使用 Proc Compare 进行此类工作,但我还没有在 R 中找到完全等效的方法。
dplyr
和 tidyr
在这里工作得很好。首先,稍微缩小的数据集:
dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
First_Name = c('John', 'Jane', 'Bob', 'Mike'),
ZIP = c(12345, 54321, 23232, 77777),
VisitCount = c(20, 10, 5, 3),
stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
First_Name = c('John', 'Jane', 'Bob'),
ZIP = c(12345, 32132, 11111),
VisitCount = c(20, 17, 5),
stringsAsFactors = FALSE)
(抱歉,我不想全部输入。如果它很重要,请提供具有明确数据结构的 reproducible example。)
此外,您的 "desired output" 似乎与 Jane Doe 的 ZIP
和 VisitCount
有点不一样。
你把它们融化的想法很有效:
library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
## Last_Name First_Name key value
## 1 Doe John ZIP 12345
## 2 Doe Jane ZIP 54321
## 3 Don Bob ZIP 23232
## 4 Smith Mike ZIP 77777
## 5 Doe John VisitCount 20
## 6 Doe Jane VisitCount 10
从这里开始,它看似简单:
dat1g %>%
inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
filter(value.x != value.y)
## Last_Name First_Name key value.x value.y
## 1 Doe Jane ZIP 54321 32132
## 2 Doe Jane VisitCount 10 17
这是一个基于data.table
的解决方案:
library(data.table)
# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)
# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]
# Last_Name First_Name VarName TableOne TableTwo
# 1: Doe Jane Street_Address 4321 Tower St 4111 Tower St
# 2: Doe Jane ZIP 54321 32132
# 3: Doe Jane VisitCount 10 17
其中输入数据集是:
# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John",
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St",
"771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L,
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name",
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John",
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St",
"771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L,
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address",
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
dataCompareR 包旨在解决这个确切的问题。该软件包的小插图包含一些简单的示例,我已经使用该软件包解决了下面的原始问题。
免责声明:我参与了这个包的创建。
library(dataCompareR)
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))
print(compd1d2)
All columns were compared, 3 row(s) were dropped from comparison
There are 3 mismatched variables:
First and last 5 observations for the 3 mismatched variables
FIRST_NAME LAST_NAME valueA valueB variable typeA typeB diffAB
1 Jane Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character
2 Jane Doe 10 17 VISITCOUNT integer integer -7
3 Jane Doe 54321 32132 ZIP integer integer 22189
要获得更详细、更漂亮的摘要,用户可以运行
summary(compd1d2)
使用FIRST_NAME和LAST_NAME作为两个表之间的'join'是由rCompare
函数的keys =
参数控制的。在这种情况下,任何与这两个变量不匹配的行都会从比较中删除,但您可以通过使用 summary
获得更详细的比较输出
我在 SAS/SQL 背景下使用 R,我正在尝试编写代码来获取两个 table,比较它们,并提供差异列表。此代码将重复用于许多不同的 table 集,因此我需要避免硬编码。
我正在与 Identifying specific differences between two data sets in R 合作,但它并没有让我一路走来。
示例数据,使用 LastName/FirstName(唯一)的组合作为键 --
Dataset One --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4321 Tower St 54321 10
Don Bob 771 North Ave 23232 5
Smith Mike 732 South Blvd. 77777 3
Dataset Two --
Last_Name First_Name Street_Address ZIP VisitCount
Doe John 1234 Main St 12345 20
Doe Jane 4111 Tower St 32132 17
Donn Bob 771 North Ave 11111 5
Desired Output --
LastName FirstName VarName TableOne TableTwo
Doe Jane StreetAddress 4321 Tower St 4111 Tower St
Doe Jane Zip 23232 32132
Doe Jane VisitCount 5 17
请注意,此输出忽略了我在两个 table 中没有相同 ID 的记录(例如,因为 Bob 的姓氏在一个 table 中是 "Don", "Donn" 在另一个 table 中,我们完全忽略该记录)。
我已经探索过通过在两个数据集上应用 melt 函数然后比较它们来实现这一点,但是我正在使用的大小数据表明这不切实际。在 SAS 中,我使用 Proc Compare 进行此类工作,但我还没有在 R 中找到完全等效的方法。
dplyr
和 tidyr
在这里工作得很好。首先,稍微缩小的数据集:
dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
First_Name = c('John', 'Jane', 'Bob', 'Mike'),
ZIP = c(12345, 54321, 23232, 77777),
VisitCount = c(20, 10, 5, 3),
stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
First_Name = c('John', 'Jane', 'Bob'),
ZIP = c(12345, 32132, 11111),
VisitCount = c(20, 17, 5),
stringsAsFactors = FALSE)
(抱歉,我不想全部输入。如果它很重要,请提供具有明确数据结构的 reproducible example。)
此外,您的 "desired output" 似乎与 Jane Doe 的 ZIP
和 VisitCount
有点不一样。
你把它们融化的想法很有效:
library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
## Last_Name First_Name key value
## 1 Doe John ZIP 12345
## 2 Doe Jane ZIP 54321
## 3 Don Bob ZIP 23232
## 4 Smith Mike ZIP 77777
## 5 Doe John VisitCount 20
## 6 Doe Jane VisitCount 10
从这里开始,它看似简单:
dat1g %>%
inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
filter(value.x != value.y)
## Last_Name First_Name key value.x value.y
## 1 Doe Jane ZIP 54321 32132
## 2 Doe Jane VisitCount 10 17
这是一个基于data.table
的解决方案:
library(data.table)
# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]
# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)
# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]
# Last_Name First_Name VarName TableOne TableTwo
# 1: Doe Jane Street_Address 4321 Tower St 4111 Tower St
# 2: Doe Jane ZIP 54321 32132
# 3: Doe Jane VisitCount 10 17
其中输入数据集是:
# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John",
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St",
"771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L,
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name",
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John",
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St",
"771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L,
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address",
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
dataCompareR 包旨在解决这个确切的问题。该软件包的小插图包含一些简单的示例,我已经使用该软件包解决了下面的原始问题。
免责声明:我参与了这个包的创建。
library(dataCompareR)
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771 North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))
d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771 North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))
print(compd1d2)
All columns were compared, 3 row(s) were dropped from comparison
There are 3 mismatched variables:
First and last 5 observations for the 3 mismatched variables
FIRST_NAME LAST_NAME valueA valueB variable typeA typeB diffAB
1 Jane Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character
2 Jane Doe 10 17 VISITCOUNT integer integer -7
3 Jane Doe 54321 32132 ZIP integer integer 22189
要获得更详细、更漂亮的摘要,用户可以运行
summary(compd1d2)
使用FIRST_NAME和LAST_NAME作为两个表之间的'join'是由rCompare
函数的keys =
参数控制的。在这种情况下,任何与这两个变量不匹配的行都会从比较中删除,但您可以通过使用 summary