在加入 Dplyr 之前比较两个数据帧之间的名称列

Comparing Columns of Names Between Two Dataframes Before Joining with Dplyr

我想知道在 dplyr 中进行联接之前是否有一种简单的方法来比较列。下面是两个简单的数据框。我想根据名字和姓氏进行全连接,但是存在一些拼写错误或不同的格式,例如 "Elizabeth Ray" 与 "Elizabeth"。

我想在加入之前比较这些专栏。我希望有一种方法可以生成包含索引的所有差异的列表或向量,以便我可以在加入之前更正它们。

如果有更简单的方法,我也愿意接受,但我希望是最简单的方法。我想要一个基于 dplyr、tidyr 和 stringr 的解决方案。

FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)

我正在假装回答,因为我无法访问评论

df = Names1[!(Names1$LastNames %in% Names2$LastNames2), ]

试试关于代码。

为了比较您的记录之间的相似性,我想您可能正在寻找一种方法来将模糊逻辑匹配的度量应用于您的名称比较任务。又名:应用 String Distance Function 在执行您的 Record Linkage 任务时。 (如果您已经了解所有这些,请原谅我 - 但这些关键字在一开始对我帮助很大。)

有一个很棒的软件包 stringdist that works very well for these applications, but recordlinkage 可能会帮助您以最快的速度开始对齐数据框。

如果您希望查看最相似的名字和姓氏值直至最不同的值,您可以使用如下代码:

library(RecordLinkage)
library(dplyr)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5) # added in to allow joining of data tables & comparison results
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)

# RecordLinkage function that calculates string distance b/w records in two data frames
Results <- compare.linkage(Names1, Names2, blockfld = 1, strcmp = T, exclude = 4)
Results
#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

# Trim $pairs dataframe (seen above) to contain just id's & similarity measures
PairsSelect <- 
    Results$pairs %>% 
    select(id = id1, firstNameSim = firstName, lastNameSim = lastName)

# Join original data & string comparison results together
# reorganize data to facilitate review
JoinedResults <-
    left_join(Names1, Names2) %>% 
    left_join(PairsSelect) %>% 
    select(id, firstNameSim, firstName, firstName2, lastNameSim, lastName, lastName2) %>% 
    arrange(desc(lastNameSim), desc(firstNameSim), id)
JoinedResults
# id firstNameSim firstName firstName2 lastNameSim      lastName lastName2
# 1  2            1      Doug       Doug   1.0000000       Shapiro   Shapiro
# 2  4            1   Bubbles    Bubbles   1.0000000        Murphy    Murphy
# 3  1            1     Chris      Chris   0.9800000    MacDougall MacDougal
# 4  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku
# 5  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth

# If you want to collect just the perfect matches
PerfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim == 1 & lastNameSim == 1) %>% 
    select(id, firstName, lastName)
PerfectMatches
#   id firstName lastName
# 1  2      Doug  Shapiro
# 2  4   Bubbles   Murphy

# To collect the matches that are going to need alignment
ImperfectMatches <- 
    JoinedResults %>% 
    filter(firstNameSim < 1 | lastNameSim < 1) %>% 
    mutate(flgFrstNm = 0, flgLstNm = 0)
ImperfectMatches
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         0        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         0        0
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         0        0
# 

# If you want to enter your column preference in a flag column to facilitate faster rectification...
write.csv(ImperfectMatches, "ImperfectMatches.csv", na = "", row.names = F)
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
# Flag data externally - save file to new name with '_reviewed' appended to filename
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##
#reload results
FlaggedMatches <- read.csv("ImperfectMatches_reviewed.csv", stringsAsFactors = F)
FlaggedMatches
## Where a 1 is the 1st data set preferred and 0 (or 2 if that is easier for the 'data processor') means the 2nd data set is preferred.
#   id firstNameSim firstName firstName2 lastNameSim      lastName lastName2 flgFrstNm flgLstNm
# 1  1            1     Chris      Chris   0.9800000    MacDougall MacDougal         1        0
# 2  3            1  Shintaro   Shintaro   0.9500000      Yamazaki  Yamazaku         1        1
# 3  5            1      Elsa       Elsa   0.9384615 Elizabeth Ray Elizabeth         1        0

## Executing Assembly of preferred/rectified firstName and lastName columns
ResolvedMatches <- 
    FlaggedMatches %>% 
    mutate(rectifiedFirstName = ifelse(flgFrstNm == 1,firstName, firstName2),
           rectifiedLastName = ifelse(flgLstNm == 1, lastName, lastName2)) %>% 
    select(id, starts_with("rectified"))

ResolvedMatches
# id rectifiedFirstName rectifiedLastName
# 1  1              Chris         MacDougal
# 2  3           Shintaro          Yamazaki
# 3  5               Elsa         Elizabeth

dplyr 非常直观,但 compare.linkage() 函数需要一些解释。

前两个参数很明显:您正在比较的两个数据帧(dataframe1 和 dataframe2)。 [如果你只想将一个数据帧内的记录与它们自己进行比较(对记录集进行重复数据删除),那么你可以使用 compare.dedup(),并且只引用一个数据帧。

blockfld 设置为 1 或 2,在这种情况下,将指定匹配必须分别为名字或姓氏的 100%。相反,您可能希望在数据集中包含 primary/foreign 键并在 blckfld 参数中引用该列。或者,如果您的记录实际上并非如此等价构造,您可以完全忽略此参数(默认为 FALSE),然后将比较所有可能的组合 [数据帧的叉积]。

strcmpTRUE 为您提供一个应用于您正在比较的数据列的字符串距离函数;如果您将其保留为 false,那么它只会测试精确的 1:1 字符串对应关系。

exclude 也是避免必须构建中间数据帧和 select 只有您希望相互比较的列的好方法:排除 3 只允许我们删除结果中的宠物和住宅比较。

上面代码中的 4 列键控数据帧(不是原始问题的 3 列数据帧)产生的结果如下:

#  $data1
#    firstName      lastName  pet
# 1      Chris    MacDougall  Cat
# 2       Doug       Shapiro  Dog
# 3   Shintaro      Yamazaki  Cat
# 4    Bubbles        Murphy  Dog
# 5       Elsa Elizabeth Ray  Cat

# $data2
#    firstName2  lastName2  dwelling
# 1       Chris  MacDougal     House
# 2        Doug    Shapiro     House
# 3    Shintaro   Yamazaku Apartment
# 4     Bubbles     Murphy     Condo
# 5        Elsa  Elizabeth     House

# $pairs
# id1 id2 id firstName  lastName is_match
# 1   1   1  1         1 0.9800000       NA
# 2   2   2  1         1 1.0000000       NA
# 3   3   3  1         1 0.9500000       NA
# 4   4   4  1         1 1.0000000       NA
# 5   5   5  1         1 0.9384615       NA

# $frequencies
# id firstName  lastName 
# 0.200     0.200     0.125 
# $type
# [1] "linkage"

# attr(,"class")
# [1] "RecLinkData"

上面的每个部分(例如 $pairs)都是它自己的数据框。 添加一个键,您可以将它们全部连接在一起,然后参考并使用 df 对中的值作为切换级别门,然后甚至将 data1 值复制到 data2 帧中,例如,当您在配对评级中有 > 0.95 值时. (注意:is_match看起来很重要,但它是训练匹配工具的,与我们这里的任务无关。)

无论如何,我希望你发现这些库的突然增强会让你像我第一次遇到它们时一样兴奋地工作。

顺便说一句:我还发现这个 Comparison of String Distance Algorithms 是对当前可用的字符串距离度量的一个很好的调查。

使用@alistaire 建议的标准 adist() 函数提供了一种非常有效的方法(并且很可能是讲师希望看到的方法。)adist 的字符串指标仅限于广义 Levenshtein(编辑)距离,但这看起来正是您正在寻找的。

代码如下: (因为这看起来像是专门针对数据处理的 R 编码 class 的介绍,所以我在 reproducible/question 中添加了一些最佳实践润色。)

library(dplyr)

id <- c(1:5)
firstName <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName <- c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
pet <- c("Cat","Dog","Cat","Dog","Cat")
Names1 <- data.frame(id, firstName, lastName, pet)

id <- c(1:5)
firstName2 <- c("Chris","Doug","Shintaro","Bubbles","Elsa")
lastName2 <- c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
dwelling <- c("House","House","Apartment","Condo","House")
Names2 <- data.frame(id, firstName2, lastName2, dwelling)
# NB: technically you could merge these data frames later with `bind_cols()` but best 
# datahandling practices dictate joining/comparing data based on keys (instead of 
# binding columns together based upon the order in which tables are initially arranged.)
#[also preference is for column headers to be singular and lower case, and tables/dataframes to be uppercase and plural - from (or extension from principles in): https://google.github.io/styleguide/Rguide.xml]

## adist() calculates string distance b/w records in two data frames
# Matrix between all columns is great way to ascertain similarity of data
# on overall column to column basis.
# 0 is closest resemblance, higher numbers are lowest resemblance
ResultsInterColumnComparison <-
    adist(Names1, Names2, partial = T)
ResultsInterColumnComparison

# firstName to firstName2 & lastName to LastName2 are similar columns.
#           id firstName2 lastName2 dwelling
# id         0          2         2        2
# firstName 15          0         3        4
# lastName  15          3         0        5
# pet       15          5         4        3

# adist column to column DifferenceCount (using dplyr)
dltFrstN <- diag(adist(Names1$firstName, Names2$firstName2, partial = T))
dltLstN <- diag(adist(Names1$lastName, Names2$lastName2, partial = T))

# Join all info together
DFcompilation <- 
    data.frame(id, dltFrstN, firstName, firstName2, dltLstN, lastName, lastName2) %>% 
    arrange(desc(dltLstN), desc(dltLstN))
DFcompilation
#   id dltFrstN firstName firstName2 dltLstN      lastName lastName2
# 1  5        0      Elsa       Elsa       4 Elizabeth Ray Elizabeth
# 2  1        0     Chris      Chris       1    MacDougall MacDougal
# 3  3        0  Shintaro   Shintaro       1      Yamazaki  Yamazaku
# 4  2        0      Doug       Doug       0       Shapiro   Shapiro
# 5  4        0   Bubbles    Bubbles       0        Murphy    Murphy

这种方法更简单,所需的编码也更简洁。我希望这对您的目的也更有帮助。