根据名字、中间名、姓氏和出生日期匹配两个 DF(考虑数据缺陷)

match two DFs based on first, middle, last name & date of birth (account for data flaws)

我有一个很简单的问题:我想查看DF1中的哪些人包含在DF2中。我想根据他们

这样做

我只想保留 DF1 和 DF2 中正确匹配的那些行。

DF1 看起来像这样

(编辑:“XXX”改为“乔”)

DF1 <- data.frame(row_ID = 1:13, 
                  first_name = c("Jovana", "Jovana", "Jovana", "Joe", "Jovana", "Jovana", "Zuhair", "Jackson", "James", "Alexandria", "Nicole", "Carl", "Matthew"),
                  middle_name = c("Cole", "", "Joe", "Cole", "Cole", "Cole", "Beate", "Milhouse",  "", "Macy", "Riley", "", ""),
                  last_name = c("Tossie", "Tossie", "Tossie", "Tossie", "Tossie", "Joe", "Biddison", "Beck", "Baker", "Maya", "Grinstead", "Domenico", "Hosler"),
                  date_of_birth = as.Date(c("1930-07-05","1930-07-05", "1930-07-05", "1930-07-05", "2000-01-01", "1930-07-05", "1939-04-18", "1936-11-11", "1933-02-18"," 1942-10-18", "1945-03-24", "1948-01-25", "1951-02-03")),
                  var_difference = c("none", "no middle name",  "diff middle name", "first name", "date of birth", "last name", "middle name not abbr", "middle name incl", "no title", "middle name column", "columns", "columns", "columns"),
                  var_should_be_found = c("yes", "yes", "no", "no", "no", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes"))

DF2 看起来像这样:

(编辑:Zuhair Biddison BD 1933-02-18 至 1939-04-18)

DF2 <- data.frame(row_ID = 1:20, 
                  first_name = c("Jovana","Zuhair","Jackson","Dr. James","Alexandria Macy","Nicole Riley Grinstead","","","Isaiah","Wyatt","Rayyana","Dhaahir","Lauren",
                                                "Tony","Aziza","Cody","Paige","Jasmine","Kawkab","Pedro"),
                  middle_name = c("Cole","B.","", "","","","","","Kyrie","","Zachary", "","Tyler","", "Brian", 
                                  "","Amy", "","Robert",""),
                  last_name = c("Tossie","Biddison","Beck","Baker","Maya","", "Carl Domenico","Hosler, Matthew","Bishop","Ericson","Leptich","Franks","Pummer","Neves","Ferguson","Jennings",
                                "Phillips","Wyatt","Caisse","Laplante"),
                  date_of_birth = as.Date(c("1930-07-05", "1939-04-18", "1936-11-11", "1939-04-18", 
                                            "1942-10-18", "1945-03-24", "1948-01-25", "1951-02-03", 
                                            "1954-05-27", "1957-08-05", "1960-08-01", "1963-11-26", 
                                            "1966-05-25", "1969-11-19", "1972-01-28", "1975-06-17", 
                                            "1978-07-24", "1981-07-11", "1984-10-28", "1987-09-14")),
                  var_other = sample(colours(), 20)
                  )

DF2有很多缺陷

有时:

  1. 中间名缩写
  2. 没有中间名
  3. 标题包含在 first_name 列中
  4. 中间名出现在first_name列
  5. 名字和姓氏一起出现在last-name-column中(顺序:first-namelast-name)
  6. 名字和姓氏一起出现在last-name-column中(顺序:last-name、first-name)

如前所述,最后,我想只保留DF1和DF2中出现的人的行,丢弃其余行,并合并DF1和DF2的列。

首先请问这个有什么方便快捷的功能吗? (问题看起来很简单,但我没有找到)

如果没有,这就是我所做的。它有效,但对我来说太慢了。对于 DF1(约 74000 obs)和几个 DF2 之一(超过 100000 obs),需要数小时

如有任何帮助,我将不胜感激!

我的做法:

1。合并所有姓名(名字、中间名、姓氏),至少有 2 个匹配,稍后。

DF1$all_names <- paste(DF1$first_name,
                       DF1$middle_name, 
                       DF1$last_name,
                       sep = " ")

DF2$all_names <- paste(DF2$first_name,
                       DF2$middle_name, 
                       DF2$last_name,
                       sep = " ")

2。首先寻找匹配的生日(首先,log-algorithm,然后是树)

##########################
# FUNCTION: BD MATCH log #
##########################
BD_MATCH <- function(the_data, birthday){
  
  not_precise_date <- T
  not_found <- T
  bd_found <- F
  
  while(not_precise_date & !bd_found & nrow(the_data)> 1){
    # check half
    half_of_df <- ceiling(nrow(the_data)/2)
    
    # is bd at half?
    bd_found <- the_data[half_of_df, "date_of_birth"] == birthday
    
    if(bd_found){bd_row_id <- the_data[half_of_df, "row_ID"]; break} # else{bd_row_id <- NULL}
    
    # is the bd above or below
    in_upper_half <- the_data[half_of_df, "date_of_birth"] >= birthday
    
    # subset accordingly
    if(in_upper_half){the_data <- the_data[1:half_of_df, ]
    } else{the_data <- the_data[(half_of_df+1):nrow(the_data), ]}
    
  }
  
  if(bd_found){return(bd_row_id)} else{return(NA)}
}

###########################
# FUNCTION: BD MATCH tree #
###########################
# search above and below for duplicate bds
TREE_FUN <- function(the_bd_vec, the_row){
  
  birthday <- the_bd_vec[the_row]
  
  # search above
  i <- the_row
  bd_criterion <- T
  
  while(bd_criterion & i>1){
    
    i <- i-1
    bd_criterion <- the_bd_vec[i] == birthday
  }
  begin <- ifelse(bd_criterion, 1, i+1)
  
  # search below
  i <- the_row
  bd_criterion <- T
  
  while(bd_criterion & i <= length(the_bd_vec)){
    
    i <- i+1
    
    bd_criterion <- the_bd_vec[i] == birthday
  }
  
  if(is.na(bd_criterion)|bd_criterion == F){
    end <- i-1
  } else{
    end <- i
  }
  
  return(begin:end)
}

3。检查是否至少有 2 个名称匹配

(这匹配,i.a。例如,姓氏不同,但名字、中间名和生日匹配的人。这是不正确的,但非常罕见.)

##########
# SEARCH #
##########

res_list <- list()

for(j in 1:nrow(DF1)){
  birthday <-  DF1$date_of_birth[j]
  DF1_name <- strsplit(DF1$all_names[j], split = " ")
  
  # SEARCH BIRTHDAY
  bd_row_id <- BD_MATCH(DF2, birthday)
  
  # SEARCH NAME 
  if(is.na(bd_row_id)){
    
    res_list[[j]] <- NA
    
  } else{
    
    the_row <- which(DF2$row_ID == bd_row_id)
    the_bd_vec <- DF2$date_of_birth
    
    begin_end <- TREE_FUN(the_bd_vec, the_row)
    
    BD_subset <- DF2[begin_end, ]
    
    ##############
    # NAME CHECK #
    ##############
    DF2_name <- strsplit(BD_subset$all_names, split = " ")
    
    the_vec <- NULL
    nest <- list()
    for(k in seq(DF2_name)){
      if(sum(DF2_name[[k]] %in% DF1_name[[1]]) >= 2) {
        
        the_vec <- c(the_vec, k)
        nest[[k]] <- BD_subset[the_vec, ]
        
      } else {
        nest[[k]] <- NA
      }
    }
    
    if(sum(is.na(nest)) == length(nest)){
      res_list[[j]] <- NA
    }
    else{
      res_list[[j]] <- bind_rows(nest[!is.na(nest)])  
    }
    
  } 
  print(j)
}


found_DF1 <- DF1[which(!is.na(res_list)), ]
found_DF2 <- res_list[!is.na(res_list)]

for(i in seq(found_DF2)){
  found_DF2[[i]] <- cbind(found_DF2[[i]], found_DF1[i , ])
}

found_DF2 <- bind_rows(found_DF2)

避免清理和计算循环。相反,考虑通过向量化操作清理两个数据帧,以正确规范化名称,每个名称列都包含一个标识符。然后,运行 两个 mergerbind 首先是所有三个名字,其次是名字和姓氏。然后 运行 unique() 到 de-duplicate 行。

withinstrsplitifelse清理)

注意:以下解决方案适用于发布的数据,可能需要针对其他数据问题进行扩展。

DF1_clean <- within(
  DF1, {
    first_name <- gsub("XXX", "", first_name)
    middle_name <- gsub("XXX", "", middle_name)
    last_name <- gsub("XXX", "", last_name)
  }
)

DF2_clean <- within(
  DF2, {
    # FIRST NAME CLEANUP
    first_temp <- trimws(gsub("Dr.|Mr.|Ms.|Mrs.", "", first_name))
    first_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 1))
    middle_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 2))
    last_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 3))
    
    first_name <- ifelse(is.na(first_name_), first_name, first_name_)
    middle_name <- ifelse(is.na(middle_name_), middle_name, middle_name_)
    last_name <- ifelse(is.na(last_name_), last_name, last_name_)
    
    # LAST NAME CLEANUP
    last_temp <- trimws(gsub("Jr|Sr|III", "", last_name))
    first_name_ <- ifelse(
      grepl(",", last_temp), 
      sapply(strsplit(last_temp, ","), `[`, 2),
      sapply(strsplit(last_temp, " "), `[`, 1)
    )
    last_name_ <- ifelse(
      grepl(",", last_temp), 
      sapply(strsplit(last_temp, ","), `[`, 1),
      sapply(strsplit(last_temp, " "), `[`, 2)
    )
    
    first_temp <- trimws(first_name)
    first_name <- trimws(ifelse(first_temp=="", first_name_, first_name))
    last_name <- trimws(ifelse(first_temp=="", last_name_, last_name))
    
    # REMOVE HELPER TEMP COLUMNS
    rm(first_temp, last_temp, first_name_, middle_name_, last_name_)
  }
)

merge + rbind

final_df <- rbind.data.frame(
  merge(
    DF1_clean, DF2_clean, 
    by=c("first_name", "middle_name", "last_name", "date_of_birth"),
    suffixes=c("_DF1", "_DF2")
  ),
  merge(
    DF1_clean, transform(DF2_clean, middle_name=NULL),
    by=c("first_name", "last_name", "date_of_birth"),
    suffixes=c("_DF1", "_DF2")
  )
) |> unique()

输出

注意:Zuhair Biddison 和 James Baker 的倒置出生日期在 OP 的输入数据中是固定的,以匹配两个数据框。

final_df
   first_name middle_name last_name date_of_birth row_ID_DF1       var_difference var_should_be_found row_ID_DF2       var_other
1  Alexandria        Macy      Maya    1942-10-18         10   middle name column                 yes          5          gray58
2        Carl              Domenico    1948-01-25         12              columns                 yes          7 mediumvioletred
3       James                 Baker    1933-02-18          9             no title                 yes          4          grey94
4      Jovana        Cole    Tossie    1930-07-05          1                 none                 yes          1         bisque1
5     Matthew                Hosler    1951-02-03         13              columns                 yes          8           wheat
6      Nicole       Riley Grinstead    1945-03-24         11              columns                 yes          6         yellow1
9     Jackson    Milhouse      Beck    1936-11-11          8     middle name incl                 yes          3      steelblue1
11     Jovana                Tossie    1930-07-05          3     diff middle name                  no          1         bisque1
13     Jovana                Tossie    1930-07-05          2       no middle name                 yes          1         bisque1
16     Zuhair       Beate  Biddison    1939-04-18          7 middle name not abbr                 yes          2         orange2