根据名字、中间名、姓氏和出生日期匹配两个 DF(考虑数据缺陷)
match two DFs based on first, middle, last name & date of birth (account for data flaws)
我有一个很简单的问题:我想查看DF1中的哪些人包含在DF2中。我想根据他们
这样做- 名字,
- 中间名,
- 姓氏,
- 出生日期。
我只想保留 DF1 和 DF2 中正确匹配的那些行。
DF1 看起来像这样
(编辑:“XXX”改为“乔”)
DF1 <- data.frame(row_ID = 1:13,
first_name = c("Jovana", "Jovana", "Jovana", "Joe", "Jovana", "Jovana", "Zuhair", "Jackson", "James", "Alexandria", "Nicole", "Carl", "Matthew"),
middle_name = c("Cole", "", "Joe", "Cole", "Cole", "Cole", "Beate", "Milhouse", "", "Macy", "Riley", "", ""),
last_name = c("Tossie", "Tossie", "Tossie", "Tossie", "Tossie", "Joe", "Biddison", "Beck", "Baker", "Maya", "Grinstead", "Domenico", "Hosler"),
date_of_birth = as.Date(c("1930-07-05","1930-07-05", "1930-07-05", "1930-07-05", "2000-01-01", "1930-07-05", "1939-04-18", "1936-11-11", "1933-02-18"," 1942-10-18", "1945-03-24", "1948-01-25", "1951-02-03")),
var_difference = c("none", "no middle name", "diff middle name", "first name", "date of birth", "last name", "middle name not abbr", "middle name incl", "no title", "middle name column", "columns", "columns", "columns"),
var_should_be_found = c("yes", "yes", "no", "no", "no", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes"))
DF2 看起来像这样:
(编辑:Zuhair Biddison BD 1933-02-18 至 1939-04-18)
DF2 <- data.frame(row_ID = 1:20,
first_name = c("Jovana","Zuhair","Jackson","Dr. James","Alexandria Macy","Nicole Riley Grinstead","","","Isaiah","Wyatt","Rayyana","Dhaahir","Lauren",
"Tony","Aziza","Cody","Paige","Jasmine","Kawkab","Pedro"),
middle_name = c("Cole","B.","", "","","","","","Kyrie","","Zachary", "","Tyler","", "Brian",
"","Amy", "","Robert",""),
last_name = c("Tossie","Biddison","Beck","Baker","Maya","", "Carl Domenico","Hosler, Matthew","Bishop","Ericson","Leptich","Franks","Pummer","Neves","Ferguson","Jennings",
"Phillips","Wyatt","Caisse","Laplante"),
date_of_birth = as.Date(c("1930-07-05", "1939-04-18", "1936-11-11", "1939-04-18",
"1942-10-18", "1945-03-24", "1948-01-25", "1951-02-03",
"1954-05-27", "1957-08-05", "1960-08-01", "1963-11-26",
"1966-05-25", "1969-11-19", "1972-01-28", "1975-06-17",
"1978-07-24", "1981-07-11", "1984-10-28", "1987-09-14")),
var_other = sample(colours(), 20)
)
DF2有很多缺陷
有时:
- 中间名缩写
- 没有中间名
- 标题包含在 first_name 列中
- 中间名出现在first_name列
- 名字和姓氏一起出现在last-name-column中(顺序:first-namelast-name)
- 名字和姓氏一起出现在last-name-column中(顺序:last-name、first-name)
如前所述,最后,我想只保留DF1和DF2中出现的人的行,丢弃其余行,并合并DF1和DF2的列。
首先请问这个有什么方便快捷的功能吗? (问题看起来很简单,但我没有找到)
如果没有,这就是我所做的。它有效,但对我来说太慢了。对于 DF1(约 74000 obs)和几个 DF2 之一(超过 100000 obs),需要数小时
如有任何帮助,我将不胜感激!
我的做法:
1。合并所有姓名(名字、中间名、姓氏),至少有 2 个匹配,稍后。
DF1$all_names <- paste(DF1$first_name,
DF1$middle_name,
DF1$last_name,
sep = " ")
DF2$all_names <- paste(DF2$first_name,
DF2$middle_name,
DF2$last_name,
sep = " ")
2。首先寻找匹配的生日(首先,log-algorithm,然后是树)
##########################
# FUNCTION: BD MATCH log #
##########################
BD_MATCH <- function(the_data, birthday){
not_precise_date <- T
not_found <- T
bd_found <- F
while(not_precise_date & !bd_found & nrow(the_data)> 1){
# check half
half_of_df <- ceiling(nrow(the_data)/2)
# is bd at half?
bd_found <- the_data[half_of_df, "date_of_birth"] == birthday
if(bd_found){bd_row_id <- the_data[half_of_df, "row_ID"]; break} # else{bd_row_id <- NULL}
# is the bd above or below
in_upper_half <- the_data[half_of_df, "date_of_birth"] >= birthday
# subset accordingly
if(in_upper_half){the_data <- the_data[1:half_of_df, ]
} else{the_data <- the_data[(half_of_df+1):nrow(the_data), ]}
}
if(bd_found){return(bd_row_id)} else{return(NA)}
}
###########################
# FUNCTION: BD MATCH tree #
###########################
# search above and below for duplicate bds
TREE_FUN <- function(the_bd_vec, the_row){
birthday <- the_bd_vec[the_row]
# search above
i <- the_row
bd_criterion <- T
while(bd_criterion & i>1){
i <- i-1
bd_criterion <- the_bd_vec[i] == birthday
}
begin <- ifelse(bd_criterion, 1, i+1)
# search below
i <- the_row
bd_criterion <- T
while(bd_criterion & i <= length(the_bd_vec)){
i <- i+1
bd_criterion <- the_bd_vec[i] == birthday
}
if(is.na(bd_criterion)|bd_criterion == F){
end <- i-1
} else{
end <- i
}
return(begin:end)
}
3。检查是否至少有 2 个名称匹配
(这匹配,i.a。例如,姓氏不同,但名字、中间名和生日匹配的人。这是不正确的,但非常罕见.)
##########
# SEARCH #
##########
res_list <- list()
for(j in 1:nrow(DF1)){
birthday <- DF1$date_of_birth[j]
DF1_name <- strsplit(DF1$all_names[j], split = " ")
# SEARCH BIRTHDAY
bd_row_id <- BD_MATCH(DF2, birthday)
# SEARCH NAME
if(is.na(bd_row_id)){
res_list[[j]] <- NA
} else{
the_row <- which(DF2$row_ID == bd_row_id)
the_bd_vec <- DF2$date_of_birth
begin_end <- TREE_FUN(the_bd_vec, the_row)
BD_subset <- DF2[begin_end, ]
##############
# NAME CHECK #
##############
DF2_name <- strsplit(BD_subset$all_names, split = " ")
the_vec <- NULL
nest <- list()
for(k in seq(DF2_name)){
if(sum(DF2_name[[k]] %in% DF1_name[[1]]) >= 2) {
the_vec <- c(the_vec, k)
nest[[k]] <- BD_subset[the_vec, ]
} else {
nest[[k]] <- NA
}
}
if(sum(is.na(nest)) == length(nest)){
res_list[[j]] <- NA
}
else{
res_list[[j]] <- bind_rows(nest[!is.na(nest)])
}
}
print(j)
}
found_DF1 <- DF1[which(!is.na(res_list)), ]
found_DF2 <- res_list[!is.na(res_list)]
for(i in seq(found_DF2)){
found_DF2[[i]] <- cbind(found_DF2[[i]], found_DF1[i , ])
}
found_DF2 <- bind_rows(found_DF2)
避免清理和计算循环。相反,考虑通过向量化操作清理两个数据帧,以正确规范化名称,每个名称列都包含一个标识符。然后,运行 两个 merge
的 rbind
首先是所有三个名字,其次是名字和姓氏。然后 运行 unique()
到 de-duplicate 行。
within
(用strsplit
和ifelse
清理)
注意:以下解决方案适用于发布的数据,可能需要针对其他数据问题进行扩展。
DF1_clean <- within(
DF1, {
first_name <- gsub("XXX", "", first_name)
middle_name <- gsub("XXX", "", middle_name)
last_name <- gsub("XXX", "", last_name)
}
)
DF2_clean <- within(
DF2, {
# FIRST NAME CLEANUP
first_temp <- trimws(gsub("Dr.|Mr.|Ms.|Mrs.", "", first_name))
first_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 1))
middle_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 2))
last_name_ <- trimws(sapply(strsplit(first_temp, " "), `[`, 3))
first_name <- ifelse(is.na(first_name_), first_name, first_name_)
middle_name <- ifelse(is.na(middle_name_), middle_name, middle_name_)
last_name <- ifelse(is.na(last_name_), last_name, last_name_)
# LAST NAME CLEANUP
last_temp <- trimws(gsub("Jr|Sr|III", "", last_name))
first_name_ <- ifelse(
grepl(",", last_temp),
sapply(strsplit(last_temp, ","), `[`, 2),
sapply(strsplit(last_temp, " "), `[`, 1)
)
last_name_ <- ifelse(
grepl(",", last_temp),
sapply(strsplit(last_temp, ","), `[`, 1),
sapply(strsplit(last_temp, " "), `[`, 2)
)
first_temp <- trimws(first_name)
first_name <- trimws(ifelse(first_temp=="", first_name_, first_name))
last_name <- trimws(ifelse(first_temp=="", last_name_, last_name))
# REMOVE HELPER TEMP COLUMNS
rm(first_temp, last_temp, first_name_, middle_name_, last_name_)
}
)
merge
+ rbind
final_df <- rbind.data.frame(
merge(
DF1_clean, DF2_clean,
by=c("first_name", "middle_name", "last_name", "date_of_birth"),
suffixes=c("_DF1", "_DF2")
),
merge(
DF1_clean, transform(DF2_clean, middle_name=NULL),
by=c("first_name", "last_name", "date_of_birth"),
suffixes=c("_DF1", "_DF2")
)
) |> unique()
输出
注意:Zuhair Biddison 和 James Baker 的倒置出生日期在 OP 的输入数据中是固定的,以匹配两个数据框。
final_df
first_name middle_name last_name date_of_birth row_ID_DF1 var_difference var_should_be_found row_ID_DF2 var_other
1 Alexandria Macy Maya 1942-10-18 10 middle name column yes 5 gray58
2 Carl Domenico 1948-01-25 12 columns yes 7 mediumvioletred
3 James Baker 1933-02-18 9 no title yes 4 grey94
4 Jovana Cole Tossie 1930-07-05 1 none yes 1 bisque1
5 Matthew Hosler 1951-02-03 13 columns yes 8 wheat
6 Nicole Riley Grinstead 1945-03-24 11 columns yes 6 yellow1
9 Jackson Milhouse Beck 1936-11-11 8 middle name incl yes 3 steelblue1
11 Jovana Tossie 1930-07-05 3 diff middle name no 1 bisque1
13 Jovana Tossie 1930-07-05 2 no middle name yes 1 bisque1
16 Zuhair Beate Biddison 1939-04-18 7 middle name not abbr yes 2 orange2