R 中的模糊外部 join/merge
fuzzy outer join/merge in R
我有 2 个数据集,想进行模糊连接。
这是两个数据集。
library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)
两个数据集在State
和type
中具有相同的字符;但是,列 NAME
并不相同。他们很相似。
虽然我可以在每个数据上减去 3 或 4 个章程的列 NAME
,然后合并它们,但由于大量观察,正确的比率似乎并不高。
dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)
方法不好。
我检查包 fuzzyjoin
。但不知道我说的对不对
library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
这个练习的结果是正确的,见下文。但是如果这两个数据中有任何一个NAME相同,则答案将不正确。
我在这两个数据中创建了一个新的观察。
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB
这是不正确的结果。
有什么建议吗?
看来我不能使用fuzzy_full_join
。
这是因为您要求 fuzzy_full_join 给出不匹配的名称(使用 !=),然后给出匹配的状态和类型(使用 == ==)。因此,如果三者都匹配,则不会显示。
您可以 运行 两次:
match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))
library(data.table); library(fuzzyjoin)
#> Warning: package 'data.table' was built under R version 3.5.2
dt1 <- fread("NAME State type
ABERCOMBIETOWNSHIP ND TS
ABERDEENTOWNSHIP NJ TS
ABERDEENTOWNSHIP SD TS
ABBOTSFORDCITY WI CI
ABERDEENCITY WA CI
ADATOWNSHIP MI TS
ADAMS IL TS
THESAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEENTWPNJ NJ TS
ABERDEENWASH WA CI
ABBOTSFORDWIS WI CI
ADATWPMICH MI TS
ADAOHIO OH CI
ADAMSMASS MA CI
ADAMSVILLEALA AL CI
THESAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: ABERDEENTOWNSHIP NJ TS ABERDEENTWPNJ NJ TS
#> 2: ABBOTSFORDCITY WI CI ABBOTSFORDWIS WI CI
#> 3: ABERDEENCITY WA CI ABERDEENWASH WA CI
#> 4: ADATOWNSHIP MI TS ADATWPMICH MI TS
#> 5: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 6: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 7: ADAMS IL TS <NA> <NA> <NA>
#> 8: THESAME AA BB <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ADAOHIO OH CI
#> 10: <NA> <NA> <NA> ADAMSMASS MA CI
#> 11: <NA> <NA> <NA> ADAMSVILLEALA AL CI
#> 12: <NA> <NA> <NA> THESAME AA BB
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`==`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: THESAME AA BB THESAME AA BB
#> 2: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 3: ABERDEENTOWNSHIP NJ TS <NA> <NA> <NA>
#> 4: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 5: ABBOTSFORDCITY WI CI <NA> <NA> <NA>
#> 6: ABERDEENCITY WA CI <NA> <NA> <NA>
#> 7: ADATOWNSHIP MI TS <NA> <NA> <NA>
#> 8: ADAMS IL TS <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ABERDEENTWPNJ NJ TS
#> 10: <NA> <NA> <NA> ABERDEENWASH WA CI
#> 11: <NA> <NA> <NA> ABBOTSFORDWIS WI CI
#> 12: <NA> <NA> <NA> ADATWPMICH MI TS
#> 13: <NA> <NA> <NA> ADAOHIO OH CI
#> 14: <NA> <NA> <NA> ADAMSMASS MA CI
#> 15: <NA> <NA> <NA> ADAMSVILLEALA AL CI
由 reprex package (v0.2.1)
于 2019-03-17 创建
我有 2 个数据集,想进行模糊连接。
这是两个数据集。
library(data.table)
# data1
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS", header = T)
# data2
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI", header = T)
两个数据集在State
和type
中具有相同的字符;但是,列 NAME
并不相同。他们很相似。
虽然我可以在每个数据上减去 3 或 4 个章程的列 NAME
,然后合并它们,但由于大量观察,正确的比率似乎并不高。
dt1$NameSubstr <- substr(dt1$NAME, 1, 4)
dt2$NameSubstr <- substr(dt2$NAME, 1, 4)
merge(dt1, dt2, by = c("NameSubstr", "State", "type"), all = T)
方法不好。
我检查包 fuzzyjoin
。但不知道我说的对不对
library(fuzzyjoin)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
# Results
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: <NA> <NA> <NA> ADA OHIO OH CI
9: <NA> <NA> <NA> ADAMS MASS MA CI
10: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
这个练习的结果是正确的,见下文。但是如果这两个数据中有任何一个NAME相同,则答案将不正确。
我在这两个数据中创建了一个新的观察。
dt1 <- fread("NAME State type
ABERCOMBIE TOWNSHIP ND TS
ABERDEEN TOWNSHIP NJ TS
ABERDEEN TOWNSHIP SD TS
ABBOTSFORD CITY WI CI
ABERDEEN CITY WA CI
ADA TOWNSHIP MI TS
ADAMS IL TS
THE SAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEEN TWP N J NJ TS
ABERDEEN WASH WA CI
ABBOTSFORD WIS WI CI
ADA TWP MICH MI TS
ADA OHIO OH CI
ADAMS MASS MA CI
ADAMSVILLE ALA AL CI
THE SAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
NAME.x State.x type.x NAME.y State.y type.y
1: ABERDEEN TOWNSHIP NJ TS ABERDEEN TWP N J NJ TS
2: ABBOTSFORD CITY WI CI ABBOTSFORD WIS WI CI
3: ABERDEEN CITY WA CI ABERDEEN WASH WA CI
4: ADA TOWNSHIP MI TS ADA TWP MICH MI TS
5: ABERCOMBIE TOWNSHIP ND TS <NA> <NA> <NA>
6: ABERDEEN TOWNSHIP SD TS <NA> <NA> <NA>
7: ADAMS IL TS <NA> <NA> <NA>
8: THE SAME AA BB <NA> <NA> <NA>
9: <NA> <NA> <NA> ADA OHIO OH CI
10: <NA> <NA> <NA> ADAMS MASS MA CI
11: <NA> <NA> <NA> ADAMSVILLE ALA AL CI
12: <NA> <NA> <NA> THE SAME AA BB
这是不正确的结果。 有什么建议吗?
看来我不能使用fuzzy_full_join
。
这是因为您要求 fuzzy_full_join 给出不匹配的名称(使用 !=),然后给出匹配的状态和类型(使用 == ==)。因此,如果三者都匹配,则不会显示。
您可以 运行 两次:
match_fun = list(`!=`, `==`, `==`))
match_fun = list(`==`, `==`, `==`))
library(data.table); library(fuzzyjoin)
#> Warning: package 'data.table' was built under R version 3.5.2
dt1 <- fread("NAME State type
ABERCOMBIETOWNSHIP ND TS
ABERDEENTOWNSHIP NJ TS
ABERDEENTOWNSHIP SD TS
ABBOTSFORDCITY WI CI
ABERDEENCITY WA CI
ADATOWNSHIP MI TS
ADAMS IL TS
THESAME AA BB
", header = T)
dt2 <- fread("NAME State type
ABERDEENTWPNJ NJ TS
ABERDEENWASH WA CI
ABBOTSFORDWIS WI CI
ADATWPMICH MI TS
ADAOHIO OH CI
ADAMSMASS MA CI
ADAMSVILLEALA AL CI
THESAME AA BB
", header = T)
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`!=`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: ABERDEENTOWNSHIP NJ TS ABERDEENTWPNJ NJ TS
#> 2: ABBOTSFORDCITY WI CI ABBOTSFORDWIS WI CI
#> 3: ABERDEENCITY WA CI ABERDEENWASH WA CI
#> 4: ADATOWNSHIP MI TS ADATWPMICH MI TS
#> 5: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 6: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 7: ADAMS IL TS <NA> <NA> <NA>
#> 8: THESAME AA BB <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ADAOHIO OH CI
#> 10: <NA> <NA> <NA> ADAMSMASS MA CI
#> 11: <NA> <NA> <NA> ADAMSVILLEALA AL CI
#> 12: <NA> <NA> <NA> THESAME AA BB
fuzzy_full_join(dt1, dt2, by = c("NAME" = "NAME", "State" = "State", "type" = "type"), match_fun = list(`==`, `==`, `==`))
#> NAME.x State.x type.x NAME.y State.y type.y
#> 1: THESAME AA BB THESAME AA BB
#> 2: ABERCOMBIETOWNSHIP ND TS <NA> <NA> <NA>
#> 3: ABERDEENTOWNSHIP NJ TS <NA> <NA> <NA>
#> 4: ABERDEENTOWNSHIP SD TS <NA> <NA> <NA>
#> 5: ABBOTSFORDCITY WI CI <NA> <NA> <NA>
#> 6: ABERDEENCITY WA CI <NA> <NA> <NA>
#> 7: ADATOWNSHIP MI TS <NA> <NA> <NA>
#> 8: ADAMS IL TS <NA> <NA> <NA>
#> 9: <NA> <NA> <NA> ABERDEENTWPNJ NJ TS
#> 10: <NA> <NA> <NA> ABERDEENWASH WA CI
#> 11: <NA> <NA> <NA> ABBOTSFORDWIS WI CI
#> 12: <NA> <NA> <NA> ADATWPMICH MI TS
#> 13: <NA> <NA> <NA> ADAOHIO OH CI
#> 14: <NA> <NA> <NA> ADAMSMASS MA CI
#> 15: <NA> <NA> <NA> ADAMSVILLEALA AL CI
由 reprex package (v0.2.1)
于 2019-03-17 创建