R 中的模糊匹配(不是行到行)
Fuzzy matching (not row-to-row) in R
我需要按照以下模式进行模糊匹配:table A 包含带地址的字符串(我已经预先格式化,例如删除空格等),我必须验证它们的正确性。我有 table B,其中包含所有可能的地址(格式与 table A 相同),所以我不想只匹配 table A 的第 1 行到第 1 行table B 等等,但是将 table A 中的每一行与整个 table B 中的每一行进行比较,并为每一行找到最接近的匹配项。
根据我的检查,adist
和 agrep
默认情况下是逐行工作的,通过尝试使用它们,我也会立即收到内存不足的消息。甚至可以在只有 8 GB RAM 的情况下在 R 中进行操作吗?
我找到了类似问题的示例代码并以此为基础解决了我的问题,但性能仍然存在问题。它在 table A 中的 600 行样本和 table B 中的 2000 行样本上工作正常,但完整的数据集分别为 600000 行和 900000 行。
adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
min_odl<-apply(adresy_odl, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(adresy_odl))
{
s2.i<-match(min_odl[i],adresy_odl[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
}
内存错误已经发生在第一行(adist 函数):
Error: cannot allocate vector of size 1897.0 Gb
下面是我使用的数据示例 (CSV),tableA 和 tableB 看起来完全一样,唯一的区别是 tableB 有所有可能的组合邮政编码、街道和城市,而在 tableA 中,主要是邮政编码错误或街道拼写错误。
表A:
"","Zipcode","Street","City","Adres"
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk"
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk"
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk"
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk"
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk"
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk"
表B:
"","Zipcode","Street","City","Adres"
"47204","80-180","11Listopada","Gdańsk","80-18011ListopadaGdańsk"
"47205","80-041","3BrygadySzczerbca","Gdańsk","80-0413BrygadySzczerbcaGdańsk"
"47206","80-802","3Maja","Gdańsk","80-8023MajaGdańsk"
"47207","80-299","Achillesa","Gdańsk","80-299AchillesaGdańsk"
"47208","80-316","AdamaAsnyka","Gdańsk","80-316AdamaAsnykaGdańsk"
"47209","80-405","AdamaMickiewicza","Gdańsk","80-405AdamaMickiewiczaGdańsk"
"47210","80-425","AdamaMickiewicza","Gdańsk","80-425AdamaMickiewiczaGdańsk"
"47211","80-456","AdolfaDygasińskiego","Gdańsk","80-456AdolfaDygasińskiegoGdańsk"
我的代码结果的前几行:
"","s2.i","s1.i","s2name","s1name","adist"
"1",1333,614,"80-152PowstańcówWarszawskichGdańsk","80-158PowstańcówWarszawskichGdańsk",1
"2",257,613,"80-180CzerskaGdańsk","80-180ZEUSAGdańsk",3
"3",1916,612,"80-119WojskiegoGdańsk","80-355BeniowskiegoGdańsk",8
"4",1916,611,"80-119WojskiegoGdańsk","80-180PorębskiegoGdańsk",6
"5",181,610,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"6",181,609,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"7",21,608,"80-401alGenJózefaHalleraGdańsk","80-401GenJózefaHalleraGdańsk",2
"8",1431,607,"80-264RomanaDmowskiegoGdańsk","80-264DmowskiegoGdańsk",6
"9",1610,606,"80-239StefanaCzarnieckiegoGdańsk","80-239StefanaCzarnieckiegoGdańsk",0
我会尝试很棒的 fuzzyjoin
由 Whosebug 的@drob 提供的软件包
library(dplyr)
dict_df <- tibble::tribble(
~ID,~Zipcode,~Street,~City,~Adres,
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk",
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk",
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk") %>%
select(ID, Adres)
noise_df <- tibble::tribble(
~Zipcode,~Street,~City,~Adres,
"80-221","Trauguta","Gdansk","80-221TraugutaGdansk",
"80-211","Traugguta","Gdansk","80-211TrauggutaGdansk",
"80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
"80-267","KsBernardaSyschty","Gdańsk","80-276KsBernardaSyschtyGdańsk",
"80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
"80-399","Grunwaldzka","dansk","80-399Grunwaldzkadańsk",
"80-318","Wasowicza","Gdańsk","80-318WasowiczaGdańsk",
"80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
"80-625","Stryewskogo","Gdansk","80-625StryewskogoGdansk",
"80-452","Kilinskiego","Gdańsk","80-452KilinskiegoGdańsk")
library(fuzzyjoin)
noise_df %>%
# using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
stringdist_left_join(dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5) %>%
select(Adres.x, ID, Adres.y, dist) %>%
group_by(Adres.x) %>%
# select best fit record
top_n(-1, dist)
结果table由原始地址(Adres.x
)和字典中的最佳匹配(ID
和Adres.y
)以及字符串距离组成。
# A tibble: 10 x 4
# Groups: Adres.x [10]
Adres.x ID Adres.y dist
<chr> <chr> <chr> <dbl>
1 80-221TraugutaGdansk 33854 80-221TrauguttaGdańsk 0.11764706
2 80-211TrauggutaGdansk 33854 80-221TrauguttaGdańsk 0.11764706
3 80-276KsBernardaSychtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
4 80-276KsBernardaSyschtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
5 80-339GrunwaldzkaGdańsk 200115 80-339GrunwaldzkaGdańsk 0.00000000
6 80-399Grunwaldzkadańsk 200115 80-339GrunwaldzkaGdańsk 0.00000000
7 80-318WasowiczaGdańsk 344514 80-318WąsowiczaGdańsk 0.05555556
8 80-625StryjewskiegoGdańsk 355415 80-625StryjewskiegoGdańsk 0.00000000
9 80-625StryewskogoGdansk 355415 80-625StryjewskiegoGdańsk 0.17391304
10 80-452KilinskiegoGdańsk 356414 80-452KilińskiegoGdańsk 0.05263158
我发现当您将所有内容都转换为小写 ASCII(iconv()
和 tolower()
)时,模糊匹配效果最好
更新:这可能占用更少的内存:
library(purrr)
library(dplyr)
noise_df %>% split(.$Adres) %>%
# using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
map_df(~stringdist_left_join(.x, dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5, ignore_case = TRUE) %>%
select(Adres.x, ID, Adres.y, dist) %>%
group_by(Adres.x) %>%
# select best fit record
top_n(-1, dist))
更新 2:当使用 "lv" 距离算法时,您会得到太多缺失值和 NA。在某些情况下,当找不到匹配项时,string_dist_join
会删除您创建的 distance
列。这就是管道其余部分失败的原因,首先是 select
,然后是 top_n
。为了了解发生了什么,取少量数据样本,将 map_df
更改为 map
并浏览结果列表。
我需要按照以下模式进行模糊匹配:table A 包含带地址的字符串(我已经预先格式化,例如删除空格等),我必须验证它们的正确性。我有 table B,其中包含所有可能的地址(格式与 table A 相同),所以我不想只匹配 table A 的第 1 行到第 1 行table B 等等,但是将 table A 中的每一行与整个 table B 中的每一行进行比较,并为每一行找到最接近的匹配项。
根据我的检查,adist
和 agrep
默认情况下是逐行工作的,通过尝试使用它们,我也会立即收到内存不足的消息。甚至可以在只有 8 GB RAM 的情况下在 R 中进行操作吗?
我找到了类似问题的示例代码并以此为基础解决了我的问题,但性能仍然存在问题。它在 table A 中的 600 行样本和 table B 中的 2000 行样本上工作正常,但完整的数据集分别为 600000 行和 900000 行。
adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
min_odl<-apply(adresy_odl, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(adresy_odl))
{
s2.i<-match(min_odl[i],adresy_odl[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
}
内存错误已经发生在第一行(adist 函数):
Error: cannot allocate vector of size 1897.0 Gb
下面是我使用的数据示例 (CSV),tableA 和 tableB 看起来完全一样,唯一的区别是 tableB 有所有可能的组合邮政编码、街道和城市,而在 tableA 中,主要是邮政编码错误或街道拼写错误。
表A:
"","Zipcode","Street","City","Adres"
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk"
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk"
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk"
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk"
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk"
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk"
表B:
"","Zipcode","Street","City","Adres"
"47204","80-180","11Listopada","Gdańsk","80-18011ListopadaGdańsk"
"47205","80-041","3BrygadySzczerbca","Gdańsk","80-0413BrygadySzczerbcaGdańsk"
"47206","80-802","3Maja","Gdańsk","80-8023MajaGdańsk"
"47207","80-299","Achillesa","Gdańsk","80-299AchillesaGdańsk"
"47208","80-316","AdamaAsnyka","Gdańsk","80-316AdamaAsnykaGdańsk"
"47209","80-405","AdamaMickiewicza","Gdańsk","80-405AdamaMickiewiczaGdańsk"
"47210","80-425","AdamaMickiewicza","Gdańsk","80-425AdamaMickiewiczaGdańsk"
"47211","80-456","AdolfaDygasińskiego","Gdańsk","80-456AdolfaDygasińskiegoGdańsk"
我的代码结果的前几行:
"","s2.i","s1.i","s2name","s1name","adist"
"1",1333,614,"80-152PowstańcówWarszawskichGdańsk","80-158PowstańcówWarszawskichGdańsk",1
"2",257,613,"80-180CzerskaGdańsk","80-180ZEUSAGdańsk",3
"3",1916,612,"80-119WojskiegoGdańsk","80-355BeniowskiegoGdańsk",8
"4",1916,611,"80-119WojskiegoGdańsk","80-180PorębskiegoGdańsk",6
"5",181,610,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"6",181,609,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"7",21,608,"80-401alGenJózefaHalleraGdańsk","80-401GenJózefaHalleraGdańsk",2
"8",1431,607,"80-264RomanaDmowskiegoGdańsk","80-264DmowskiegoGdańsk",6
"9",1610,606,"80-239StefanaCzarnieckiegoGdańsk","80-239StefanaCzarnieckiegoGdańsk",0
我会尝试很棒的 fuzzyjoin
由 Whosebug 的@drob 提供的软件包
library(dplyr)
dict_df <- tibble::tribble(
~ID,~Zipcode,~Street,~City,~Adres,
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk",
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk",
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk") %>%
select(ID, Adres)
noise_df <- tibble::tribble(
~Zipcode,~Street,~City,~Adres,
"80-221","Trauguta","Gdansk","80-221TraugutaGdansk",
"80-211","Traugguta","Gdansk","80-211TrauggutaGdansk",
"80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
"80-267","KsBernardaSyschty","Gdańsk","80-276KsBernardaSyschtyGdańsk",
"80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
"80-399","Grunwaldzka","dansk","80-399Grunwaldzkadańsk",
"80-318","Wasowicza","Gdańsk","80-318WasowiczaGdańsk",
"80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
"80-625","Stryewskogo","Gdansk","80-625StryewskogoGdansk",
"80-452","Kilinskiego","Gdańsk","80-452KilinskiegoGdańsk")
library(fuzzyjoin)
noise_df %>%
# using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
stringdist_left_join(dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5) %>%
select(Adres.x, ID, Adres.y, dist) %>%
group_by(Adres.x) %>%
# select best fit record
top_n(-1, dist)
结果table由原始地址(Adres.x
)和字典中的最佳匹配(ID
和Adres.y
)以及字符串距离组成。
# A tibble: 10 x 4
# Groups: Adres.x [10]
Adres.x ID Adres.y dist
<chr> <chr> <chr> <dbl>
1 80-221TraugutaGdansk 33854 80-221TrauguttaGdańsk 0.11764706
2 80-211TrauggutaGdansk 33854 80-221TrauguttaGdańsk 0.11764706
3 80-276KsBernardaSychtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
4 80-276KsBernardaSyschtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
5 80-339GrunwaldzkaGdańsk 200115 80-339GrunwaldzkaGdańsk 0.00000000
6 80-399Grunwaldzkadańsk 200115 80-339GrunwaldzkaGdańsk 0.00000000
7 80-318WasowiczaGdańsk 344514 80-318WąsowiczaGdańsk 0.05555556
8 80-625StryjewskiegoGdańsk 355415 80-625StryjewskiegoGdańsk 0.00000000
9 80-625StryewskogoGdansk 355415 80-625StryjewskiegoGdańsk 0.17391304
10 80-452KilinskiegoGdańsk 356414 80-452KilińskiegoGdańsk 0.05263158
我发现当您将所有内容都转换为小写 ASCII(iconv()
和 tolower()
)时,模糊匹配效果最好
更新:这可能占用更少的内存:
library(purrr)
library(dplyr)
noise_df %>% split(.$Adres) %>%
# using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
map_df(~stringdist_left_join(.x, dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5, ignore_case = TRUE) %>%
select(Adres.x, ID, Adres.y, dist) %>%
group_by(Adres.x) %>%
# select best fit record
top_n(-1, dist))
更新 2:当使用 "lv" 距离算法时,您会得到太多缺失值和 NA。在某些情况下,当找不到匹配项时,string_dist_join
会删除您创建的 distance
列。这就是管道其余部分失败的原因,首先是 select
,然后是 top_n
。为了了解发生了什么,取少量数据样本,将 map_df
更改为 map
并浏览结果列表。