R:将相似的地址分组在一起
R: Group Similar Addresses Together
我有一个包含 400,000 行手动输入地址的文件,需要对其进行地理编码。文件中相同地址有很多不同的变体,因此多次使用 API 调用相同地址似乎很浪费。
为了减少这个,我想减少这五行:
Address
1 Main Street, Country A, World
1 Main St, Country A, World
1 Maine St, Country A, World
2 Side Street, Country A, World
2 Side St. Country A, World
减少到两个:
Address
1 Main Street, Country A, World
2 Side Street, Country A, World
使用 stringdist
包,您可以将字符串的 'word' 部分组合在一起,但字符串匹配算法不会区分数字。这意味着它将同一条街道上的两个不同门牌号归类为同一个地址。
为了解决这个问题,我想出了两种方法:首先,尝试使用正则表达式手动将数字和地址分隔到单独的列中,然后重新加入它们。这样做的问题是,有这么多手动输入的地址,似乎有数百种不同的边缘情况,而且它变得笨拙。
使用这个关于 and this on converting 单词到数字的答案,我有第二种方法可以处理边缘情况,但计算成本非常高。有更好的第三种方法吗?
library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)
similarGroups <- function(x, thresh = 0.8, method = "lv"){
grp <- integer(length(x))
Address <- x
x <- tolower(x)
for(i in seq_along(Address)){
if(!is.na(Address[i])){
sim <- stringdist::stringsim(x[i], x, method = method)
k <- which(sim > thresh & !is.na(Address))
grp[k] <- i
is.na(Address) <- k
}
}
grp
}
df <- data.frame(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World"))
df1 <- df %>%
# Converts Numbers into Letters
mutate(Address = replace_number(Address),
# Groups Similar Addresses Together
Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
# Converts Letters back into Numbers
Address = gsubfn("\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
) %>%
# Removes the Duplicates
unique()
stringdist::stringsimmatrix
允许比较字符串之间的相似度:
library(dplyr)
library(stringr)
df <- data.frame(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"3 Main St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. PO 5678 Country A, World"))
stringdist::stringsimmatrix(df$Address)
1 2 3 4 5
1 1.0000000 0.8709677 0.8387097 0.8387097 0.5161290
2 0.8518519 1.0000000 0.9629630 0.6666667 0.4444444
3 0.8148148 0.9629630 1.0000000 0.6666667 0.4444444
4 0.8387097 0.7096774 0.7096774 1.0000000 0.6774194
5 0.5833333 0.5833333 0.5833333 0.7222222 1.0000000
正如您所指出的,在上面的示例中,根据此标准 (96%),第 2 行和第 3 行非常相似,而门牌号不同。
您可以添加另一个条件,从字符串中提取数字,并比较它们的相似性:
# Extract numbers
nums <- df %>% rowwise %>% mutate(numlist = str_extract_all(Address,"\(?[0-9]+\)?"))
# Create numbers vectors pairs
numpairs <- expand.grid(nums$numlist, nums$numlist)
# Calculate similarity
numsim <- numpairs %>% rowwise %>% mutate(dist = length(intersect(Var1,Var2)) / length(union(Var1,Var2)))
# Return similarity matrix
matrix(numsim$dist,nrow(df))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 0.0 0.0
[2,] 1 1 0 0.0 0.0
[3,] 0 0 1 0.0 0.0
[4,] 0 0 0 1.0 0.5
[5,] 0 0 0 0.5 1.0
根据这个新标准,第 2 行和第 3 行明显不同。
您可以结合这两个标准来判断地址是否足够相似,例如:
matrix(numsim$dist,nrow(df)) * stringdist::stringsimmatrix(df$Address)
1 2 3 4 5
1 1.0000000 0.8709677 0 0.0000000 0.0000000
2 0.8518519 1.0000000 0 0.0000000 0.0000000
3 0.0000000 0.0000000 1 0.0000000 0.0000000
4 0.0000000 0.0000000 0 1.0000000 0.3387097
5 0.0000000 0.0000000 0 0.3611111 1.0000000
要处理数十万个地址,expand.grid
无法处理整个数据集,但您可以按国家/地区拆分/并行化,以避免不可行的完整笛卡尔积。
可能想查看 OpenRefine,或 R 的 refinr
包,它的视觉效果差很多,但仍然不错。它有两个函数,key_collision_merge
和 n_gram_merge
有几个参数。如果你有一个好的地址字典,你可以把它传递给 key_collision_merge
.
最好记下您经常看到的缩写(St.、Blvd.、Rd. 等)并替换所有这些缩写。这些缩写中肯定有一个很好的 table,例如 https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf.
然后:
library(refinr)
df <- tibble(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World",
"3 Side Rd. Country A, World",
"3 Side Road Country B World"))
df2 <- df %>%
mutate(address_fix = str_replace_all(Address, "St\.|St\,|St\s", "Street"),
address_fix = str_replace_all(address_fix, "Rd\.|Rd\,|Rd\s", "Road")) %>%
mutate(address_merge = n_gram_merge(address_fix, numgram = 1))
df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"
[7] "3 Side Road Country B World"
我有一个包含 400,000 行手动输入地址的文件,需要对其进行地理编码。文件中相同地址有很多不同的变体,因此多次使用 API 调用相同地址似乎很浪费。
为了减少这个,我想减少这五行:
Address
1 Main Street, Country A, World
1 Main St, Country A, World
1 Maine St, Country A, World
2 Side Street, Country A, World
2 Side St. Country A, World
减少到两个:
Address
1 Main Street, Country A, World
2 Side Street, Country A, World
使用 stringdist
包,您可以将字符串的 'word' 部分组合在一起,但字符串匹配算法不会区分数字。这意味着它将同一条街道上的两个不同门牌号归类为同一个地址。
为了解决这个问题,我想出了两种方法:首先,尝试使用正则表达式手动将数字和地址分隔到单独的列中,然后重新加入它们。这样做的问题是,有这么多手动输入的地址,似乎有数百种不同的边缘情况,而且它变得笨拙。
使用这个关于
library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)
similarGroups <- function(x, thresh = 0.8, method = "lv"){
grp <- integer(length(x))
Address <- x
x <- tolower(x)
for(i in seq_along(Address)){
if(!is.na(Address[i])){
sim <- stringdist::stringsim(x[i], x, method = method)
k <- which(sim > thresh & !is.na(Address))
grp[k] <- i
is.na(Address) <- k
}
}
grp
}
df <- data.frame(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World"))
df1 <- df %>%
# Converts Numbers into Letters
mutate(Address = replace_number(Address),
# Groups Similar Addresses Together
Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
# Converts Letters back into Numbers
Address = gsubfn("\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
) %>%
# Removes the Duplicates
unique()
stringdist::stringsimmatrix
允许比较字符串之间的相似度:
library(dplyr)
library(stringr)
df <- data.frame(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"3 Main St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. PO 5678 Country A, World"))
stringdist::stringsimmatrix(df$Address)
1 2 3 4 5
1 1.0000000 0.8709677 0.8387097 0.8387097 0.5161290
2 0.8518519 1.0000000 0.9629630 0.6666667 0.4444444
3 0.8148148 0.9629630 1.0000000 0.6666667 0.4444444
4 0.8387097 0.7096774 0.7096774 1.0000000 0.6774194
5 0.5833333 0.5833333 0.5833333 0.7222222 1.0000000
正如您所指出的,在上面的示例中,根据此标准 (96%),第 2 行和第 3 行非常相似,而门牌号不同。
您可以添加另一个条件,从字符串中提取数字,并比较它们的相似性:
# Extract numbers
nums <- df %>% rowwise %>% mutate(numlist = str_extract_all(Address,"\(?[0-9]+\)?"))
# Create numbers vectors pairs
numpairs <- expand.grid(nums$numlist, nums$numlist)
# Calculate similarity
numsim <- numpairs %>% rowwise %>% mutate(dist = length(intersect(Var1,Var2)) / length(union(Var1,Var2)))
# Return similarity matrix
matrix(numsim$dist,nrow(df))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 0.0 0.0
[2,] 1 1 0 0.0 0.0
[3,] 0 0 1 0.0 0.0
[4,] 0 0 0 1.0 0.5
[5,] 0 0 0 0.5 1.0
根据这个新标准,第 2 行和第 3 行明显不同。
您可以结合这两个标准来判断地址是否足够相似,例如:
matrix(numsim$dist,nrow(df)) * stringdist::stringsimmatrix(df$Address)
1 2 3 4 5
1 1.0000000 0.8709677 0 0.0000000 0.0000000
2 0.8518519 1.0000000 0 0.0000000 0.0000000
3 0.0000000 0.0000000 1 0.0000000 0.0000000
4 0.0000000 0.0000000 0 1.0000000 0.3387097
5 0.0000000 0.0000000 0 0.3611111 1.0000000
要处理数十万个地址,expand.grid
无法处理整个数据集,但您可以按国家/地区拆分/并行化,以避免不可行的完整笛卡尔积。
可能想查看 OpenRefine,或 R 的 refinr
包,它的视觉效果差很多,但仍然不错。它有两个函数,key_collision_merge
和 n_gram_merge
有几个参数。如果你有一个好的地址字典,你可以把它传递给 key_collision_merge
.
最好记下您经常看到的缩写(St.、Blvd.、Rd. 等)并替换所有这些缩写。这些缩写中肯定有一个很好的 table,例如 https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf.
然后:
library(refinr)
df <- tibble(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World",
"3 Side Rd. Country A, World",
"3 Side Road Country B World"))
df2 <- df %>%
mutate(address_fix = str_replace_all(Address, "St\.|St\,|St\s", "Street"),
address_fix = str_replace_all(address_fix, "Rd\.|Rd\,|Rd\s", "Road")) %>%
mutate(address_merge = n_gram_merge(address_fix, numgram = 1))
df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"
[7] "3 Side Road Country B World"