R:使用plyr对两个数据源的匹配子集进行模糊字符串匹配
R: Using plyr to perform fuzzy string matching between matching subsets of two data sources
假设我有一个县列表,其中存在不同数量的拼写错误或其他问题,这些问题将它们与 2010 FIPS dataset(下面创建 fips
数据框的代码)区分开来,但是拼写错误的县居住地输入正确。这是来自我的完整数据集的 21 个随机观察的 sample
:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
我已经使用 adist
创建了一个模糊字符串匹配算法,该算法将我大约 80% 的县与 fips
中的县名相匹配。然而,有时它会匹配拼写相似但来自不同州的两个县(例如,"Webster, LA" 匹配到 "Webster, GA" 而不是 "Webster Parrish, LA")。
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本。
我目前的算法制作了一个大矩阵,计算两个源之间的标准 Levenshtein 距离,然后选择距离最小的值。
为了解决我的问题,我想我需要创建一个可以应用于 ddply
的每个 'state' 组的函数,但我对如何实现感到困惑应该表明 ddply
函数中的组值应该匹配另一个数据帧。 dplyr
解决方案或使用任何其他包的解决方案也将不胜感激。
创建 FIPS 数据集的代码:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
没有示例数据,但尝试使用 agrep 而不是 adist 并仅搜索该状态下的名称
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
您可以在 agrep
中使用 max.distance
参数来改变它们需要匹配的程度。此外,设置 value=TRUE
returns 匹配字符串的值而不是匹配的位置。
这是 dplyr 的一种方法。我首先按州加入 tomatch
data.frame 和 FIPS 名称(仅允许州内匹配):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
接下来,我注意到很多县在 FIPS 数据集中没有 'Saint',而是 'St.'。首先清理它应该会改善获得的结果。
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
然后,将这个data.frame按县分组,用adist计算距离:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
请注意,我将结果矩阵的对角线作为 adist returns 一个 n x m 矩阵,其中 n 代表 x 向量,m 代表 y 向量(它计算所有组合)。
或者,您可以添加 agrep 结果:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
然后像之前那样过滤,取最小距离:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
请注意,这可能会导致 tomatch
中的每个输入行返回多行。
或者,一次完成所有操作 运行(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为这种格式):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
两种情况下的结果:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU
假设我有一个县列表,其中存在不同数量的拼写错误或其他问题,这些问题将它们与 2010 FIPS dataset(下面创建 fips
数据框的代码)区分开来,但是拼写错误的县居住地输入正确。这是来自我的完整数据集的 21 个随机观察的 sample
:
tomatch <- structure(list(county = c("Beauregard", "De Soto", "Dekalb", "Webster",
"Saint Joseph", "West Feliciana", "Ketchikan Gateway", "Evangeline",
"Richmond City", "Saint Mary", "Saint Louis City", "Mclean",
"Union", "Bienville", "Covington City", "Martinsville City",
"Claiborne", "King And Queen", "Mclean", "Mcminn", "Prince Georges"
), state = c("LA", "LA", "GA", "LA", "IN", "LA", "AK", "LA", "VA",
"LA", "MO", "KY", "LA", "LA", "VA", "VA", "LA", "VA", "ND", "TN",
"MD")), .Names = c("county", "state"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -21L))
county state
1 Beauregard LA
2 De Soto LA
3 Dekalb GA
4 Webster LA
5 Saint Joseph IN
6 West Feliciana LA
7 Ketchikan Gateway AK
8 Evangeline LA
9 Richmond City VA
10 Saint Mary LA
11 Saint Louis City MO
12 Mclean KY
13 Union LA
14 Bienville LA
15 Covington City VA
16 Martinsville City VA
17 Claiborne LA
18 King And Queen VA
19 Mclean ND
20 Mcminn TN
21 Prince Georges MD
我已经使用 adist
创建了一个模糊字符串匹配算法,该算法将我大约 80% 的县与 fips
中的县名相匹配。然而,有时它会匹配拼写相似但来自不同州的两个县(例如,"Webster, LA" 匹配到 "Webster, GA" 而不是 "Webster Parrish, LA")。
distance <- adist(tomatch$county,
fips$countyname,
partial = TRUE)
min.name <- apply(distance, 1, min)
matchedcounties <- NULL
for(i in 1:nrow(distance)) {
s2.i <- match(min.name[i], distance[i, ])
s1.i <- i
matchedcounties <- rbind(data.frame(s2.i = s2.i,
s1.i = s1.i,
s1name = tomatch[s1.i, ]$county,
s2name = fips[s2.i, ]$countyname,
adist = min.name[i]),
matchedcounties)
}
因此,我想将县的模糊字符串匹配限制为具有匹配状态的正确拼写版本。
我目前的算法制作了一个大矩阵,计算两个源之间的标准 Levenshtein 距离,然后选择距离最小的值。
为了解决我的问题,我想我需要创建一个可以应用于 ddply
的每个 'state' 组的函数,但我对如何实现感到困惑应该表明 ddply
函数中的组值应该匹配另一个数据帧。 dplyr
解决方案或使用任何其他包的解决方案也将不胜感激。
创建 FIPS 数据集的代码:
download.file('http://www2.census.gov/geo/docs/reference/codes/files/national_county.txt',
'./nationalfips.txt')
fips <- read.csv('./nationalfips.txt',
stringsAsFactors = FALSE, colClasses = 'character', header = FALSE)
names(fips) <- c('state', 'statefips', 'countyfips', 'countyname', 'classfips')
# remove 'County' from countyname
fips$countyname <- sub('County', '', fips$countyname, fixed = TRUE)
fips$countyname <- stringr::str_trim(fips$countyname)
没有示例数据,但尝试使用 agrep 而不是 adist 并仅搜索该状态下的名称
sapply(df_tomatch$county, function(x) agrep(x,df_matchby[df_matchby$state==dj_tomatch[x,'state'],'county'],value=TRUE)
您可以在 agrep
中使用 max.distance
参数来改变它们需要匹配的程度。此外,设置 value=TRUE
returns 匹配字符串的值而不是匹配的位置。
这是 dplyr 的一种方法。我首先按州加入 tomatch
data.frame 和 FIPS 名称(仅允许州内匹配):
require(dplyr)
df <- tomatch %>%
left_join(fips, by="state")
接下来,我注意到很多县在 FIPS 数据集中没有 'Saint',而是 'St.'。首先清理它应该会改善获得的结果。
df <- df %>%
mutate(county_clean = gsub("Saint", "St.", county))
然后,将这个data.frame按县分组,用adist计算距离:
df <- df %>%
group_by(county_clean) %>% # Calculate the distance per county
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
arrange(county, dist) # Used this for visual inspection.
请注意,我将结果矩阵的对角线作为 adist returns 一个 n x m 矩阵,其中 n 代表 x 向量,m 代表 y 向量(它计算所有组合)。 或者,您可以添加 agrep 结果:
df <- df %>%
rowwise() %>% # 'group_by' a single row.
mutate(agrep_result = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() # Always a good idea to remove 'groups' after you're done.
然后像之前那样过滤,取最小距离:
df <- df %>%
group_by(county_clean) %>% # Causes it to calculate the 'min' per group
filter(dist == min(dist)) %>%
ungroup()
请注意,这可能会导致 tomatch
中的每个输入行返回多行。
或者,一次完成所有操作 运行(一旦我确信它正在执行它应该执行的操作,我通常会将代码更改为这种格式):
df <- tomatch %>%
# Join on all names in the relevant state and clean 'St.'
left_join(fips, by="state") %>%
mutate(county_clean = gsub("Saint", "St.", county)) %>%
# Calculate the distances, per original county name.
group_by(county_clean) %>%
mutate(dist = diag(adist(county_clean, countyname, partial=TRUE))) %>%
# Append the agrepl result
rowwise() %>%
mutate(string_agrep = agrepl(county_clean, countyname, max.distance = 0.3)) %>%
ungroup() %>%
# Only retain minimum distances
group_by(county_clean) %>%
filter(dist == min(dist))
两种情况下的结果:
county county_clean state countyname dist string_agrep
1 Beauregard Beauregard LA Beauregard Parish 0 TRUE
2 De Soto De Soto LA De Soto Parish 0 TRUE
3 Dekalb Dekalb GA DeKalb 1 TRUE
4 Webster Webster LA Webster Parish 0 TRUE
5 Saint Joseph St. Joseph IN St. Joseph 0 TRUE
6 West Feliciana West Feliciana LA West Feliciana Parish 0 TRUE
7 Ketchikan Gateway Ketchikan Gateway AK Ketchikan Gateway Borough 0 TRUE
8 Evangeline Evangeline LA Evangeline Parish 0 TRUE
9 Richmond City Richmond City VA Richmond city 1 TRUE
10 Saint Mary St. Mary LA St. Mary Parish 0 TRUE
11 Saint Louis City St. Louis City MO St. Louis city 1 TRUE
12 Mclean Mclean KY McLean 1 TRUE
13 Union Union LA Union Parish 0 TRUE
14 Bienville Bienville LA Bienville Parish 0 TRUE
15 Covington City Covington City VA Covington city 1 TRUE
16 Martinsville City Martinsville City VA Martinsville city 1 TRUE
17 Claiborne Claiborne LA Claiborne Parish 0 TRUE
18 King And Queen King And Queen VA King and Queen 1 TRUE
19 Mclean Mclean ND McLean 1 TRUE
20 Mcminn Mcminn TN McMinn 1 TRUE
21 Prince Georges Prince Georges MD Prince George's 1 TRU