通过部分匹配大于 n 个字符的单词的两列来子集行
Subset rows by partially matching two columns for words greater than n characters
我愿意在 python pandas 中这样做,但在 R 中我有以下 df:
result<-structure(list(traffic_Count_Street = c("San Angelo", "W Commerce St",
"W Commerce St", "S Gevers St", "Austin Hwy", "W Evergreen St"
), unit_Street = c("San Pedro Ave", "W Commerce", "W Commerce",
"S New Braunfels", "Austin Highway", "W Cypress")), .Names = c("traffic_Count_Street",
"unit_Street"), row.names = c(1L, 17L, 18L, 34L, 260L, 273L), class = "data.frame")
1 San Angelo San Pedro Ave
17 W Commerce St W Commerce
18 W Commerce St W Commerce
34 S Gevers St S New Braunfels
260 Austin Hwy Austin Highway
273 W Evergreen St W Cypress
对于每一行,如果其中一个大于 3 个字符的单词与另一个单词匹配,我想部分匹配第 1 列到第 2 列。
我会删除:
1 San Angelo San Pedro Ave
34 S Gevers St S New Braunfels
273 W Evergreen St W Cypress
并保持:
17 W Commerce St W Commerce
18 W Commerce St W Commerce
260 Austin Hwy Austin Highway
我尝试按以下方式使用 stringR
,但没有成功:
result$unit_Street[str_detect(result$traffic_Count_Street, "\w{3}")]
创建一个具有阈值调整的距离过滤器。然后你可以调整,直到你得到你想要的结果。在这种情况下,Levenshtein 距离 5 效果很好:
distanceFilter <- function(df, thresh=5) {
ind <- apply(df, 1, function(x) adist(x[1], x[2]) < thresh )
df[ind,]
}
distanceFilter(result, 5)
# traffic_Count_Street unit_Street
# 17 W Commerce St W Commerce
# 18 W Commerce St W Commerce
# 260 Austin Hwy Austin Highway
要了解更多信息,请参阅 the wiki page and the R doc help page
我愿意在 python pandas 中这样做,但在 R 中我有以下 df:
result<-structure(list(traffic_Count_Street = c("San Angelo", "W Commerce St",
"W Commerce St", "S Gevers St", "Austin Hwy", "W Evergreen St"
), unit_Street = c("San Pedro Ave", "W Commerce", "W Commerce",
"S New Braunfels", "Austin Highway", "W Cypress")), .Names = c("traffic_Count_Street",
"unit_Street"), row.names = c(1L, 17L, 18L, 34L, 260L, 273L), class = "data.frame")
1 San Angelo San Pedro Ave
17 W Commerce St W Commerce
18 W Commerce St W Commerce
34 S Gevers St S New Braunfels
260 Austin Hwy Austin Highway
273 W Evergreen St W Cypress
对于每一行,如果其中一个大于 3 个字符的单词与另一个单词匹配,我想部分匹配第 1 列到第 2 列。
我会删除:
1 San Angelo San Pedro Ave
34 S Gevers St S New Braunfels
273 W Evergreen St W Cypress
并保持:
17 W Commerce St W Commerce
18 W Commerce St W Commerce
260 Austin Hwy Austin Highway
我尝试按以下方式使用 stringR
,但没有成功:
result$unit_Street[str_detect(result$traffic_Count_Street, "\w{3}")]
创建一个具有阈值调整的距离过滤器。然后你可以调整,直到你得到你想要的结果。在这种情况下,Levenshtein 距离 5 效果很好:
distanceFilter <- function(df, thresh=5) {
ind <- apply(df, 1, function(x) adist(x[1], x[2]) < thresh )
df[ind,]
}
distanceFilter(result, 5)
# traffic_Count_Street unit_Street
# 17 W Commerce St W Commerce
# 18 W Commerce St W Commerce
# 260 Austin Hwy Austin Highway
要了解更多信息,请参阅 the wiki page and the R doc help page