Stringdist 距离出乎意料地大
Stringdist distance unexpectedly large
下面的数据竟然不匹配。我原以为距离是 5
,但即使在 7
我也找不到匹配项
library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) <NA>
只有在 10
我才匹配到..
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) other_field_crops_non_organic
谁能给我解释一下为什么这个距离大于9
?它与括号有关吗?如果是这样,我如何在不删除括号的情况下规避这个问题?
编辑
library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
A.x A.y
1 Other field crops non-organic <NA>
即使没有括号我也无法得到 5
内的距离。
问题归结为您用于计算字符串距离的方法。您正在使用 lcs
(最长公共子串)方法,该方法实际上只允许删除和插入而不是替换。来自文档:
The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.
因此,当我们将空格转换为下划线时,每次替换的权重为 2:
stringdist('abc def', 'abc_def', method = 'lcs')
#> [1] 2
这与默认的 'osa' 方法形成对比,后者像 Levenshtein 距离和 R 函数 adist
允许直接替换,只有 1 点权重:
stringdist('abc def', 'abc_def', method = 'osa')
#> [1] 1
您可以比较不同的 stringdist
方法如何比较您的两个字符串。为了进一步简化,让我们将两者都设为小写,因为您已经在左联接中指定了 ignore_case
:
library(stringdist)
a <- "other field crops (non-organic)"
b <- "other_field_crops_non_organic"
methods <- c("osa", "lv", "dl", "hamming", "lcs",
"qgram", "cosine", "jaccard", "jw", "soundex")
sapply(methods, function(x) stringdist(a, b, method = x))
#> osa lv dl hamming lcs qgram cosine
#> 6.0000000 6.0000000 6.0000000 Inf 10.0000000 10.0000000 0.2025635
#> jaccard jw soundex
#> 0.2500000 0.1104931 0.0000000
你可以看到汉明距离是无限的,因为你的字符串长度不同,osa
(默认方法)只有 6,但 lcs
需要 10(4 个删除下划线、3 个空格、1 个连字符和 2 个括号)。如果此字符串对代表您的数据,您可能需要切换到“osa”
由 reprex package (v2.0.1)
于 2022-04-14 创建
你能在加入前清理一下文字吗?如果问题只是特殊字符,首先删除它们可能会更容易加入。
library(fuzzyjoin)
library(stringdist)
library(stringr)
## sample data
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
##
# remove special chars, make lower-case, single-space between strings
# you might want to use purrr or *apply for multiple columns
one$A <- str_replace_all(one$A, "[^[:alnum:]]", " ") %>%
tolower() %>%
str_squish()
two$A <- str_replace_all(two$A, "[^[:alnum:]]", " ") %>%
tolower() %>%
str_squish()
stringdist(one$A, two$A)
#> [1] 0
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
#> A.x A.y
#> 1 other field crops non organic other field crops non organic
由 reprex package (v2.0.1)
于 2022-04-14 创建
下面的数据竟然不匹配。我原以为距离是 5
,但即使在 7
我也找不到匹配项
library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) <NA>
只有在 10
我才匹配到..
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
A.x A.y
1 Other field crops (non-organic) other_field_crops_non_organic
谁能给我解释一下为什么这个距离大于9
?它与括号有关吗?如果是这样,我如何在不删除括号的情况下规避这个问题?
编辑
library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
A.x A.y
1 Other field crops non-organic <NA>
即使没有括号我也无法得到 5
内的距离。
问题归结为您用于计算字符串距离的方法。您正在使用 lcs
(最长公共子串)方法,该方法实际上只允许删除和插入而不是替换。来自文档:
The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.
因此,当我们将空格转换为下划线时,每次替换的权重为 2:
stringdist('abc def', 'abc_def', method = 'lcs')
#> [1] 2
这与默认的 'osa' 方法形成对比,后者像 Levenshtein 距离和 R 函数 adist
允许直接替换,只有 1 点权重:
stringdist('abc def', 'abc_def', method = 'osa')
#> [1] 1
您可以比较不同的 stringdist
方法如何比较您的两个字符串。为了进一步简化,让我们将两者都设为小写,因为您已经在左联接中指定了 ignore_case
:
library(stringdist)
a <- "other field crops (non-organic)"
b <- "other_field_crops_non_organic"
methods <- c("osa", "lv", "dl", "hamming", "lcs",
"qgram", "cosine", "jaccard", "jw", "soundex")
sapply(methods, function(x) stringdist(a, b, method = x))
#> osa lv dl hamming lcs qgram cosine
#> 6.0000000 6.0000000 6.0000000 Inf 10.0000000 10.0000000 0.2025635
#> jaccard jw soundex
#> 0.2500000 0.1104931 0.0000000
你可以看到汉明距离是无限的,因为你的字符串长度不同,osa
(默认方法)只有 6,但 lcs
需要 10(4 个删除下划线、3 个空格、1 个连字符和 2 个括号)。如果此字符串对代表您的数据,您可能需要切换到“osa”
由 reprex package (v2.0.1)
于 2022-04-14 创建你能在加入前清理一下文字吗?如果问题只是特殊字符,首先删除它们可能会更容易加入。
library(fuzzyjoin)
library(stringdist)
library(stringr)
## sample data
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <- as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
##
# remove special chars, make lower-case, single-space between strings
# you might want to use purrr or *apply for multiple columns
one$A <- str_replace_all(one$A, "[^[:alnum:]]", " ") %>%
tolower() %>%
str_squish()
two$A <- str_replace_all(two$A, "[^[:alnum:]]", " ") %>%
tolower() %>%
str_squish()
stringdist(one$A, two$A)
#> [1] 0
stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
#> A.x A.y
#> 1 other field crops non organic other field crops non organic
由 reprex package (v2.0.1)
于 2022-04-14 创建