Stringdist 距离出乎意料地大

Question

下面的数据竟然不匹配。我原以为距离是 5，但即使在 7 我也找不到匹配项

library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)

                              A.x  A.y
1 Other field crops (non-organic) <NA>

只有在 10 我才匹配到..

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
                              A.x                           A.y
1 Other field crops (non-organic) other_field_crops_non_organic

谁能给我解释一下为什么这个距离大于9？它与括号有关吗？如果是这样，我如何在不删除括号的情况下规避这个问题？

编辑

library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
                            A.x  A.y
1 Other field crops non-organic <NA>

即使没有括号我也无法得到 5 内的距离。

Answer 1

问题归结为您用于计算字符串距离的方法。您正在使用 lcs （最长公共子串）方法，该方法实际上只允许删除和插入而不是替换。来自文档：

The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.

因此，当我们将空格转换为下划线时，每次替换的权重为 2：

stringdist('abc def', 'abc_def', method = 'lcs')
#> [1] 2

这与默认的 'osa' 方法形成对比，后者像 Levenshtein 距离和 R 函数 adist 允许直接替换，只有 1 点权重：

stringdist('abc def', 'abc_def', method = 'osa')
#> [1] 1

您可以比较不同的 stringdist 方法如何比较您的两个字符串。为了进一步简化，让我们将两者都设为小写，因为您已经在左联接中指定了 ignore_case：

library(stringdist)

a <- "other field crops (non-organic)"
b <- "other_field_crops_non_organic"
methods <- c("osa", "lv", "dl", "hamming", "lcs", 
             "qgram", "cosine", "jaccard", "jw", "soundex")

sapply(methods, function(x) stringdist(a, b, method = x))
#>        osa         lv         dl    hamming        lcs      qgram     cosine 
#>  6.0000000  6.0000000  6.0000000        Inf 10.0000000 10.0000000  0.2025635 
#>    jaccard         jw    soundex 
#>  0.2500000  0.1104931  0.0000000

你可以看到汉明距离是无限的，因为你的字符串长度不同，osa（默认方法）只有 6，但 lcs 需要 10（4 个删除下划线、3 个空格、1 个连字符和 2 个括号）。如果此字符串对代表您的数据，您可能需要切换到“osa”

^{由 reprex package (v2.0.1)}

于 2022-04-14 创建

Answer 2

你能在加入前清理一下文字吗？如果问题只是特殊字符，首先删除它们可能会更容易加入。

library(fuzzyjoin)
library(stringdist)
library(stringr)

## sample data
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
##

# remove special chars, make lower-case, single-space between strings
#  you might want to use purrr or *apply for multiple columns
one$A <- str_replace_all(one$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>% 
  str_squish()
two$A <- str_replace_all(two$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>%
  str_squish()


stringdist(one$A, two$A)
#> [1] 0

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
#>                             A.x                           A.y
#> 1 other field crops non organic other field crops non organic

^{由 reprex package (v2.0.1)}

于 2022-04-14 创建

Stringdist 距离出乎意料地大

Stringdist distance unexpectedly large

string

r

levenshtein-distance

stringdist

编辑