Stringdist 距离出乎意料地大

Stringdist distance unexpectedly large

下面的数据竟然不匹配。我原以为距离是 5,但即使在 7 我也找不到匹配项

library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)

                              A.x  A.y
1 Other field crops (non-organic) <NA>

只有在 10 我才匹配到..

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
                              A.x                           A.y
1 Other field crops (non-organic) other_field_crops_non_organic

谁能给我解释一下为什么这个距离大于9?它与括号有关吗?如果是这样,我如何在不删除括号的情况下规避这个问题?

编辑

library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
                            A.x  A.y
1 Other field crops non-organic <NA>

即使没有括号我也无法得到 5 内的距离。

问题归结为您用于计算字符串距离的方法。您正在使用 lcs (最长公共子串)方法,该方法实际上只允许删除和插入而不是替换。来自文档:

The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.

因此,当我们将空格转换为下划线时,每次替换的权重为 2:

stringdist('abc def', 'abc_def', method = 'lcs')
#> [1] 2

这与默认的 'osa' 方法形成对比,后者像 Levenshtein 距离和 R 函数 adist 允许直接替换,只有 1 点权重:

stringdist('abc def', 'abc_def', method = 'osa')
#> [1] 1

您可以比较不同的 stringdist 方法如何比较您的两个字符串。为了进一步简化,让我们将两者都设为小写,因为您已经在左联接中指定了 ignore_case

library(stringdist)

a <- "other field crops (non-organic)"
b <- "other_field_crops_non_organic"
methods <- c("osa", "lv", "dl", "hamming", "lcs", 
             "qgram", "cosine", "jaccard", "jw", "soundex")

sapply(methods, function(x) stringdist(a, b, method = x))
#>        osa         lv         dl    hamming        lcs      qgram     cosine 
#>  6.0000000  6.0000000  6.0000000        Inf 10.0000000 10.0000000  0.2025635 
#>    jaccard         jw    soundex 
#>  0.2500000  0.1104931  0.0000000

你可以看到汉明距离是无限的,因为你的字符串长度不同,osa(默认方法)只有 6,但 lcs 需要 10(4 个删除下划线、3 个空格、1 个连字符和 2 个括号)。如果此字符串对代表您的数据,您可能需要切换到“osa”

reprex package (v2.0.1)

于 2022-04-14 创建

你能在加入前清理一下文字吗?如果问题只是特殊字符,首先删除它们可能会更容易加入。

library(fuzzyjoin)
library(stringdist)
library(stringr)

## sample data
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"
##

# remove special chars, make lower-case, single-space between strings
#  you might want to use purrr or *apply for multiple columns
one$A <- str_replace_all(one$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>% 
  str_squish()
two$A <- str_replace_all(two$A, "[^[:alnum:]]", " ") %>% 
  tolower() %>%
  str_squish()


stringdist(one$A, two$A)
#> [1] 0

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)
#>                             A.x                           A.y
#> 1 other field crops non organic other field crops non organic

reprex package (v2.0.1)

于 2022-04-14 创建