了解 R 中 agrep 模糊匹配的约束

Understanding constraints in agrep fuzzy matching in R

这看起来很简单,但出于某种原因,我不理解 agrep 涉及替换的模糊匹配的行为。当指定 all=2 时,两次替换会按预期产生匹配,但当 substitutions=2 时则不会。这是为什么?

# Finds a match as expected
agrep("abcdeX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> [1] "abcdef"


# Doesn't find a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=1, ins=0, del=0))
#> character(0)


# Finds a match as expected
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(all=2))
#> [1] "abcdef"
      

# Doesn't find a match UNEXPECTEDLY
agrep("abcdXX", "abcdef", value = T,
      max.distance = list(sub=2, ins=0, del=0))
#> character(0)

reprex package (v2.0.0)

创建于 2021-06-03

all 是始终适用的上限,与其他 max.distance 控件(cost 除外)无关。它默认为 10%。

# one characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
     max.distance = list(sub = 2, ins = 0, del = 0, all = 0.2))
# [1] "abcdef"

# one character can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.1))
# character(0)

# two characters can change
agrep(pattern = "abcdXX", x = "abcdef", value = TRUE,
    max.distance = list(sub = 1, ins = 1, del = 0, all = 0.2))
# [1] "abcdef"

设置 all 的小数模式在 1 处切换到整数模式时有点问题。

# 8 insertions allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

# 1 insertion allowed
agrep(pattern = "abcdXXef", x = "abcdef", value = TRUE,
    max.distance = list(sub = 0, ins = 2, del = 0, all = 1))
# character(0)

当您通过将 all 设置为刚好小于 1 来抑制它时,距离模式的限制适用。

# two substitutions allowed
agrep(pattern = "abcdXX", 
    x = c("abcdef", "abcXdef", "abcefg"), value = TRUE,
    max.distance = list(sub = 2, ins = 0, del = 0, all = 1 - 1e-9))
# [1] "abcdef"

设置成本的目的是允许您以不同的速率向不同的方向移动突变-space。这将取决于您的用例。例如,某些语言方言可能更有可能添加字母。您可能会选择让一次删除花费两次插入。默认情况下,当 costs = NULL 时所有权重均等,即 costs = c(ins = 1, del = 1, sub = 1).

编辑:关于您关于为什么某些模式匹配而其他模式不匹配的评论,10% 指的是模式中的字符数,四舍五入

agrep(pattern = "01234567XX89", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 0, ins = 2, del = 0))
# [1] "0123456789"
agrep(pattern = "01234567XX", x = "0123456789", value = TRUE, 
    max.distance = list(sub = 2, ins = 0, del = 0))
# character(0)
num_mutations <- nchar(c("01234567XX89", "01234567XX")) * 0.1
num_mutations
# [1] 1.2 1.0
ceiling(num_mutations)
[1] 2 1

第二个模式只有10个字符,所以只能替换一个。