R - 带加权词的字符串距离

R - String Distance with weighted words

有没有什么方法可以使用 stringdist 包或其他字符串距离包对特定单词进行加权?

通常我有共享一个共同词的字符串,例如 "city" 或 "university" 结果得到相对接近的字符串距离匹配,但非常不同(即:"University of Utah"和 "University of Ohio",或 "XYZ City" 和 "ABC City")。


当然,一种选择是在匹配之前 str_remove 那些常用词,但这有一个问题,即 "XYZ County" 和 "XYZ City" 看起来完全相同。


"University of Utah" 和 "University of Ohio"

stringdist("University of Utah", "University of Ohio") / max(nchar("University of Utah"), nchar("University of Ohio"))

归一化字符串距离为 0.22222。这是相对较低的。但实际上,"Utah" 和 "Ohio" 之间的标准化 OSA 字符串距离是 1:

4 / 18 = 0.222222

但是,事先删除 "University of" 和其他常见字符串(如 "State")会导致 "University of Ohio" 和 "Ohio State" 之间的匹配。

对像 "University of" 这样的字符串进行加权以进行计数,例如规范化分母中使用的实际字符数的 0.25 将减少这些公共子字符串的影响,即:

4 / (18 * 0.25) = 0.888888.


stringdist("University of Ohio", "Ohio State")

产生 16。但取分母的 .25:

16 / (18 * .25) = 3.55555.

也许更好的选择是使用 LCS,但降低与常见字符串列表匹配的子字符串的权重。所以即使 "University of Utah" 和 "University of Ohio" 有一个 14 个字符的公共子字符串,如果 "University of" 出现在这个列表中,它的 LCS 值也会减少。


我有另一个想法 - 使用 tidytext 包和 unnest_tokens,可以生成所有匹配字符串中最常见单词的列表。考虑相对于它们在数据集中的共性来降低这些词的权重可能会很有趣,因为它们越常见,它们的区分能力就越小......

也许一个想法是在计算字符串距离之前重新组合相似的术语,以避免完全比较 "Ohio State" 和 "University of Ohio"。

# Strings
v1 <- c("University of Ohio", "University of Utah", "Ohio State", "Utah State",
        "University Of North Alabama", "University of South Alabama", "Alabama State",
        "Arizona State University Polytechnic", "Arizona State University Tempe", 
        "Arizona State", "Metropolitan State University of Denver", 
        "Metropolitan University Of The State Of Denver", "University Of Colorado", 
        "Western State Colorado University", "The Dalton College", "The Colorado State", 
        "The Dalton State College", "Columbus State University", "Dalton College")

# Remove stop words
v2 <- strsplit(v1, " ") %>% 
  map_chr(~ paste(.x[!tolower(.x) %in% tm::stopwords()], collapse = " "))

# Define groups
groups <- c(Group1 = "state", 
            Group2 = "university", 
            Group3 = "college",
            # Groups 4-5 must contain BOTH terms
            Group4 = ".*(state.*university|university.*state).*", 
            Group5 = ".*(state.*college|college.*state).*")

# Iterate over the list and assign groups
dat <- list(words = v2, pattern = groups)
lst <- dat$pattern %>% map(~ grepl(.x, dat$words, ignore.case = TRUE))

lst %>%
  # Make sure groups 1 to 3 and 4-5 are mutually exclusive
  # i.e: if a string contains "state" AND "university" (Group4), it must not be in Group1
  modify_at(c("Group1", "Group2", "Group3"), 
            ~ ifelse(lst$Group4 & .x | lst$Group5 & .x, !.x, .x)) %>%
  # Return matches from strings 
  map(~ v2[.x]) %>%
  # Compute the stringdistance for each group
  map(~ stringdistmatrix(.x, .x)) ## Maybe using method = "jw" ?