如何 return 来自大型矩阵且相互满足最大字符串距离标准的字符串对列表?

How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?

我正在尝试一种方式来呈现人工输入的词,使它们的分组更容易被识别为指的是同一事物。本质上是一个拼写检查器。我已经制作了一个大矩阵(实际矩阵是 250 * 250 左右)。该矩阵的代码与下面给出的可重现示例相同。 (我用随机词生成器填充了它,实际值更有意义但保密)

strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","eyes","hollow","quince","pack","twig","knot")

matrix <- stringdistmatrix(strings, strings, useNames = TRUE)

现在我想用两个变量创建一个新的 table,第一列必须包含成对的 'strings' 元素,这些元素满足其字符串距离小于某个数字的条件对于此示例(stringdist<7,非零),第二列必须包含 stringdist。此外 table 不应显示矩阵中存在的结果的反映,例如(椭圆,卷曲:3),(卷曲,椭圆:3)。

我感觉这需要某种 apply 函数,但我没有任何线索。

干杯。

以下基于 tidyverse 的解决方案应该可以解决问题。

请注意最后一行是为了便于查看结果。我不认为这对您的目的是必要的。如果你确实想保留它,我建议将它合并到 'pair'.

的初始制作中
library(stringdist)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(stringr)

matrix %>%
  as_tibble() %>%
  mutate(X = colnames(.), .before = 1) %>%
  pivot_longer(-X) %>%
  filter(value %in% 1:7) %>%
  transmute(pair = map2(X, name, ~ sort(c(.x, .y))),
            stringDist = value) %>%
  distinct(pair, stringDist) %>%
  mutate(pair = map_chr(pair, ~ str_c(., collapse = '_')))

# A tibble: 451 x 2
#   pair                   stringDist
#   <chr>                       <dbl>
# 1 domineering_sweltering          6
# 2 curl_root                       4
# 3 curl_gratis                     6
# 4 curl_secretary                  7
# 5 cumbersome_curl                 7
# 6 curl_oval                       3
# 7 curl_mighty                     6
# 8 curl_thaw                       4
# 9 curl_troubled                   6
# 10 curl_furniture                 7

这也能完成工作

matrix[lower.tri(matrix)] <- 0
matrix_melt <- melt(matrix)
matrix_melt %>%
    filter(value %in% 1:7)