从 agrep 中提取子字符串匹配

Question

我的目标是确定给定的 text 中是否有 target 字符串，但我想允许拼写错误/小推导并提取 "caused" 的子字符串匹配（用于进一步的文本分析）。

示例：

target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."

期望输出：

我想将 target strlng 作为输出，因为它非常接近目标（编辑距离为 1）。接下来我想使用 target strlng 来提取单词 Butter （这部分我已经介绍过了，我只是添加它以获得详细的规范）。

我尝试了什么：

使用 adist 无效，因为它比较两个字符串，而不是子字符串。

接下来我看了看agrep，看起来很接近。我可以得到输出，即找到了我的目标，但没有 "caused" 匹配的 substring。

我试过 value = TRUE 但它似乎适用于数组级别。我想我不可能切换到数组类型，因为我不能用空格分割（我的目标字符串可能有空格，...）。

agrep(
  pattern = target, 
  x = text,
  value = TRUE
)

Answer 1

使用aregexec，类似于使用regexpr/regmatches（或gregexpr）进行精确匹配提取。

m <- aregexec('string', 'text strlng wrong')
regmatches('text strlng wrong', m)
#[[1]]
#[1] "strlng"

这可以包装在一个函数中，该函数同时使用 aregexec 和 regmatches 的参数。请注意，在后一种情况下，函数参数 invert 在点参数 ... 之后出现 ，因此它必须是命名参数。

aregextract <- function(pattern, text, ..., invert = FALSE){ m <- aregexec(pattern, text, ...) regmatches(text, m, invert = invert) } aregextract(target, text) #[[1]] #[1] "target strlng" aregextract(target, text, invert = TRUE) #[[1]] #[1] "the " #[2] ": Butter. this text i dont want to extract."

从 agrep 中提取子字符串匹配

Extract substring match from agrep

r

agrep

levenshtein-distance