在两个特定单词之间提取一串单词，但允许 R 中的不匹配

Question

我有以下字符串。


string =c("today is Oscar")

我想提取今天和 奥斯卡 之间的所有内容，但允许最多两个 mismatches/typos 单词今天和奥斯卡.

在这种情况下，预期结果将是 is，但有些字符串在 today 和 Oscar 之间有另一个词。单词 today 和 Oscar.

中的任何字母都可能出现拼写错误

我目前正在查看 agrep 包。感谢任何帮助或指导。

Answer 1

如果我没理解错的话，你想从你的向量中提取动词（即中间的子字符串）iff 它左边和右边的词是距离 "today \w+ Oscar" 模式最多 2 insertions/deletions 等。

如果该前提是正确的，您可以首先使用 agrep（或 agrepl）在满足该条件的那些字符串上对向量进行子集化，然后在捕获组中捕获中间的子字符串 (...) 并在 sub 的替换参数中使用反向引用 \1 引用它：

sub("\w+ (\w+) \w+", "\1", string[agrepl("today \w+ Oscar", string, max.distance = list(all = 2), ignore.case = T, fixed = F)])
[1] "IS"    "drive" "goes"

注意：参数all指定“所有转换（插入、删除和替换）的最大number/fraction”；或者使用：insertions、deletions 和 substitutions.

模拟数据：

string = c("today IS Oscar", "today drive car", "tody goes Oscar", "tomorrow was Oscar")

"today IS Oscar" 完全匹配，因为 ignore.case = T 确保大小写无关紧要
"today drive car" 是模糊匹配，因为 car 距离 Oscar
"tody goes Oscar"是模糊匹配，因为 tody 与 today 和
"tomorrow was Oscar" 根本不匹配，因为 tomorrow 距离 today

在两个特定单词之间提取一串单词，但允许 R 中的不匹配

Extract a string of words between two specific words, but allow for a mismatches in R

regex

r

stringr

dplyr

tidyverse