R中近似子串匹配的位置

Position of Approximate Substring Matches in R

我正在使用 R 进行字符串处理。我有一个包含一列字符串的数据框,比如:

 df <- data.frame(textcol=c("In this substring would like to find the position of this substring",
 "I would also like to find the position of thes substring",
 "No match here","No mention of this substrangy thing"))

 matchPattern <- "this substring"

我正在搜索一个函数(取决于某种距离参数,例如 Jarro-Winkler)将采用我的 matchPattern,将其与数据框文本列的每一行进行比较,然后 return匹配字符串中匹配项的确切位置,即第一个元素为 36(除非我算错了),第二个元素(可能)为 43,第三个为 NA,第四个为 14(?)。

您可以使用 aregexec

## Get positions (-1 instead of NA)
positions <- aregexec(matchPattern, df$textcol, max.distance = 0.1)
unlist(positions)
# [1] 38 43 -1 15

## Extract matches
regmatches(df$textcol, positions)
# [[1]]
# [1] "this substring"
# 
# [[2]]
# [1] "thes substring"
# 
# [[3]]
# character(0)
# 
# [[4]]
# [1] "this substrang"

编辑

## A possibilty for replacing matches, or maybe `regmatches<-`
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow
df$out <- Vectorize(gsub)(unlist(res), "Censored", df$textcol)
df$out
# [1] "I would like to find the position of Censored"     
# [2] "I would also like to find the position of Censored"
# [3] "No match here"                                     
# [4] "No mention of Censoredy thing"