在 R 中使用查找和替换多个字符串条件

using find and replace for multiple string criteria in R

我最近做了一个调查问卷,在国籍字段中我留下了一个开放的文本框字段(一个明显的错误)。现在我得到了结果,剩下几个意思相同的字符串,我想知道是否有一个函数可以让我搜索并用某种松散的标准替换。例如,我有很多法国参与者,得到的答案是 francaise、france、french 或 france from x territory。是否有任何 R 函数可以让我执行以下操作(只是部分命名字符串):

如果data$nationality包含'franc'、'frenc',则转换为'france'

gsub 可以做到这一点:

df<-data.frame(strings=c("France","Francais","French"),stringsAsFactors =FALSE)

df$New_Strings<-gsub("Francais|French","France",df$strings)

| 运算符的作用类似于 'or',因此您可以根据需要在其中串入更多

您可以使用相似性度量来计算字符串与目标字符串 "Franc""Frenc" 的接近程度。然后根据阈值决定保留什么。

我将使用包 stringdist

library(stringdist)

x <- scan(what = character(), text = '
I got a lot of french participants and got answers like francaise, france, french or france from x territory, as an example. Is there any R function that would let me do the following
')
pattern <- c('Franc', 'Frenc')

现在 sapply 对每个 pattern 函数 stringsim,使用两种不同的度量,"soundex""lw"

sim1 <- sapply(pattern, stringsim, x, method = 'soundex')
sim1 <- apply(sim1, 1, max)

sim2 <- sapply(pattern, stringsim, x, method = 'jw')
sim2 <- apply(sim2, 1, max)

决定保留什么。

thresh <- 0.75

x[sim1 >= thresh]
#[1] "french"     "francaise," "france,"    "french"     "france"

x[sim2 >= thresh]
#[1] "french"  "france," "french"  "france"

阈值可以调小

thresh <- 0.70

x[sim1 >= thresh]
#[1] "french"     "francaise," "france,"    "french"     "france"

x[sim2 >= thresh]
#[1] "french"     "francaise," "france,"    "french"     "france"