在 R 中使用查找和替换多个字符串条件
using find and replace for multiple string criteria in R
我最近做了一个调查问卷,在国籍字段中我留下了一个开放的文本框字段(一个明显的错误)。现在我得到了结果,剩下几个意思相同的字符串,我想知道是否有一个函数可以让我搜索并用某种松散的标准替换。例如,我有很多法国参与者,得到的答案是 francaise、france、french 或 france from x territory。是否有任何 R 函数可以让我执行以下操作(只是部分命名字符串):
如果data$nationality包含'franc'、'frenc',则转换为'france'
gsub 可以做到这一点:
df<-data.frame(strings=c("France","Francais","French"),stringsAsFactors =FALSE)
df$New_Strings<-gsub("Francais|French","France",df$strings)
|
运算符的作用类似于 'or',因此您可以根据需要在其中串入更多
您可以使用相似性度量来计算字符串与目标字符串 "Franc"
和 "Frenc"
的接近程度。然后根据阈值决定保留什么。
我将使用包 stringdist
。
library(stringdist)
x <- scan(what = character(), text = '
I got a lot of french participants and got answers like francaise, france, french or france from x territory, as an example. Is there any R function that would let me do the following
')
pattern <- c('Franc', 'Frenc')
现在 sapply
对每个 pattern
函数 stringsim
,使用两种不同的度量,"soundex"
和 "lw"
。
sim1 <- sapply(pattern, stringsim, x, method = 'soundex')
sim1 <- apply(sim1, 1, max)
sim2 <- sapply(pattern, stringsim, x, method = 'jw')
sim2 <- apply(sim2, 1, max)
决定保留什么。
thresh <- 0.75
x[sim1 >= thresh]
#[1] "french" "francaise," "france," "french" "france"
x[sim2 >= thresh]
#[1] "french" "france," "french" "france"
阈值可以调小
thresh <- 0.70
x[sim1 >= thresh]
#[1] "french" "francaise," "france," "french" "france"
x[sim2 >= thresh]
#[1] "french" "francaise," "france," "french" "france"
我最近做了一个调查问卷,在国籍字段中我留下了一个开放的文本框字段(一个明显的错误)。现在我得到了结果,剩下几个意思相同的字符串,我想知道是否有一个函数可以让我搜索并用某种松散的标准替换。例如,我有很多法国参与者,得到的答案是 francaise、france、french 或 france from x territory。是否有任何 R 函数可以让我执行以下操作(只是部分命名字符串):
如果data$nationality包含'franc'、'frenc',则转换为'france'
gsub 可以做到这一点:
df<-data.frame(strings=c("France","Francais","French"),stringsAsFactors =FALSE)
df$New_Strings<-gsub("Francais|French","France",df$strings)
|
运算符的作用类似于 'or',因此您可以根据需要在其中串入更多
您可以使用相似性度量来计算字符串与目标字符串 "Franc"
和 "Frenc"
的接近程度。然后根据阈值决定保留什么。
我将使用包 stringdist
。
library(stringdist)
x <- scan(what = character(), text = '
I got a lot of french participants and got answers like francaise, france, french or france from x territory, as an example. Is there any R function that would let me do the following
')
pattern <- c('Franc', 'Frenc')
现在 sapply
对每个 pattern
函数 stringsim
,使用两种不同的度量,"soundex"
和 "lw"
。
sim1 <- sapply(pattern, stringsim, x, method = 'soundex')
sim1 <- apply(sim1, 1, max)
sim2 <- sapply(pattern, stringsim, x, method = 'jw')
sim2 <- apply(sim2, 1, max)
决定保留什么。
thresh <- 0.75
x[sim1 >= thresh]
#[1] "french" "francaise," "france," "french" "france"
x[sim2 >= thresh]
#[1] "french" "france," "french" "france"
阈值可以调小
thresh <- 0.70
x[sim1 >= thresh]
#[1] "french" "francaise," "france," "french" "france"
x[sim2 >= thresh]
#[1] "french" "francaise," "france," "french" "france"