将相似的字符串分组并将它们的值更改为常见的值,同时保留各个行

Group similar strings and change their values to something common while retaining the individual rows

我有收据数据和项目描述,但有些非常相似,我想用相同的值对那些相似的项目进行编码,以增加在数据中找到关联的机会。例如:

Strawberries
Premium Strawberries
Premium Strawberries 
Hass Avocado
Mini Avocado

我想要:

Strawberries
Strawberries
Strawberries
Avocado
Avocado

有点类似的效果,但我愿意接受 sure.All 的建议我能想到的是某种模糊搜索可能是我需要的我只是不知道如何实现它?

再次感谢!

一种可能的方式是字符串距离。但要小心,因为它们不包含任何含义,只是实际字符串之间的相似性。下面的例子可以像一些启发式的那样工作,但要注意最后一个例子。 cutree 中的阈值越高,您将拥有的组越少,并且可能有更多错误分类的示例。因此,门槛较低意味着您更严格,并且可能会错过好的解决方案:

th <- 0.35 ## between 0 and 1
roles <- c("Strawberies","strawberries","Mini strawberries","Avocado","Hass avocado","Not Avocado")
mat <- stringdist::stringdistmatrix(roles,roles,method = "jw",p=0.025,nthread = parallel::detectCores())
colnames(mat) <- roles
rownames(mat) <- roles
t <- hclust(as.dist(mat),method = "single")
memb <- cutree(t,h=th) 
df <- data.frame(a=c(roles),b=c(memb),stringsAsFactors = F)
df$to <- plyr::mapvalues(df$b,from=1:length(unique(memb)),to=df$a[!duplicated(df$b)])

prior <- data.frame(str=roles,to=df$to,stringsAsFactors = F)
prior
                str          to
1       Strawberies Strawberies
2      strawberries Strawberies
3 Mini strawberries Strawberies
4           Avocado     Avocado
5      Hass avocado     Avocado
6       Not Avocado     Avocado

假设您的收据是一个数据框 df,即

df <- data.frame("Strawberries",
"Premium Strawberries",
"Premium Strawberries",
"Hass Avocado",
"Mini Avocado",stringsAsFactors = F)

那么也许你可以通过

实现
res <- gsub(".*\s(\w)","\1",df$name)

屈服

>res
[1] "Strawberries" "Strawberries" "Strawberries" "Avocado"     
[5] "Avocado"