R 字典：创建多对一映射

Question

在文本挖掘练习中考虑以下 MWE，使用 R{tm}：丰田在美国有几款SUV车型。models<-c("highlander","land cruiser","rav4","sequoia","4runner")。一般媒体称这些不是"toyota rav4"（语料库已经转换为小写）而是"rav4"。要在 DocumentTermMatrix 中获得一列丰田 suv，我需要将所有这些品牌转换为一个通用 "toyota_suv"。我现在正在做的是对长度（模型）重复 mycorpus<-tm_map(mycorpus, gsub, pattern="rav4", replacement="toyota_suv")。黑客将设置 model_names<-rep("toyota_suv",length(models)) 并继续生活。如何设置具有多对一映射的字典，以便在一个表达式中将所有 models 替换为 'toyota_suv'？非常感谢。

Answer 1

您可以使用向量化替换函数。 stringi 包通过 stri_replace_all 函数族提供了这样的函数。在这里，我使用 stri_replace_all_fixed，但根据需要调整区分大小写和其他选项。

library(tm)
library(stringi)

toyota_suvs <- c("highlander","land cruiser","rav4","sequoia","4runner")

tm_map(toyCorp, stri_replace_all_fixed,
    pattern = toyota_suvs, replacement = "toyota_suv",
    vectorize_all = FALSE)

数据：

toyExample <- c("you don't know about the rav4, John Snow",
    "the highlander is a great car",
    "I want a land cruiser")

toyCorp <- Corpus(VectorSource(toyExample))

R 字典：创建多对一映射

R dictionary: create a many-to-one mapping

r

text-mining

tm