使用 mapply 将向量中的模式替换为 tm 中向量中的替换项

Use mapply to replace a patterns in a vector with replacements in a vector in tm

你好:我正在使用 tm 包进行一些文本分析,我需要用替换向量中的成对替换项来子项向量。所以模式/替换字典看起来像这样。

#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')

我试过了,但收到错误消息

tm_map(crude, mapply, gsub, df$replace, df$with)

Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code

基于此 answer,您可以使用 stringi 并将其包裹在 content_transformer() 周围以保留语料库结构:

corp <- tm_map(crude, content_transformer(
  function(x) { 
    stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE) 
    })
  )

multigsub 来自 qdap

corp <- tm_map(crude, content_transformer(
  function(x) { 
    multigsub(df$replace, df$with, fixed = FALSE, x) 
    })
  )

给出:

> corp[[1]][1]

"Diamond Shamrock Corp said that\neffective today it had cut its contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave cut its contract, or posted, xprices over the last two days\nciting weak xoil markets.\n Reuter"

然后您可以在生成的语料库上应用其他 tm 函数:

> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity           : 91%
#Maximal term length: 17
#Weighting          : term frequency (tf)