使用 mapply 将向量中的模式替换为 tm 中向量中的替换项
Use mapply to replace a patterns in a vector with replacements in a vector in tm
你好:我正在使用 tm 包进行一些文本分析,我需要用替换向量中的成对替换项来子项向量。所以模式/替换字典看起来像这样。
#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')
我试过了,但收到错误消息
tm_map(crude, mapply, gsub, df$replace, df$with)
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
基于此 answer,您可以使用 stringi
并将其包裹在 content_transformer()
周围以保留语料库结构:
corp <- tm_map(crude, content_transformer(
function(x) {
stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE)
})
)
或 multigsub
来自 qdap
corp <- tm_map(crude, content_transformer(
function(x) {
multigsub(df$replace, df$with, fixed = FALSE, x)
})
)
给出:
> corp[[1]][1]
"Diamond Shamrock Corp said that\neffective today it had cut its
contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to
16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices
and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave
cut its contract, or posted, xprices over the last two
days\nciting weak xoil markets.\n Reuter"
然后您可以在生成的语料库上应用其他 tm
函数:
> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity : 91%
#Maximal term length: 17
#Weighting : term frequency (tf)
你好:我正在使用 tm 包进行一些文本分析,我需要用替换向量中的成对替换项来子项向量。所以模式/替换字典看起来像这样。
#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')
我试过了,但收到错误消息
tm_map(crude, mapply, gsub, df$replace, df$with)
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
基于此 answer,您可以使用 stringi
并将其包裹在 content_transformer()
周围以保留语料库结构:
corp <- tm_map(crude, content_transformer(
function(x) {
stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE)
})
)
或 multigsub
来自 qdap
corp <- tm_map(crude, content_transformer(
function(x) {
multigsub(df$replace, df$with, fixed = FALSE, x)
})
)
给出:
> corp[[1]][1]
"Diamond Shamrock Corp said that\neffective today it had cut its contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave cut its contract, or posted, xprices over the last two days\nciting weak xoil markets.\n Reuter"
然后您可以在生成的语料库上应用其他 tm
函数:
> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity : 91%
#Maximal term length: 17
#Weighting : term frequency (tf)