使用 tm 包删除停用词（Gsub 错误）

Question

我正在尝试删除我从语料库创建的停用词列表。我不确定发生了什么，因为我已经从停用词列表中删除了所有特殊字符并完成了语料库的文本清理。任何帮助将不胜感激。代码和错误消息如下。此处列出了带有用户定义停用词的 csv： Stop Words

    myCorpus <- Corpus(VectorSource(c("blank", "blank", "blank", "blank", "blank", "blank", "blank", 
"blank", "blank", "blank", "blank", "blank", "blank", "<br />Key skills:<br />Octopus Deploy, MS Build, PowerShell, Azure, NuGet, CI / CD concepts, release management<br /><br /> * Minimum 5 years plus relevant experience in Application Development lifecycle, Automation and Release and Configuration Management<br /> * Considerable experience in the following disciplines - TFS (Team Foundation Server), DevOps, Continuous Delivery, Release Engineering, Application Architect, Database Architect, Information Modeling, Service Oriented Architecture (SOA), Quality Assurance, Branch Management, Network setup and troubleshooting, Server setup, configuration, maintenance and patching<br /> * Solid understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery<br /> * Solid understanding and experience working with high availability and high performance, multi-data center systems and hybrid cloud environments.<br /> * Proficient with Agile methodologies and working closely within small teams and vendors<br /> * Knowledge of Deployment and configuration automation platforms<br /> * Extensive PowerShell experience<br /> * Extensive knowledge of Windows based systems including hardware, software and .NET applications<br /> * Strong ability to troubleshoot complex issues ranging from system resources to application stack traces<br /><br />REQUIRED SKILLS:<br />Bachelor's degree & 5-10 years of relevant work experience.", 
    "blank")))

for (j in seq(myCorpus)) {
  myCorpus[[j]] <- gsub("<.*>", " ", myCorpus[[j]])
  myCorpus[[j]] <- gsub("\b[[:alnum:]]{20,}\b", " ", myCorpus[[j]], perl=T)
  myCorpus[[j]] <- gsub("[[:punct:]]", " ", myCorpus[[j]])
}

#Clean Corpus
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, stripWhitespace)

#User defined stop word
manualStopwords <- read.csv("r_stop.csv", header = TRUE)
myStopwords <- paste(manualStopwords[,1])
myStopwords <- str_replace_all(myStopwords, "[[:punct:]]", "")
myStopwords <- gsub("\+", "plus", myStopwords)
myStopwords <- gsub("\$", "dollars", myStopwords)

myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

第一个错误

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(zimmermann|yrs|yr|youve|.....the rest of the Stop Words

附加错误

In addition: Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'regular expression is too large' at ''

Answer 1

我能够将我的停用词分解成更小的桶和代码运行。内存可能有问题。

chunk <- 500
n <- length(myStopwords)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(myStopwords,r)

for (i in 1:length(d)) {
  myCorpus <- tm_map(myCorpus, removeWords, c(paste(d[[i]])))
}

使用 tm 包删除停用词（Gsub 错误）

Remove Stop Words Using tm package (Gsub Error)

r

stop-words

gsub

tm