
Remove special character from corpus


newpapers1 <- tm_map(newpapers, removePunctuation)

punremove <- function(x){gsub(c('¡'|'¯'),"",x)}
punremove1 <- lapply(newpapers1, punremove)
my.check.func <- function(x){str_extract_all(x, "[[:punct:]]")}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))


  Var1 Freq
1    ¡   25


编辑: 检查文件后,标点符号仍然存在:

> newpapers1[[24]]$content

"This study employs a crosscultural perspective to examine how local audiences perceive and enjoy foreign dramas and how this psychological process differs depending on the cultural distance between the media and the viewing audience Using a convenience sample of young Korean college students this study as predicted by cultural discount theory shows that cultural distance decreases Korean audiences¡¯ perceived identification with dramatic characters which erodes their enjoyment of foreign dramas Unlike cultural discount theory however cultural distance arouses Korean audiences¡¯ perception of novelty which heightens their enjoyment of foreign dramas This study discusses the theoretical and practical implications of these findings as well as their potential limitations"

您可以使用 gsub 删除标点符号,就像这样。

newpapers1 <- tm_map(newpapers, removePunctuation)

my.check.func <- function(x){gsub('[[:punct:]]+','',x)}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))
