从语料库中删除特殊字符

Remove special character from corpus

我建立了一个数据来显示所有带有标点符号的术语及其出现频率。然后我应该从它们中删除标点符号并检查是否还有任何标点符号。

newpapers1 <- tm_map(newpapers, removePunctuation)

punremove <- function(x){gsub(c('¡'|'¯'),"",x)}
punremove1 <- lapply(newpapers1, punremove)
my.check.func <- function(x){str_extract_all(x, "[[:punct:]]")}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))
p

但我还是得到了这个特殊字符:

  Var1 Freq
1    ¡   25

有没有办法编写一个函数来一起删除所有标点符号或一个函数来删除这个?

编辑: 检查文件后,标点符号仍然存在:

> newpapers1[[24]]$content

"This study employs a crosscultural perspective to examine how local audiences perceive and enjoy foreign dramas and how this psychological process differs depending on the cultural distance between the media and the viewing audience Using a convenience sample of young Korean college students this study as predicted by cultural discount theory shows that cultural distance decreases Korean audiences¡¯ perceived identification with dramatic characters which erodes their enjoyment of foreign dramas Unlike cultural discount theory however cultural distance arouses Korean audiences¡¯ perception of novelty which heightens their enjoyment of foreign dramas This study discusses the theoretical and practical implications of these findings as well as their potential limitations"


您可以使用 gsub 删除标点符号,就像这样。

newpapers1 <- tm_map(newpapers, removePunctuation)

my.check.func <- function(x){gsub('[[:punct:]]+','',x)}
my.check1 <- lapply(newpapers1, my.check.func)
p <- as.data.frame(table(unlist(my.check1)))
p

希望对您有所帮助。