去除微笑以外的标点符号 - R, tm 包

Removing Punctuation Marks except smiles - R, tm package

我在 R 中使用 tm 包。我想从该文本中删除所有标点符号,除了微笑。

data <- c("conflict need resolved :<. turned conversation exchange ideas richer environment one tricky concepts :D , �conflict� always top business agendas :>. maybe different ideas/opinions different :) " )

我试过了

library(tm) data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)

即删除所有标点符号,包括微笑,作为输出

data <- conflict need resolved turned conversation exchange ideas richer environment one tricky concepts conflict always top business agendas maybe different ideas opinions different

当我需要的时候,

data <- conflict need resolved :< turned conversation exchange ideas richer environment one tricky concepts :D conflict always top business agendas :> maybe different ideas opinions different :) 

建议请

我会写一个笑脸词典,把它们全部替换成文字,去掉标点符号,然后再替换回去。

# Make the dictionary. You need to make sure the strings are not in the text, which can be tested with something like stri_match(str=data,regex = smiles$r)
smiles <- data.frame(s=c(":<",":>",":)",":(",";)",":D"),
                     r=c("unhappyBracket","happyBracket","happyParen","unhappyParen","winkSmiley","DSmiley"))

library(stringi)
## replace smiley with text
data <- stri_replace_all_fixed(data,pattern = smiles$s,replacement = smiles$r,vectorize_all = FALSE)
## remove punctuation
data <- gsub("[^a-z]", " ", data, ignore.case = TRUE)
## replace text-smiley with punctuation smiley
data <- stri_replace_all_fixed(data,pattern = smiles$r,replacement = smiles$s,vectorize_all = FALSE)

请注意,如果笑脸对您的分析很重要,您应该将它们保留为文字,因为这样操作起来更容易。此外,您可能需要查看 tm::removePunctuation()tm::tm_map 来处理标点符号删除步骤。