replace_emoticon 函数错误地替换了单词中的字符 - R

replace_emoticon function incorrectly replaces characters within a word - R

我正在使用 R 并使用 textclean 包中的 replace_emoticon 函数将表情符号替换为相应的单词:

library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)

[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "

如上所示,该函数有效,但它也会替换看起来像表情符号但在单词中的字符(例如 "experience" 中的 "xp")。我试图找到解决此问题的方法,并找到了以下声称可以解决此问题的函数覆盖:

 replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){

     trimws(gsub(
         "\s+", 
         " ", 
         mgsub_regex(x, paste0('\b\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
     ))

 }

replace_emoticon(test_text)

[1] "i had a great experience tongue sticking out :P"

然而,虽然它确实解决了单词 "experience" 的问题,但它产生了一个全新的问题:它停止替换“:P”——这是一个表情符号,通常应该被函数替换.

此外,字符 "xp" 的错误是已知的,但我不确定除了 "xp" 之外是否还有其他字符在作为单词的一部分时也会被错误地替换。

是否有解决方案告诉 replace_emoticon 函数仅在 "emoticons" 不是单词的一部分时替换它们?

谢谢!

Wiktor 是对的,边界检查这个词引起了问题。我在下面的函数中稍微调整了一下。这仍然存在 1 个问题,那就是如果表情符号后面紧跟着一个单词,而表情符号和单词之间没有 space。问题是最后一个问题是否重要。请参阅下面的示例。

注意:我使用 textclean 将此信息添加到问题跟踪器。

replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
  trimws(gsub(
    "\s+", 
    " ", 
    mgsub_regex(x, paste0('\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
  ))
}

# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"


# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"

新增功能:

基于 stringi 和来自

的 wiktor 的正则表达式转义函数
replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...) 
{
  regex_escape <- function(string) {
    gsub("([][{}()+*^${|\\?.])", "\\\1", string)
  }

  stringi::stri_replace_all(x, 
                            regex = paste0("\s+", regex_escape(emoticon_dt[["x"]])),
                            replacement = paste0(" ", emoticon_dt[['y']]),   
                            vectorize_all = FALSE)
}

test_text <- "Hello :) Great experience! xp :) :P"
replace_emoticon_new(test_text)
[1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"