replace_emoticon 函数错误地替换了单词中的字符 - R
replace_emoticon function incorrectly replaces characters within a word - R
我正在使用 R 并使用 textclean 包中的 replace_emoticon
函数将表情符号替换为相应的单词:
library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)
[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "
如上所示,该函数有效,但它也会替换看起来像表情符号但在单词中的字符(例如 "experience" 中的 "xp")。我试图找到解决此问题的方法,并找到了以下声称可以解决此问题的函数覆盖:
replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\s+",
" ",
mgsub_regex(x, paste0('\b\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
replace_emoticon(test_text)
[1] "i had a great experience tongue sticking out :P"
然而,虽然它确实解决了单词 "experience" 的问题,但它产生了一个全新的问题:它停止替换“:P”——这是一个表情符号,通常应该被函数替换.
此外,字符 "xp" 的错误是已知的,但我不确定除了 "xp" 之外是否还有其他字符在作为单词的一部分时也会被错误地替换。
是否有解决方案告诉 replace_emoticon
函数仅在 "emoticons" 不是单词的一部分时替换它们?
谢谢!
Wiktor 是对的,边界检查这个词引起了问题。我在下面的函数中稍微调整了一下。这仍然存在 1 个问题,那就是如果表情符号后面紧跟着一个单词,而表情符号和单词之间没有 space。问题是最后一个问题是否重要。请参阅下面的示例。
注意:我使用 textclean 将此信息添加到问题跟踪器。
replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\s+",
" ",
mgsub_regex(x, paste0('\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"
# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"
新增功能:
基于 stringi 和来自
的 wiktor 的正则表达式转义函数
replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...)
{
regex_escape <- function(string) {
gsub("([][{}()+*^${|\\?.])", "\\\1", string)
}
stringi::stri_replace_all(x,
regex = paste0("\s+", regex_escape(emoticon_dt[["x"]])),
replacement = paste0(" ", emoticon_dt[['y']]),
vectorize_all = FALSE)
}
test_text <- "Hello :) Great experience! xp :) :P"
replace_emoticon_new(test_text)
[1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"
我正在使用 R 并使用 textclean 包中的 replace_emoticon
函数将表情符号替换为相应的单词:
library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)
[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "
如上所示,该函数有效,但它也会替换看起来像表情符号但在单词中的字符(例如 "experience" 中的 "xp")。我试图找到解决此问题的方法,并找到了以下声称可以解决此问题的函数覆盖:
replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\s+",
" ",
mgsub_regex(x, paste0('\b\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
replace_emoticon(test_text)
[1] "i had a great experience tongue sticking out :P"
然而,虽然它确实解决了单词 "experience" 的问题,但它产生了一个全新的问题:它停止替换“:P”——这是一个表情符号,通常应该被函数替换.
此外,字符 "xp" 的错误是已知的,但我不确定除了 "xp" 之外是否还有其他字符在作为单词的一部分时也会被错误地替换。
是否有解决方案告诉 replace_emoticon
函数仅在 "emoticons" 不是单词的一部分时替换它们?
谢谢!
Wiktor 是对的,边界检查这个词引起了问题。我在下面的函数中稍微调整了一下。这仍然存在 1 个问题,那就是如果表情符号后面紧跟着一个单词,而表情符号和单词之间没有 space。问题是最后一个问题是否重要。请参阅下面的示例。
注意:我使用 textclean 将此信息添加到问题跟踪器。
replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\s+",
" ",
mgsub_regex(x, paste0('\Q', emoticon_dt[['x']], '\E\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"
# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"
新增功能:
基于 stringi 和来自
replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...)
{
regex_escape <- function(string) {
gsub("([][{}()+*^${|\\?.])", "\\\1", string)
}
stringi::stri_replace_all(x,
regex = paste0("\s+", regex_escape(emoticon_dt[["x"]])),
replacement = paste0(" ", emoticon_dt[['y']]),
vectorize_all = FALSE)
}
test_text <- "Hello :) Great experience! xp :) :P"
replace_emoticon_new(test_text)
[1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"