如何用文本替换表情符号并将它们视为单个单词?

How can I replace emojis with text and treat them as single words?

我必须用R 对包含表情符号的文本片段进行主题建模。使用replace_emoji()replace_emoticon 函数让我分析它们,但结果有问题。

A red heart emoji 翻译为“red heart ufef”。然后在分析过程中分别处理这些词并影响结果。

像“heart”这样的词可以有非常不同的意思,就像“red heart ufef”和“broken heart”一样 函数 replace_emoji_identifier() 也无济于事,因为标识符使分析变得困难。

虚拟数据集可通过使用 dput() 重现(包括步骤 force to lowercase:

Emoji_struct <- c(
      list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),  
      list(content = "", " thanks for helping",  " oh no, why? ", "careful, challenging ❌❌❌")
)

当前编码(data_orig是几个文件的列表):

library(textclean)
#The rest should be standard r packages for pre-processing

#pre-processing:
data <- gsub("'", "", data) 
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data)  #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data) 
data <- gsub("[[:digit:]]", "", data)  #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)

期望的输出:

[1] list(content = c("fire fire wow", 
                     "facewithopenmouth look at that", 
                     "facewithsteamfromnose this makes me angry facewithsteamfromnose", 
                     "smilingfacewithhearteyes redheart \ufe0f, i love it!"), 
         content = c("smilingfacewithhearteyes smilingfacewithhearteyes", 
                     "smilingfacewithsmilingeyes thanks for helping", 
                     "cryingface oh no, why? cryingface", 
                     "careful, challenging crossmark crossmark crossmark"))

有什么想法吗?小写也可以。 最好的祝福。注意安全。保持健康。

回答

replace_emoji 中的默认转换 table 替换为删除了 spaces/punctuation 的版本:

hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)

replace_emoji(Emoji_struct[,1], emoji_dt = hash2)

例子

单个字符串:

replace_emoji("wow! that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"

字符向量:

replace_emoji(c("1: ", "2: "), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "

列表:

list("list_element_1: ", "list_element_2: ❌") %>%
  lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "

理由

要将表情符号转换为文本,replace_emoji 使用 lexicon::hash_emojis 作为转换 table(散列 table):

head(lexicon::hash_emojis)
#              x                        y
#1: <e2><86><95>            up-down arrow
#2: <e2><86><99>          down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a>                    watch
#6: <e2><8c><9b>           hourglass done

这是 class data.table 的对象。我们可以简单地修改此散列 table 的 y 列,以便删除所有空格和标点符号。请注意,这还允许您添加新的 ASCII 字节表示和随附的字符串。