如何用文本替换表情符号并将它们视为单个单词?
How can I replace emojis with text and treat them as single words?
我必须用R 对包含表情符号的文本片段进行主题建模。使用replace_emoji()
和replace_emoticon
函数让我分析它们,但结果有问题。
A red heart emoji 翻译为“red heart ufef”。然后在分析过程中分别处理这些词并影响结果。
像“heart”这样的词可以有非常不同的意思,就像“red heart ufef”和“broken heart”一样
函数 replace_emoji_identifier()
也无济于事,因为标识符使分析变得困难。
虚拟数据集可通过使用 dput()
重现(包括步骤 force to lowercase
:
Emoji_struct <- c(
list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),
list(content = "", " thanks for helping", " oh no, why? ", "careful, challenging ❌❌❌")
)
当前编码(data_orig
是几个文件的列表):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
期望的输出:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
有什么想法吗?小写也可以。
最好的祝福。注意安全。保持健康。
回答
将 replace_emoji
中的默认转换 table 替换为删除了 spaces/punctuation 的版本:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
例子
单个字符串:
replace_emoji("wow! that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
字符向量:
replace_emoji(c("1: ", "2: "), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
列表:
list("list_element_1: ", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
理由
要将表情符号转换为文本,replace_emoji
使用 lexicon::hash_emojis
作为转换 table(散列 table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
这是 class data.table
的对象。我们可以简单地修改此散列 table 的 y
列,以便删除所有空格和标点符号。请注意,这还允许您添加新的 ASCII 字节表示和随附的字符串。
我必须用R 对包含表情符号的文本片段进行主题建模。使用replace_emoji()
和replace_emoticon
函数让我分析它们,但结果有问题。
A red heart emoji 翻译为“red heart ufef”。然后在分析过程中分别处理这些词并影响结果。
像“heart”这样的词可以有非常不同的意思,就像“red heart ufef”和“broken heart”一样
函数 replace_emoji_identifier()
也无济于事,因为标识符使分析变得困难。
虚拟数据集可通过使用 dput()
重现(包括步骤 force to lowercase
:
Emoji_struct <- c(
list(content = " wow", " look at that", "this makes me angry", "❤\ufe0f, i love it!"),
list(content = "", " thanks for helping", " oh no, why? ", "careful, challenging ❌❌❌")
)
当前编码(data_orig
是几个文件的列表):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
期望的输出:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
有什么想法吗?小写也可以。 最好的祝福。注意安全。保持健康。
回答
将 replace_emoji
中的默认转换 table 替换为删除了 spaces/punctuation 的版本:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
例子
单个字符串:
replace_emoji("wow! that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
字符向量:
replace_emoji(c("1: ", "2: "), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
列表:
list("list_element_1: ", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
理由
要将表情符号转换为文本,replace_emoji
使用 lexicon::hash_emojis
作为转换 table(散列 table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
这是 class data.table
的对象。我们可以简单地修改此散列 table 的 y
列,以便删除所有空格和标点符号。请注意,这还允许您添加新的 ASCII 字节表示和随附的字符串。