将 Unicode Emoji 正确读入 R
Reading in Unicode Emoji correctly into R
我有一组来自 Facebook 的评论(通过像 Sprinkr 这样的系统提取),其中包含文本和表情符号,我正在尝试 运行 在 R 中对它们进行各种分析,但是 运行难以正确摄取表情符号字符。
例如:我有一个 .csv(以 UTF-8 编码),其消息行包含如下内容:
"IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"
然后我通过以下方式将其提取到 R 中:
library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\n\n"
现在,根据我从其他来源了解到的情况,我需要将此 UTF-8 转换为 ASCII,然后我可以使用它 link 将其与其他表情符号资源(如精彩的 emojidictionary).为了使连接工作,我需要将其放入 R 编码中,如下所示:
<e2><9d><a4><ef><b8><8f>
但是,添加正常步骤(使用 iconv
)并没有让我到达那里:
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook") %>%
mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>\n\n"
任何人都可以向我阐明我所缺少的东西,或者我需要找到不同的表情符号映射资源吗?谢谢!
目标不是很明确,但我怀疑放弃正确表示表情符号并仅将其表示为字节并不是最好的方法。例如,如果你想将表情符号转换成它们的描述,你可以这样做:
x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"
## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
encoding="UTF-8") %>%
stri_subset_regex(pattern = "^[^#]") %>%
stri_subset_regex(pattern = ".+") -> emoji
## get the emoji characters and clean them up
emoji %>%
stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
stri_replace_all_fixed(pattern = c("*", "#"),
replacement = "",
vectorize_all=FALSE) %>%
stri_trim_both() -> emoji.chars
## get the emoji character descriptions
emoji %>%
stri_extract_all_regex(pattern = "#.*$") %>%
stri_replace_all_regex(pattern = "# *.{1,2} *",
replacement = "") %>%
stri_trim_both() -> emoji.descriptions
## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
pattern = emoji.chars,
replacement = emoji.descriptions,
vectorize_all=FALSE)
## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"
我有一组来自 Facebook 的评论(通过像 Sprinkr 这样的系统提取),其中包含文本和表情符号,我正在尝试 运行 在 R 中对它们进行各种分析,但是 运行难以正确摄取表情符号字符。
例如:我有一个 .csv(以 UTF-8 编码),其消息行包含如下内容:
"IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"
然后我通过以下方式将其提取到 R 中:
library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\n\n"
现在,根据我从其他来源了解到的情况,我需要将此 UTF-8 转换为 ASCII,然后我可以使用它 link 将其与其他表情符号资源(如精彩的 emojidictionary).为了使连接工作,我需要将其放入 R 编码中,如下所示:
<e2><9d><a4><ef><b8><8f>
但是,添加正常步骤(使用 iconv
)并没有让我到达那里:
fb.comments <- raw.fb.comments %>%
clean_names() %>%
filter(senderscreenname != "Reese's") %>%
select(c(message,messagetype,sentiment)) %>%
mutate(type = "Facebook") %>%
mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>\n\n"
任何人都可以向我阐明我所缺少的东西,或者我需要找到不同的表情符号映射资源吗?谢谢!
目标不是很明确,但我怀疑放弃正确表示表情符号并仅将其表示为字节并不是最好的方法。例如,如果你想将表情符号转换成它们的描述,你可以这样做:
x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"
## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
encoding="UTF-8") %>%
stri_subset_regex(pattern = "^[^#]") %>%
stri_subset_regex(pattern = ".+") -> emoji
## get the emoji characters and clean them up
emoji %>%
stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
stri_replace_all_fixed(pattern = c("*", "#"),
replacement = "",
vectorize_all=FALSE) %>%
stri_trim_both() -> emoji.chars
## get the emoji character descriptions
emoji %>%
stri_extract_all_regex(pattern = "#.*$") %>%
stri_replace_all_regex(pattern = "# *.{1,2} *",
replacement = "") %>%
stri_trim_both() -> emoji.descriptions
## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
pattern = emoji.chars,
replacement = emoji.descriptions,
vectorize_all=FALSE)
## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"