HTML 在 R 中编码文本

HTML encode text in R

我正在查看 Twitter 数据,然后将其输入到 html 文档中。文本通常包含特殊字符,例如未针对 html 正确编码的表情符号。例如推文:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be

会变成:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥

当输入 html 文档时。

手动工作我可以使用 https://www.textfixer.com/html/html-character-encoding.php 之类的工具将推文编码为:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "&#55357";"&#56613"; "&#55357";"&#56613"; "&#55357";"&#56613";

然后我可以将其提供给 html 文档并显示表情符号。 R 中是否有一个包或函数可以接受文本并 html 像上面的网络工具一样对其进行编码?

这是一个将非 ascii 字符编码为 HTML 实体的函数。

entity_encode <- function(x) {
  cp <- utf8ToInt(x)
  rr <- vector("character", length(cp))
  ucp <- cp>128
  rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
  rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
  paste0(rr, collapse="")
}

这个returns

[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be &#128293; &#128293; &#128293;"

供您输入,但这些编码似乎是等效的。