在 R 中从 NCR 转换为 Unicode

Question

我有一些 html 文件（我将它们作为纯文本处理）使用十进制 NCR 对特殊字符进行编码。有没有办法使用 R 将它们方便地转换为 Unicode？
NCR 代码并不总是与 unicode 一对一匹配，这变得非常混乱，因为 ѣ 不等于 \u1123，而是 \u0463:

> stri_unescape_unicode("\u1123")
[1] "ᄣ"

和

> stri_unescape_unicode("\u0463")
[1] "ѣ"

Answer 1

1123是十六进制0463的十进制等值，Unicode使用十六进制。因此，为了进行转换，您需要去掉非数字字符，将数字转换为十六进制字符，在它们前面加上“\u”，然后使用 stri_unescape_unicode.

这个函数会做所有的事情：

ncr2uni <- function(x)
{
  # Strip out non-digits and and convert remaining numbers to hex
  x <- as.hexmode(as.numeric(gsub("\D", "", x)))

  # Left pad with zeros to length 4 so escape sequence is recognised as Unicode 
  x <- stringi::stri_pad_left(x, 4, "0")

  # convert to Unicode
  stringi::stri_unescape_unicode(paste0("\u", x))
}

现在你可以做

ncr2uni(c("&#1123;", "&#1124;", "&#1125;"))
# [1] "ѣ" "Ѥ" "ѥ"

在 R 中从 NCR 转换为 Unicode

Converting from NCR to Unicode in R

r

unicode

stringi

ncr