如何在 R 中的 Unicode 字符串中查找 "real" 个字符

Question

我知道如何在 R 中查找非 Unicode 字符串的长度。

nchar("ABC")

（感谢所有在这里回答问题的人：How to find the length of a string in R?）。

但是 Unicode 字符串呢？

如何在 Unicode 字符串中查找字符串的长度（字符串中的字符数）？如何在 R 中找到 Unicode 字符串的长度（以字节为单位）和字符数（符文、符号）？

Answer 1

您可以使用 nchar 作为 characters 的数量和 bytes 的数量：

nchar("bi\u00dfchen", type="chars")
#[1] 7
nchar("bi\u00dfchen", type="bytes")
#[1] 8

确实，在帮助中，您可以找到有关如何计算字符串大小的详细信息：

The ‘size’ of a character string can be measured in one of three ways (corresponding to the type argument):

bytes: The number of bytes needed to store the string (plus in C a final terminator which is not counted).

chars: The number of human-readable characters.

width: The number of columns cat will use to print the string in a monospaced font. The same as chars if this cannot be calculated.

如果您想知道字符串中可能（或可能不）包含 unicode 的“符号”的数量（即没有解释 unicode 符号），您可以使用来自程序包 stringi 的函数 stri_escape_unicode:

library(stringi)
nchar(stri_escape_unicode("bi\u00dfchen")) # same as stri_length(stri_escape_unicode("bi\u00dfchen"))
# [1] 12

如何在 R 中的 Unicode 字符串中查找 "real" 个字符

How to find the "real" number of characters in a Unicode string in R

string

unicode

r

string-length