Erlang 相当于 javascript codePointAt？

Question

是否有来自 js 的 codePointAt 的 erlang 等价物？一个从字节偏移量开始获取代码点，而不修改底层 string/binary?

Answer 1

您可以使用 bit syntax 模式匹配来跳过前 N 个字节并将剩余字节中的第一个字符解码为 UTF-8:

1> CodePointAt = fun(Binary, Offset) ->
  <<_:Offset/binary, Char/utf8, _/binary>> = Binary,
  Char
end.

测试：

2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>

如您所见，如果偏移量不在有效的 UTF-8 字符边界内，函数将抛出错误。如果需要，您可以使用 case 表达式以不同方式处理。

Answer 2

首先，请记住只有二进制字符串在 Erlang 中使用 UTF-8。普通 double-quote 字符串已经只是代码点列表（很像 UTF-32）。 unicode:chardata() 类型表示这两种字符串，包括像 ["Hello", $\s, [<<"Filip"/utf8>>, $!]] 这样的混合列表。如果需要，您可以使用 unicode:characters_to_list(Chardata) 或 unicode:characters_to_binary(Chardata) 获取扁平化版本。

同时，JS codePointAt 函数适用于 UTF-16 编码字符串，这正是 JavaScript 使用的。请注意，本例中的索引不是字节位置，而是编码的 16 位单元的索引。并且 UTF-16 也是一种可变长度编码：需要超过 16 位的代码点使用一种称为“代理对”的转义序列 - 例如表情符号 - 所以如果可以出现这样的字符，索引就会产生误导：在 "az"（在JavaScript中），a是0，但是z不是2而是3。

您想要的可能是所谓的“字素簇”——那些在打印时看起来像单个东西的东西（请参阅 Erlang 字符串模块的文档：https://www.erlang.org/doc/man/string.html). And you can't really use numerical indexes to dig the grapheme clusters out from a string - you need to iterate over the string from the start, getting them out one at a time. This can be done with string:next_grapheme(Chardata) (see https://www.erlang.org/doc/man/string.html#next_grapheme-1) or if you for some reason really need to index them numerically, you could insert the individual cluster substrings in an array (see https://www.erlang.org/doc/man/array.html）。例如：array:from_list(string:to_graphemes(Chardata)).

Erlang 相当于 javascript codePointAt？

Erlang equivalent of javascript codePointAt?

erlang