codePointAt 和 charCodeAt 之间的区别

Difference between codePointAt and charCodeAt

JavaScript中的String.prototype.codePointAt()String.prototype.charCodeAt()有什么区别?

'A'.codePointAt(); // 65
'A'.charCodeAt();  // 65

来自 charCodeAt 上的 MDN 页面:

The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

The UTF-16 code unit matches the Unicode code point for code points which can be represented in a single UTF-16 code unit. If the Unicode code point cannot be represented in a single UTF-16 code unit (because its value is greater than 0xFFFF) then the code unit returned will be the first part of a surrogate pair for the code point. If you want the entire code point value, use codePointAt().

TLDR;

  • charCodeAt()UTF-16
  • codePointAt()Unicode.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt

从这个url可以看出区别,它们的功能几乎是一样的,只是在returns和非法参数

上有些区别

为 ToxicTeacakes 的答案添加一些内容,这里是另一个帮助您了解差异的示例:

"".charCodeAt(0).toString(16);//d842
"".charCodeAt(1).toString(16);//dfb7

"".codePointAt(0);//20bb7
"".codePointAt(1);//dfb7

console.log("\ud842\udfb7");//, an example of hexadecimal digits
console.log("\u20bb7\udfb7");//₻7�
console.log("\u{20bb7}");// an unicode code point escapes the "\ud842\udfb7"

The following 是关于 javascript 字符串文字的信息:

"\uXXXX"
The Unicode character specified by the four hexadecimal digits XXXX. For example, \u00A9 is the Unicode sequence for the copyright symbol.

"\u{XXXXX}"
Unicode code point
escapes. For example, \u{2F804} is the same as the simple Unicode escapes \uD87E\uDC04.

另见 msdn

JS 中的示例

在带有字符串和表情符号的示例中,我将说明当您不知道某些字符可能由 2 个代码单元组成时会出现什么问题。有些字符占用了一个以上的代码单元。考虑使用 codePointAt() 而不是 charCodeAt() 或者如果你确定你的角色位于 065535 之间(216)

more about code units here

// charCodeAt() is UTF-16
// codePointAt() is Unicode

/* UTF-16 is generally considered a bad idea today */

const strings = ["o", "four", "to"];
const emojis = ["", ""];

function printItemsLength(arr) {
    for (const item of arr) {
    console.log(item, item.length);
  }
}

printItemsLength(strings);
console.log('================================');
printItemsLength(emojis);
console.log('================================');
console.log("i.charCodeAt(0)", "i".charCodeAt(0)); // 105
console.log("i.charCodeAt(1)", "i".charCodeAt(1)); // 105
console.log("i.codePointAt(0)", "i".codePointAt(0)); // 105
console.log('=============EMOJIS=============');
// getting the decimal (dec) by which you can find them

console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[0] + '.charCodeAt(0)', emojis[0].charCodeAt(0)); // only half-character - 55357
console.log(emojis[0] + '.charCodeAt(1)', emojis[0].charCodeAt(1)); // only half-character - 55357

console.log('===========codePointAt===========');
console.log(emojis[0] + '.codePointAt(0)', emojis[0].codePointAt(0)); // 128014

console.log('===========charCodeAt===========');
// "surrogate pair"
console.log(emojis[1] + '.charCodeAt(0)', emojis[1].charCodeAt(0)); // only half-character - 55357
console.log(emojis[1] + '.charCodeAt(1)', emojis[1].charCodeAt(1)); // only half-character - 55357

console.log('===========codePointAt===========');
// full-character
console.log(emojis[1] + '.codePointAt(0)', emojis[1].codePointAt(0)); // 128095
console.log(emojis[1] + '.codePointAt(1)', emojis[1].codePointAt(1)); // will return lower surragate (non-displayable character)
// to find this emojis have a look here: https://www.w3schools.com/charsets/ref_emoji.asp

有人可能已经注意到,我曾尝试将字符代码转换回表情符号,但它对其中一个符号不起作用(那是因为它不在 UTF-16 的范围内

Unicode 和 UTF-16 简介

如果您已经熟悉请跳过此部分

Unicode – is a set of characters used around the world; UTF-16 - 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "" (two 16-bits) read more

“代理对”字符是表情符号和一些由超过 1 个字符组成的字母,正如所解释的那样 here

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. read more

Unicode - 它为每个字符分配一个唯一编号,称为 代码点

区分 charCodeAt()codePointAt()

charCodeAt(pos) returns编码一个编码单元(不是一个完整的字符)。

如果您需要一个字符(可以是一个或两个代码单元),您可以使用 codePointAt(pos) 来获取它的代码。

charCodeAt() - returns 0 到 65535 之间的整数,表示给定索引处的 UTF-16 代码单元 link codePointAt() - returns 一个 non-negative 整数,它是给定位置的 Unicode 代码点值 link

其中 pos 是您要检查的字符的索引。 书中引述:

UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. It’s easy to write programs that pretend code units and characters are the same things.

read more

jsfiddle sandbox 来源

  1. What is Unicode, UTF-8, UTF-16?
  2. Marijn Haverbeke Eloquent JavaScript,第 3 版:现代编程导论 [文本] – 城市(not-specified):否淀粉出版社,2018 – 447 页。 can be found here
  3. What is "surrogate pair"
  4. 找到这个表情符号看看w3schools.com/charsets/ref_emoji

第 5 章,第 2 页。 91 => Strings and character codes