为什么 Unicode 表情符号 属性 转义匹配数字?

Why do Unicode emoji property escapes match numbers?

我发现了这种使用正则表达式检测表情符号的绝妙方法,该正则表达式不使用 Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers ')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

但是当我分享这个知识时 in this answer,@Bronzdragon 注意到 \p{Emoji} 也匹配数字!这是为什么?数字不是表情符号?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 '), // true, as expected
  regex.test('flowers '), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 '), // true, as expected
  hasEmoji('flowers '), // true, as expected
)

根据 this post,digitis,#*,ZWJ 和其他一些字符包含 Emoji 属性 设置为 ,这意味着 数字被认为是有效的表情符号字符:

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (..)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (..)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (..)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (..)      tag space..cancel tag

例如,1是一个数字,但与U+FE0FU+20E3字符组合后就变成了表情符号:1️⃣:

console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")

如果要避免匹配数字,请使用 Extended_Pictographic Unicode 类别 class:

The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.

因此,您可以使用 /\p{Extended_Pictographic}/gu 来匹配大多数表情符号,或者 /\p{Extended_Pictographic}/u 来测试单个表情符号,或者使用 /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u 来匹配表情符号和浅色皮肤深色皮肤模式字符和 red-haired 到 white-haired 个字符:

const regex_emoji = /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u;
console.log( regex_emoji.test('flowers 123') );     // => false
console.log( regex_emoji.test('flowers ') ); // => true