在PHP PCRE 语法中,如何指定多代码点Unicode 字符/"emoji"?
In PHP PCRE syntax, how does one specify a multi-codepoint Unicode character/"emoji"?
代码:
var_dump(preg_replace('#\x{1F634}#u', '', 'This is the sleeping emoji: '));
var_dump(preg_replace('#\x{1F1FB 1F1F3}#u', '', 'This is the Vietnam flag: '));
预期输出:
string(28) "This is the sleeping emoji: "
string(33) "This is the Vietnam flag: "
实际输出:
string(28) "This is the sleeping emoji: "
string(34) "This is the Vietnam flag: "
分析:
成功移除单码表情,但未检测到多码表情。
进行的研究:
阅读以下内容:https://www.php.net/manual/en/regexp.reference.escape.php
After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "\x{...}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
遗憾的是,它没有提到多代码点 Unicode 字符。
问题:
如何在 PHP PCRE 语法中指定多代码点 emoji/Unicode 字符?
帮助说明:
这不是范围!我能够检测和删除范围。这是一个 单个 emoji/Unicode 字符,由 多个 个“代码点”组成。这里指定了很多:https://www.unicode.org/Public/emoji/13.1/emoji-sequences.txt
您引用了 \x{...]
“被解释为 UTF-8 字符”之类的段落。写法有点奇怪,因为它是UTF-8中的Unicode代码点而不是字符,但是由于你需要两个代码点,所以你还需要两个这样的序列:
var_dump(preg_replace('#\x{1F1FB}\x{1F1F3}#u', '', 'This is the Vietnam flag: '));
代码:
var_dump(preg_replace('#\x{1F634}#u', '', 'This is the sleeping emoji: '));
var_dump(preg_replace('#\x{1F1FB 1F1F3}#u', '', 'This is the Vietnam flag: '));
预期输出:
string(28) "This is the sleeping emoji: "
string(33) "This is the Vietnam flag: "
实际输出:
string(28) "This is the sleeping emoji: "
string(34) "This is the Vietnam flag: "
分析:
成功移除单码表情,但未检测到多码表情。
进行的研究:
阅读以下内容:https://www.php.net/manual/en/regexp.reference.escape.php
After "\x", up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, "\x{...}" is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.
遗憾的是,它没有提到多代码点 Unicode 字符。
问题:
如何在 PHP PCRE 语法中指定多代码点 emoji/Unicode 字符?
帮助说明:
这不是范围!我能够检测和删除范围。这是一个 单个 emoji/Unicode 字符,由 多个 个“代码点”组成。这里指定了很多:https://www.unicode.org/Public/emoji/13.1/emoji-sequences.txt
您引用了 \x{...]
“被解释为 UTF-8 字符”之类的段落。写法有点奇怪,因为它是UTF-8中的Unicode代码点而不是字符,但是由于你需要两个代码点,所以你还需要两个这样的序列:
var_dump(preg_replace('#\x{1F1FB}\x{1F1F3}#u', '', 'This is the Vietnam flag: '));