如何在 Python 中找到表情符号的 unicode 平面

How to find unicode planes for emojis in Python

我有 pandas 个包含表情符号的数据框,我想根据 Unicode Planes.

对它们进行分类
emoji | unicode
---------------
    |  1F602
    |  1F60A

预期输出

emoji | unicode | Plane
-----------------------
    |  1F602  |   1    
    |  1F60A  |   1
 ⛹   |  26F9   |   0

这里Plane 0指的是Basic Multilingual Plane (BMP),Plane 1指的是Supplementary Multilingual Plane (SMP)。

[注意:请在 Mac 上使用 Safari,在 Linux 上使用 Firefox,在 Windows 上使用 Chrome 以使用正确的表情符号查看此问题]

请经常给一个minimum reproducible example帮助别人帮助你

根据您在 Unicode Planes 上的 link,

There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–10 (in base 16) of the first two positions in six position hexadecimal format (U+hhhhhh).

基于该解释,让我们编写一个函数来获取该信息。

# in the comments, we can use char = ''
def unicode_to_plane(char: str) -> int:
    unicode_codepoint = ord(char)       # 128512
    hex_repr = hex(unicode_codepoint)   # '0x1f600'
    hex_digits = hex_repr[2:]           # '1f600'
    plane = 0                           # Assume plane is 0 until proven otherwise
    if len(hex_digits) > 4:             # The plane is 0 if hex representation is four hex digits or less
        hex_plane = hex_digits[:-4]     # '1' (take away the last four characters)
        plane = int(hex_plane, 16)      # 1 (convert hex characters to integer)
    return plane                        # 1

请注意,根据 wiki on Emoji

Most, but not all, emoji are included in the Supplementary Multilingual Plane (SMP) of Unicode.

并且 SMP 是平面 1。

</code>和<code>都属于Plane 1, the Supplementary Multilingual Plane (SMP)

下面的代码片段可以举例说明获取Unicode平面#的算法(它是ord(ch)>>16,参见bitwise right shift)。

for ch in '✌⛹☹☺☻':
    print( ch, '\t{:04x}\t'.format(ord(ch)), ord(ch)>>16)
✌       270c     0
⛹       26f9     0
☹       2639     0
☺       263a     0
☻       263b     0
      1f602    1
      1f60a    1