如何正确迭代 Python 中的 unicode 字符

Question

我想遍历一个字符串并输出所有表情符号。

我正在尝试遍历字符，并根据 emoji list.

检查它们

然而，python 似乎将 unicode 字符拆分为更小的字符，破坏了我的代码。示例：

>>> list(u'Test \U0001f60d')
[u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d']

知道为什么 u'\U0001f60d' 会分裂吗？

或者提取所有表情符号的更好方法是什么？这是我原来的提取码：

def get_emojis(text):
  emojis = []
  for character in text:
    if character in EMOJI_SET:
      emojis.append(character)
  return emojis

Answer 1

试试这个，

import re
re.findall(r'[^\w\s,]', my_list[0])

正则表达式 r'[^\w\s,]' 匹配任何非单词、空格或逗号的字符。

Answer 2

Python pre-3.3 在内部使用 UTF-16LE (narrow build) 或 UTF-32LE (wide build) 来存储 Unicode，并且由于 leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs 表示 U+FFFF 以上的 Unicode 字符作为两个代码点。使用广泛的 Python 构建或切换到 Python 3.3 或更高版本来解决问题。

处理窄构建的一种方法是匹配代理对：

Python 2.7（窄体）：

>>> s = u'Test \U0001f60d'
>>> len(s)
7
>>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
[u'T', u'e', u's', u't', u' ', u'\U0001f60d']

Python 3.6:

>>> s = 'Test \U0001f60d'
>>> len(s)
6
>>> list(s)
['T', 'e', 's', 't', ' ', '']

Answer 3

我一直在与 Unicode 作斗争，但它并不像看起来那么容易。这个 emoji 库包含所有注意事项（我不隶属于）。

如果你想列出字符串中出现的所有表情符号，我建议 emoji.emoji_lis。

只需查看 emoji.emoji_lis 的源代码即可了解它实际上有多复杂。

例子

>>> emoji.emoji_lis('')
>>> [{'location': 0, 'emoji': ''}, {'location': 1, 'emoji': ''}, {'location': 2, 'emoji': ''}]

列表示例（并非总是有效）

>>> list('')
>>> ['', '', '', '']

如何正确迭代 Python 中的 unicode 字符

How to properly iterate over unicode characters in Python

python

unicode

python-unicode