提取列表中的 Unicode-Emoticons，Python 3.x

Question

我处理一些推特数据，我想过滤列表中的表情符号。数据本身是用 utf8 编码的。我像这三个示例行一样逐行读取文件：

['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']
['This', 'is', 'another', 'tweet', 'with', 'a', 'emoticon', '']
['This', 'tweet', 'contains', 'no', 'emoticon']

我想收集这样的每行表情：

['', '⚓️']

等等。

我已经研究并发现 python 中有一个 'emoji' 包。我试着像那样在我的代码中使用它

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []
        data = re.findall(r'\X', elements)
        for word in data:
            if any(char in emoji.UNICODE_EMOJI for char in word):
                emoji_list.append(word)

第一次尝试

import emoji

with open("file.txt", "r", encoding='utf-8') as f:
    for line in f:
        elements = []
        col = line.strip('\n')
        cols = col.split('\t')
        elements.append(cols)

        emoji_list = []

        for c in elements:
            if c in emoji.UNICODE_EMOJI:
                emojilist.append(c)

第二次尝试

我尝试了此处给出的示例，但它们有点不适合我，我不确定我做错了什么。

非常感谢任何帮助提取表情符号的帮助，在此先感谢！ :)

Answer 1

表情符号存在于几个 Unicode 范围内，由这个正则表达式模式表示：

>>> import re
>>> emoji = re.compile('[\u203C-\u3299\U0001F000-\U0001F644]')

您可以使用它来过滤您的列表：

>>> list(filter(emoji.match, ['This', 'is', 'a', 'test', 'tweet', 'with', 'two', 'emoticons', '', '⚓️']))
['', '⚓️']

N.B.: 该模式是一个近似值，可能会捕获一些额外的字符。

提取列表中的 Unicode-Emoticons，Python 3.x

Extract Unicode-Emoticons in list, Python 3.x

python

unicode

emoticons

python-3.x