Sklearn CountVectorizer：将表情符号保存为文字

Question

我在字符串上使用 Sk Learn CountVectorizer，但 CountVectorizer 丢弃了文本中的所有表情符号。

例如， Welcome 应该给我们：["\xf0\x9f\x91\x8b", "welcome"]

然而，当运行:

vect = CountVectorizer()
test.fit_transform([' Welcome'])

我只得到：["welcome"]

这与 token_pattern 有关，它不将编码的表情符号算作一个单词，但是是否有自定义 token_pattern 来处理表情符号？

Answer 1

尝试使用参数 CountVectorizer(analyzer = 'char', binary = True)

文档说："token_pattern: Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'" 参见 https://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html。

另请参阅此笔记本：https://www.kaggle.com/kmader/toxic-emojis

Answer 2

还有一个 couple of packages 可以直接将 emojis/emoticons 转换成单词，例如

import emot
>>> text = "I love python  :-)"
>>> emot.emoji(text)
[{'value': '', 'mean': ':man:', 'location': [14, 14], 'flag': True}]

>> import emoji
>> print(emoji.demojize('Python is '))
Python is :thumbs_up:

Answer 3

是的，你是对的！ token_pattern 必须更改。除了字母数字字符，我们还可以将其设为 除白色以外的任何字符 space.

试试这个！

from sklearn.feature_extraction.text import TfidfVectorizer
s= [' Welcome', ' Welcome']

v = TfidfVectorizer(token_pattern=r'[^\s]+')
v.fit(s)
v.get_feature_names()

# ['welcome', '']

Sklearn CountVectorizer：将表情符号保存为文字

Sk Learn CountVectorizer: keeping emojis as words

python

nlp

scikit-learn

countvectorizer