单词和表情符号计数器

Question

我有一个包含列 "clear_message" 的数据框，我创建了一个列来计算每行中的所有单词。

history['word_count'] = history.clear_message.apply(lambda x: Counter(x.split(' ')))

例如，如果行消息是：Hello my name is Hello 然后他所在行的计数器将是 Counter({'Hello': 2, 'is': 1, 'my': 1, 'name': 1})

问题

我的文本中有表情符号，我还想要一个表情符号计数器。

例如：

test = 'here sasdsa'
test_counter = Counter(test.split(' '))

输出为：

Counter({'sasdsa': 1, 'here': 1})

但我想要：

Counter({'sasdsa': 1, '': 5, 'here':1})

很明显问题是我正在使用 split(' ')。

我想到的：

在表情符号前后添加 space。喜欢：

test = '     here sasdsa'

然后使用拆分，这将起作用。

不确定这种方法是最好的。
不知道该怎么做。（我知道如果 i 是表情符号，那么 if i in emoji.UNICODE_EMOJI 将 return 为真（emoji 包）。

Answer 1

我认为你在每个表情符号后添加 space 的想法是个好方法。您还需要去除白色 space 以防表情符号和下一个字符之间已经存在 space，但这很简单。类似于：

def emoji_splitter(text):
    new_string = ""
    for char in text:
        if char in emoji.UNICODE_EMOJI:
            new_string += " {} ".format(char)
        else:
            new_string += char
    return [v for v in map(lambda x: x.strip(), new_string.split(" ")) if v != ""]

也许你可以通过使用滑动 window 检查表情符号后的 spaces 并只在必要时添加 spaces 来改进这一点，但假设只有永远是一个 space，因为这个解决方案应该在表情符号之间占 0 到 n spaces。

Answer 2

@con-- answer 有一些问题，所以我修复了它。

def emoji_splitter(text):
    new_string = ""
    text = text.lstrip()
    if text:
        new_string += text[0] + " "
    for char in ' '.join(text[1:].split()):
        new_string += char
        if char in emoji.UNICODE_EMOJI:
            new_string = new_string + " " 
    return list(map(lambda x: x.strip(), new_string.split()))

示例：

emoji_splitter(' a ads')
Out[7]: ['a', '', '', '', 'ads']

单词和表情符号计数器

Counter for words and emoji

python

string

counter

emoji

pandas