Python 表情符号搜索和替换未按预期工作

Question

我正在尝试将给定文本中的表情符号与其他表情符号分开 characters/words/emojis。我想稍后使用表情符号作为文本分类的特征。因此，重要的是我将句子中的每个表情符号单独视为一个单独的字符。

代码：

import re

text = "I am very #happy man but my wife is not "
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

#padding the emoji with space at both the ends
new_text = reg.sub('  ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emoji in new_text
new_text2 = reg.sub('##', new_text) 
print(new_text2) # line c

这里是实际输出：

（我必须粘贴屏幕截图，因为从终端复制粘贴输出到此处会扭曲 b 和 c 行中那些已经扭曲的表情符号）

这是我的预期输出：

I am very #happy man but my wife is not 
I am very #happy man but     my wife   is not     
I am very #happy man but ##  ##  my wife ##  is not  ##  ##

问题：

1) 为什么搜索和替换没有按预期工作？被替换的表情符号是什么？（b 行）。它绝对不是原始表情符号的 unicode，否则第 c 行会在两端打印带有 # 的表情符号。

2) 我不确定我对此是否正确但是 - 为什么分组的表情符号被单个 emoji/unicode 替换？（b 行）

Answer 1

这里有几个问题。

正则表达式模式中没有捕获组，但在替换模式中，您定义 </code> 对第 1 组的反向引用 - 因此，最自然的解决方法是使用对第 0 组的反向引用，即整场比赛，即<code>\g<0>.
替换中的 </code> 实际上并未解析为反向引用，而是解析为八进制值为 1 的字符，因为常规（非原始）字符串文字中的反斜杠形成 <em>escape序列</em>。这里，是八进制转义。</li> <li><code>]后面的+表示正则引擎必须匹配字符class出现1次或多次，所以你匹配sequences 个表情符号，而不是每个单独的表情符号。

使用

import re

text = "I am very #happy man but my wife is not "
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]', 
    re.UNICODE)

#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text) 
print(new_text2) # line c

看到Python demo印刷

I am very #happy man but my wife is not 
I am very #happy man but     my wife   is not     
I am very #happy man but ##  ##  my wife ##  is not  ##  ##

Python 表情符号搜索和替换未按预期工作

Python emoji search and replace not working as expected

python

regex

string

emoji