替换字符串中的特定字符串模式 python

Question

我有一些带有表情符号 Unicode 的句子，它们由 Unicode 模式组成，例如 U0001。我需要将所有具有 U0001 的字符串提取到一个数组中。这是我试过的代码

    import re
    
    pattern = re.compile(r"^U0001")
    sentence = 'U0001f308 U0001f64b The dark clouds disperse the hail subsides and one neon lit rainbow with a faint second arches across the length of the A u2026'
    print(pattern.match(sentence).group()) #this prints U0001 every time but what i want is ['U0001f308']

    matches = re.findall(r"^\w+", sentence)
    print(matches) # This only prints the first match which is 'U0001f308'

有什么方法可以将字符串提取到数组中吗？。我对正则表达式没有太多经验。

Answer 1

'U0001f30' 不是表情符号代码点！这是一个以字母 'U'.

开头的 9 个字符的字符串

输入超过 4 个十六进制字符的 unicode 代码点的方法是 \U0001f308。同样输入 4 个十六进制字符代码点：\u0001.

但是您不能像查找常规字符串一样查找以“0001”开头的代码点。在我看来，您正在寻找 4 位十六进制字符代码点 \u0001 或 \U00010000 - \U0001FFFF:

范围内的任何内容

import re

sentence = '\U0001f308 \U0001f64b The dark clouds disperse the hail subsides and one neon lit rainbow with a faint second arches across the length of the A \u2026'

matches = re.findall('[\u0001\U00010000-\U0001FFFF]', sentence)
print(matches)

matches -> ['\U0001f308', '\U0001f64b']

如果由于某种原因你确实有以 'U' 开头的字符串而不是实际的代码点，那么：

matches = re.findall('U0001(?:[0-9a-fA-F]{4})?', sentence)

我还假设表情符号可以出现在字符串中的任何位置并与任何其他字符相邻。

替换字符串中的特定字符串模式 python

Replace particular string patter in a string python

python

regex

unicode