特殊字符的正则表达式（顶部有线的 S）

Question

我试图在 Python 中编写正则表达式以用下划线替换所有非 ascii，但如果其中一个字符是“S̄”（'S'顶部的行），它添加了一个额外的 'S'... 是否也有办法解释这个字符？我相信这是一个有效的 utf-8 字符，但不是 ascii

这是代码：

import re
line = "ra*ndom wordS̄"
print(re.sub('[\W]', '_', line))

我希望它输出：

ra_ndom_word_

但我得到的是：

ra_ndom_wordS__

Answer 1

Python 以这种方式工作的原因是您实际上正在查看两个不同的字符；有一个 S 然后是一个组合长音符 U+0304

在一般情况下，如果您想用下划线替换一系列组合字符和基本字符，请尝试

import unicodedata

def cleanup(line):
    cleaned = []
    strip = False
    for char in line:
        if unicodedata.combining(char):
            strip = True
            continue
        if strip:
            cleaned.pop()
            strip = False
        if unicodedata.category(char) not in ("Ll", "Lu"):
            char = "_"
        cleaned.append(char)
    return ''.join(cleaned)

顺便说一下，\W 不需要方括号；它已经是一个正则表达式字符 class.

Python 的 re 模块缺少对重要 Unicode 属性的支持，但如果你真的想为此专门使用正则表达式，third-party regex 库对 Unicode 类别有适当的支持。

"Ll" 是小写字母，"Lu" 是大写字母。还有其他 Unicode L 类别，所以也许可以调整它以满足您的要求（unicodedata.category(char).startswith("L") 也许吧？）；另见 https://www.fileformat.info/info/unicode/category/index.htm

Answer 2

您可以使用以下脚本获得所需的输出：

import re

line="ra*ndom wordS̄"
print(re.sub('[^[-~]+]*','_',line))

输出

ra_ndom_word_

在这种方法中，它也适用于其他 non-ascii 个字符：

import re

line="ra*ndom ¡¢£Ä wordS̄.  another non-ascii: Ä and Ï"
print(re.sub('[^[-~]+]*','_',line))

输出：

ra_ndom_word_another_non_ascii_and_

特殊字符的正则表达式（顶部有线的 S）

Regex For Special Character (S with line on top)

python

regex

ascii