保持字符串转ASCII前后的一致性

Question

我有许多 unicode 格式的字符串，例如 carbon copolymers—III\n12- Géotechnique\n 以及更多具有许多不同 unicode 字符的字符串，在一个名为 txtWords.

的字符串变量

我的目标是删除所有非 ASCII 字符同时保持字符串的一致性。例如，我想将第一句变成 carbon copolymers III 或 carbon copolymers iii （这里不区分大小写），第二句变成 geotechnique\n 等等...

目前我正在使用以下代码，但它并没有帮助我实现我的期望。当前代码将 carbon copolymers III 更改为 carbon copolymersiii 这绝对不是它应该的样子：

import unicodedata, re
txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')
txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)

如果我先使用正则表达式代码，那么我会得到更糟的结果（根据我的预期）：

    import unicodedata, re
    txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)
    txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')

这样，对于字符串 Géotechnique\n 我得到 otechnique!

我该如何解决这个问题？

Answer 1

在分解技巧之前使用\w正则表达式去除非字母数字：

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

输出（Python 2 和 3）：

carbon copolymers iii
12  geotechnique

保持字符串转ASCII前后的一致性

Maintaining the consistency of strings before and after converting to ASCII

regex

string

unicode

consistency

python-2.7