使用 re 清理 word 文件，允许带有连字符和撇号的字母

Question

这是我目前的情况：

import re

def read_file(file):
    words = []
    for line in file:
        for word in line.split():
            words.append(re.sub("[^a-z]", "", word.lower()))

就目前而言，这将在 "can't" 中读作 "cant"，在 "co-ordinate" 中读作 "coordinate"。我想阅读单词，以便允许使用这 2 个标点符号。我如何修改我的代码来执行此操作？

Answer 1

可以有两种方法：一种是 ritisht93 在对问题的评论中提出的，尽管我会使用

words.append(re.sub("[^-'a-z]+", "", word.lower()))
                       ^^    ^ - One or more occurrences to remove in one go
                        | - Apostrophe and hyphen added

+ 量词将一次性删除与模式匹配的不需要的字符。

请注意，连字符添加在否定字符 class 的开头，因此不必转义。 注意：如果其他不太懂正则表达式的开发人员稍后要维护它，仍然建议将其转义。

如果您有 Unicode 字母，第二种方法会有所帮助。

ur'((?![-'])[\W\d_])+'

参见regex demo（要用re.UNICODE标志编译）

该模式匹配任何非字母（除了一个连字符或撇号，因为否定先行(?![-'])），任何数字或下划线（[\W\d_])

使用 re 清理 word 文件，允许带有连字符和撇号的字母

Using re to sanitize a word file, allowing letters with hyphens and apostrophes

python

regex

sanitization

file

list