Python 正则表达式二进制文件文本文件 - 如何使用数字范围和单词边界？

Question

我有一个文本文件，需要我以二进制形式读取它并以二进制形式写出它。没问题。我需要用 Xs 屏蔽社会安全号码，通常很容易：

text = re.sub("\b\d{3}-\d{2}-\{4}\b","XXX-XX-XXXX", text)

这是我正在解析的文本示例：

more stuff here CHILDREN�S 001-02-0003 get rid of that stuff goes here not001-02-0003 but ssn:001-02-0003

我需要把它变成这样：

more stuff here CHILDREN�S XXX-XX-XXXX get rid of that stuff goes here not001-02-0003 but ssn:XXX-XX-XXXX

超级棒！所以现在我正在尝试编写相同的正则表达式 'in binary'。这是我得到的，它是 'works'，但天哪，它根本感觉不对：

line = re.sub(b"\B(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\x00-(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)(\x000|\x001|\x002|\x003|\x004|\x005|\x006|\x007|\x008|\x009)\B", b"\x00X\x00X\x00X\x00-\x00X\x00X\x00-\x00X\x00X\x00X\x00X", line)

备注：

孩子们的垃圾，必须保持这样
需要单词边界，因此第 4 行不会被屏蔽掉

我的正则表达式不应该是一个数字范围吗？我只是不知道如何用二进制来做到这一点。而且我的单词边界只适用于 backwards 作为 \B 而不是 \b，呃..那是怎么回事？

更新：我也试过这个：

line = re.sub(b"[\x30-\x39]", b"\x58", line)

这对每个数字都是如此，但如果我尝试做一些简单的事情，比如：

line = re.sub(b"[\x30-\x39][\x30-\x39]", b"\x58\x58", line)

它不再匹配任何内容，知道为什么吗？

Answer 1

你可以试试：

import re

rx = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

with open("test.txt", "rb") as fr, open("test2.txt", "wb+") as fp:
    repl = rx.sub('XXX-XX-XXXX', fr.read())
    fp.write(repl)

这会保留所有垃圾字符并将它们写入 test2.txt。
请注意，当您不想转义每个反斜杠时，可以在 Python 中使用 r'string here'。

Python 正则表达式二进制文件文本文件 - 如何使用数字范围和单词边界？

Python regex a binary file text file - how to use a range of numbers and word boundry?

python

regex

binaryfiles

python-3.6