是否可以在 python 中标记已屏蔽的文本？

Question

我正在尝试为 NLP 预处理一个文本文件，在这项工作中，我们正在标记各种项目，例如日期、地址和敏感的个人信息 (SPI)。问题是文本已经掩盖了其中的一些信息。例如：

1 月 6 日，xxxx 或 (xxx)xxx-1234

我的问题是，是否可以在 python 中使用正则表达式来揭露它们，以便我们可以继续正确标记它们？所以我需要这样的东西：

1111 年 1 月 6 日或 (111)111-1234

将它们标记为#US_DATE和#PHONE

我已经尝试过简单可行的解决方案，例如：

re.sub(r'xx', '11', '(xxx)xxx-1234')
re.sub(r'xx+', '11', 'January 9 xxxx')

但都没有给我正确的模式！提前致谢。

Answer 1

也许一种选择是使用交替来匹配不同的格式，并使用 re.sub 和回调将所有 x 字符替换为 1.

对于模式，我使用 character classes with quantifiers 来指定允许匹配的内容，但您可以更新它以使其更具体。

\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b

Regex demo | Python demo

例如：

import re

regex = r"\b[A-Za-z]{3,} [a-zA-Z\d]{1,2},? [a-zA-Z\d]{4}\b|\([a-zA-Z\d]+\)[a-zA-Z\d]{3}-[a-zA-Z\d]{4}\b"
test_str = ("Jan 6, xxxx or (xxx)xxx-1234\n"
    "Jan 16, xxxx or (xxx)xxx-1234\n"
    "January 9 xxxx\n"
    "(xxx)xxx-1234")
matches = re.sub(regex, lambda x: x.group().replace('x', '1'),  test_str)
print(matches)

结果

Jan 6, 1111 or (111)111-1234
Jan 16, 1111 or (111)111-1234
January 9 1111
(111)111-1234

是否可以在 python 中标记已屏蔽的文本？

Is it possible to tag an already masked text in python?

python

regex

text-processing