使用 RE 替换文本 - 允许第一次出现,替换其余的
Text Replacement Using RE - Allow The First Occurance, Replace The Rest
我正在寻找有关如何完成这些任务的一些想法:
- 允许 problem_word 的第一次出现,但禁止其后的任何使用以及其余问题词。
- 未对原始文档(.txt 文件)进行任何修改。仅针对 print() 进行修改。
- 保持电子邮件的相同结构。如果有换行符、制表符或奇怪的间距,请让它们保持完整。
这是代码示例:
import re
# Sample email is "Hello, banned1. This is banned2. What is going on with
# banned 3? Hopefully banned1 is alright."
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
# TODO: Filter negative_words into overused_negative_words
banned_problem_words = []
for w in problem_words:
if sample_email.count(f'\b{w}s?\b') > 1:
banned_problem_words.append(w)
pattern = '|'.join(f'\b{w}s?\b' for w in banned_problem_words)
def list_check(email, pattern):
return re.sub(pattern, 'REDACTED', email, flags=re.IGNORECASE)
print(list_check(sample_email, pattern))
# Result should be: "Hello, banned1. This is REDACTED. What is going on with
# REDACTED? Hopefully REDACTED is alright."
re.sub
的 repl
参数可以接受一个接受匹配对象和 returns 替换字符串的函数。这是我的解决方案:
import re
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
pattern = '|'.join(f'\b{w}\b' for w in problem_words)
occurrences = 0
def redact(match):
global occurrences
occurrences += 1
if occurrences > 1:
return "REDACTED"
return match.group(0)
replaced = re.sub(pattern, redact, sample_email, flags=re.IGNORECASE)
print(replaced)
(进一步说明,string.count
不支持正则表达式,但无需计算)
我正在寻找有关如何完成这些任务的一些想法:
- 允许 problem_word 的第一次出现,但禁止其后的任何使用以及其余问题词。
- 未对原始文档(.txt 文件)进行任何修改。仅针对 print() 进行修改。
- 保持电子邮件的相同结构。如果有换行符、制表符或奇怪的间距,请让它们保持完整。
这是代码示例:
import re
# Sample email is "Hello, banned1. This is banned2. What is going on with
# banned 3? Hopefully banned1 is alright."
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
# TODO: Filter negative_words into overused_negative_words
banned_problem_words = []
for w in problem_words:
if sample_email.count(f'\b{w}s?\b') > 1:
banned_problem_words.append(w)
pattern = '|'.join(f'\b{w}s?\b' for w in banned_problem_words)
def list_check(email, pattern):
return re.sub(pattern, 'REDACTED', email, flags=re.IGNORECASE)
print(list_check(sample_email, pattern))
# Result should be: "Hello, banned1. This is REDACTED. What is going on with
# REDACTED? Hopefully REDACTED is alright."
re.sub
的 repl
参数可以接受一个接受匹配对象和 returns 替换字符串的函数。这是我的解决方案:
import re
sample_email = open('email.txt', 'r').read()
# First use of any of these words is allowed; those following are banned
problem_words = ['banned1', 'banned2', 'banned3']
pattern = '|'.join(f'\b{w}\b' for w in problem_words)
occurrences = 0
def redact(match):
global occurrences
occurrences += 1
if occurrences > 1:
return "REDACTED"
return match.group(0)
replaced = re.sub(pattern, redact, sample_email, flags=re.IGNORECASE)
print(replaced)
(进一步说明,string.count
不支持正则表达式,但无需计算)