使用 python 正则表达式删除除组合成字符串的数字以外的所有数字

Question

尝试使用正则表达式函数删除单词、空格、特殊字符和数字，但不删除与 word/string 组合的那个。例如

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

\W+删除malwmrll1中包括1在内的所有数字

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\b\d+\b\s*$+\sORIGIN$\W+]', '', text_file)

print(new_txt, len(new_txt))

我的输出是：

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

期望的输出应该是：malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 1

对，根据你想要的结果是否显示下划线，尝试使用 re.findall 和 raw-string 符号。您目前使用的字符 class 没有意义：

\b(?!(?:ORIGIN|[_\d]+)\b)\w+

网上看一个demo

\b - Word-boundary;
(?!(?:ORIGIN|[_\d]+)\b) - 带有嵌套 non-capture 组的否定前瞻匹配尾随 word-boundary 之前的 ORIGIN 或 1+ underscore/digit 组合；
\w+ - 1+ word-characters.

import re
  
text_file = """ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//"""

new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))    
print(new_txt, len(new_txt))

打印：

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 2

为此使用 RE 是一项有趣的学术练习，但扩展功能充满了危险，除非您非常熟悉该技术。

这个答案可能看起来 long-winded，但您应该能够看到扩展它是多么容易，以便可以排除或包含其他 tokens/patterns。它也很容易维护，因为任何其他必须修改代码的人都不会在试图弄清楚 RE 的工作原理时感到偏头痛。

FILENAME = 'mytext.txt'

def keep(t):
    if t.isdigit() or t == 'ORIGIN' or t == '//':
        return False
    return True

with open(FILENAME) as f:
    new_txt = ''.join(filter(keep, f.read().split()))
    print(new_txt, len(new_txt))

输出：

malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110

Answer 3

另一个想法：

new_txt = re.sub('[\W_]+|\b(?:\d+|ORIGIN)\b', '', text_file)

去除所有非单词字符 + 下划线或数字/单词边界之间的“ORIGIN”。

See this demo at tio.run (the regex is very basic, explanation at regex101)

使用 python 正则表达式删除除组合成字符串的数字以外的所有数字

Remove all numbers except for the ones combined to string using python regex

python

regex

string

text

text-files