Python: 如何在使用正则表达式时跳过有多余字符的行？

Question

使用正则表达式时，如何 select 仅从您感兴趣的文本后没有多余文本的行中提取文本？

对于以下输入文本，我只想 select string1 到 string10 并跳过同一行上具有 "blah" 的字符串。

输入文本文件：

[random lines of text]
DATE/USER: 07/01/15   string1
[random lines of text]
DATE/USER: 07/12/15   string2
[random lines of text]
DATE/USER: 07/04/15   string3
[random lines of text]
DATE/USER: 07/12/15   string4
[random lines of text]
DATE/USER: 07/05/15   string5      * blah1 *
[random lines of text]
DATE/USER: 07/02/15   string6
[random lines of text]
DATE/USER: 07/08/15   string7
[random lines of text]
DATE/USER: 07/11/15   string8      * blah2 *
[random lines of text]
DATE/USER: 07/03/15   string9
[random lines of text]
DATE/USER: 07/10/15   string10      * blah3 *
[random lines of text]

我当前的代码：

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
    if rphfind:
    print rphfind[0].strip()

输出：

string1
string2
string3
string4
string5      * blah1 *
string6
string7
string8      * blah2 *
string9
string10      * blah3 *

同样，只尝试抓取字符串并跳过那些在同一行上具有 "blah" 的字符串。我的输出应该排除字符串 5、字符串 8 和字符串 10。

编辑：抱歉。进行了一些编辑以完善我要实现的目标。

Answer 1

根据您的编辑，您绝对可以拆分：

with open("in.txt") as f:
    for line in f:
        if line.startswith("DATE/USER:"):
            spl = line.split()
            if len(spl) == 3:
                print(spl[2])

输出：

string1
string2
string3
string4
string6
string7
string9

使用回复：

with open("in.txt") as f:
    import re
    r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
    for line in f:
        match = r.search(line)
        if match:
           print(match.group(2))

输出：

string1
string2
string3
string4
string6
string7
string9

Answer 2

re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)

Answer 3

下面的“$”实际上会排除后面有 * blah * 的任何行：

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)

只会匹配 A,B,C,D,F,G,I

捕获组 ([A-Z]) 将只捕获单个大写字母，但仍允许匹配任何行（在您的示例中打印 A 到 J）

rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)

不确定您要找的是哪个版本

Python: 如何在使用正则表达式时跳过有多余字符的行？

Python: How to skip lines that have extra characters while using Regular Expressions?

python

regex

text-processing