Python: 如何在使用正则表达式时跳过有多余字符的行?
Python: How to skip lines that have extra characters while using Regular Expressions?
使用正则表达式时,如何 select 仅从您感兴趣的文本后没有多余文本的行中提取文本?
对于以下输入文本,我只想 select string1 到 string10 并跳过同一行上具有 "blah" 的字符串。
输入文本文件:
[random lines of text]
DATE/USER: 07/01/15 string1
[random lines of text]
DATE/USER: 07/12/15 string2
[random lines of text]
DATE/USER: 07/04/15 string3
[random lines of text]
DATE/USER: 07/12/15 string4
[random lines of text]
DATE/USER: 07/05/15 string5 * blah1 *
[random lines of text]
DATE/USER: 07/02/15 string6
[random lines of text]
DATE/USER: 07/08/15 string7
[random lines of text]
DATE/USER: 07/11/15 string8 * blah2 *
[random lines of text]
DATE/USER: 07/03/15 string9
[random lines of text]
DATE/USER: 07/10/15 string10 * blah3 *
[random lines of text]
我当前的代码:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
if rphfind:
print rphfind[0].strip()
输出:
string1
string2
string3
string4
string5 * blah1 *
string6
string7
string8 * blah2 *
string9
string10 * blah3 *
同样,只尝试抓取字符串并跳过那些在同一行上具有 "blah" 的字符串。我的输出应该排除字符串 5、字符串 8 和字符串 10。
编辑:抱歉。进行了一些编辑以完善我要实现的目标。
根据您的编辑,您绝对可以拆分:
with open("in.txt") as f:
for line in f:
if line.startswith("DATE/USER:"):
spl = line.split()
if len(spl) == 3:
print(spl[2])
输出:
string1
string2
string3
string4
string6
string7
string9
使用回复:
with open("in.txt") as f:
import re
r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
for line in f:
match = r.search(line)
if match:
print(match.group(2))
输出:
string1
string2
string3
string4
string6
string7
string9
re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)
下面的“$”实际上会排除后面有 * blah * 的任何行:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)
只会匹配 A,B,C,D,F,G,I
捕获组 ([A-Z]) 将只捕获单个大写字母,但仍允许匹配任何行(在您的示例中打印 A 到 J)
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)
不确定您要找的是哪个版本
使用正则表达式时,如何 select 仅从您感兴趣的文本后没有多余文本的行中提取文本?
对于以下输入文本,我只想 select string1 到 string10 并跳过同一行上具有 "blah" 的字符串。
输入文本文件:
[random lines of text]
DATE/USER: 07/01/15 string1
[random lines of text]
DATE/USER: 07/12/15 string2
[random lines of text]
DATE/USER: 07/04/15 string3
[random lines of text]
DATE/USER: 07/12/15 string4
[random lines of text]
DATE/USER: 07/05/15 string5 * blah1 *
[random lines of text]
DATE/USER: 07/02/15 string6
[random lines of text]
DATE/USER: 07/08/15 string7
[random lines of text]
DATE/USER: 07/11/15 string8 * blah2 *
[random lines of text]
DATE/USER: 07/03/15 string9
[random lines of text]
DATE/USER: 07/10/15 string10 * blah3 *
[random lines of text]
我当前的代码:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d).+', line)
if rphfind:
print rphfind[0].strip()
输出:
string1
string2
string3
string4
string5 * blah1 *
string6
string7
string8 * blah2 *
string9
string10 * blah3 *
同样,只尝试抓取字符串并跳过那些在同一行上具有 "blah" 的字符串。我的输出应该排除字符串 5、字符串 8 和字符串 10。
编辑:抱歉。进行了一些编辑以完善我要实现的目标。
根据您的编辑,您绝对可以拆分:
with open("in.txt") as f:
for line in f:
if line.startswith("DATE/USER:"):
spl = line.split()
if len(spl) == 3:
print(spl[2])
输出:
string1
string2
string3
string4
string6
string7
string9
使用回复:
with open("in.txt") as f:
import re
r = re.compile(r'(^DATE/USER:\s+\d+/\d+/\d+\s+(\w+$))')
for line in f:
match = r.search(line)
if match:
print(match.group(2))
输出:
string1
string2
string3
string4
string6
string7
string9
re.findall('DATE/USER: \d\d/\d\d/\d\d\s+([A-Z])', line)
下面的“$”实际上会排除后面有 * blah * 的任何行:
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])$', line)
只会匹配 A,B,C,D,F,G,I
捕获组 ([A-Z]) 将只捕获单个大写字母,但仍允许匹配任何行(在您的示例中打印 A 到 J)
rphfind = re.findall('(?<=DATE/USER: \d\d/\d\d/\d\d)\s+([A-Z])', line)
不确定您要找的是哪个版本