逐行读取文本文件并将变量存储在 Python 中匹配特定模式
Read a text file line by line and store variables on matching specific pattern in Python
我们有一个包含以下两行的大型日志文件:
00 LOG | Cycles Run: 120001
00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).
00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).
00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
由于日志文件很大,我们希望逐行读取(而不是一次读取缓冲区中的整个文本),匹配特定的模式集并在单独的变量中选择值。
例如
00 LOG | Cycles Run: 120001
我们想要选择 120001
并存储在一个变量中,比如 cycle
。
另一方面,我们解析这些行:
00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).
00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).
00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
标有?
的字符可以是任意数字。
我们想像下面这样存储变量:
640733184
in var virtual_cur
1082470400
in var virtual_max
472154112
in var actual_cur
861736960
in var actual_max
在 Python 3.6
中写了一个片段,但它打印的是空列表:
import re
filename = "test.txt"
with open(filename) as fp:
line = fp.readline()
while line:
cycle_num = re.findall(r'00 LOG | Cycles Run: (.*?)',line,re.DOTALL)
line = fp.readline()
print (cycle_num[0])
NOTE: I want to pick each values in seperate variables and use it
later on. Need to set 5 patterns one by one, pick value if it matches
any specific pattern and put it inrespective variable.
不确定第二个模式的通配符匹配。
请向我们建议一种有效执行此操作的方法。
你可以在这里使用两个 lookbehinds 的交替:
(?<=Cycles Run: )\d+|(?<= Current> )\d+
正则表达式演示 here.
Python 示例:
import re
text = '''
00 LOG | Cycles Run: 120001
00 LOG ! Virtual: Max> 1082470400 bytes (1.0081 gb), Current> 640733184 bytes (0.5967 gb)
'''
pattern = re.compile(r'(?<=Cycles Run: )\d+|(?<= Current> )\d+')
matches = re.findall(pattern,text)
num_cycle = matches[0]
current = matches[1]
print(num_cycle,current)
打印:
120001 640733184
由于在循环中重复该过程,建议使用re.compile
在循环前只编译一次模式。
在这里我们搜索一些标识符(例如 cycles
并应用不同的正则表达式)
import re
with open('test.txt','r') as f:
for line in f:
if re.search(r'Cycles',line):
m=re.findall(r'\d+$',line)
else:
m=re.findall(r'Current> (\d+)',line)
print(m)
使用正则表达式
(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)
您可以按照以下方式做一些事情:
import re
pat=re.compile(r'(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)')
with open('test.txt','r') as f:
for line_num, line in enumerate(f):
m=pat.search(line)
if m:
print(line_num, m.group(0))
我们有一个包含以下两行的大型日志文件:
00 LOG | Cycles Run: 120001
00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).
00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).
00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
由于日志文件很大,我们希望逐行读取(而不是一次读取缓冲区中的整个文本),匹配特定的模式集并在单独的变量中选择值。
例如
00 LOG | Cycles Run: 120001
我们想要选择 120001
并存储在一个变量中,比如 cycle
。
另一方面,我们解析这些行:
00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).
00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).
00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).
标有?
的字符可以是任意数字。
我们想像下面这样存储变量:
640733184
in varvirtual_cur
1082470400
in varvirtual_max
472154112
in varactual_cur
861736960
in varactual_max
在 Python 3.6
中写了一个片段,但它打印的是空列表:
import re
filename = "test.txt"
with open(filename) as fp:
line = fp.readline()
while line:
cycle_num = re.findall(r'00 LOG | Cycles Run: (.*?)',line,re.DOTALL)
line = fp.readline()
print (cycle_num[0])
NOTE: I want to pick each values in seperate variables and use it later on. Need to set 5 patterns one by one, pick value if it matches any specific pattern and put it inrespective variable.
不确定第二个模式的通配符匹配。
请向我们建议一种有效执行此操作的方法。
你可以在这里使用两个 lookbehinds 的交替:
(?<=Cycles Run: )\d+|(?<= Current> )\d+
正则表达式演示 here.
Python 示例:
import re
text = '''
00 LOG | Cycles Run: 120001
00 LOG ! Virtual: Max> 1082470400 bytes (1.0081 gb), Current> 640733184 bytes (0.5967 gb)
'''
pattern = re.compile(r'(?<=Cycles Run: )\d+|(?<= Current> )\d+')
matches = re.findall(pattern,text)
num_cycle = matches[0]
current = matches[1]
print(num_cycle,current)
打印:
120001 640733184
由于在循环中重复该过程,建议使用re.compile
在循环前只编译一次模式。
在这里我们搜索一些标识符(例如 cycles
并应用不同的正则表达式)
import re
with open('test.txt','r') as f:
for line in f:
if re.search(r'Cycles',line):
m=re.findall(r'\d+$',line)
else:
m=re.findall(r'Current> (\d+)',line)
print(m)
使用正则表达式
(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)
您可以按照以下方式做一些事情:
import re
pat=re.compile(r'(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)')
with open('test.txt','r') as f:
for line_num, line in enumerate(f):
m=pat.search(line)
if m:
print(line_num, m.group(0))