试图限制正则表达式匹配范围
Trying to restrict regex match scope
Python 新人,请原谅这个愚蠢的问题。
我正在尝试从一组 gzip 文件中提取日志数据。
数据跨越多行,所以我试图从其压缩的 tar 文件中提取每个文件并将其作为单个对象读取,如下所示:
正则表达式:
first_match = re.compile(r"(?P<date>\d{4}[-]?\d{1,2}[-]?\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}).*?http://servername:99999/chargeit.*?manager_event=first.*?\bwantThisUser=([^&]*).*?\b_operator=(\w+).*?request\:.*?Want-To-Have-This\:\s\*123\*0\#")
tfile = tarfile.open("logfile-year-month-day.number.log.tar.gz", "r")
for filename in tfile.getmembers():
f = tfile.extractfile(filename).read()
f = str(f)
for match in first_match.finditer(f):
linecount = linecount + 1
print(linecount, match.group(1), match.group(2), match.group(3))
我正在尝试匹配时间戳和日志文件中的其他两个组。
日志数据看起来有点像这样,如果逐行打印:
2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
HEADERS:
this-is-a-header: 200
Want-To-Have-This: *123*200#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:26:29 DEBUG[ispatcher-12563] this.is.the.api.Api - http://servername:99999/chargeit?session_id=a5e456ad2f5645c39a580463630cd3db&manage_event=first&wantThisUser=4119023107960&_source=operator2 1021c087-1918-40a3-a7c1-4b7c37690471 request:
HEADERS:
this-is-a-header: 1000*0111111111
Want-To-Have-This: *123*1000*0111111111#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
我期待看到这个:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
我希望捕获的组是时间戳:(2016-12-16 20:43:4)
、wantThisUser=
(4119185011005
)和_operator=
(operator4
).
相反,正则表达式捕获 target 行,以及它上面的行:
2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
HEADERS:
this-is-a-header: 200
Want-To-Have-This: *123*200#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
它从所需匹配项上方的行中提取时间戳和其他两组。
请问如何将比赛限制在自己的行内?还是我处理方法不对?
谢谢@blubberdibulb!
您帮助我将块匹配正则表达式缩小到 first_match = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.*?(?=^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)", re.DOTALL|re.MULTILINE)
这使得更易于管理的块进行解析。
现在一切都好多了。
Python 新人,请原谅这个愚蠢的问题。 我正在尝试从一组 gzip 文件中提取日志数据。 数据跨越多行,所以我试图从其压缩的 tar 文件中提取每个文件并将其作为单个对象读取,如下所示: 正则表达式:
first_match = re.compile(r"(?P<date>\d{4}[-]?\d{1,2}[-]?\d{1,2} \d{1,2}:\d{1,2}:\d{1,2}).*?http://servername:99999/chargeit.*?manager_event=first.*?\bwantThisUser=([^&]*).*?\b_operator=(\w+).*?request\:.*?Want-To-Have-This\:\s\*123\*0\#")
tfile = tarfile.open("logfile-year-month-day.number.log.tar.gz", "r")
for filename in tfile.getmembers():
f = tfile.extractfile(filename).read()
f = str(f)
for match in first_match.finditer(f):
linecount = linecount + 1
print(linecount, match.group(1), match.group(2), match.group(3))
我正在尝试匹配时间戳和日志文件中的其他两个组。 日志数据看起来有点像这样,如果逐行打印:
2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
HEADERS:
this-is-a-header: 200
Want-To-Have-This: *123*200#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:26:29 DEBUG[ispatcher-12563] this.is.the.api.Api - http://servername:99999/chargeit?session_id=a5e456ad2f5645c39a580463630cd3db&manage_event=first&wantThisUser=4119023107960&_source=operator2 1021c087-1918-40a3-a7c1-4b7c37690471 request:
HEADERS:
this-is-a-header: 1000*0111111111
Want-To-Have-This: *123*1000*0111111111#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
我期待看到这个:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
我希望捕获的组是时间戳:(2016-12-16 20:43:4)
、wantThisUser=
(4119185011005
)和_operator=
(operator4
).
相反,正则表达式捕获 target 行,以及它上面的行:
2016-12-16 20:43:47 DEBUG[ispatcher-12570] this.is.the.api.Api - http://servername:99999/chargeit?session_id=1d7cb257e22946abbb3a14b17f232505&manage_event=first&wantThisUser=4119057000083&_source=operator3 b90e7798-8abd-4cf4-9660-45d6527e2804 request:
HEADERS:
this-is-a-header: 200
Want-To-Have-This: *123*200#
Host: servername:99999
Accept: */*
User-Agent: AHC/2.0
Timeout-Access: <function1>
CONTENT:
2016-12-16 20:43:47 DEBUG[ispatcher-12571] this.is.the.api.Api - http://servername:99999/chargeit?session_id=20111&manage_event=first&wantThisUser=4119185011005&_operator=operator4 926fa104-e72f-46e8-a5fc-912ef9707a01 request:
HEADERS:
this-is-a-header: 0
Want-To-Have-This: *123*0#
它从所需匹配项上方的行中提取时间戳和其他两组。 请问如何将比赛限制在自己的行内?还是我处理方法不对?
谢谢@blubberdibulb!
您帮助我将块匹配正则表达式缩小到 first_match = re.compile(r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.*?(?=^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)", re.DOTALL|re.MULTILINE)
这使得更易于管理的块进行解析。
现在一切都好多了。