使用 Powershell RegEx DOTALL 和回顾断言将日志解析为记录块

Parsing a log into record blocks with Powershell RegEx DOTALL and a lookbehind assertion

此问题适用于使用 Powershell 4.0 解析非常大的非结构化日志文件的任务,应用带有后视断言和 dotall 修饰符的正则表达式。

日志中的一条记录记录了一个过程,涉及多行各种交易尝试。我希望能够使用可由成功消息识别的起始行和结束行将日志拆分为离散记录。成功消息标志着正在处理的记录结束。接下来的行始终是新记录的开始。

一旦日志被分解成一系列离散的记录,我就会更有信心地从每条记录中获取关键数据。无论如何,这是当前的逻辑——但我现在不关心流程的这一部分。我稍后再做。

高度简化的日志块如下所示:

20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.

提醒 reader 进程结束和新记录开始的消息行如下所示:

20151120 11:10:31 INFO transaction processed successfully.

该行之后的所有内容,直到下一条成功消息都是完整的记录。

我目前拥有的正则表达式模式是:

(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)

此模式正确识别了第一个成功消息,但随后在该第一个记录中包含后续成功消息,并为第二个匹配项重复相同的记录。 (.+) 表达式占用太多。我尝试了一个不受欢迎的(+?)量词 - 没有匹配;以及用于在下一个成功消息处确定停止点的前瞻断言 - 再次不高兴。

完整的 Powershell 代码是:

Clear-Host

$s = @"
20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.
"@

$p = "(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)"

$s | Select-String $p -AllMatches | Foreach {$_.Matches}

感谢您的指导。

别管后面的事了,就用这个:

(?:\d{8}\s\d{2}:\d{2}:\d{2}\s(?!INFO transaction processed successfully\.).+\n?)+

DEMO

它匹配与成功消息的模式不匹配的一行或多行。如果您不确定如何解决问题,回顾永远不应该是您使用的第一个工具。通常这只会让工作变得更加困难。 DOTALL/Singleline 模式也有影响,但程度较轻,而且它使您更容易受到永无止境的比赛的影响。

另一种选择是根据 确实 匹配成功消息的模式拆分:

\s*\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.\s*

@alan-moore,我要感谢您提出将多行文本块拆分为数组的建议。我无法使该示例正常工作,但返回到文档,在那里我找到了一个可以扩展的 multiline -split 示例。

https://technet.microsoft.com/en-us/library/hh847811(v=wps.630).aspx

$p = "(.+INFO.+?processed successfully\..+\n)"

$s -split $p, 0, "multiline"

这个解决方案似乎有效。文本块是我试图解析的实际日志的匿名片段。

<#
Working on the example pattern provided here by Alan Moore:

#>

Clear-Host

<#
This is an anonymized, unstructured actual log in which a record is comprised of 
returns from various transactions, and those return messages can span multiple lines.
For ease in identifying starting and ending lines, in record blocks, 
## START ##  and ## END ## text was inserted.
#>

$s = @"
20151120 11:00:01 INFO Nightly NICS Criminal Synchronization began execution.
20151120 11:00:01 INFO Connected to Database OK.
20151120 11:00:01 INFO Connected to SMEL.
20151120 11:00:01 INFO In criminal_sequence_sync_protocolin state 1.
20151120 11:10:24 INFO {call dbo.lems_check_identifiers} procedure call successfull.
20151120 11:10:24 INFO process_checkid processed successfully.
20151120 11:10:24 ## START: 1 ## INFO criminal_sequence_sync_protocol new state=2, transaction count=564.
20151120 11:10:57 UPDATE Tag NSS has value [].
20151120 11:10:57 UPDATE Tag COS has value [].
20151120 11:10:57 UPDATE Tag DIS has value [0000123456].
20151120 11:10:57 UPDATE Tag ARI has value [MIRC1234567].
20151120 11:10:57 UPDATE SMIC.C5.LA012345J.DIS/0000123456.PUR/C.REQ/SMIC MIRC1234567.
20151120 11:10:58 RESPONSE SMIC-NC.ACK 123456B021 SMIC-NC  755B4F SMIC-NC  37B021 20151120 11:10:51

.
20151120 11:10:58 INFO MSN of message to SMEL is <123456B021>.
20151120 11:10:58 RESPONSE SMIC-NC.MSG 123456B021 SMIC-NC  755B50 CCHC     89140F 20151120 11:10:51
LA012345J.
CTL/
ATN/

C5.LA012345J..123456B021..
                     *****REQUESTED HCC DIS NOT FOUND*****.
20151120 11:10:58 INFO MSN of message from SMEL is <123456B021>.
20151120 11:10:58 INFO In criminal_sequence_sync_protocolin state 2.
20151120 11:10:58 UPDATE Tag ATN has value [].
20151120 11:10:58 INFO ATN read from query: [], w/ length: 0.
20151120 11:10:58 UPDATE ATN from C5 response has value [].
20151120 11:10:58 INFO ATN read from C5 response: [], w/ length: 0.
20151120 11:10:58 UPDATE ARI has value [].
20151120 11:10:58 ## END 1 ## ERROR returning from process_updid with invalid SCIN query - no ARI code: []..
20151120 11:10:58 INFO process_updid processed successfully.
20151120 11:35:13 ## START 2 ## UPDATE SMIC-NC.EDP.LA012345J.NAM/SADDLER,BELL.SEX/M.RAC/B.DOB/19700101.PCA/A1.ARI/MIRC1234567.DNY/CONVICTED OF 15 1403 B.OCA/0123456.SOR/LA.MIS/CONFIRM RECORD AT CSAL 123-456-7890 OR PAGER 123-456-7890.
20151120 11:35:13 RESPONSE SMIC-NC.ACK 123456B247 SMIC-NC  755FA8 SMIC-NC  37B247 20151120 11:35:07

.
20151120 11:35:13 INFO MSN of message to SMEL is <123456B247>.
20151120 11:35:14 RESPONSE SMIC-NC.MSG 123456B247 SMIC-NC  755FA9 NC2K     8E261D 20151120 11:35:07
LA012345J
CTL/
ATN/

6L01123456B2472EDP 
LA012345J
REJECT    MKE/EDP
NAM/SADDLER,BELL.SEX/M.RAC/B.DOB/19700101.SOR/LA.PCA/A1.
ARI/MIRC1234567.OCA/0123456.DNY/CONVICTED OF 15 1403 B.
MIS/CONFIRM RECORD AT CSAL 123-456-7890 OR PAGER 123-456-7890

FOR THE FOLLOWING REASON(S)
 DUPLICATE RECORD
SCIN-END.
20151120 11:35:14 INFO MSN of message from SMEL is <123456B247>.
20151120 11:35:14 INFO In criminal_sequence_sync_protocolin state 4.
20151120 11:35:14 ## END 2 ## ERROR SCIN REJECT msg received by process_updc - move onto next msg.
20151120 11:10:58 INFO process_updid processed successfully.
20151120 11:10:58 ## START 3 ## UPDATE Tag SSN has value [123456789].
20151120 11:10:58 UPDATE Tag ARI has value [CRIM1234567].
20151120 11:10:58 UPDATE SIMC.C4.LA012345J.COS/123456789.PUR/C.ATN/SIMC MIRC1234567.
20151120 11:10:58 RESPONSE SIMC-NC.ACK 123456B022 SIMC-NC  755B51 SIMC-NC  37B022 20151120 11:10:51

.
20151120 11:10:58 INFO MSN of message to SMEL is <123456B022>.
20151120 11:10:58 RESPONSE SIMC-NC.MSG 123456B022 SIMC-NC  755B52 CCHC     123450 20151120 11:10:52
LA012345J.
CTL/
ATN/SIMC MIRC1234567

C4.LA012345J..123456B022.SIMC MIRC1234567.
 LNM/ SMITH             FNM/ JENNIE          MIN/ A    NS/ ---
 RAC/ W SEX/ F HGT/ 501 WGT/ 120  HAI/ BRO EYE/ BRO POB/ LA DOB/ 01-01-1970
 AUTO/ Y   COS/           OLN/                           OLS/ 
 LID/  ORI/ LA0530000 FBI/ 123456FA8
 DIS/ 0001234567 STAT/ 
 FPH1/ 

 FPH2/                            008 ALIASES
       001 LNM/ SMITH             FNM/ JENNIE          MIN/ A        SUF/ 
       002 LNM/ STEVENS              FNM/ JANE          MIN/ A        SUF/ 
       003 LNM/ SMITH             FNM/ JENNIE          MIN/          SUF/ 
       004 LNM/ SMITH             FNM/ JENNIE          MIN/ A        SUF/ 
       005 LNM/ SMYTH             FNM/ JENNIFER          MIN/ A        SUF/ 
       006 LNM/ SMITH             FNM/ JENNIE          MIN/          SUF/ 
       007 LNM/ SMITH             FNM/ JENNIE          MIN/          SUF/ 
       008 LNM/ SMITH             FNM/ JENNIE          MIN/ A        SUF/ 
.
C5.LA012345J..123456B030.. 11/20/2015 11:11:43                                  
REQUESTED BY: SIMC MIRC1234560

                           S T A T E  CRIMINAL HISTORY
                            *FOR AUTHORIZED USE ONLY*
                 (FINGERPRINTS ARE NECESSARY FOR A POSITIVE ID)

INVESTIGATIVE REPORT                                      CONFIDENTIAL RECORDS
--------------------------------------------------------------------------------
CRIMINAL RECORD OF: SMITH, JENNIE A                                 FBI: 123456HC2
STATE ID: 0001234567    BIRTH DATE: 01/01/1970     PLACE: TN      DOC: 
RACE: W         HEIGHT:  5' 5"          HAIR: BLK                 DNA ON FILE:YES
SEX:  F         WEIGHT: 145             EYES: BRN
SSN: 123456789  OLS/OLN:                                          III: SSO
STATUS: 
--------------------------------------------------------------------------------
ARREST DATE: 01/12/2005                                    LID: 01234567
AGENCY: CLARKSTOWN, MS PD (LA0123456)                         AFIS ATN: 123456789012
   NAME: SMITH, JENNIE A

CHARGE 1                                                   COUNTS 1
   R.S. 14:67A(FELONY) F THEFT CHARGE

--------------------------------------------------------------------------------
** TO BE CONTINUED **.
20151120 11:10:58 INFO MSN of message from SMEL is <123456B022>.
20151120 11:10:58 INFO In criminal_sequence_sync_protocolin state 2.
20151120 11:10:58 UPDATE Tag ATN has value [SIMC MIRC1234567].
20151120 11:10:58 INFO ATN read from query: [SIMC MIRC1234567], w/ length: 20.
20151120 11:10:58 UPDATE ARI has value [CRIM1234567].
20151120 11:10:58 UPDATE Tag COS has value [].
20151120 11:10:58 UPDATE Tag DIS has value [0001234567].
20151120 11:10:59 UPDATE {call dbo.lems_update_id_check_fields ('CRIM1234567', 'COS', 'N', '20151120')} update procedure call successfull.
20151120 11:10:59 UPDATE {call dbo.lems_update_id_check_fields ('CRIM1234567', 'DIS', 'N', '20151120')} update procedure call successfull.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [STEVENS].
20151120 11:10:59 INFO LNM read from SNIC response: [STEVENS], w/ length: 7.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 UPDATE LNM has value [SMITH].
20151120 11:10:59 INFO LNM read from SNIC response: [SMITH], w/ length: 8.
20151120 11:10:59 ERROR No data found for given names.
20151120 11:10:59 INFO  SELECT DIS FROM CRIMINALS WHERE  DIS = '0001234567'  AND UPPER(LNAME) IN ( 'SMITH' , 'SMITH' , 'STEVENS' , 'SMITH' , 'SMITH' , 'SMITH' , 'SMITH' , 'SMITH' , 'SMITH' )  
 name and DIS check done, name_match N .
20151120 11:10:59 ## END 3 ## UPDATE {call dbo.lems_update_id_check_fields ('CRIM1234567', 'LNM', 'N', '20151120')} update procedure call successfull.
20151120 11:10:59 INFO process_updid processed successfully.
20151120 11:10:59 ## START 4 ## UPDATE Tag NSS has value [].
20151120 11:10:59 UPDATE Tag COS has value [].
"@

$p = "(.+INFO.+?processed successfully\..+\n)"

$s -split $p, 0, "multiline"

这里有一个关于拆分文本块的额外小提示。假设用例需要,而不是删除拆分块的行,而是保留该行。假设分割文本块的行不是多余的,而是对记录至关重要。当您想保留分割文本块所在的行时,前瞻表达式非常有用。

这是一个非常简单的例子。用例是 "start" 行是较大文本块中新记录的第一行:

Clear-Host

$s = @"
1 one
2 two
3 three
4 start
5 five
6 six
7 seven
"@

$p = "(?=\d\sstart)"
$records = $s | Select-String $p -AllMatches | Foreach {$_.Matches} | Foreach {$_.Value}
#$records = [regex]::split($s, $p, "MULTILINE")
$records = [regex]::split($s, $p)

Foreach($r in $records) 
{
    $r
}

上面的代码示例产生了两个文本块,在我的用例中,每个拆分块都是一个记录块,由包含相关信息的多行组成。

1 one
2 two
3 three

4 start
5 five
6 six
7 seven

魔术酱是模式中的 (?=) 前瞻表达式。它在匹配的先行模式之前的字符处解析文本块,将起始行捕获为第二个文本块的一部分。请注意,其中可能有一个令人困惑的位 - 文字 "start" 文本前面的 \s space 表达式:

$p = "(?=\d\sstart)"

删除先行表达式,但保留分组括号,最后得到三个文本块,其中第二个是起始行,但与下面的其余行分开。

$p = "(\d\sstart)"

删除先行表达式,并删除分组括号,生成两个文本块,现在删除了起始行。

$p = "\d\sstart"

请注意,正如@alan-moore 指出的那样,"multiline" 修饰符没有帮助。在这个例子中它没有坏处,但也没有帮助。最好只是删除它。上面的代码块中有一行注释来验证该结论。