日志格式具有多个 ip 时的正则表达式问题

Question

我对 fluenTd 日志解析器有疑问。当有 2 个 ip 时，以下配置工作正常。

expression  /^(?<client_ip>[^ ]*)(?:, (?<lb_ip>[^ ]*))? (?<ident>[^ ]*) (?<user>[^ ]*) \[(?<time>[^ ]* [^ ]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) (?<protocol>[A-Z]{1,}[^ ]*)+\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)/

这匹配：

148.165.41.129, 10.25.1.120 - - [09/Dec/2019:16:22:23 +0000] "GET /comet_request/44109669162/F1551019433002Y5MYEP?F155101943300742PMLG=1551019433877&_=1575904426457 HTTP/1.1" 200 0 0 0

当有 3 个 ip 时，我收到模式不匹配警告。

这不匹配：

176.30.235.70, 165.225.70.200, 10.25.1.120 - - [09/Dec/2019:13:30:57 +0000] \"GET /comet_request/71142769981/F1551018730440IY5YNF?F1551018721447ZVKYZ4=1551018733078&_=1575898029473 HTTP/1.1\" 200 0 0 0

我尝试了以下正则表达式，但是 work.Can 有人帮忙吗？

expression /^(?<client_ip>[^ ]*)(?:, (?<proxy_ip>[^ ]*))? (?:, (?<lb_ip>[^ ]*))? (?<ident>[^ ]*) (?<user>[^ ]*) \[(?<time>[^ ]* [^ ]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) (?<protocol>[A-Z]{1,}[^ ]*)+\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)$/

Answer 1

您需要使用更具体的模式来匹配 IP，例如 [\d.]+ 或 [^, ]+，并确保您还匹配最后两个字段（您没有匹配它们和 $需要line/string).

结尾

使用类似

的模式

^(?<client_ip>[^ ,]+)(?:, +(?<proxy_ip>[^ ,]+))?(?:, +(?<lb_ip>[^ ,]+))? (?<ident>[^ ]+) (?<user>[^ ]+) \[(?<time>[^\]\[ ]* [^\]\[ ]*)\] "(?<method>\S+)(?: +(?<path>\S+) (?<protocol>[A-Z][^" ]*)[^"]*)?" (?<code>\S+) (?<size>\S+) \S+ \S+$

见regex demo

IP匹配部分是^(?<client_ip>[^ ,]+)(?:, +(?<proxy_ip>[^ ,]+))?(?:, +(?<lb_ip>[^ ,]+))?，看到[^ ,]+匹配了除space以外的1+个字符，并且,和\S+ \S+被添加到模式的结尾（如果这些是数字，您可以使用 \d+ \d+ 并在需要时捕获它们）。

Answer 2

示例字符串

让我们考虑一下您问题的简化版本，重点关注前四个命名范围（因为处理其余命名范围很简单）。

str1 = "148.165.41.129, 10.25.1.120 - - [09/Dec/2019:16:22:23 +0000]"

str2 = "176.30.235.70, 165.225.70.200, 10.25.1.120 - - [09/Dec/2019:13:30:57 +0000]"

自由行写的正则表达式

如果字符串具有有效结构，则以下正则表达式可用于提取命名范围的内容。请注意，它要求 IPv4 地址和日期时间字符串具有有效模式（而不仅仅是 [^ ]+ 和 [^ ]+ [^ ]+）。我在 自由间距 模式下编写了正则表达式，以使其自我记录。

r = /
    \A              # match the beginning of the string 
    (?<client_ip>   # begin a capture group named client_ip
      \g<user_ip>   # evaluate the subexpression (capture group) named user_ip
    )               # end capture group client_ip
    (?:             # begin a non-capture group
      ,[ ]          # match the string ', '
      (?<lb_ip>     # begin a capture group named lb_ip
        \g<user_ip> # evaluate the subexpression (capture group) named user_ip
      )             # end capture group lb_ip
    )?              # end non-capture group and optionally execute it
    (?:             # begin a non-capture group
      ,[ ]          # match the string ', '
      (?<user_ip>   # begin a capture group named user_ip
        \d{1,3}     # match 1-3 digits 
        (?:         # begin a non-capture group
          \.\d{1,3} # match a period followed by 1-3 digits
        ){3}        # end the non-capture group and execute 3 times
      )             # end capture group user_id
    )               # end non-capture group
    [ ]-[ ]-[ ]\[   # match the string ' - - ['
    (?<time>        # begin a capture group named time 
      \d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2}[ ]\+\d{4}
                    # match a time string
    )               # end capture group time                    
    \]              # match string ']'
    \z              # match end of string
    /x              # free-spacing regex definition mode

根据正则表达式匹配字符串

我们现在确认这两个字符串匹配这个正则表达式并提取捕获组的内容。

    m1 = str1.match(r)
    m1.named_captures
      #=> {"client_ip"=>"148.165.41.129",
      #    "lb_ip"=>nil,
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:16:22:23 +0000"}

    m2 = str2.match(r)
    m2.named_captures
      #=> {"client_ip"=>"176.30.235.70",
      #    "lb_ip"=>"165.225.70.200",
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:13:30:57 +0000"}

子表达式调用

我没有为前两个命名捕获组中的每一个复制捕获组 user_ip 的内容，而是简单地使用了 \g<user_ip>，这实际上告诉正则表达式引擎评估在引用 \g<user_ip> 的位置捕获组（子表达式）user_ip 的内容。在 Regexp.

的文档中搜索 "Subexpression Calls"

注意子表达式调用是前瞻性的。假设我们改为写：

r = /
    \A 
    (?<client_ip>\d{1,3}(?:\.\d{1,3}){3})
    (?:,[ ](?<lb_ip>\g<client_ip>))?
    (?:,[ ](?<user_ip>\g<client_ip>))
    [ ]-[ ]-[ ]\[
    (?<time>\d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2}[ ]\+\d{4}) 
    \]
    \z
    /x

然后

    m1 = str1.match(r)
    m1.named_captures
      #=> {"client_ip"=>"10.25.1.120",
      #    "lb_ip"=>nil,
      #    "user_ip"=>"10.25.1.120", 
      #    "time"=>"09/Dec/2019:16:22:23 +0000"}

    m2 = str2.match(r)
    m2.named_captures
      #=> {"client_ip"=>"10.25.1.120",
      #    "lb_ip"=>"165.225.70.200",
      #    "user_ip"=>"10.25.1.120",
      #    "time"=>"09/Dec/2019:13:30:57 +0000"}

如上所示，捕获组 client_ip 的内容设置为等于 user_ip 的内容。 here 解释了此行为的原因（查找 "In PCRE but not Perl, one interesting twist is..." 和该文档的其他参考部分）。

常规写的正则表达式

正则表达式约定俗成如下：

/\A(?<client_ip>\g<user_ip>)(?:, (?<lb_ip>\g<user_ip>))?(?:, (?<user_ip>\d{1,3}(?:\.\d{1,3}){3})) - - \[(?<time>\d{2}\/\p{L}{3}\/\d{4}:\d{2}:\d{2}:\d{2} \+\d{4})\]\z/

请注意，当正则表达式以自由间距模式编写时，上面有 space 的字符类包含单个 space。这是必要的，因为在自由间距模式下，未受保护的 space 会在表达式被解析之前被移除。另一种保护 spaces 的方法是转义它们 (\)。如果希望使用白色spaces而不是spaces，可以使用\s。

日志格式具有多个 ip 时的正则表达式问题

Regex Issue when log format has multiple ip's

ruby

regex

regex-greedy

fluentd