在 BigQuery 上的指定字符后提取字符串

Extract strings after specified characters on BigQuery

我正在尝试将数据从一列提取到多列。原始数据是一个长文本,参数之间用spaces分隔,如下例:-

查询:-

SELECT * FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
  and message like '%Rayong%'
  and message like '%urls%'
  and message like '%10.1.1.155%'
LIMIT 10

输出:-

[
  {
    "message": "1 1641525138.935169636 Rayong_1 urls src=10.1.1.155:57977 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://example.lan/...",
    "msg_time": "2022-01-07 03:12:18.993264 UTC",
    "rcv_time": "2022-01-07 03:12:19.050126 UTC",
    "client_addr": "10.158.81.1"
  },
  {
    "message": "1 1641525883.268370959 Rayong1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
    "msg_time": "2022-01-07 03:24:43.327320 UTC",
    "rcv_time": "2022-01-07 03:24:43.600830 UTC",
    "client_addr": "10.158.81.1"
  },
  {
    "message": "1 1641525892.720006714 Rayong_1 urls src=10.1.1.155:58207 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://acp-ss-an1.adobe.io/...",
    "msg_time": "2022-01-07 03:24:52.772515 UTC",
    "rcv_time": "2022-01-07 03:24:52.895756 UTC",
    "client_addr": "10.158.81.1"
  },
  {
    "message": "1 1641525894.263687469 Rayong_1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
    "msg_time": "2022-01-07 03:24:54.331499 UTC",
    "rcv_time": "2022-01-07 03:24:54.620822 UTC",
    "client_addr": "10.158.81.1"
  }, ...

我想实现两件事:

  1. 我想提取字符“src=”、“dst=”、“mac=”之后和 space.
  2. 之前的字符串
  3. 我想在提取这些参数后提取字符串和 spaces。

所以理想的输出应该是这样的:-

[
  {
    "src": "10.1.1.155:57977",
    "dst": "23.1.1.2:443",
    "info": "request: UNKNOWN https://example.lan/..."
    "msg_time": "2022-01-07 03:12:18.993264 UTC",
    "rcv_time": "2022-01-07 03:12:19.050126 UTC",
    "client_addr": "10.158.81.1"
  }, ...

是否可以直接使用 BigQuery 语法执行此操作?非常感谢您的帮助和指导。非常感谢您。

考虑以下方法

SELECT 
  REGEXP_EXTRACT(message, r' src=([^ ]+) ') src,
  REGEXP_EXTRACT(message, r' dst=([^ ]+) ') dst,
  REGEXP_EXTRACT(message, r' mac=([^ ]+) ') mac,
  REGEXP_EXTRACT(message, r' mac=[^ ]+ (.*)$') info,
  * EXCEPT(message) 
FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
  and message like '%Rayong%'
  and message like '%urls%'
  and message like '%10.1.1.155%'
LIMIT 10        

输出如下