在 BigQuery 上的指定字符后提取字符串
Extract strings after specified characters on BigQuery
我正在尝试将数据从一列提取到多列。原始数据是一个长文本,参数之间用spaces分隔,如下例:-
查询:-
SELECT * FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
and message like '%Rayong%'
and message like '%urls%'
and message like '%10.1.1.155%'
LIMIT 10
输出:-
[
{
"message": "1 1641525138.935169636 Rayong_1 urls src=10.1.1.155:57977 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://example.lan/...",
"msg_time": "2022-01-07 03:12:18.993264 UTC",
"rcv_time": "2022-01-07 03:12:19.050126 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525883.268370959 Rayong1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
"msg_time": "2022-01-07 03:24:43.327320 UTC",
"rcv_time": "2022-01-07 03:24:43.600830 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525892.720006714 Rayong_1 urls src=10.1.1.155:58207 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://acp-ss-an1.adobe.io/...",
"msg_time": "2022-01-07 03:24:52.772515 UTC",
"rcv_time": "2022-01-07 03:24:52.895756 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525894.263687469 Rayong_1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
"msg_time": "2022-01-07 03:24:54.331499 UTC",
"rcv_time": "2022-01-07 03:24:54.620822 UTC",
"client_addr": "10.158.81.1"
}, ...
我想实现两件事:
- 我想提取字符“src=”、“dst=”、“mac=”之后和 space.
之前的字符串
- 我想在提取这些参数后提取字符串和 spaces。
所以理想的输出应该是这样的:-
[
{
"src": "10.1.1.155:57977",
"dst": "23.1.1.2:443",
"info": "request: UNKNOWN https://example.lan/..."
"msg_time": "2022-01-07 03:12:18.993264 UTC",
"rcv_time": "2022-01-07 03:12:19.050126 UTC",
"client_addr": "10.158.81.1"
}, ...
是否可以直接使用 BigQuery 语法执行此操作?非常感谢您的帮助和指导。非常感谢您。
考虑以下方法
SELECT
REGEXP_EXTRACT(message, r' src=([^ ]+) ') src,
REGEXP_EXTRACT(message, r' dst=([^ ]+) ') dst,
REGEXP_EXTRACT(message, r' mac=([^ ]+) ') mac,
REGEXP_EXTRACT(message, r' mac=[^ ]+ (.*)$') info,
* EXCEPT(message)
FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
and message like '%Rayong%'
and message like '%urls%'
and message like '%10.1.1.155%'
LIMIT 10
输出如下
我正在尝试将数据从一列提取到多列。原始数据是一个长文本,参数之间用spaces分隔,如下例:-
查询:-
SELECT * FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
and message like '%Rayong%'
and message like '%urls%'
and message like '%10.1.1.155%'
LIMIT 10
输出:-
[
{
"message": "1 1641525138.935169636 Rayong_1 urls src=10.1.1.155:57977 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://example.lan/...",
"msg_time": "2022-01-07 03:12:18.993264 UTC",
"rcv_time": "2022-01-07 03:12:19.050126 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525883.268370959 Rayong1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
"msg_time": "2022-01-07 03:24:43.327320 UTC",
"rcv_time": "2022-01-07 03:24:43.600830 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525892.720006714 Rayong_1 urls src=10.1.1.155:58207 dst=23.1.1.2:443 mac=XX:XX:XX:XX:XX:XX request: UNKNOWN https://acp-ss-an1.adobe.io/...",
"msg_time": "2022-01-07 03:24:52.772515 UTC",
"rcv_time": "2022-01-07 03:24:52.895756 UTC",
"client_addr": "10.158.81.1"
},
{
"message": "1 1641525894.263687469 Rayong_1 urls src=10.1.1.155:58199 dst=23.1.1.2:80 mac=XX:XX:XX:XX:XX:XX agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36' request: POST http://example.lan/jsrpc.php?output=json-rpc",
"msg_time": "2022-01-07 03:24:54.331499 UTC",
"rcv_time": "2022-01-07 03:24:54.620822 UTC",
"client_addr": "10.158.81.1"
}, ...
我想实现两件事:
- 我想提取字符“src=”、“dst=”、“mac=”之后和 space. 之前的字符串
- 我想在提取这些参数后提取字符串和 spaces。
所以理想的输出应该是这样的:-
[
{
"src": "10.1.1.155:57977",
"dst": "23.1.1.2:443",
"info": "request: UNKNOWN https://example.lan/..."
"msg_time": "2022-01-07 03:12:18.993264 UTC",
"rcv_time": "2022-01-07 03:12:19.050126 UTC",
"client_addr": "10.158.81.1"
}, ...
是否可以直接使用 BigQuery 语法执行此操作?非常感谢您的帮助和指导。非常感谢您。
考虑以下方法
SELECT
REGEXP_EXTRACT(message, r' src=([^ ]+) ') src,
REGEXP_EXTRACT(message, r' dst=([^ ]+) ') dst,
REGEXP_EXTRACT(message, r' mac=([^ ]+) ') mac,
REGEXP_EXTRACT(message, r' mac=[^ ]+ (.*)$') info,
* EXCEPT(message)
FROM `networkmanage.syslog`
WHERE DATE(msg_time) between DATE_SUB(current_date(), INTERVAL 1 DAY) AND current_date()
and message like '%Rayong%'
and message like '%urls%'
and message like '%10.1.1.155%'
LIMIT 10
输出如下