AWS 雅典娜 regexp_extract() 损坏

AWS Athena regexp_extract() broken

我正在使用 AWS Athena 从 CloudWatch 日志中提取一些统计数据。但是,尝试使用 Presto regexp_extract() 会生成空结果集,即使 rexexp 根据在线正则表达式测试器看起来不错。

源cloudwatch日志样本如下:

2021-10-04 00:10:56.201 INFO 10711 --- [io-5000-exec-31] au.com.crecy.VP4CStatistics : {"atlassianLicense" : {"key" : "visio-publisher-for-confluence","version" : "1.1.5-AC","state" : "ENABLED","installedDate" : 1619695028000,"lastUpdated" : 1632692975000,"license" : {"active" : true,"type" : "COMMERCIAL","evaluation" : false,"supportEntitlementNumber" : "SEN-0123456789"},"valid" : true,"host" : {"product" : "Confluence","contacts" : [ ]},"links" : {"marketplace" : [{"href" : "https://marketplace.atlassian.com/plugins/visio-publisher-for-confluence"}],"self" : [{"href" : "https://acme.atlassian.net/wiki/rest/atlassian-connect/1/addons/visio-publisher-for-confluence"}]}},"viewAttachments" : [{"height" : "1000","width" : "100%","scrolling" : "no","frameBorder" : "hide","url" : "/download/attachments/574160906/Foo.html.zip?version=22&modificationDate=1632311039065&cacheVersion=1&api=v2","space" : "VM","page" : 574160906,"id" : "att568885320","frameBorderStyle" : "border:none;"}],"durations" : {"1" : {"method" : "ModelGenAtlassianConnectPlugin.loadHtmlAttachment","startTime" : 2542004145837271,"endTime" : 2542005346331840,"durationMillis" : 1200,"durationNanos" : 1200494569},"2" : {"method" : "AtlassianHostRestClientsHelper.getLicense","startTime" : 2542004145845740,"endTime" : 2542004523777555,"durationMillis" : 377,"durationNanos" : 377931815},"3" : {"method" : "AtlassianHostRestClientsHelper.processJwt","startTime" : 2542004523813282,"endTime" : 2542004525757229,"durationMillis" : 1,"durationNanos" : 1943947},"4" : {"method" : "AttachmentLoaderHelper.loadAttachment","startTime" : 2542004525774026,"endTime" : 2542005346321184,"durationMillis" : 820,"durationNanos" : 820547158},"5" : {"method" : "AtlassianHostRestClientsHelper.getAttachments","startTime" : 2542004525784513,"endTime" : 2542004796450920,"durationMillis" : 270,"durationNanos" : 270666407},"6" : {"method" : "AtlassianHostRestClientsHelper.getCompressedPageSource","startTime" : 2542004796503557,"endTime" : 2542005341655641,"durationMillis" : 545,"durationNanos" : 545152084},"7" : {"method" : "ChecksumHelper.checksumValid","startTime" : 2542005341695482,"endTime" : 2542005341889382,"durationMillis" : 0,"durationNanos" : 193900},"8" : {"method" : "UnzipCompressionHelper.unzipCompressedPageSource","startTime" : 2542005341899585,"endTime" : 2542005346303984,"durationMillis" : 4,"durationNanos" : 4404399},"9" : {"method" : "VP4CResponseHeaderFilter.doFilter","startTime" : 2542008074147431,"endTime" : 2542008074514454,"durationMillis" : 0,"durationNanos" : 367023}},"uncompressedSize" : 520631,"compressedSize" : 48836}

AWS Athena/presto查询如下:

select regexp_extract(message, '(au.com.crecy.VP4CStatistics : )({.*}$)', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
where message LIKE '%visio-publisher-for-confluence%'
order by time desc

简而言之,我想提取日志消息末尾的 JSON 负载。上面的查询正在生成空结果集。

感谢和问候, 安德鲁

请注意 {}. 是正则表达式元字符,可能需要通过反斜杠进行转义。

SELECT REGEXP_EXTRACT(message, 'au\.com\.crecy\.VP4CStatistics : (\{.*\})$', 1)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
WHERE message LIKE '%visio-publisher-for-confluence%'
ORDER BY time DESC;

好的,明白了——正则表达式需要匹配多个 space 或选项卡(即使示例日​​志似乎只有一个 space。以下模式有效:

select regexp_extract(message, '(au\.com\.crecy\.VP4CStatistics[ \t]*:[ \t]*)(\{.*\})', 2)
FROM "VP4C_Statistics_Catalog"."/aws/elasticbeanstalk/vp4c-prod/var/log/web.stdout.log"."all_log_streams" 
where message LIKE '%visio-publisher-for-confluence%'
order by time desc