如何获取 Athena 中 Crawler 的最后 运行 日期时间?

How to get last run Datetime of Crawler in Athena?

我有 AWS Glue Crawler,它每天 运行 两次并在 Athena 中填充数据。

Quicksight 从 Athena 获取数据并将其显示在仪表板中。

我正在实施 LastDataRefresh(日期时间)以在 Quicksight 仪表板中显示。有没有办法获取最后一个爬虫 运行 日期时间,以便我可以将其存储在 Athena table 中并在 Quicksight 中显示?

也欢迎任何其他建议。

TL;DR 最后 运行 从 Glue 的 CloudWatch 日志中提取爬虫

Glue 在每个爬虫 运行 期间向 CloudWatch 发送一系列事件。从 /aws-glue/crawlers 日志组中提取并处理“完成的 运行ning”日志,以获取每个爬虫的最新日志。

单个抓取工具的日志 运行:

2021-12-15T12:08:54.448+01:00   [7dd..] BENCHMARK : Running Start Crawl for Crawler lorawan_datasets_bucket_crawler
2021-12-15T12:09:12.559+01:00   [7dd..] BENCHMARK : Classification complete, writing results to database jokerman_events_database
2021-12-15T12:09:12.560+01:00   [7dd..] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2021-12-15T12:09:27.064+01:00   [7dd..] BENCHMARK : Finished writing to Catalog
2021-12-15T12:12:13.768+01:00   [7dd..] BENCHMARK : Crawler has finished running and is in state READY

提取并处理 BENCHMARK : Crawler has finished running and is in state READY 日志:

import boto3
from datetime import datetime, timedelta

def get_last_runs():
  session = boto3.Session(profile_name='sandbox', region_name='us-east-1')
  logs = session.client('logs')

  startTime = startTime = datetime.now() - timedelta(days=14)

  # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.filter_log_events
  filtered_events = logs.filter_log_events(
      logGroupName="/aws-glue/crawlers",
      # https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html#matching-terms-events
      filterPattern="BENCHMARK state READY", # match "BENCHMARK : Crawler has finished running and is in state READY" messages
      startTime=int(startTime.timestamp()*1000)
  )

  completed_runs = [
      {"crawler": m.get("logStreamName"), "timestamp": datetime.fromtimestamp(m.get("timestamp")/1000).isoformat()}
      for m in filtered_events["events"]
  ]

  # rework the list to get a dictionary of the last runs by crawler
  crawlers = set([r['crawler'] for r in completed_runs])
  last_runs = dict()

  for n in crawlers:
    last_runs[n] = max([d["timestamp"] for d in completed_runs if d["crawler"] == n])

  print(last_runs)

输出:

{
  'lorawan_datasets_bucket_crawler': '2021-12-15T12:12:13.768000',
  'jokerman_lorawan_events_table_crawler': '2021-12-15T12:12:12.007000'
}