DAG cli 和 catchup

DAG cli and catchup

我有 DAG:

dag = DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    dagrun_timeout=timedelta(minutes=60),
    tags=['example']
)

dag.cli() 的意义是什么? cli()有什么作用?

if __name__ == "__main__":
    dag.cli()

今天是 10 月 14 日。当我添加 catchup false 时,它​​会在 10 月 13 日执行。它不应该只在第 14 天执行吗?没有它执行 12 和 13,这是有意义的,因为它会回填。但是如果 catchup 为 false,为什么它会在 10 月 13 日执行?

dag = DAG(
    dag_id='example_bash_operator',
    default_args=args,
    schedule_interval='0 0 * * *',
    start_date=days_ago(2),
    catchup=False,
    dagrun_timeout=timedelta(minutes=60),
    tags=['example']
)

您应该避免将 start_date 设置为相对值 - 这可能会导致意外行为,因为每次解析 DAG 文件时都会重新解释该值。

Airflow FAQ中有一段很长的描述:

We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an @hourly DAG would never get to an hour after now as now() moves along.

关于 dag.cli(),我会删除整个部分 - DAG 绝对不需要气流调度程序执行它,请参阅

关于 catchup=False 以及为什么它在 10 月 13 日执行 - 查看 scheduler documentation

The scheduler won’t trigger your tasks until the period it covers has ended e.g., A job with schedule_interval set as @daily runs after the day has ended. This technique makes sure that whatever data is required for that period is fully available before the dag is executed. In the UI, it appears as if Airflow is running your tasks a day late

Note If you run a DAG on a schedule_interval of one day, the run with execution_date 2019-11-21 triggers soon after 2019-11-21T23:59. Let’s Repeat That, the scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

此外,文章 Scheduling Tasks in Airflow 可能值得一读。