From 19744420c43fb207e70dd34aaf2cdb589847a428 Mon Sep 17 00:00:00 2001 From: Wei Lee Date: Thu, 1 Aug 2024 16:20:56 +0800 Subject: [PATCH] docs(dataset): add more detail on when dataset alias is resolved (#41152) --- docs/apache-airflow/authoring-and-scheduling/datasets.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/apache-airflow/authoring-and-scheduling/datasets.rst b/docs/apache-airflow/authoring-and-scheduling/datasets.rst index 53b4ad38cd1ef..b424324c8f634 100644 --- a/docs/apache-airflow/authoring-and-scheduling/datasets.rst +++ b/docs/apache-airflow/authoring-and-scheduling/datasets.rst @@ -472,6 +472,8 @@ Scheduling based on dataset aliases ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Since dataset events added to an alias are just simple dataset events, a downstream depending on the actual dataset can read dataset events of it normally, without considering the associated aliases. A downstream can also depend on a dataset alias. The authoring syntax is referencing the ``DatasetAlias`` by name, and the associated dataset events are picked up for scheduling. Note that a DAG can be triggered by a task with ``outlets=DatasetAlias("xxx")`` if and only if the alias is resolved into ``Dataset("s3://bucket/my-task")``. The DAG runs whenever a task with outlet ``DatasetAlias("out")`` gets associated with at least one dataset at runtime, regardless of the dataset's identity. The downstream DAG is not triggered if no datasets are associated to the alias for a particular given task run. This also means we can do conditional dataset-triggering. +The dataset alias is resolved to the datasets during DAG parsing. Thus, if the "min_file_process_interval" configuration is set to a high value, there is a possibility that the dataset alias may not be resolved. To resolve this issue, you can trigger DAG parsing. + .. code-block:: python with DAG(dag_id="dataset-producer"):