Skip to content

Commit

Permalink
Allow committer builds to use scripts/ci, dev and actions from the PR (
Browse files Browse the repository at this point in the history
…apache#37057)

In our process, we generally do not let the scripts in the "build
images" workflow to use `scripts/ci`, `dev` and `action` scripts to come
from the PR. This is a security feature that prevent Pull Requests from
forks to run code on a worker that can potentially access sensitive
information - such as GITHUB_TOKEN with write access to Github
Registry.

This, however, causes troubles, because in order to test any changes
in those scripts affecting building image, you have to close your
PR from the fork and push one directly to Apache repository (there
in-line build workflows are used from "Test" workflow and those
PRs are safe to run, because only committers can push directly to the
`apache/airflow` repository branches.

This PR changes default behaviour for committer PRs. Rather than
do the same as "regular" PRs, those PRs will not use scripts from
the target branch, instead they will use scripts from the incoming
PRs of the committers. This is equally safe as running PRs from
the `apache/airflow` branch - because we have a reviewed list
of committers in our code and "selective checks" job that
checks it is run always in the context and with the code of
the "target" branch, which means that you cannot manipulate the
list of actors.

The Girhub actor is retrieved from pull requests github
context (event/pull_request/user/login) so it is impossible to
spoof it by the incoming PR.

As part of this PR - list of available selective checks and
documentation of PR labels and selective checks (wrongly named
as "static checks") were reviewed and updated.

While impolementing this, we also realised that we can simplify
branch information retrieval. The code that we had in workflow
was written a long time ago, when the target branch was always
"main" - so we had to check-out the target commit to be able to
retrieve branch_defaults.py and get the branches from there. However
it's already for quite some time that "pull request workflow"
uses "base_ref" as the base commit, which means that in `main` it
is `main` and in `v2-8-test` it is `v2-8-stable`, which means that
we already have the correct `AIRFLOW_BRANCH` and the
`DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH` without having to check
the incoming commit. Which means that we do not need to override
scripts in the build-info step, we only need to check it out
temporarily to fetch the incoming PR and it's parent to see
what files are changed in the incoming PR.
  • Loading branch information
potiuk authored Jan 28, 2024
1 parent 4fa8e45 commit 07fd364
Show file tree
Hide file tree
Showing 12 changed files with 298 additions and 170 deletions.
165 changes: 93 additions & 72 deletions .github/workflows/build-images.yml

Large diffs are not rendered by default.

28 changes: 0 additions & 28 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,34 +146,6 @@ jobs:
persist-credentials: false
- name: "Install Breeze"
uses: ./.github/actions/breeze
- name: "Retrieve defaults from branch_defaults.py"
id: defaults
# We could retrieve it differently here - by just importing the variables and
# printing them from python code, however we want to have the same code as used in
# the build-images.yml (there we cannot import python code coming from the PR - we need to
# treat the python code as text and extract the variables from there.
run: |
python - <<EOF >> ${GITHUB_ENV}
from pathlib import Path
import re
import sys
DEFAULTS_CONTENT = Path('dev/breeze/src/airflow_breeze/branch_defaults.py').read_text()
BRANCH_PATTERN = r'^AIRFLOW_BRANCH = "(.*)"$'
CONSTRAINTS_BRANCH_PATTERN = r'^DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH = "(.*)"$'
branch = re.search(BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
constraints_branch = re.search(CONSTRAINTS_BRANCH_PATTERN, DEFAULTS_CONTENT, re.MULTILINE).group(1)
output = f"""
DEFAULT_BRANCH={branch}
DEFAULT_CONSTRAINTS_BRANCH={constraints_branch}
""".strip()
print(output)
# Stdout is redirected to GITHUB_ENV but we also print it to stderr to see it in ci log
print(output, file=sys.stderr)
EOF
- name: "Get information about the Workflow"
id: source-run-info
run: breeze ci get-workflow-info 2>> ${GITHUB_OUTPUT}
Expand Down
2 changes: 1 addition & 1 deletion dev/breeze/doc/08_ci_tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The selective-check command will produce the set of ``name=value`` pairs of outp
from the context of the commit/PR to be merged via stderr output.

More details about the algorithm used to pick the right tests and the available outputs can be
found in `Selective Checks <dev/breeze/SELECTIVE_CHECKS.md>`_.
found in `Selective Checks <ci/04_selective_checks.md>`_.

These are all available flags of ``selective-check`` command:

Expand Down

Large diffs are not rendered by default.

30 changes: 30 additions & 0 deletions dev/breeze/doc/ci/05_workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
- [Workflows](#workflows)
- [Build Images Workflow](#build-images-workflow)
- [Differences for main and release branches](#differences-for-main-and-release-branches)
- [Committer vs. non-committer PRs](#committer-vs-non-committer-prs)
- [Tests Workflow](#tests-workflow)
- [CodeQL scan](#codeql-scan)
- [Publishing documentation](#publishing-documentation)
Expand Down Expand Up @@ -84,6 +85,13 @@ CI/CD scripts or changes to the CI/CD workflows). In this case the PR is
run in the context of the "apache/airflow" repository and has WRITE
access to the GitHub Container Registry.

When the PR changes important files (for example `generated/provider_depdencies.json` or
`pyproject.toml`), the PR is run in "upgrade to newer dependencies" mode - where instead
of using constraints to build images, attempt is made to upgrade all dependencies to latest
versions and build images with them. This way we check how Airflow behaves when the
dependencies are upgraded. This can also be forced by setting the `upgrade to newer dependencies`
label in the PR if you are a committer and want to force dependency upgrade.

## Canary run

This workflow is triggered when a pull request is merged into the "main"
Expand Down Expand Up @@ -184,6 +192,28 @@ the new branch and there are a few places where selection of tests is
based on whether this output is `main`. They are marked as - in the
"Release branches" column of the table below.

## Committer vs. non-committer PRs

There is a difference in how the CI jobs are run for committer and non-committer PRs from forks.
Main reason is security - we do not want to run untrusted code on our infrastructure for self-hosted runners,
but also we do not want to run unverified code during the `Build imaage` workflow, because that workflow has
access to GITHUB_TOKEN that has access to write to the Github Registry of ours (which is used to cache
images between runs). Also those images are build on self-hosted runners and we have to make sure that
those runners are not used to (fore example) mine cryptocurrencies on behalf of the person who opened the
pull request from their newly opened fork of airflow.

This is why the `Build Images` workflow checks if the actor of the PR (GITHUB_ACTOR) is one of the committers,
and if not, then workflows and scripts used to run image building are coming only from the ``target`` branch
of the repository, where such scripts were reviewed and approved by the committers before being merged.

This is controlled by `Selective checks <04_selective_checks.md>`__ that set appropriate output in
the build-info job of the workflow (see`is-committer-build` to `true`) if the actor is in the committer's
list and can be overridden by `non committer build` label in the PR.

Also, for most of the jobs, committer builds by default use "Self-hosted" runners, while non-committer
builds use "Public" runners. For committers, this can be overridden by setting the
`use public runners` label in the PR.

## Tests Workflow

This workflow is a regular workflow that performs all checks of Airflow
Expand Down
5 changes: 3 additions & 2 deletions dev/breeze/src/airflow_breeze/commands/ci_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@

import click

from airflow_breeze.branch_defaults import AIRFLOW_BRANCH, DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
from airflow_breeze.commands.common_options import (
option_answer,
option_dry_run,
Expand Down Expand Up @@ -197,14 +198,14 @@ def get_changed_files(commit_ref: str | None) -> tuple[str, ...]:
@click.option(
"--default-branch",
help="Branch against which the PR should be run",
default="main",
default=AIRFLOW_BRANCH,
envvar="DEFAULT_BRANCH",
show_default=True,
)
@click.option(
"--default-constraints-branch",
help="Constraints Branch against which the PR should be run",
default="constraints-main",
default=DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH,
envvar="DEFAULT_CONSTRAINTS_BRANCH",
show_default=True,
)
Expand Down
6 changes: 2 additions & 4 deletions dev/breeze/src/airflow_breeze/params/common_build_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,8 @@ class CommonBuildParams:
additional_dev_apt_env: str | None = None
additional_python_deps: str | None = None
additional_pip_install_flags: str | None = None
airflow_branch: str = os.environ.get("DEFAULT_BRANCH", AIRFLOW_BRANCH)
default_constraints_branch: str = os.environ.get(
"DEFAULT_CONSTRAINTS_BRANCH", DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
)
airflow_branch: str = AIRFLOW_BRANCH
default_constraints_branch: str = DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
airflow_constraints_location: str | None = None
builder: str = "autodetect"
build_progress: str = ALLOWED_BUILD_PROGRESS[0]
Expand Down
6 changes: 2 additions & 4 deletions dev/breeze/src/airflow_breeze/params/shell_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ class ShellParams:
Shell parameters. Those parameters are used to determine command issued to run shell command.
"""

airflow_branch: str = os.environ.get("DEFAULT_BRANCH", AIRFLOW_BRANCH)
airflow_branch: str = AIRFLOW_BRANCH
airflow_constraints_location: str = ""
airflow_constraints_mode: str = ALLOWED_CONSTRAINTS_MODES_CI[0]
airflow_constraints_reference: str = ""
Expand All @@ -130,9 +130,7 @@ class ShellParams:
collect_only: bool = False
database_isolation: bool = False
db_reset: bool = False
default_constraints_branch: str = os.environ.get(
"DEFAULT_CONSTRAINTS_BRANCH", DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
)
default_constraints_branch: str = DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
dev_mode: bool = False
docker_host: str | None = os.environ.get("DOCKER_HOST")
downgrade_sqlalchemy: bool = False
Expand Down
12 changes: 10 additions & 2 deletions dev/breeze/src/airflow_breeze/utils/selective_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from functools import cached_property, lru_cache
from typing import Any, Dict, List, TypeVar

from airflow_breeze.branch_defaults import AIRFLOW_BRANCH, DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH
from airflow_breeze.global_constants import (
ALL_PYTHON_MAJOR_MINOR_VERSIONS,
APACHE_AIRFLOW_GITHUB_REPOSITORY,
Expand Down Expand Up @@ -62,6 +63,7 @@
FULL_TESTS_NEEDED_LABEL = "full tests needed"
DEBUG_CI_RESOURCES_LABEL = "debug ci resources"
USE_PUBLIC_RUNNERS_LABEL = "use public runners"
NON_COMMITTER_BUILD_LABEL = "non committer build"
UPGRADE_TO_NEWER_DEPENDENCIES_LABEL = "upgrade to newer dependencies"

ALL_CI_SELECTIVE_TEST_TYPES = (
Expand Down Expand Up @@ -350,8 +352,8 @@ class SelectiveChecks:
def __init__(
self,
files: tuple[str, ...] = (),
default_branch="main",
default_constraints_branch="constraints-main",
default_branch=AIRFLOW_BRANCH,
default_constraints_branch=DEFAULT_AIRFLOW_CONSTRAINTS_BRANCH,
commit_ref: str | None = None,
pr_labels: tuple[str, ...] = (),
github_event: GithubEvents = GithubEvents.PULL_REQUEST,
Expand Down Expand Up @@ -1039,3 +1041,9 @@ def providers_compatibility_checks(self) -> str:
if check["python-version"] in self.python_versions
]
)

@cached_property
def is_committer_build(self):
if NON_COMMITTER_BUILD_LABEL in self._pr_labels:
return False
return self._github_actor in COMMITTERS
49 changes: 49 additions & 0 deletions dev/breeze/tests/test_selective_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1719,3 +1719,52 @@ def test_mypy_matches(
pr_labels=pr_labels,
)
assert_outputs_are_printed(expected_outputs, str(stderr))


@pytest.mark.parametrize(
"files, expected_outputs, github_actor, pr_labels",
[
pytest.param(
("README.md",),
{
"is-committer-build": "false",
"runs-on": '["ubuntu-22.04"]',
},
"",
(),
id="Regular pr",
),
pytest.param(
("README.md",),
{
"is-committer-build": "true",
"runs-on": '["self-hosted", "Linux", "X64"]',
},
"potiuk",
(),
id="Committer regular PR",
),
pytest.param(
("README.md",),
{
"is-committer-build": "false",
"runs-on": '["self-hosted", "Linux", "X64"]',
},
"potiuk",
("non committer build",),
id="Committer regular PR - forcing non-committer build",
),
],
)
def test_pr_labels(
files: tuple[str, ...], expected_outputs: dict[str, str], github_actor: str, pr_labels: tuple[str, ...]
):
stderr = SelectiveChecks(
files=files,
commit_ref="HEAD",
default_branch="main",
github_actor=github_actor,
github_event=GithubEvents.PULL_REQUEST,
pr_labels=pr_labels,
)
assert_outputs_are_printed(expected_outputs, str(stderr))
3 changes: 1 addition & 2 deletions scripts/in_container/install_airflow_and_providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -335,8 +335,7 @@ def find_installation_spec(
)
@click.option(
"--default-constraints-branch",
default="constraints-main",
show_default=True,
required=True,
envvar="DEFAULT_CONSTRAINTS_BRANCH",
help="Default constraints branch to use",
)
Expand Down
14 changes: 7 additions & 7 deletions scripts/in_container/run_generate_constraints.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,11 @@
from click import Choice
from in_container_utils import click, console, run_command

AIRFLOW_SOURCES = Path(__file__).resolve().parents[2]
AIRFLOW_SOURCE_DIR = Path(__file__).resolve().parents[2]

DEFAULT_BRANCH = os.environ.get("DEFAULT_BRANCH", "main")
PYTHON_VERSION = os.environ.get("PYTHON_MAJOR_MINOR_VERSION", "3.8")
GENERATED_PROVIDER_DEPENDENCIES_FILE = AIRFLOW_SOURCES / "generated" / "provider_dependencies.json"
GENERATED_PROVIDER_DEPENDENCIES_FILE = AIRFLOW_SOURCE_DIR / "generated" / "provider_dependencies.json"

ALL_PROVIDER_DEPENDENCIES = json.loads(GENERATED_PROVIDER_DEPENDENCIES_FILE.read_text())

Expand Down Expand Up @@ -145,7 +146,7 @@ def install_local_airflow_with_eager_upgrade(
"eager",
],
github_actions=config_params.github_actions,
cwd=AIRFLOW_SOURCES,
cwd=AIRFLOW_SOURCE_DIR,
check=True,
)

Expand Down Expand Up @@ -240,7 +241,7 @@ def uninstall_all_packages(config_params: ConfigParams):
result = run_command(
["pip", "freeze"],
github_actions=config_params.github_actions,
cwd=AIRFLOW_SOURCES,
cwd=AIRFLOW_SOURCE_DIR,
text=True,
check=True,
capture_output=True,
Expand All @@ -253,7 +254,7 @@ def uninstall_all_packages(config_params: ConfigParams):
run_command(
["pip", "uninstall", "--root-user-action", "ignore", "-y", *all_installed_packages],
github_actions=config_params.github_actions,
cwd=AIRFLOW_SOURCES,
cwd=AIRFLOW_SOURCE_DIR,
text=True,
check=True,
)
Expand Down Expand Up @@ -396,8 +397,7 @@ def generate_constraints_no_providers(config_params: ConfigParams) -> None:
)
@click.option(
"--default-constraints-branch",
default="constraints-main",
show_default=True,
required=True,
envvar="DEFAULT_CONSTRAINTS_BRANCH",
help="Branch to get constraints from",
)
Expand Down

0 comments on commit 07fd364

Please sign in to comment.