Skip to content

Latest commit

 

History

History
518 lines (402 loc) · 20.7 KB

customization.md

File metadata and controls

518 lines (402 loc) · 20.7 KB

Customizing parts of the package build process

Fromager support customizing most aspects of the build process, including acquiring the source, applying local patches, passing arguments to the build, and providing a custom build command.

Variants

It is frequently necessary to build the same packages in different ways, especially for binary wheels with code compiled for different operating systems or hardware architectures. In fromager, the sets of build configuration options for each scenario is called a "variant".

The default variant is cpu, to differentiate from variants for hardware accelerators.

A variant name can be any string, but since the variant name shows up in filesystem paths it is often easier to avoid including whitespace.

Set the variant using the --variant command line option or the FROMAGER_VARIANT environment variable (for ease of use in containers).

Build environment variables

Most python packages with configurable builds use environment variables to pass build parameters. The environment variables set when fromager runs are passed to the build process for each wheel. Sometimes this does not provide sufficient opportunity for customization, so fromager also supports setting build variables for each package using environment files. The --envs-dir command line argument specifies a directory with environment files for passing variables to the builds. The default is overrides/envs.

(canonical-distribution-names)= Environment files are named using the canonical distribution name and the suffix .env.

Settings common to all variants of a given package can be placed in an environment file at the root of the --envs-dir.

The most common reason for passing different environment variables is to support different build variants, such as to enable different hardware accelerators. Therefore, the environment file directory is organized with a subdirectory per variant. Settings in the variant-specific file override and extend settings in the base directory.

$ tree overrides/envs/
overrides/envs/
├── flash_attn.env
├── vllm.env
├── cpu
│   └── vllm.env
├── cuda
│   ├── flash_attn.env
│   └── llama_cpp_python.env
└── test
    └── testenv.env

Environment files use shell-like syntax to set variables, with the variable name followed by = and then the variable value. Whitespace around the 3 parts will be ignored, but whitespace inside the value is preserved. It is not necessary to quote values, and quotes inside the value will be passed through.

Environment files support simple parameter expansions $NAME and ${NAME}. Values are taken from previous lines, then process environment. Sub shell expression $(cmd) and extended parameter expansions like ${NAME:-default} are not implemented.

$ cat overrides/envs/cuda/llama_cpp_python.env
LLAMA_NATIVE=off
CMAKE_ARGS=-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all-major -DLLAMA_NATIVE=${LLAMA_NATIVE}
CFLAGS=-mno-avx
FORCE_CMAKE=1

Patching source

The --patches-dir command line argument specifies a directory containing patches to be applied after the source code is in place and before evaluating any further dependencies or building the project. The default directory is overrides/patches.

Patch files should be placed in a subdirectory matching the source directory name and use the suffix .patch. The filenames are sorted lexicographically, so any text between the prefix and suffix can be used to ensure the patches are applied in a specific order.

Patches are applied by running patch -p1 filename while inside the root of the source tree.

$ ls -1 overrides/patches/*
clarifai-10.2.1/fix-sdist.patch
flash_attn-2.5.7/pyproject-toml.patch
jupyterlab_pygments-0.3.0/pyproject-remove-jupyterlab.patch
ninja-1.11.1.1/wrap-system-ninja.patch
pytorch-v2.2.1/001-remove-cmake-build-requirement.patch
pytorch-v2.2.1/002-dist-info-no-run-build-deps.patch
pytorch-v2.2.1/003-fbgemm-no-maybe-uninitialized.patch
pytorch-v2.2.1/004-fix-release-version.patch
pytorch-v2.2.2/001-remove-cmake-build-requirement.patch
pytorch-v2.2.2/002-dist-info-no-run-build-deps.patch
pytorch-v2.2.2/003-fbgemm-no-maybe-uninitialized.patch
pytorch-v2.2.2/004-fix-release-version.patch
xformers-0.0.26.post1/pyproject.toml.patch

Note: A legacy patch organization with the patches in the patches directory, not in subdirectories, with the filenames prefixed with the source directory name is also supported. The newer format, using subdirectories, is preferred because it avoids name collisions between variant source trees.

Override plugins

For more complex customization requirements, create an override plugin.

Plugins are registered using entry points so they can be discovered and loaded at runtime. In pyproject.toml, configure the entry point in the project.entry-points."fromager.project_overrides" namespace to link the canonical distribution name to an importable module.

[project.entry-points."fromager.project_overrides"]
flit_core = "package_plugins.flit_core"
pyarrow = "package_plugins.pyarrow"
torch = "package_plugins.torch"
triton = "package_plugins.triton"

The plugins are treated as providing overriding implementations of functions with default implementations, so it is only necessary to implement the functions needed to make it possible to build the package.

download_source

The download_source() function is responsible for resolving a requirement and acquiring the source for that version of a package. The default is to use pypi.org to resolve the requirement and then download the sdist file for the package.

The arguments are the WorkContext, the Requirement being evaluated, and the URL to the server where sdists may be found.

The return value should be a tuple containing the location where the source archive file was written and the version that was resolved.

The function may be invoked more than once if multiple sdist servers are being used. Returning a valid response prevents multiple invocations.

def download_source(ctx, req, sdist_server_url):
    ...
    return source_filename, version

get_resolver_provider

(pypi.org)= The get_resolver_provider() function allows an override to change the way requirement specifications are converted to fixed versions. The default implementation looks for published versions on a Python package index. Most overrides do not need to implement this hook unless they are building versions of packages not released to https://pypi.org.

For examples, refer to fromager.resolver.PyPIProvider and fromager.resolver.GitHubTagProvider.

The arguments are the WorkContext, the Requirement being evaluated, a boolean indicating whether source distributions should be included, a boolean indicating whether built wheels should be included, and the URL for the sdist server.

The return value must be an instance of a class that implements the resolvelib.providers.AbstractProvider API.

The expectation is that a download_source() override will call sources.resolve_dist(), which will call get_resolver_provider(), and then the return value of the resolution will be passed back to download_source() as a tuple of URL and version. The provider can therefore use any value as the "URL" that will help it decide what to download. For example, the GitHubTagProvider returns the actual tag name in case that is different from the version number encoded within that tag name.

def get_resolver_provider(ctx, req, include_sdists, include_wheels, sdist_server_url):
    ...

The GenericProvider is a convenient base class, or can be instantiated directly if given a version_source callable that returns an iterator of version values as str or Version objects.

from fromager import resolver

VERSION_MAP = {'1.0': 'first-release', '2.0': 'second-release'}

def _version_source(
        identifier: str,
        requirements: resolver.RequirementsMap,
        incompatibilities: resolver.CandidatesMap,
    ) -> typing.Iterable[Version]:
    return VERSION_MAP.keys()


def get_resolver_provider(ctx, req, include_sdists, include_wheels, sdist_server_url):
    return resolver.GenericProvider(version_source=_version_source, constraints=ctx.constraints)

prepare_source

The prepare_source() function is responsible for setting up a tree of source files in a format that is ready to be built. The default implementation unpacks the source archive and applies patches.

The arguments are the WorkContext, the Requirement being evaluated, the Path to the source archive, and the version.

The return value should be the Path to the root of the source tree, ideally inside the ctx.work_dir directory.

def prepare_source(ctx, req, source_filename, version):
    ...
    return output_dir_name

expected_source_archive_name

The expected_source_archive_name() function is used to re-discover a source archive downloaded by a previous step, especially if the filename does not match the standard naming scheme for an sdist.

The arguments are the Requirement being evaluated and the version to look for.

The return value should be a string with the base filename (no paths) for the archive.

def expected_source_archive_name(req, dist_version):
    return f'apache-arrow-{dist_version}.tar.gz'

expected_source_directory_name

The expected_source_directory_name() function is used to re-discover the location of a source tree prepared by a previous step, especially if the name does not match the standard naming scheme for an sdist.

The arguments are the Requirement being evaluated and the version to look for.

The return value should be a string with the name of the source root directory relative to the ctx.work_dir where it was prepared.

def expected_source_directory_name(req, dist_version):
    return f'apache-arrow-{dist_version}/arrow-apache-arrow-{dist_version}'

get_build_system_dependencies

The get_build_system_dependencies() function should return the PEP 517 build dependencies for a package.

The arguments are the WorkContext, the Requirement being evaluated, and the Path to the root of the source tree.

The return value is an iterable of requirement specification strings for build system dependencies for the package. The caller is responsible for evaluating the requirements with the current build environment settings to determine if they are actually needed.

# pyarrow.py
def get_build_system_dependencies(ctx, req, sdist_root_dir):
    # The _actual_ directory with our requirements is different than
    # the source root directory detected for the build because the
    # source tree doesn't just include the python package.
    return dependencies.default_get_build_system_dependencies(
        ctx, req,
        sdist_root_dir / 'python',
    )

get_build_backend_dependencies

The get_build_backend_dependencies() function should return the PEP 517 build dependencies for a package.

The arguments are the WorkContext, the Requirement being evaluated, and the Path to the root of the source tree.

The return value is an iterable of requirement specification strings for build backend dependencies for the package. The caller is responsible for evaluating the requirements with the current build environment settings to determine if they are actually needed.

# pyarrow.py
def get_build_backend_dependencies(ctx, req, sdist_root_dir):
    # The _actual_ directory with our requirements is different than
    # the source root directory detected for the build because the
    # source tree doesn't just include the python package.
    return dependencies.default_get_build_backend_dependencies(
        ctx, req,
        sdist_root_dir / 'python',
    )

get_build_sdist_dependencies

The get_build_sdist_dependencies() function should return the PEP 517 dependencies for building the source distribution for a package.

The return value is an iterable of requirement specification strings for build backend dependencies for the package. The caller is responsible for evaluating the requirements with the current build environment settings to determine if they are actually needed.

# pyarrow.py
def get_build_sdist_dependencies(ctx, req, sdist_root_dir):
    # The _actual_ directory with our requirements is different than
    # the source root directory detected for the build because the
    # source tree doesn't just include the python package.
    return dependencies.default_get_build_sdist_dependencies(
        ctx, req,
        sdist_root_dir / 'python',
    )

build_sdist

The build_sdist() function is responsible for creating a new source distribution from the prepared source tree and placing it in ctx.sdists_build.

The arguments are the WorkContext, the Requirement being evaluated, and the Path to the root of the source tree.

The return value is the Path to the newly created source distribution.

# pyarrow.py
def build_sdist(ctx, req, sdist_root_dir):
    return sources.build_sdist(
        ctx, req,
        sdist_root_dir / 'python',
    )

build_wheel

The build_wheel() function is responsible for creating a wheel from the prepared source tree and placing it in ctx.wheels_build.. The default implementation invokes pip wheel in a temporary directory and passes the path to the source tree as argument.

The arguments are the WorkContext, the Path to a virtualenv prepared with the build dependencies, a dict with extra environment variables to pass to the build, the Requirement being evaluated, and the Path to the root of the source tree.

The return value is ignored.

def build_wheel(ctx, build_env, extra_environ, req, sdist_root_dir):
    ...

add_extra_metadata_to_wheels

The add_extra_metadata_to_wheels() function is responsible to return any data the user would like to include in the wheels that fromager builds. This data will be added to the fromager-build-settings file under the .dist-info directory of the wheels. This file already contains the settings used to build that package.

The arguments available are WorkContext, Requirement being evaluated, the resolved Version of that requirement, a dict with extra environment variables, a Path to the root directory of the source distribution and a Path to the .dist-info directory of the wheel.

The return value must be a dict, otherwise it will be ignored.

Canonical distribution names

The Python packaging ecosystem is flexible in how the source distribution, wheel, and python package names are represented in filenames and requirements specifications. To standardize and ensure that build customizations are recognized correctly, we always use a canonical version of the name, computed using packaging.utils.canonicalize_name() and then replacing hyphens (-) with underscores (_). For convenience, the canonicalize command will print the correct form of a name.

$ tox -e cli -- canonicalize flit-core
flit_core

Process hooks

Fromager supports plugging in Python hooks to be run after build events.

post_build

The post_build hook runs after a wheel is successfully built and can be used to publish that wheel to a package index or take other post-build actions.

Configure a post_build hook in your pyproject.toml like this:

[project.entry-points."fromager.hooks"]
post_build = "package_plugins.module:function"

The input arguments to the post_build hook are the WorkContext, Requirement being built, the distribution name and version, and the sdist and wheel filenames.

NOTE: The files should not be renamed or moved.

def post_build(
    ctx: context.WorkContext,
    req: Requirement,
    dist_name: str,
    dist_version: str,
    sdist_filename: pathlib.Path,
    wheel_filename: pathlib.Path,
):
    logger.info(
        f"{req.name}: running post build hook for {sdist_filename} and {wheel_filename}"
    )

Customizations using settings.yaml

To use predefined urls to download sources from, instead of overriding the entire download_source function, a mapping of package to download source url can be provided directly in settings.yaml. Optionally the downloaded sdist can be renamed. Both the url and the destination filename support templating. The only supported template variable are:

  • version - it is replaced by the version returned by the resolver
  • canonicalized_name - it is replaced by the canonicalized name of the package specified in the requirement, specifically it applies canonicalize_name(req.nam)

The source distribution index server used by the package resolver can be overriden for a particular package. The resolver can also be told to whether include wheels or sdist sources while trying to resolve the package. Templating is not supported here.

A build_dir field can also be defined to indicate to fromager where the package should be build relative to the sdist root directory

packages:
    torch:
        download_source:
            url: "https://github.com/pytorch/pytorch/releases/download/v${version}/pytorch-v${version}.tar.gz"
            destination_filename: "${canonicalized_name}-${version}.tar.gz"
        resolver_dist:
            sdist_server_url: "https://pypi.org/simple"
            include_wheels: true
            include_sdists: false
        build_dir: directory name relative to sdist directory, defaults to an empty string, which means to use the sdist directory

Customizations using package-specific settings files

In addition to the unified settings file, package settings can be saved in their own file in the directory specified with --settings-dir. Files should be named using the canonicalized form of the package name with the suffix .yaml. The data from the file is treated as though it appears in the packages.package_name section of the main settings file. Files are read in lexicographical order. Settings in package-specific files override all settings in the global file.

# settings/torch.yaml
download_source:
    url: "https://github.com/pytorch/pytorch/releases/download/v${version}/pytorch-v${version}.tar.gz"
    destination_filename: "${canonicalized_name}-${version}.tar.gz"
resolver_dist:
    sdist_server_url: "https://pypi.org/simple"
    include_wheels: true
    include_sdists: false
build_dir: directory name relative to sdist directory, defaults to an empty string, which means to use the sdist directory

Speeding up build-sequence

The build-sequence command can be sped up by passing the --skip-existing option and the URL of the wheels index which hosts all the previosuly built wheels. When using build-sequence in skip existing mode, it will check the wheels index to see if the version the wheel it is trying to build already exists. If it does then it won't build that package.

However, there might be cases where we need to build a package again even if the version hasn't changed. This might be due to many factors such as change in plugin code, environment variables, additional patches to sdist etc. For such situations, whenever such a change occurs the user can define a changelog in the settings file for that particular version of the package:

# settings/torch.yaml
changelog:
    "2.3.1":
        - Changed env variable MAX_JOBS to 10

Changelogs help fromager control the build tag of a wheel. Even if the version matches but the build tag doesn't the package will be built with the new build tag even in skip existing mode. By default all packages are built with build tag 0 when no changelog is specified. Build tags are added to any wheel built by fromager using any command i.e. wheels built by bootstrap and step commands also have build tags in them. The order and content of the changelog itself is irrelevant to fromager and is only meant to help user keep track why a new build tag was needed. For fromager only the number of entries in the changelog is important. The user is responsible to maintain the correct number of entries in the changelog and any decrease in the number of entries might result in an error when using build-sequence in the skip existing mode.

Sometimes, it might be necessary to rebuild all the packages even in skip existing mode. This might be due to complete changes in the build environment for all wheels or a breaking change in a core build dependency such as setuptools. Instead of defining a changelog for all packages, the user can define a global changelog which will cause a bump to the build tag for all packages:

# global settings.yaml
changelog:
    variant:
        - Changed env variable MAX_JOBS to 10