Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality 
Group Project for the Spatiotemporal Exposures and Toxicology group with help from friends 😃 🤠 🌎
remotes::install_github("NIEHS/beethoven")
beethoven
is a targets reproducible analysis pipeline with the following workflow.
Version 0.4.4 of beethoven
has stable targets
for downloading data files, calculating features at AQS sites, and merging to a base learner-ready data.table
(dt_feat_calc_xyt
). Ongoing changes relate to calculating features for the prediction grid, computationally managing prediction grid, base learner hyperparameter tuning, and meta learner function development.
targets::tar_visnetwork()
Here, we describe the structure of the repository, important files, and the targets
object naming conventions.
R/
is where thebeethoven
functions are stored. Only ".R" files should be in this folder (ie.targets
helpers, post-processing, model fitting functions).inst/
is a directory for arbitrary files outside of the mainR/
directorytargets/
is a sub-directory withininst/
which contains the pipeline files (ie. "targets_aqs.R"). These files declare thetargets::tar_target
objects which constitute thebeethoven
pipeline.
tests/
stores unit and integration tests (testthat/
) and test data (testdata/
) according to the testthat package's standard structure. for unit testing.testthat.R
is created and maintained bytestthat
, and is not to be edited manually.
container/
stores definition files and build scripts to build covariate- and model-specific Apptainer container images (container_covariates.def
andcontainer_models.def
).man/
contains function documentation files (".Rd") which are by the roxygen2 package. These files are not to be edited manually.vignettes/
contains ".Rmd" narrative text and code files. These are rendered by pkgdown into the Articles section of thebeethoven
webpage..github/workflows/
is a hidden directory which stores the GitHub CI/CD "yaml" files.tools/
is dedicated to educational or demonstration material (e.g. Rshiny), but is not excluded from the package build.
_targets.R
configurestargets
settings, creates computational resource controllers, and structures thebeethoven
pipeline.- To run
beethoven
, users must review and update the following parameters for their user profile and computing system:controller_*
Ensure the local controllers do not request more CPUs than are available on your machine or high performance system.#SBATCH --partition
Utilization of NVIDIA GPUs (withinglue::glue
command)--bind /USER_PATH_TO_INPUT/input:/input
(withinglue::glue
command)
- To run
_targets.yaml
is created and updated by runningtargets::tar_make
and is not to be edited manually.run.sh
submits separateSBATCH
jobs for the covariate, cpu- and gpu-enabled base learner, and the meta learnertargets
(see/inst/scripts/
). This setup ensures that each stage utilizes the proper container image and computational resources. To runbeethoven
, users must review and update the following parameters for their user profile and computing system in each of theinst/scripts/run_*
files.:#SBATCH --mail-user
#SBATCH --partition
#SBATCH --mem
#SBATCH --cpus-per-task
--bind /USER_PATH_TO_INPUT/input:/input
--bind /USER_PATH_TO_SLURM/slurm:/USER_PATH_TO_SLURM/slurm
beethoven
pipeline is configured for SLURM with defaults for NIEHS HPC settings. For adapting the settings to users' environment, consult with the documentation of your platform and edit the requested resources in the stage-specific run files (/inst/scripts/
) (lines 3-11) and _targets.R
(lines 41-45; individual crew
and crew.cluster
controller workers).
There are 5 "critical" targets
that users may want to change to run beethoven
.
chr_daterange
- Controls all time-related targets for the entire pipeline. This is the only
target
that needs to be changed to update the pipeline with a new temopral range. Month and year specific arguments are derived from the time range defined bychr_daterange
.
- Controls all time-related targets for the entire pipeline. This is the only
chr_nasa_token
- Sets the file path to the user's NASA Earthdata account credentials. These credentials expire at ~90 day intervals and therefore must be updated regularly.
chr_mod06_links
- The file path to the MOD06 links file. These links must be manually downloaded per the
amadeus::download_modis
function. The links are then stored in a CSV file that is read by the function. The new file with links must be updated to match the new date range.
- The file path to the MOD06 links file. These links must be manually downloaded per the
chr_input_dir
- The file path to the input directory. This target controls where the raw data files are downloaded to and imported from. This file path must be mounted to the container at run time in the
run.sh
script.
- The file path to the input directory. This target controls where the raw data files are downloaded to and imported from. This file path must be mounted to the container at run time in the
num_dates_split
- Controls the size of temporal splits. Splitting the temporal range into smaller chunks allows for parallel processing across multiple workers. It also allows for dispatching new dynamic branches when the temporal range is updated.
Current implementation of beethoven
utilizes Apptainer
images to run the pipeline with consistent package versions and custom installations. Users must build these images before runnning beethoven.
cd container/ # must be working in the `container/` directory
sh build_container_covariates.sh # build "covariates" stage image
sh build_container_models.sh # build "models" image
mv *sif ../ # move images to `beethoven/` root directory
Note
.sif
files are omitted from GitHub due to size (>5 Gb each)
After switching back to the project root directory, users can run the pipeline with the run.sh
shell script. The following lines of /inst/scripts/run_*.sh
must be updated with user-specific settings before running the pipeline
#SBATCH --mail-user=[USER_EMAIL] # email address for job notifications
#SBATCH --partition=[PARTITION_NAME] # HPC partition to run on
#SBATCH --mem=[###G] # Total memory for the job
#SBATCH --cpus-per-task=[###] # Total CPUs for the job
...
--bind [USER_INPUT_DIRECTORY]/input:/input \
...
--bind [USER_SYSTEM_PATH/munge]:/run/munge \
--bind [USER_SYSTEM_PATH/slurm]:[USER_SYSTEM_PATH/slurm] \
Once configured, the pipeline can be run with a SLRUM
batch job.
cd ../ # assuming still in the `container/` directory
sbatch run.sh
The SLURM batch job can also be submitted R
session with the batch
helper function.
source("R/helpers.R")
batch()
The Developer's Guide provides detailed instructions for how to develop or update beethoven
settings or individual targets
objecdts
To contribute developments or modifications, open a Pull request into the dev
branch with a detailed description of the proposed changes. Pull requests must pass all status checks, and then will be approved or rejected by beethoven
's authors.
Utilize Issues to notify the authors of bugs, questions, or recommendations. Identify each issue with the appropriate label to help ensure a timely response.