The data pipeline automates a series of tasks related to processing and analysis of geographical and environmental data.
. ├── config.yml # Configuration file containing settings and paths ├── jobs # Scripts for various jobs to be executed │ ├── *.exp # Job scripts (SLURM/BASH executable) │ └── README.md ├── launcher.sh # Main script to launch the jobs ├── scripts # Python scripts that perform the core functionalities of the pipeline └── utility # Additional utilities and helper functions
Before running the pipeline:
- Ensure that all dependencies specified in the
config.yml
are installed and available. - Set the appropriate paths in the
config.yml
file. - Check that the necessary data is in place, as specified in the
config.yml
.
To execute the pipeline, you can use the launcher.sh
script. It accepts a mode, either "SLURM" or "BASH", to dictate how jobs should be executed.
./launcher.sh <MODE>
Replace <MODE>
with either SLURM
if you're running on a cluster with SLURM workload manager or BASH
to execute the scripts directly.
Example:
./launcher.sh SLURM
Note: When using the SLURM
mode, the launcher will submit jobs using the sbatch
command and monitor their completion before moving to the next job. In BASH
mode, the jobs are executed directly.
The pipeline consists of several jobs, each dedicated to a specific processing or analysis task. Jobs are located in the jobs
directory and can be executed individually or as part of the pipeline using the launcher.sh
script.
The jobs included are:
- standardize_tifs: Standardizes TIF files for processing.
- proximity: Calculates the proximity for the data.
- calculate_global_stats: Calculates global statistics for the dataset.
- stack: Stacks layers of data.
- sample_deforestation: Samples deforestation tiles.
- cut_tiles_distributed: Cuts tiles, distributing the computantions across different nodes..
You can customize the pipeline by:
- Modifying the
config.yml
for different parameters or paths. - Adding or removing job scripts in the
jobs
directory. - Updating the
launcher.sh
script to include/exclude jobs as needed.
This configuration file serves as the central settings repository for the data pipeline.
-
RESOLUTION: The spatial resolution of the data, in meters. Set to
30
for 30-meter resolution. -
WINDOW_SIZE: The dimension size for splitting datasets into smaller windows. For instance, a window size of
256
would create 256x256 pixel windows. -
DST_CRS: The Coordinate Reference System to be used for output datasets, specified as an EPSG code.
-
TARGET_EXTENT: The geographical extent for output datasets. This specifies the bounding box for processing data.
-
TARGET_VARIABLE: The variable of interest in the datasets.
-
MODULES: Lists the software modules required for the pipeline, which will be loaded before processing.
-
WORK_DIR: Specifies the main working directory. This is where all processing will take place.
-
PYTHON_ENV: Path to the Python virtual environment to be used.
-
RAW_DATA: Directory containing the raw input data.
-
DATA_DIR: Directory where processed data will be stored.
-
SHAPEFILE: The path to a shapefile used in the processing, in this case, specifying the Amazon biome borders.
-
LOG_DIR: Directory where log files will be stored. It helps in tracking the progress and debugging if needed.
- DEFAULT: Contains default settings for running tasks on a SLURM cluster.
- NODES: The number of nodes to be used for SLURM jobs.
- NTASKS: Number of tasks to run.
- CPUS_PER_TASK: Specifies how many CPUs will be used per task.
- RUN_TIME: Maximum allowed runtime for the SLURM job.
- MEM_PER_CPU: Memory allocated per CPU.
Detailed settings and paths for each of the tasks/jobs in the pipeline:
- standardize_tifs: This job processes raw data to standardize TIF files.
- stack_xtest: Stacks different layers of data together.
- sample_deforestation: Samples tiles from areas that have undergone deforestation.
- proximity: Calculates proximity measurements for specified features.
- cut_tiles_distributed: Splits larger datasets into smaller tiles, distributed across resources.
- clip_tifs: Clips TIF files based on a specified shape or boundary.
- calculate_global_stats: Computes global statistics for the datasets over the years specified.
Each job section contains specific settings such as log file locations, input data paths, and output directories.
To ensure that the pipeline runs smoothly, it is imperative to periodically review and update the config.yml
file, especially when introducing new datasets or changing directory structures.