Skip to content

Latest commit

 

History

History

data_pipeline

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Data Pipeline

The data pipeline automates a series of tasks related to processing and analysis of geographical and environmental data.

Directory Structure

. ├── config.yml # Configuration file containing settings and paths ├── jobs # Scripts for various jobs to be executed │ ├── *.exp # Job scripts (SLURM/BASH executable) │ └── README.md ├── launcher.sh # Main script to launch the jobs ├── scripts # Python scripts that perform the core functionalities of the pipeline └── utility # Additional utilities and helper functions

Setup

Before running the pipeline:

  1. Ensure that all dependencies specified in the config.yml are installed and available.
  2. Set the appropriate paths in the config.yml file.
  3. Check that the necessary data is in place, as specified in the config.yml.

Usage

To execute the pipeline, you can use the launcher.sh script. It accepts a mode, either "SLURM" or "BASH", to dictate how jobs should be executed.

./launcher.sh <MODE>

Replace <MODE> with either SLURM if you're running on a cluster with SLURM workload manager or BASH to execute the scripts directly.

Example:

./launcher.sh SLURM

Note: When using the SLURM mode, the launcher will submit jobs using the sbatch command and monitor their completion before moving to the next job. In BASH mode, the jobs are executed directly.

Jobs

The pipeline consists of several jobs, each dedicated to a specific processing or analysis task. Jobs are located in the jobs directory and can be executed individually or as part of the pipeline using the launcher.sh script.

The jobs included are:

  • standardize_tifs: Standardizes TIF files for processing.
  • proximity: Calculates the proximity for the data.
  • calculate_global_stats: Calculates global statistics for the dataset.
  • stack: Stacks layers of data.
  • sample_deforestation: Samples deforestation tiles.
  • cut_tiles_distributed: Cuts tiles, distributing the computantions across different nodes..

Customizing the Pipeline

You can customize the pipeline by:

  1. Modifying the config.yml for different parameters or paths.
  2. Adding or removing job scripts in the jobs directory.
  3. Updating the launcher.sh script to include/exclude jobs as needed.

Configuration Details: config.yml

This configuration file serves as the central settings repository for the data pipeline.

GLOBAL Settings:

  • RESOLUTION: The spatial resolution of the data, in meters. Set to 30 for 30-meter resolution.

  • WINDOW_SIZE: The dimension size for splitting datasets into smaller windows. For instance, a window size of 256 would create 256x256 pixel windows.

  • DST_CRS: The Coordinate Reference System to be used for output datasets, specified as an EPSG code.

  • TARGET_EXTENT: The geographical extent for output datasets. This specifies the bounding box for processing data.

  • TARGET_VARIABLE: The variable of interest in the datasets.

  • MODULES: Lists the software modules required for the pipeline, which will be loaded before processing.

  • WORK_DIR: Specifies the main working directory. This is where all processing will take place.

  • PYTHON_ENV: Path to the Python virtual environment to be used.

  • RAW_DATA: Directory containing the raw input data.

  • DATA_DIR: Directory where processed data will be stored.

  • SHAPEFILE: The path to a shapefile used in the processing, in this case, specifying the Amazon biome borders.

  • LOG_DIR: Directory where log files will be stored. It helps in tracking the progress and debugging if needed.

SLURM Settings:

  • DEFAULT: Contains default settings for running tasks on a SLURM cluster.
    • NODES: The number of nodes to be used for SLURM jobs.
    • NTASKS: Number of tasks to run.
    • CPUS_PER_TASK: Specifies how many CPUs will be used per task.
    • RUN_TIME: Maximum allowed runtime for the SLURM job.
    • MEM_PER_CPU: Memory allocated per CPU.

Jobs:

Detailed settings and paths for each of the tasks/jobs in the pipeline:

  • standardize_tifs: This job processes raw data to standardize TIF files.
  • stack_xtest: Stacks different layers of data together.
  • sample_deforestation: Samples tiles from areas that have undergone deforestation.
  • proximity: Calculates proximity measurements for specified features.
  • cut_tiles_distributed: Splits larger datasets into smaller tiles, distributed across resources.
  • clip_tifs: Clips TIF files based on a specified shape or boundary.
  • calculate_global_stats: Computes global statistics for the datasets over the years specified.

Each job section contains specific settings such as log file locations, input data paths, and output directories.


To ensure that the pipeline runs smoothly, it is imperative to periodically review and update the config.yml file, especially when introducing new datasets or changing directory structures.