Skip to content

Latest commit

 

History

History
81 lines (52 loc) · 2.71 KB

README.md

File metadata and controls

81 lines (52 loc) · 2.71 KB

ML Pipeline for River Flow Estimation

This repository provides a machine learning pipeline designed to estimate mean and low reference flows for Brazilian river stretches. It includes scripts for data collection, preprocessing, model training, and evaluation.


Data Generation Workflow

This section outlines how the dataset was generated. The steps below provide details about each part of the workflow.

Step 1: Data Collection

  • Description: Data was collected using the Google Earth Engine Python API to extract hydrological and environmental metrics for Brazilian river stretches.

  • File: src/data_treatment/gee_data_extract.py

Step 2: Data Pre-processing

  • Description: The raw data was processed using topological information from the Brazilian Hydrography Ottocodified (BHO) to generate features. Ran in the following order:

  • 1. Structure Flow data: src/data_treatment/org_flow.py

  • 2. Aggregate all input data: src/data_treatment/agg_att.py

  • 3. Aggregated attributes to catchment accumulated: src/data_treatment/acc_att.py

  • 4. Structure All the data to be used by the ML models: src/data_treatment/to_ml.py

Step 3: Model Processing

  • Description: Six ML models were processed. A K-fold CV was used at the gauging sites, and the all gauging data was used for all ungauged sites, for all models.

  • File: src/process_modelig/model_run.py

Step 4: Post-processing

  • Description: The trained model was evaluated, and performance metrics were saved.

  • 1. Evaluation of averaged ensemble combination: src/process_post/ens_eval.py

  • 2. Processing of the best ensemble combination to all data: src/process_post/ens_run.py

  • 3. Uncertainty estimation: src/process_post/unc_run.py

  • 4. Final dataset production: src/process_post/data_gen.py


How to Use

Clone the Repository: bash git clone https://github.com/barbedorafael/ml_pipeline.git cd ml_pipeline

Install Dependencies: Install required libraries with: bash pip install -r requirements.txt


Dataset

The dataset generated by this pipeline will be publicly available on Zenodo. [Link pending]. The dataset includes:

  • Raw input data collected via the pipeline.
  • Processed features used in model training.
  • Output predictions and uncertainty estimates.

Requirements

  • Python 3.10+
  • Google Earth Engine Python API
  • Additional Python libraries (see requirements.txt)

Contributing

Suggestions, bug reports, and contributions are welcome! Open an issue or submit a pull request to improve the workflow.


License

This project is licensed under the MIT License. See the LICENSE file for details.