ML Pipeline for River Flow Estimation

This repository provides a machine learning pipeline designed to estimate mean and low reference flows for Brazilian river stretches. It includes scripts for data collection, preprocessing, model training, and evaluation.

Data Generation Workflow

This section outlines how the dataset was generated. The steps below provide details about each part of the workflow.

Step 1: Data Collection

Description: Data was collected using the Google Earth Engine Python API to extract hydrological and environmental metrics for Brazilian river stretches.
File: src/data_treatment/gee_data_extract.py

Step 2: Data Pre-processing

Description: The raw data was processed using topological information from the Brazilian Hydrography Ottocodified (BHO) to generate features. Ran in the following order:
1. Structure Flow data: src/data_treatment/org_flow.py
2. Aggregate all input data: src/data_treatment/agg_att.py
3. Aggregated attributes to catchment accumulated: src/data_treatment/acc_att.py
4. Structure All the data to be used by the ML models: src/data_treatment/to_ml.py

Step 3: Model Processing

Description: Six ML models were processed. A K-fold CV was used at the gauging sites, and the all gauging data was used for all ungauged sites, for all models.
File: src/process_modelig/model_run.py

Step 4: Post-processing

Description: The trained model was evaluated, and performance metrics were saved.
1. Evaluation of averaged ensemble combination: src/process_post/ens_eval.py
2. Processing of the best ensemble combination to all data: src/process_post/ens_run.py
3. Uncertainty estimation: src/process_post/unc_run.py
4. Final dataset production: src/process_post/data_gen.py

How to Use

Clone the Repository: bash git clone https://github.com/barbedorafael/ml_pipeline.git cd ml_pipeline

Install Dependencies: Install required libraries with: bash pip install -r requirements.txt

Dataset

The dataset generated by this pipeline will be publicly available on Zenodo. [Link pending]. The dataset includes:

Raw input data collected via the pipeline.
Processed features used in model training.
Output predictions and uncertainty estimates.

Requirements

Python 3.10+
Google Earth Engine Python API
Additional Python libraries (see requirements.txt)

Contributing

Suggestions, bug reports, and contributions are welcome! Open an issue or submit a pull request to improve the workflow.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
docs		docs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset_vis.qgz		dataset_vis.qgz
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Pipeline for River Flow Estimation

Data Generation Workflow

Step 1: Data Collection

Step 2: Data Pre-processing

Step 3: Model Processing

Step 4: Post-processing

How to Use

Dataset

Requirements

Contributing

License

About

Releases 1

Packages

Languages

License

barbedorafael/ml_pipeline

Folders and files

Latest commit

History

Repository files navigation

ML Pipeline for River Flow Estimation

Data Generation Workflow

Step 1: Data Collection

Step 2: Data Pre-processing

Step 3: Model Processing

Step 4: Post-processing

How to Use

Dataset

Requirements

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages