Table of Contents
AquaViva is an innovative project aimed at addressing one of the most important sustainable development goals and overall global humanitarian challenges of our time - the lack of access to clean water (SDG 6). To accomplish this, we are using cutting-edge machine learning models, trained on various datasets including satellite imagery, climatic variables, and geological features, to produce near real-time, high resolution maps of groundwater level.
We believe that this tool has great potential to help communities mitigate water scarcity, monitor groundwater, and efficiently identify suitable sources of clean water. As such, we are committed to keeping our project open-source and free-to-use, and we welcome any contributors to build off of what we have done. This project is part of NASA's Pale Blue Dot Challenge, which shares our deep commitment to using technology for environmental and social good. We are ecstatic to say that we have been recognized as Best Overall in the challenge!
Created by Team Viva Aqua | Francisco Furey, Adam Zheng, Malena Vildoza, El Hadji Malick DIEYE (AKA Jay) 😊
For our training data, we conducted an extensive literature review into past studies, as well as key concepts such as the water balance equation, in order to determine the variables that would provide a comprehensive set of information for predicting groundwater level. We then collected, cleaned, preprocessed, and integrated the datasets together using Python scripts (see scripts/preprocessing) and Jupyter Notebooks (see notebooks/preprocessing)
- Data Collection: First and foremost, we used IGRAC/GGIS to obtain piezometric (groundwater level) data from 2015-2022 for 36 wells distributed across Gambia. Then we gathered corresponding data for our 13 input variables (see Features), sourced from AρρEEARS, ClimateSERV, BGS, and GGIS (see Data Sources). Most of the raw data is available under data/original_data (except for a few files that were too large to upload)
- Data Cleaning and Preprocessing: We used Jupyter notebooks (see notebooks/preprocessing) to manage the various data formats (.nc4, .nc, .csv), visualize/analyze the raw data, and account for missing/erroneous data using nearest neighbor algorithms and linear interpolation. QGIS was also used to process hydrogeological region and topographical data. All processed data is available under data/processed_data
- Data Integration: Using pandas & geopandas, we merged datasets based on date, latitude, and longitude to form our primary dataset, which consisted of ~6600 rows (see data/processed_data/wells_data_gambia_for_machine_learning.csv)
-
Global Groundwater Information System (GGIS): An interactive portal by IGRAC that compiles data on global groundwater resources. We use it to access groundwater level data as well as data on hydrogeological regions.
-
British Geological Survey (BGS): This research project by BGS focused on the resilience of African groundwater to climate change. We incorporate their depth to groundwater data, which classifies data into 6 categories (0-7, 7-25, 25-50, 50-100, 100-250, >250 meters) - significantly lower resolution & precision than our targets, but still potentially useful.
-
Application for Extracting and Exploring Analysis Ready Samples (AρρEEARS): We used this tool to extract various parameters such as NDVI, MIR, EVI, Elevation, Curvature, Drainage Density, and Slope.
-
ClimateSERV: A tool by SERVIR, NASA, & USAID that provides climatic and vegetation data. We wrote a custom Python library (climateservAccess) for accessing the ClimateSERV API and used it to gather soil moisture, evapotranspiration, streamflow, and precipitation data.
Datatype | Description | Data Source | Resolution |
---|---|---|---|
LIS_Soil_Moisture_Combined | Soil Moisture | ClimateSERV/LIS | 3 km |
LIS_Streamflow | Streamflow | ClimateSERV/LIS | 3 km |
LIS_ET | Evapotranspiration | ClimateSERV/LIS | 3 km |
MOD13Q1_061__250m_16_days_EVI | Enhanced Vegetation Index (EVI) | AρρEEARS/MODIS | 250 m |
MOD13Q1_061__250m_16_days_MIR_reflectance | Mid-Infrared Reflectance | AρρEEARS/MODIS | 250 m |
MOD13Q1_061__250m_16_days_NDVI | Normalized Difference Vegetation Index (NDVI) | AρρEEARS/MODIS | 250 m |
NASA_IMERG_Late | Precipitation | ClimateSERV/IMERG | 10 km |
DepthToGroundwater | Estimated Groundwater Level Range | BGS | 5 km |
Curvatu_tif2 | Curvature | AρρEEARS/NASADEM | 30 m |
Drainage_density | Drainage Density | AρρEEARS/NASADEM | 30 m |
Slope_tif2 | Slope | AρρEEARS/NASADEM | 30 m |
Hydrogeo | Hydrogeological Region | IGRAC/GGIS | N/A |
NASADEM_HGT | Elevation | AρρEEARS/NASADEM | 30 m |
Datatype | Description | Data Source |
---|---|---|
GROUNDWATER_LEVEL | Groundwater Level | IGRAC/GGIS |
All relevant Jupyter Notebooks are located in notebooks/machine_learning.
- Model Selection and Training: First, we divided our dataset based on well IDs to avoid overfitting, allocating 83% for training and 17% for testing. We trained 6 different regression models using scikit-learn: SVR, AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor, SGDRegressor, and LinearSVR. Our computational resources limited our ability to test more computationally intensive models like neural networks. However, with access to more powerful machines, exploring these models could yield even more promising results.
- Model Evaluation: We employed metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and Coefficient of Determination (R²) for performance assessment, achieving our best result (MAE = 2.6 m, R² = 0.42) with Linear SVR.
- Model Optimization: We also applied Cross-Validation and GridSearchCV for hyperparameter tuning to optimize the model's performance, and combined LinearSVR with Nystroem for kernel optimization.
- Visualization Data: We first defined an area of interest within QGIS, and then split it up into ~2874 points, each representing a 500m pixel. We then gathered feature data for each of these points (see data/final_dataset), processed and compiled it as before (see notebooks/gambia_dataset), and ran it through the Linear SVR model (see notebooks/gambia_dataset/LinearSVR_final_dataset.ipynb) to get predicted groundwater levels. Note: we only used 500m resolution due to time constraints, higher resolutions would have otherwise been entirely feasible.
- Visualization Creation: Once we had ML model results, we used IDW (Inverse Distance Weighting) interpolation in QGIS to increase the resolution to about 177m, and exported data to a csv. Then we uploaded the csv to kepler.gl and put together our interactive visualization, exported it to an html file, and customized it to create our Aqua Viva website.
Given the enormous potential scale of this project, and the fact that we were just 4 people who worked on this for about a month, there is much else that remains to be done:
- Model verification. Although our model was trained on the best open-source data we could find, it was still limited (6600 data points across 36 wells). Despite our best efforts and what we believe to be reasonably accurate results, groundwater level is still a very complex variable to predict and this project would benefit greatly from more data to verify/improve our model.
- Streamline model usage. This was just a rough first pass for the process of getting feature data, running it through our model, and visualizing the results. So an important next step would be to create some sort of tool (perhaps a single Jupyter notebook) that streamlines this process and allows the user to adjust parameters easily.
- Time series data. Due to time constraints, we only visualized data for one day (December 1, 2023). Especially once model usage is streamlined, it will be much easier to visualize time series data, which would be very useful for evaluating changes in groundwater level over time.
- Near real-time data. It is entirely possible to create a tool that automatically retrieves near real-time data, runs it through our model, and outputs data for visualization. Such a tool could be used for groundwater monitoring.
- Expand area of interest. Again, due to time constraints, we narrowed our focus to a smaller (but still high-impact) region of Gambia. Of course, with more time, it would be relatively trivial to create a visualization for all of Gambia. We have no idea if the model can be extrapolated to other regions of the world, but we think it might potentially be successful in regions with a similar biome to Gambia. More work should be done to verify this.
Whether you would like to help with any of the future work outlined above, add your own data/ML models, or have any other ideas/suggestions - all contributions are welcome and encouraged! Simply fork the repo and create a pull request. You can also open an issue with the tag "enhancement". Thanks in advance for your contributions, and feel free to contact us with questions!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
- Thank you to TANGO (The Association of Non-Governmental Organizations in the Gambia) for their insights and for connecting us with community leaders in the Gambia
- Thank you to Jun Yuan Zhang for his advice regarding groundwater level prediction
- A Machine Learning Approach to Predict Groundwater Levels in California Reveals Ecosystems at Risk
- Groundwater Prediction Using Machine-Learning Tools
- Prediction of groundwater level fluctuations under climate change based on machine learning algorithms in the Mashhad aquifer, Iran
- Global Groundwater Dependent Ecosystems
- Our World in Data SDG 6 Tracker
- CodePen Back Button
- CodePen Info on Hover