GeoPre is a Python library designed to streamline common geospatial data operations, offering a unified interface for handling raster and vector datasets. It simplifies preprocessing tasks essential for GIS analysis, machine learning workflows, and remote sensing applications.
-
Data Scaling:
- Normalization (Z-Score) and Min-Max scaling for raster bands.
- Prepares data for ML models while preserving geospatial metadata.
-
CRS Management:
- Retrieve and compare Coordinate Reference Systems (CRS) across raster (Rasterio/Xarray) and vector (GeoPandas) datasets.
- Ensure consistency between datasets with automated CRS checks.
-
Reprojection:
- Reproject vector data (GeoDataFrames) and raster data (Rasterio/Xarray) to any target CRS.
- Supports EPSG codes, WKT, and Proj4 strings.
-
No-Data Masking:
- Handle missing values in raster datasets (NumPy/Xarray) with flexible masking.
- Integrates seamlessly with raster metadata for error-free workflows.
-
Cloud Masking:
- Identify and mask clouds in Sentinel-2 and Landsat imagery.
- Supports multiple methods: QA bands, scene classification layers (SCL), probability bands, and OmniCloudMask AI-based detection.
- Optionally mask cloud shadows for improved accuracy.
-
Band Stacking:
- Stack multiple raster bands from a folder into a single multi-band raster for analysis.
- Supports automatic band detection and resampling for different resolutions.
- Raster: NumPy arrays, Rasterio
DatasetReader
, XarrayDataArray
(via rioxarray). - Vector: GeoPandas
GeoDataFrame
.
- Unified Workflow: Eliminates boilerplate code by providing consistent functions for raster and vector data.
- Interoperability: Bridges gaps between GeoPandas, Rasterio, and Xarray, ensuring smooth data transitions.
- Robust Error Handling: Automatically detects CRS mismatches and missing metadata to prevent silent failures.
- Efficiency: Optimized reprojection and masking operations reduce preprocessing time for large datasets.
- ML-Ready Outputs: Scaling functions preserve data structure, making outputs directly usable in machine learning pipelines.
Ideal for researchers and developers working with geospatial data, GeoPre enhances productivity by standardizing preprocessing steps and ensuring compatibility across diverse geospatial tools.
GeoPre is available on PyPI and can be installed with:
pip install geopre
This will automatically install all required dependencies.
Description:This method centers the data around zero by subtracting the mean and dividing by the standard deviation, which is useful for machine learning models sensitive to outliers and can standardize a band of pixel values for clustering/classification.
Parameters:
- data (numpy.ndarray): Input array to normalize.
Returns:
- numpy.ndarray: Standardized data with mean 0 and standard deviation 1.
Description: This method scales the pixel values to a fixed range, typically [0, 1] or [-1, 1]. Ideal when you want to preserve the relative range of values. For GeoTIFF image values (e.g., 0 to 65535), scale them to [0, 1].
Parameters:
- data (numpy.ndarray): Input array to normalize.
Returns:
- numpy.ndarray: Scaled data with values between 0 and 1, or -1 and 1.
import numpy as np
import geopre as gp
data = np.array([[10, 20, 30], [40, 50, 60]])
z_scaled = gp.Z_score_scaling(data)
minmax_scaled = gp.Min_Max_Scaling(data)
Description: Retrieve CRS from geospatial data objects.
Parameters:
- data: GeoPandas GeoDataFrames (vector), Rasterio DatasetReaders (raster) or Xarray DataArrays with rio accessor (raster)
Returns:
- pyproj.CRS: Coordinate reference system or None if undefined
Description: Compare CRS between raster and vector datasets.
Parameters:
- raster_obj (DatasetReader/xarray.DataArray): Raster data source.
- vector_gdf (gpd.GeoDataFrame): Vector data source.
Returns:
dict: Comparison results with keys:
- raster_crs: Formatted CRS string
- vector_crs: Formatted CRS string
- same_crs: Boolean comparison result
- error: Exception message if any
import geopandas as gpd
import rasterio
import geopre as gp
vector = gpd.read_file("data.shp")
raster = rasterio.open("image.tif")
print(gp.get_crs(vector)) # EPSG:4326
print(gp.compare_crs(raster, vector)) # CRS comparison results
Description: Reproject geospatial data to target CRS.
Parameters:
- data: GeoDataFrames (vector reprojection), or Rasterio datasets (returns array + metadata), or Xarray objects (rioxarray reprojection)
- target_crs: CRS to reproject to (EPSG code/WKT/proj4 string)
Returns:
- Reprojected data in format matching input type
import rasterio
import xarray as xr
import geopre as gp
# Vector reprojection
reprojected_vector = gp.reproject_data(vector, "EPSG:3857")
# Raster reprojection (Rasterio)
with rasterio.open("input.tif") as src:
array, metadata = gp.reproject_data(src, "EPSG:32633")
# Xarray reprojection
da = xr.open_rasterio("image.tif")
reprojected_da = gp.reproject_data(da, "EPSG:4326")
Description: Mask no-data values in raster datasets. Handles both rasterio (numpy) and rioxarray (xarray) workflows.
Parameters:
- data: Raster data (numpy.ndarray or xarray.DataArray)
- profile: Rasterio metadata dict (required for numpy arrays)
- no_data_value: Override for metadata's nodata value
- return_mask: Whether to return boolean mask
Returns:
- Masked data array. For numpy inputs, returns tuple:(masked_array, profile). For xarray, returns DataArray.
import xarray as xr
import rasterio
import geopre as gp
# Rasterio workflow
with rasterio.open("data.tif") as src:
data = src.read(1)
masked, profile = gp.mask_raster_data(data, src.profile)
# rioxarray workflow
da = xr.open_rasterio("data.tif")
masked_da = gp.mask_raster_data(da)
Description: Masks clouds and optionally shadows in a Sentinel-2 raster image using various methods.
Parameters:
image_path
(str): Path to the input raster image.output_path
(str, optional): Path to save the masked output raster. Defaults to the same directory as the input with '_masked' appended to the filename.method
(str, optional): The method for masking. Options are:'auto'
: Automatically chooses the best available method.'qa'
: Uses the QA60 band to mask clouds. WARNING: QA60 is masked between 2022-01-25 and 2024-02-28. Results for images in that date range could be wrong'probability'
: Uses the cloud probability band MSK_CLDPRB with a threshold for masking.'omnicloudmask'
: Utilizes OmniCloudMask for AI-based cloud detection. Might take a long time for big images'scl'
: Leverages the Scene Classification Layer (SCL) for masking.'standard'
: Similar to 'auto', but avoids the OmniCloudMask method.
mask_shadows
(bool): Whether to mask cloud shadows. Defaults toFalse
.threshold
(int, optional): Cloud probability threshold (if using a cloud probability band), from 0 to 100. Defaults to20
.qa60_idx
(int, optional): Index of the QA60 band (1-based). Auto-detected if not provided.qa60_path
(str, optional): Path to the QA60 band (if in a separate file).prob_band_idx
(int, optional): Index of the cloud probability band (1-based). Auto-detected if not provided.prob_band_path
(str, optional): Path to the cloud probability band (if in a separate file).scl_idx
(int, optional): Index of the SCL band (1-based). Auto-detected if not provided.scl_path
(str, optional): Path to the SCL band (if in a separate file).red_idx
,green_idx
,nir_idx
(int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.nodata_value
(float): Value for no-data regions. Defaults tonp.nan
.
Returns:
- (str): The path to the saved masked output raster.
import geopre as gp
output_s2 = gp.mask_clouds_S2("sentinel2_image.tif", method='auto', mask_shadows=True)
Description:
Masks clouds and optionally shadows in a Landsat raster image using various methods.
Parameters:
image_path
(str): Path to the input multi-band raster image.output_path
(str, optional): Path to save the masked output raster. Defaults to the same directory as the input with_masked
suffix.method
(str): The method for masking. Options are:'auto'
: Automatically chooses the best available method.'qa'
: Uses the QA_PIXEL band to mask clouds.'omnicloudmask'
: Utilizes OmniCloudMask for AI-based cloud detection.
mask_shadows
(bool): Whether to mask cloud shadows. Defaults toFalse
.qa_pixel_path
(str, optional): Path to the separate QA_PIXEL raster file.qa_pixel_idx
(int, optional): Index of the QA_PIXEL band (1-based).confidence_threshold
(str, optional): Confidence threshold for cloud masking (e.g.,'Low'
,'Medium'
,'High'
). Defaults to'High'
. WARNING: as per the Landsat official documentation, the confidence bands are still under development, always use the default 'High' untill further notice. Sourcered_idx
,green_idx
,nir_idx
(int, optional): Indices of the red, green, and NIR bands, respectively. Auto-detected if not provided.nodata_value
(float): Value for no-data regions. Defaults tonp.nan
.
- (str): The path to the saved masked output raster.
import geopre as gp
output_landsat = gp.mask_clouds_landsat("landsat_image.tif", method='auto', mask_shadows=True)
Description:
Stacks multiple raster bands from a folder into a single multi-band raster. Support also .SAFE folders.
input_path
(str or Path): Path to the folder containing band files.required_bands
(list of str): List of band name identifiers (e.g.,["B4", "B3", "B2"]
).output_path
(str or Path, optional): Path to save the stacked raster. Defaults to"stacked.tif"
in the input folder.resolution
(float, optional): Target resolution for resampling. Defaults to the highest available resolution.
- (str): The path to the saved stacked output raster.
import geopre as gp
stacked_image = gp.stack_bands("/path/to/folder/containing/bands", ["B4", "B3", "B2"])
We provide two example Jupyter notebooks demonstrating the usage of GeoPre:
- example_usage.ipynb – Demonstrates scaling, reprojecting, and masking operations.
- example_usage_2.ipynb – Covers cloud masking and band stacking.
-
Fork the repository
Click the "Fork" button at the top-right of this repository to create your copy.
-
Create your feature branch
git checkout -b feature/your-feature
-
Commit changes
git commit -am 'Add some feature'
-
Push to branch
git push origin feature/your-feature
-
Open a Pull Request
Navigate to the Pull Requests tab in the original repository and click "New Pull Request" to submit your changes.
See the full release notes in the CHANGELOG.md.
This project is licensed under the MIT License. See LICENSE for more information.
Liang Zhongyou – GitHub Profile
Matteo Gobbi Frattini – GitHub Profile