-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job splitters should retain original geometries #153
Merged
Merged
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,94 @@ | ||
"""Test the job splitters and managers of OpenEO GFMAP.""" | ||
|
||
from pathlib import Path | ||
|
||
import geopandas as gpd | ||
from shapely.geometry import Point, Polygon | ||
|
||
from openeo_gfmap.manager.job_splitters import split_job_hex | ||
from openeo_gfmap.manager.job_splitters import split_job_hex, split_job_s2grid | ||
|
||
|
||
# TODO can we instead assert on exact numbers ? | ||
# would remove the print statement | ||
def test_split_jobs(): | ||
dataset_path = Path(__file__).parent / "resources/wc_extraction_dataset.gpkg" | ||
def test_split_job_s2grid(): | ||
# Create a mock GeoDataFrame with points | ||
# The points are located in two different S2 tiles | ||
data = { | ||
"id": [1, 2, 3, 4, 5], | ||
"geometry": [ | ||
Point(60.02, 4.57), | ||
Point(59.6, 5.04), | ||
Point(59.92, 3.37), | ||
Point(59.07, 4.11), | ||
Point(58.77, 4.87), | ||
], | ||
} | ||
polygons = gpd.GeoDataFrame(data, crs="EPSG:4326") | ||
|
||
# Load the dataset | ||
dataset = gpd.read_file(dataset_path) | ||
# Define expected number of split groups | ||
max_points = 2 | ||
|
||
# Split the dataset | ||
split_dataset = split_job_hex(dataset, max_points=500) | ||
# Call the function | ||
result = split_job_s2grid(polygons, max_points) | ||
|
||
# Check the number of splits | ||
assert len(split_dataset) > 1 | ||
assert ( | ||
len(result) == 3 | ||
), "The number of GeoDataFrames returned should match the number of splits needed." | ||
|
||
for ds in split_dataset: | ||
print(len(ds)) | ||
assert len(ds) <= 500 | ||
# Check if the geometries are preserved | ||
for gdf in result: | ||
assert ( | ||
"geometry" in gdf.columns | ||
), "Each GeoDataFrame should have a geometry column." | ||
assert gdf.crs == 4326, "The original CRS should be preserved." | ||
assert all( | ||
gdf.geometry.geom_type == "Point" | ||
), "Original geometries should be preserved." | ||
|
||
|
||
def test_split_job_hex(): | ||
# Create a mock GeoDataFrame with points | ||
# The points/polygons are located in three different h3 hexes of size 3 | ||
data = { | ||
"id": [1, 2, 3, 4, 5, 6], | ||
"geometry": [ | ||
Point(60.02, 4.57), | ||
Point(58.34, 5.06), | ||
Point(59.92, 3.37), | ||
Point(58.85, 4.90), | ||
Point(58.77, 4.87), | ||
Polygon( | ||
[ | ||
(58.78, 4.88), | ||
(58.78, 4.86), | ||
(58.76, 4.86), | ||
(58.76, 4.88), | ||
(58.78, 4.88), | ||
] | ||
), | ||
], | ||
} | ||
polygons = gpd.GeoDataFrame(data, crs="EPSG:4326") | ||
|
||
max_points = 3 | ||
|
||
result = split_job_hex(polygons, max_points) | ||
|
||
assert ( | ||
len(result) == 4 | ||
), "The number of GeoDataFrames returned should match the number of splits needed." | ||
|
||
for idx, gdf in enumerate(result): | ||
assert ( | ||
"geometry" in gdf.columns | ||
), "Each GeoDataFrame should have a geometry column." | ||
assert gdf.crs == 4326, "The CRS should be preserved." | ||
if idx == 1: | ||
assert all( | ||
gdf.geometry.geom_type == "Polygon" | ||
), "Original geometries should be preserved." | ||
else: | ||
assert all( | ||
gdf.geometry.geom_type == "Point" | ||
), "Original geometries should be preserved." | ||
|
||
assert ( | ||
len(result[0]) == 3 | ||
), "The number of geometries in the first split should be 3." |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not a slow step for the entire s2 grid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentVerelst Maybe if this is a speed issue you could upload a second s2 grid in the artifactory with the CRS reprojected, and then change the
load_s2_grid()
function in gfmap to accept a parameterweb_mercator: bool = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth the effort indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I'll already update that.
Do have to note that the largest bottleneck is typically not the CRS conversion of the s2 grid, but rather the CRS conversion of the dataframe to split itself.
For example, for
2018_BEL_LPIS-Flanders_POLY_110
the old job_splitter takes 10s to run, while the new one takes 20s.