feat: improve performance of functions used when bulk loading dataset #557
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improves performance of the
buildings_bulk_load.bulk_load_outlines_insert_supplied
function, by optimising three other functions inbuildings_reference
which this calls:suburb_locality_intersect_polygon
,town_city_intersect_polygon
, andterritorial_authority_grid_intersect_polygon
.These three functions perform equivalent tasks: finding the suburb, town, and TA which each building outline overlaps the most. It is possible for an outline to overlap none, one, or multiple of each of these. If an outline overlaps more than one, we want to choose the one with the most overlap. This requires calculating the area of the intersection, which takes a non-trivial amount of processing time. Most outlines however will only intersect with a single suburb, town, or TA.
By first finding which and how many areas an outline intersects, we can then avoid having to take the time to calculate the area of the intersection in the case where there is only a single candidate. This avoids significant amounts of computation.
A test bulk load into a database running in docker saw the time taken to execute
buildings_bulk_load.bulk_load_outlines_insert_supplied
when loading the Ecopia Hawkes Bay dataset with 134,209 outlines drop from 1h 44m 23s to 17m 59s.Notes on Sqitch
I am unsure if the way I have added the changes through Sqitch is correct, and have pushed this branch up so others can check what I have done, though I am not confident it is fully correct.
This pull request should be seen more as a "here is what I did, is this the right way to do it?" than fully finished changes ready to be merged.
Steps I followed using Sqitch were:
db/sql
directory had no pending changes.sqitch rework <file-identifier> -n '<Note text>'
.suburb_locality.sql
,territorial_authority.sql
andtown_city.sql
.suburb_locality.sql
,territorial_authority.sql
andtown_city.sql
to add a block to check for text added in the new version of each function.Things I'm not sure about with Sqitch:
@4.0.0-dev1.sql
files fortown_city
andterritorial_authority
seem to have the entirity of what the non-suffixed file was before these changes, butsuburb_locality
only has the previous version of the one function that was changed. That doesn't seem quite right, given the changes were equivalent across the three files.suburb_locality
only had a singleSELECT has_function_privilege
generated, when the others had one generated for each function. I didn't add the others, but it seems inconsistent.