Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gazetteer model #3979

Open
wants to merge 254 commits into
base: development
Choose a base branch
from
Open

Conversation

kleintom
Copy link
Contributor

I'm submitting a first draft of

  • a somewhat reworked GeographicItem model
  • a new Gazetteer model
  • a 'New Gazetteer' task

All of those are unfinished (and possibly somewhat broken), but I wanted to put something out to make sure I'm on the right track. All/any feedback welcome, but I'm not looking for a full review at this point - I consider this all to be in the early stages, so 'big' changes are fine with me.

Some notes/issues/questions I've encountered so far:

  • I doubt all the specs pass, though GeographicItem currently should. GeographicItem specs have been duplicated, but I haven't changed the second copy at all yet to cover the new geography column.
  • Some changes in GeographicItem:
    • shape_column_sql(shape) returns, e.g. the polygon column if that's present, or the geography column if that column holds a polygon. This will replace many of the existing CASE statements once everything switches over to a single geography column.
    • More st_* methods for working with st_* statements (feedback welcome), see also select_self.
  • In GeographicItem I changed all ST_Contains to ST_Covers. I couldn't tell why one was meant over the other in particular cases, and ST_Covers seems easier to reason about (to me). (The difference is ST_Contains doesn't contain things in its boundary, so e.g. a polygon wouldn't contain its vertices, etc.) That could be a breaking change, and would be easy to revert.
    • One question would be which should be used in new code (maybe it depends on situation).
    • If there's no builtin reason for one over the other then a couple other spots in the code could be changed to ST_Cover instead of ST_Contains, if we wanted to settle on one over the other.
  • There are nearly 500 lines of code in GeographicItem that are either
    • not referenced at all
    • are only used in specs to test themselves
    • are only used in specs to test other things
      I put all of those in a separate GeographicItem::Deprecated module for now, but it's still included in GeographicItem (for now) and I don't have any real opinion on whether that's the way forward or not.
    • Personally my thought was to put the used-only-for-specs methods somewhere else where they would only be visible in specs, but I wasn't seeing any clean way to do that.
    • I am now assuming that PostGIS >= 3.0 so that we should be able to handle all GeometryCollections in ST_* functions.

The 'New Gazetteer' task allows the user to add multiple shapes to a new gazetteer before saving (i.e. they can draw multiple shapes in leaflet, for example, for one gazetteer) - they all get bundled into a single collection behind the scenes. That differs from the georeference model I think, but fits more closely with GeographicArea shapes (I think) - this could also be a path to combining existing GeographicItems into a Gazetteer, depending on how we want to store that. Leaflet shapes and WKT shapes (and eventually other input methods) can be combined in one gazetteer - I don't keep track of which shapes came from which input types on the backend.

  • One question is how to deal with intersections of multiple shapes. I'm not sure if that may cause issues or not. One leaflet polygon drawn inside another combines to a donut (inside leaflet iirc), which is probably a useful thing, allowing users to create donuts, but maybe in other cases it's unexpected/more tricky/doesn't-combine-well-with-wkt, I'm not sure.
  • I've turned off circle creation in leaflet for now at least - we'd need to approximate by a polygon on the backend I think, which is a whole thing (is it already covered in a library somewhere?).

As @mjy suggested, I've gone with a write once read/replace pattern for the geographic item shape of a gazetteer.

I've yet to take any position on meaning of Gazetteer parent in the code.

Most of the crud create/list/show stuff should be mostly working, but it's not all finished yet.

Currently thinking a task will be best for these.
…vs st_distance_spheroid

Apologies for getting somewhat opinionated here, I hope I'm not too far off...

The name and the description of `select_geography_sql` (as well as comparison with `select_geometry_sql`) suggest that the return value should be of type `geography` (it was previously being cast to type geometry).

But then I'm confused about `st_distance` vs `st_distance_spheroid`:
* in `st_distance_spheroid` we cast geography types to geometry types, but then do distance calculations on a spheroid.
* in `st_distance`, prior to the `select_geography_sql` change here, we were calculating planar "distance" between two geometry arguments and then converting it to meters - but the use of `select_geography_sql` in that function suggests that what was intended was to calculate `ST_Distance` between two geography points, which is a spheroidal calculation.

So I would expect the same answer in both cases, which is what I'm getting here. As to which is faster or slower, I can't say, but `st_distance` seems more natural to me, though `st_distance_spheroid` is the one that's used in other places besides specs.

If we really wanted a faster, less accurate calculation I think it would be `st_distance` between two geometry types (and then converting to meters? I'm unclear there) - should one of these be converted to that?
These are function that are either completely unreferenced or only referenced in specs that only test the deprecated function itself (i.e. not used to test some aspect of GeoItem other than the function itself).

I haven't done anything with "spec helpers", i.e. functions that are used only in specs, in service of testing aspects of GeometricItem other than the helper function itself:
* `ordered_by_shortest_distance_from
* `ordered_by_longest_distance_from
* `with_is_valid_geometry_column`
* `start_point`
* `st_distance_spheroid`
* `st_npoints`
…se with gazetteer

Get geographic_item.spec tests to pass

TODO: still (many?) more `type`-related changes needed for the new type.
containing_where_sql_geog is currently unused, so this is kind of moot, but:

ST_CoveredBy does in fact "work" when you pass mixed geography and geometry parameters (see below for a guess of how).

Oddly though, the results for both parameters geographic vs both parameters geometric differs - I would have thought ST_CoveredBy(a::geography, b::geography) iff ST_CoveredBy(a::geometry, b::geometry).

select ST_CoveredBy(multi_polygon::geometry, multi_polygon::geometry) from geographic_items where id=421; => t

select ST_CoveredBy(multi_polygon, multi_polygon) from geographic_items where id=421; => f
select ST_CoveredBy(multi_polygon::geography, multi_polygon::geography) from geographic_items where id=421; => f

Mixed parameters:
select ST_CoveredBy(multi_polygon::geometry, multi_polygon) from geographic_items where id=421; => f
suggesting mixed parameters are cast to geography

select count(*) from (select ST_CoveredBy(multi_polygon::geometry, multi_polygon::geometry) from geographic_items) as a where a.st_coveredby = 't'; => 34713 (that's all multi polygons)

select count(*) from (select ST_CoveredBy(multi_polygon, multi_polygon) from geographic_items) as a where a.st_coveredby = 't'; => 0 (took about half an hour to complete)
…y type

Still not sure what the right way to expose the underlying shape of the geography column (is it point, polygon, etc). Right now there are lots of competing types and geo_types and object_types and multiple versions of those names; hopefully when it's just geography that can be simplified.
…raphy column

Note that the public `geo_object_type` returns the underlying shape of the :geography column in this case (point, polygon, etc), i.e. callers should be unaware that the underlying type is geography.
`geo_object` returns the actual shape column stored in GeographicItem, which now may be a geography shape - that fits with the new geo_object_type since if the shape of the :geography column is a polygon e.g. (i.e. geo_object_type returned :polygon) then the :geography column can be treated as a polygon.

This also addresses in part the difficulty I was having absorbing the various geo/type/object references: geo_object, geo_object_type, geo_type, SHAPE_COLUMN, all kind of interdependent on one another. To me at least this seems simpler (though possibly slightly less performant if that's a concern).
`valid_geometry?` is another method only used in specs
…efs in GeographicItem

A polygon, e.g., can now be found in either the polygon column or the geography column.  GeographyItem currently synonymizes the column name with shape name, but now we need to be more general. This commit is just a test case, there are more changes needed.

Once there's only a geography column then `shape_column_sql` will either be adjusted to return NULL in the ELSE case, or those queries where it's used can be turned into `WHERE ST_GeometryType(geography::geometry) = 'ST_shape'`
…on throughout

Prior to 3.0 ST_Contains and siblings did not support GeometryCollection.

There's still one change to be made, which I'll do when I add geography support at the same time.
…eographicItem

The distinction being that SHAPE_TYPES includes point, polygon, etc. but NOT `geography` (which is a column name in DATA_TYPES).

The point of doing that is that if you want to perform some geometric comparison on a given x = geographic_item, say you want to find all GeographicItems of shape polygon that intersect x, then you need to check all GeographicItems whose polygon column intersects x and all GeographicItems whose geography column contains a polygon that intersects x.

The decision here was to push that reality as far back as possible, by pretending we're unaware of the distinction between polygon and geography-with-polygon-shape as long as possible.
…_id with are_contained_in_item

Is there any reason not to do this? It avoids an extra GeographicItem load, and are_contained_in_item was just a thin wrapper around are_contained_in_item_by_id.

I changed all of the are_contained_in_item_by_id specs over to are_contained_in_item - some of the old are_contained_in_item specs were marked as deprecated but those all existed in the form of are_contained_in_item_by_id specs (which again is the same as are_contained_in_item), so I kept them all.
…phicItem

As far as I can tell these seem to be currently unused, though I haven't marked them deprecated in the other shapes or among the private methods of GeographicItem.
…ir own module

There are 350 lines worth. Some are used only in specs, mostly testing only themselves - feels like there should be a particular place for those, but I'm not finding it.
… to not use GEOMETRY_SQL

Don't want to be using GEOMETRY_SQL outside of GeographicItem.

I may well be missing something, but in regards to the comment about not wanting to load the entire GeographicItem, my thinking is that the size of the shape dwarfs the rest of a geoitem in general, and the rewrite here fetches that shape as wkb instead of as geojson (which i think would be larger?).

Also I copied the snippet here from the same function in Georeference :)
…methods

I think keeping the naming (st_covers) for the new methods would have been more confusing since there's now only one parameter; hopefully the new names `within(shape_sql)` and `covering(shape_sql)` are easily readable/meaning-guessable.

The lack of `covering_union_of_sql` is due to the 'covering' case requiring that the input shape(s) are not included in the result (only an issue when there's only one shape/all shapes are the same). In the 'within' case input shapes are all included in the result. I'm not sure why the difference (since ST_Covers(A, B) iff ST_Coveredby(B, A)).
@kleintom
Copy link
Contributor Author

kleintom commented Nov 4, 2024

What fields are require to create a gazetteer? I'm trying to create one but I get error 500

Based on the error message I think/hope I fixed the issue, though it wasn't causing an error for me running in docker before or after the upgrade to pg16 (maybe you're running a pg with different config?).

Gazetteer should only require a shape and a (non-empty) name.

@jlpereira
Copy link
Member

Thanks, It works!

@mjy
Copy link
Member

mjy commented Nov 4, 2024

@kleintom The blocker to us getting to Rails 7.2 has landed (rgeo/activerecord-postgis-adapter#405 (comment) ). We're still seeing issues on our test PR: https://github.com/SpeciesFileGroup/taxonworks/actions/runs/11672461015/job/32501118816, but I haven't poked at what might be the problem.

Given that we're likely a little away from landing this branch, I think we should target it landing on the 7.2 PR. Can you please update your branch to reflect this (should largely be pointing to a couple new gems)?

If there are spatial issues, which where the blockers in the adapter getting to 7.2, then your geo work/context might be the best place to discover these.

…removed class

Actual behavior should be unchanged.
So that the new geo code can be tested against the 7.2 activerecord-postgis-adapter

https://github.com/SpeciesFileGroup/taxonworks/pull/4006/files
I'm not sure what's changed in 7.2 to cause this error:
PG::IndeterminateDatatype: ERROR:  could not determine data type of parameter $1

Changing `point.x` to `point.x.to_s` didn't help, changing `?` to `?::text` didn't help.
@kleintom
Copy link
Contributor Author

kleintom commented Nov 7, 2024

Can you please update your branch to reflect this (should largely be pointing to a couple new gems)?

I copied @LocoDelAssembly's code to update to 7.2 and added what seems like a reasonable fix to get specs passing again (though I don't know exactly what changed in 7.2 to require it). It looks like ci won't run automatically anymore though since Gemfile.lock is going to be in conflict until the 7.2 pr lands.

@mjy
Copy link
Member

mjy commented Nov 7, 2024

@LocoDelAssembly if you have time can you look into the Gemfile/CI issue please.

@kleintom This might be the thread re the issue: rgeo/activerecord-postgis-adapter#402 (comment); and the patch: rgeo/activerecord-postgis-adapter#417 ... maybe

The make_valid process on a polygon, e.g., can result in line_strings or points - here we choose to drop any of those lower dimensional shapes created during the make_valid process. I'd rather remove this as an option so that we have consistent behavior across all shape creation (at least for now).

At least that's what I thought would happen, but on EcoRegions EcoID=753 - which is a multipolygon - ST_MakeValid still returns a geometry collection containing a multipolygon and a multilinestring...
…mping

This is in addition to previously documented longitude shifting.
…for GEOS >= 3.10

I think we're probably already at sufficient postgis, which is 3.2.0 (from https://git.osgeo.org/gitea/postgis/postgis/raw/tag/3.2.0/NEWS: 'ST_MakeValid(geometry, options) allows alternative validity building
    algorithms with GEOS 3.10')
@mjy
Copy link
Member

mjy commented Nov 13, 2024

@kleintom We saw that there is a migration to remove type from geographic_items (I think). Is it possible to leave that there until after this PR lands, or does it raise/get in the way? The issue is switching dev branches back and forth requires re-adding it.

…icItem for now

 Its absence is an issue when switching back and forth between dev branches, and there's no reason it needs to be removed right now.
@kleintom
Copy link
Contributor Author

Is it possible to leave that there until after this PR lands

No problem, I just needed to make type nullable since we don't set it anymore.

@kleintom
Copy link
Contributor Author

I was going to refactor code I added to normalize longitudes of new Points - to get this GI validation to pass:

def check_point_limits
unless point.nil?
errors.add(:point_limit, 'Longitude exceeds limits: 180.0 to -180.0.') if point.x > 180.0 || point.x < -180.0
end
end
- when I realized that the same issue arises with Multipoints and Points/Multipoints contained in GeometryCollections: rgeo normalizes longitudes for any shapes that have edges, but not for those mentioned above. Is point-longitude normalization something we want to pursue for Multipoints and Points/Multipoints in GeometryCollections? And/or do we want to keep it for Points? (Is the normalization mainly for UI display purposes?)

@mjy
Copy link
Member

mjy commented Nov 13, 2024

Reading this I had a momement of panic thinking maybe we had swtiched GeographicItem to Geometry type :( !, which would have been bad.

Multipoints and Points/Multipoints contained in GeometryCollections: rgeo normalizes longitudes for any shapes that have edges, but not for those mentioned above. Is point-longitude normalization something we want to pursue for Multipoints and Points/Multipoints in GeometryCollections?

I don't forsee needing either Multipoints or Collections thereof. Let's let users request it before we move on that front (I can't wrap my mind on what it would mean to assign such, but maybe, like in a Lot of fish lumped into a single barrel).

And/or do we want to keep it for Points? (Is the normalization mainly for UI display purposes?)

I don't recall why. It might be related to downstream checks at GBIF requiring such. Mapping might also have been an issue. If we can normalize on render (e.g. to map, or DwC, or...) that's a viable solution too, as long as we persist meaning unambiguosly.

If the type column exists, rails assumes its value is used for STI and then it wants all of the STI types that go along with, e.g., still-existing GA GIs. `inheritance_column` is the rails mechanism for changing that behavior.
…select_one

Simpler cleaner more explicit calls
…sing shape

Not knowing what the caller might do with the circle (or if they'll be aware of the issue), I feel like this is prudent (though in the only current case where circle is used it means make_valid_non_anti_meridian_crossing_shape will be called twice).
…crosses_anti_meridian?

The xspecify of mysterious (to me) sometimes failures of crosses_anti_meridian? when one of the vertices has longitude (exactly, and only) 0 is disturbing, and I don't know exactly when it happens.

If it turns out to cause problems, one option would be to implement crosses_anti_meridian? with the following pseudo-code (written here for polygons):

def crosses_anti_meridian?(poly: p)
  prev = nil
  for each |vertex of exterior ring of p: v|
    if prev and line(prev, v) intersects anti-meridian
      # work around a false positive:
      if v.lon == 0 || prev.lon == 0
        continue
      end
      return true
    end
    prev = p
  end
  return false
end
@kleintom
Copy link
Contributor Author

kleintom commented Nov 18, 2024

I'll have more to say, probably tomorrow, on the last new commit, but I wanted to get the first new commit out today since it fixes my previous attempt to restore GI type, which I think actually causes an exception on any read of an existing GA that references a GI with a non-null type (specs all passed since no spec references such a GA (apparently)).

@kleintom
Copy link
Contributor Author

I finally got some notes written up (and figured out, I think) for the Ecoregions2017 shapefile - I put them here for now: https://gist.github.com/kleintom/c4c851de52380b0d08b6950e198aa8ba
People just wanting a version of that shapefile that imports cleanly into TaxonWorks can just check the first couple paragraphs and skip everything else, which is more technical regarding the issues I encountered when importing the original shapefile (not all having to do with problems with the shapefile itself). There's a summary at the end so the longer account can be skipped. If there are any issues I missed please let me know.

I don't see a way around the first two RGeo-related issues (given our factory), that said I think those issues are both going to be quite rare, involving shapes crossing +/- 85.051127 latitude (two shapes involving Antarctica in the Ecoregions case) and lines intended to take the long way around the globe.

The third issue regarding what happens when imported shapes are made_valid has more wiggle room I think, maybe more reporting/warnings would be helpful there.

Lastly I'll point out the following postgis oddity (x)spec'ed in my last commit above:

select ST_Intersects(ST_GeographyFromText('LINESTRING (91 10, 0 0)'), ST_GeographyFromText('LINESTRING (180 89.0, 180 -89.0)'));

select ST_Intersects(ST_GeographyFromText('LINESTRING (91 0, 0 10)'), ST_GeographyFromText('LINESTRING (180 89.0, 180 -89.0)'));

The first returns false, as expected (for me at least): the line LINESTRING (91 10, 0 0) does not cross the anti-meridian, the second returns true: the line LINESTRING (91 0, 0 10) does cross the anti-meridian. The issue only occurs when one of the vertices is at exactly longitude 0, and the second vertex seems to have to be longitude-span > 90 away from 0, other than that I'm not seeing the pattern (feel free to shout if you know what's going on :-). Again, I think this is unlikely to be run into, but if it does become an issue I think the workaround outlined in the last commit message might work and not be too bad (or maybe it can get fixed in postgis if it is a bug).

I'm ready for more shapefile issues is anybody runs into some :-)

@mjy
Copy link
Member

mjy commented Nov 19, 2024

@kleintom Thanks very much. I just got my hands on some shapefiles and hope to hit the interfaces a little today. We also will collectively explore the UI during tomorrows help sessions, so a first wave of feedback should be coming shortly.

@LocoDelAssembly
Copy link
Contributor

@kleintom can you please add import_gazetteers queue in exe/delayed_job and any other new queue this PR is adding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants