-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not a code bug, but a data issue #223
Comments
thanks for the report @tphilippi I also see the problem. Seems like for pairs where gbif and idibio disagree, gbif and bison data agree for the same point, which is nice: filter(dat2, key == "f9784d28-6462-4175-bb2b-9a584a0aa1bf" | key == "1145396883") %>% data.frame
#> name longitude latitude prov date key
#> 1 Rana cascadae Slater, 1939 -121.5609 40.35793 gbif 1924-06-20 1145396883
#> 2 Rana cascadae -121.5609 40.35793 bison <NA> 1145396883
#> 3 rana cascadae -121.5601 40.35813 idigbio 1924-06-20 f9784d28-6462-4175-bb2b-9a584a0aa1bf
filter(dat2, key == "523cbddb-1f32-41e0-a5aa-94a25d004029" | key == "1145665190") %>% data.frame
#> name longitude latitude prov date key
#> 1 Rana cascadae Slater, 1939 -121.5770 40.35226 gbif 1924-05-17 1145665190
#> 2 Rana cascadae -121.5770 40.35226 bison <NA> 1145665190
#> 3 rana cascadae -121.5762 40.35247 idigbio 1924-05-17 523cbddb-1f32-41e0-a5aa-94a25d004029 If you look at the issues for GBIF occurrence 1145396883 - http://api.gbif.org/v1/occurrence/1145396883 - it has COORDINATE_ROUNDED and COORDINATE_REPROJECTED. Same for 1145665190: http://api.gbif.org/v1/occurrence/1145665190/ I imagine its that COORDINATE_REPROJECTED that makes it different from idigbio? For both of those verbatim data has NAD1927 while non-verbatim data has WGS84 |
Right: the issue is that the idigbio results come with verbatim latitude longitude, which may or may not be WGS84 epsg:4326. And, as far as I can tell, the iDigBio API doesn't vend a variable for datum or CRS, even though their web page for an occurrence has that information. Rather, there is a cryptic flag (in the list of flags for each record) that appears to indicate "not WGS84 epsg:4326" but not what the correct datum/geoid/CRS is. I need to dig into their API documentation again. I tracked the issue via the separate web interfaces: Looking at out$idigbio in the example code above, out$idigbio$data$flags does note "geopoint_datum_error" for most of these records. I'm going to be parsing the flags from idigbio records delivered as Darwin Core and probably via ridigbio::idig_search_records(). I'll look at that part of the spocc code to see if it will be easy to pull out. I already parse those list-valued variables in the iDigBio DwC-A multimedia.csv files. Perhaps the man page or vignette should not that users need to check for CRS issues if they need accuracy within a few hundred meters? As for USGS Bison: Bison is almost certainly harvesting that record from gbif, as that's what it does and the Bison folks are the US node of gbif. I don't trust any location info in Bison, as it isn't easy to see where it is coming from. If you look at the map of bison records, you'll also see one occurrence out there at 40.486264 -121.400602. That's the centroid of Lassen Volcanic National Park. Bison can't harvest our protected individual occurrence records behind NPSpecies, so it harvests the park-level species lists and georeferences them to the park centroids. Once our NPS IT folks get authentication working for our protected data, and we get the full tables of what is considered protected by each park populated, I'm attempting to push NPS occurrence records out to gbif, either directly or via USGS/Bison & Stinger. |
|
interesting about BISON, didn't know that. Anything we should add to the docs about BISON? |
Issue #38 on ridigbio has a couple of workarounds. Passing a vector of field names via the fields = parameter rather than using the default fields = "all" can get the geodeticDatum value. The field name is "data.dwc:geodeticDatum" which isn't obvious from the documentation. I only used spocc to test my gbif vs idigbio offset, so don't have time to code & test a solution, but it may be simple enough to change how spocc calls ridigbio::idig_get_records() to something like: fieldlist <- c('uuid',"canonicalname", "basisofrecord", "catalognumber",
"collectioncode", "fieldnumber", "occurrenceid", "recordset",
"kingdom", "order", "class", "family", "genus",
"specificepithet", "infraspecificepithet",
"scientificname", "taxonrank", "taxonid", "taxonomicstatus", "typestatus",
"collectionid", "collectionname", "collector",
"institutioncode", "institutionid", "institutionname",
"datemodified",
"datecollected" , "geopoint", "coordinateuncertainty",
"locality", "verbatimlocality", "data.dwc:geodeticDatum")
test <- idig_search_records(rq = list(scientificname = "Rana cascadae",
geopoint = list(type="geo_bounding_box",
top_left=list(lat=40.85288, lon=-121.9347),
bottom_right=list(lat=-40.14134, lon= -120.8895)
)
),
fields = fieldlist) That fieldlist is the fields I get back from fields = "all" plus the datum. Note that fields with all missing values are dropped, so fields = "all" might have other fields I haven't seen yet in my use, and I think that is on their API server side, not in the R code. As for any explicit warnings about BISON, I don't think it's needed. eBird is based on picklists of named birding sites, so ~80% of the observations in Cabrillo National Monument are the same spot in bay 2 of the parking lot: the park centroid. The rOpenSci CoordinateCleaner package might flag such BISON records, scrubr tried to do it for political polygons. My de-duplicaton and data QC flagging code tries to do "reverse georeferencing", but that is hard because different polygons datasets give different values for centroids, and for National Parks, the boundaries slightly change over time. |
Hello from the iDigBio (and new maintainer of ridigbio) side of things. I'd like to look things over more before I comment on specifics, but I wanted to chime in and say that I'm interested in working together to see how we can make things better. |
we hardcode the in spocc: fieldlist <- c('uuid',"canonicalname", "basisofrecord", "catalognumber",
"collectioncode", "fieldnumber", "occurrenceid", "recordset",
"kingdom", "order", "class", "family", "genus",
"specificepithet", "infraspecificepithet",
"scientificname", "taxonrank", "taxonid", "taxonomicstatus", "typestatus",
"collectionid", "collectionname", "collector",
"institutioncode", "institutionid", "institutionname",
"datemodified", "datecollected" , "geopoint", "coordinateuncertainty",
"locality", "verbatimlocality", "data.dwc:geodeticDatum")
x <- occ(query = "Rana cascadae", from = "idigbio", limit = 300,
geometry = c(-121.9347, -40.14134, -120.8895, 40.85288),
idigbioopts = list(fields = fieldlist)) Which gives 4 unique values, so pretty easy to handle these unique(x$idigbio$data$Rana_cascadae$`data.dwc:geodeticDatum`)
#> [1] "NAD27" "North American Datum 1927" NA "World Geodetic System 1984" "WGS84" What would we do with these values once we have them in spocc? Or do you leave that up to the user? |
@sckott As I explore further my goal has gone from using data.dwc:geodeticDatum as a check on lon lat geoPoint coordinates to a flag to simply informing the user; now I'm not sure it will be informative. Currently, lon lat values are given for the geopoint without any CRS or Datum in the data or the metadata. That's very bad practice. Most but not geoPoints from idigbio are WGS84 (which gbif and other sources use). As far as I can tell, this is an issue in iDigBio (due to their georeferencing starting 20 years ago so they were bleeding edge), not an issue caused by spocc, but spocc returning a bunch of records in a silent mixture of CRS will lead to problems and bad science at some point. I'll try to follow up with @roncanepa of iDigBio. There's some slight probability that I happened to trip over a couple of dozen errors in coordinate processing out of the 43M records, and all are meant to be in WGS84, and the web page for each individual record shows the original, raw values, not the same as provided by the API. For my Rana cascadae example, based on wetlands & streams in imagery as well as where gbif puts the same occurrence, some of the geopoints with dwc:geodeticDatum NAD27 or North American Datum 1927 are really in NAD27 epsg:4267, while some are actually in WGS84. dwc:geodeticDatum is for the raw, original lon lat values (e.g., from the 2002 georeferencing of a 1924 specimen), and at least some coordinate values in the API data have been transformed to WGS84. Therefore, data.dwc:geodeticDatum isn't a silver bullet, or even flag for which values are suspect, or even the Datum for the lon lat values returned. Another part of the information is the flags that come back as a list x$idigbio$data$Rana_cascadae$flags. They're also in the DwC-A occurrence.csv from the iDigBio portal as field idigbio:flags. I'm looking at several values: "geopoint_datum_missing", "geopoint_datum_error", "rev_geocode_failure". I have code that parses those out to a set of 70 logical values. Would you like that as a stand-alone function (taking the data object x$idigbio$data$Rana_cascadae and returning a data.frame with uuid and the 70 logical flags, or just those with at least 1 TRUE in that dataset)? It turns out that my R. cascadae use case is a bad example. As an endangered species, the coordinate uncertainties are very inflated (conservative) for the old records georeferenced from named places like lakes or ponds, and new iNaturalist records have locations fuzzed and a coordinate uncertainty of 27km in gbif:
Most of those lon lat coordinates I've checked land in the correct pond, so within 100-250m, hence my description of uncertainties 1km and greater as "inflated". This particular use case is fun: R. cascadae appears to be extirpated from Lassen NP; the park needs to identify locations to reintroduce captive bred frogs. Historic records can identify what was good wetland habitat, both explicit locations and SDM to identify similar un-collected wetlands. Then an eDNA survey will check for relic populations, and maybe for predatory leeches & fungal diseases, identifying good places to reintroduce. My lesson learned is that lon lat coordinates in these extant data sources aren't suitable for that use in identifying wetlands, but I can provide the verbatim location descriptions to the park. [The more recent rangewide USGS surveys (not in gbif) only provide location descriptions in the public data, not coordinates, but the park itself can obtain the lat lon from USGS.] As I explore these data for other species and other management issues, I'll get a better sense of whether this issue is specific to sensitive species, or if most of the data are only usable at scales of 1-2km or broader. |
I'd definitely be in support of exposing data quality issues in outputs so users can get to them more easily. I guess though that it'd be good to have those data in a common as possible format across data sources (for those data sources that have data quality flags). those that have them that I know of: gbif, idigbio (i'm sure theres others, just don't know off the top). In terms of how that's done, i'm not sure. I guess it'd be ideal if the data flags were in a column or more than one that will fit into a data.frame workflow to make it easy to filter data downstream similar to what we do in occ_issues Thanks for the details on R. cascadae, very interesting stuff. |
Whether using spocc or rgbif & ridigbio separately, the same occurrence records show up ~71m shifted in northern California.
Reproducible example:
If you zoom in you will find pairs of occurrences with the same date and the idigbio location ~71m ENE of the gbif or other provider version. Note that both gbif and iDigBio assert that data from their APIs come as epsg:4326.
I'm putting this issue here to make others aware. I will continue to explore the discrepancy: is it the same in Florida and Alaska? can I find very narrow point locations so I can verify that the idigbio coordinates are the problem (as I suspect)? I've already reproduced this with rgbif and ridigbio, and can show that there is not a difference based on the museum behind the records in iDigBio. I can try to get the verbatim location info from the data sources to further compare.
Session Info
The text was updated successfully, but these errors were encountered: