Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_mapunit_from_SDA() fails for some mukeys #305

Closed
kevinwolz opened this issue Sep 30, 2023 · 8 comments
Closed

get_mapunit_from_SDA() fails for some mukeys #305

kevinwolz opened this issue Sep 30, 2023 · 8 comments

Comments

@kevinwolz
Copy link

kevinwolz commented Sep 30, 2023

I have a script that pulls mapunit data for every county in CONUS. There are just 3 counties in all of CONUS that fail. Here is one of them:
Bannock_ID.kml.zip

aoi <- sf::st_read("Bannock_ID.kml") # Bannock County, ID

APPROACH 1 (fails)

mukeys1 <- soilDB::mukey.wcs(aoi = aoi, db = "gssurgo", res = 30, quiet = TRUE) %>%
  unique() %>%
  .$mukey

where1 <- mukeys1 %>% 
  paste0("'", ., "'") %>% 
  toString() %>% 
  paste0("mukey IN (", ., ")")

mapunit.data1 <- soilDB::get_mapunit_from_SDA(WHERE = where1)
empty result set
Error in UseMethod("droplevels") : 
  no applicable method for 'droplevels' applied to an object of class "list"

APPROACH 2 (works)

mukeys2 <- soilDB::SDA_spatialQuery(aoi, what = "mukey", db = "SSURGO") %>% 
  dplyr::pull(mukey)

where2 <- mukeys2 %>% 
  paste0("'", ., "'") %>% 
  toString() %>% 
  paste0("mukey IN (", ., ")")

mapunit.data2 <- soilDB::get_mapunit_from_SDA(WHERE = where2)
single result set, returning a data.frame

Why does Approach 1 fail, when Approach 2 works?

length(mukeys1)
# 1110
length(mukeys2)
# 474
all(mukeys2 %in% mukeys1)
# TRUE
length(unique(mukeys1))
# 1110

Approach 1 seems to add many extra mukeys. None are duplicates. Something about or in those extra mukeys must be causing the problem.

When I run the same above approaches for any other county (except the 2 others that act the same: Brown County, NE and Caribou County, ID), length(mukeys1) is always greater than length(mukeys2), BUT Approach 1 does not produce an error. For example, here is Champaign, IL:
Champaign_IL.kml.zip

Why does Approach 1 produce so many extra mukeys compared to Approach 2 for all regions?

@brownag
Copy link
Member

brownag commented Sep 30, 2023

Hi @kevinwolz thanks for reporting this issue. Or, rather, I see two different issues.

Approach 1 (that fails) appears to be due to scientific notation formatting being applied to a small subset of the mukeys (only one that I can see, actually. Specifically the problem mukey in this case is 600000 v.s. "6e+05". This approach works for some counties/survey areas because they lack mukeys that can be more simply represented in scientific notation.

To avoid this problem use explicit conversion to integer with as.integer() :

where1 <- as.integer(mukeys1) %>% 
  paste0("'", ., "'") %>% 
  toString() %>% 
  paste0("mukey IN (", ., ")")

We have tried to handle this type of input sanitization in places that take mukey as an argument directly, but it can't be fixed in general for custom WHERE clauses. Just be aware that some mukey will be abbreviated in scientific notation, and you can set e.g. options(scipen=99) or something to disable scientific notation altogether.


As far as why Approach 2 only returns a fraction of the mukeys that Approach 1 does--this is because they are querying a different extent. In the case of the mukey.wcs() call you are getting all mukeys in a rectangular bounding box around an AOI with no masking of the area of interest.

In contrast, the SDA_spatialQuery() approach returns only the mukeys that occur within the bounds of the AOI. Visually you can compare by projecting aoi to "EPSG:5070" (CRS of the mukey.wcs() result) and plotting them. You see many more mukeys in the whole raster than fall within the polygon.

aoi <- sf::st_read("/vsizip/Bannock_ID.kml.zip") # Bannock County, ID
x <- soilDB::mukey.wcs(aoi = aoi, db = "gssurgo", res = 30, quiet = TRUE)
plot(x)
plot(terra::project(terra::vect(aoi), x), add=TRUE)

image

@dylanbeaudette
Copy link
Member

@kevinwolz, if you are interested in spatial data for all of CONUS, I'd highly recommend using either / both of

  • gSSURGO / gNATSGO mukey grids
  • gSSURGO / gNATSGO source data

The mukey grids are kind of like a gSSURGO/gNATSGO-lite, as most of the tabular data can be accessed via SDA. Note that raster soil survey (RSS) data are not yet in SDA.

I'm glad that you are interested in using the mukey WCS. More information on that here:

@kevinwolz
Copy link
Author

Thank you @brownag for explaining both cases! Very helpful. It would have taken me forever to figure out the scientific notation piece. I've added integer conversion to my script. And for the second part, I forgot that SDA_spatialQuery() doesn't use the bounding box approach as well. Got it!

@kevinwolz
Copy link
Author

@dylanbeaudette thank you for your suggestions. I think I'm doing what you're suggesting, but maybe I'm missing something. Here's my general process:

  • Get county AOI using tigris
  • Download gSSURGO mukey grid for the AOI using soilDB::mukey.wcs()
  • Download various tabular data using soilDB::SDA_query()
  • Run my analysis
  • Write county TIF to disk
  • Merge all county TIFs together

Are you suggesting that I start with the full CONUS mukey grid that you linked to? I wish I could do that, but, at the 30m resolution that I need, I think it just takes too much memory to work with the full CONUS grid. My analysis step is complex, memory intensive, and requires ultimately saving data in EPSG:3857. So far it seems easiest for me to do the analysis at the county level (which I can then also parallelize nicely), and then merge all the counties together at the end.

Does this seem logical? Am I missing something?

@kevinwolz
Copy link
Author

@dylanbeaudette one thing that I think is not good with my method is that there will be large gaps in the CONUS data where SSURGO data does not exist but where STATSGO is only available. The SoilWeb Soil Properties map does not have these gaps, but perhaps that is not using SSURGO data since the resolution there is much lower than 30m.

How would you recommend dealing in a way that balances the 30m requirement and processing intensity of the analysis with needing to get CONUS with no gaps? Perhaps I need to be checking to see if soilDB::mukey.wcs(db = "gssurgo") has not returned data for the full AOI, and, if not, call soilDB::mukey.wcs() again with a differnet db? Then I would somehow merge these together using terra, giving priority to the gSSURGO data?

@dylanbeaudette
Copy link
Member

@kevinwolz I'll reply in detail later today. Quick answer:

  • use gNATSGO 30m grid, it is about 2.5GB
  • iterate over pieces sourcing summaries from SDA or local copies of the associated tabular data (these exist as CSV files)
  • mosaic when done

I have examples and further discussion to share later on today. Good luck!

@dylanbeaudette
Copy link
Member

Hi, @kevinwolz here are some follow-up questions and suggestions. First off, are you willing to use STATSGO data in place of missing SSURGO data? If so, then the gNATSGO mukey grid may solve some of your problems. The presence of RSS mukey values (and lack of tabular RSS data on SDA) creates a couple new problems. You could "roll your own" SSURGO/STATSGO hybrid using map unit key grids. The STATSGO2 mukey grid is available via mukey.wcs(), but the source data aren't official in any way. You can easily create a STATSGO2 mukey grid on your own though. Thankfully, the RSS tabular data will be on SDA "soon".

The SoilWeb Soil Properties maps are gap-free because "holes" in SSURGO (coarsened to a 800m grid) are back-filled with STATSGO.

What is the final product that you have in mind? While this may not solve all of your problems, take a look at this experimental tiled approach to making CONUS maps. I still need to get in there and optimize a couple steps but it is pretty efficient.

https://github.com/ncss-tech/tiledConusProcessing

@brownag
Copy link
Member

brownag commented Oct 3, 2023

I think the main questions in this issue (failure with abbreviated mukey, and the differences between number of unique mukey by different methods) have been answered.

We are delving into different, important topics. I see two main discussions which could continue here

  1. how to efficiently process spatially extensive soil survey data (i.e. all of CONUS)
  2. how to deal with "gaps" in the SSURGO dataset

For item 1, larger/local tabular data are reasonably well covered by soilDB methods, e.g. by specifing dsn argument and pointing at "snapshot" databases stored locally rather than web services. Corresponding methods for wrangling the spatial data lack dedicated functions in the package, and functions like mukey.wcs() are limited to relatively small grid sizes.

For item 2, there are official methods for how SSURGO is backfilled with STATSGO2 and/or Raster Soil Survey to create continuous coverage (gNATSGO), but these backfilled areas aren't necessarily suitable replacements for detailed SSURGO. Currently STATSGO data can be summarized like SSURGO, but often areless complete, lacks interpretations etc.

Raster Soil Survey mukeys are usually quite detailed, but not yet available through any web service. Stitching together a queryable SQLite database that works for RSS mukeys can be done, but currently createSSURGO() (for example) operates only on the standard vector and tabular data associated with Web Soil Survey downloads.

In some cases there would be value added to having soilDB functions process local spatial data, but it may be that certain types of processes, especially for large spatial data workflows, will be implemented using machinery outside the package.

@brownag brownag closed this as completed Oct 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants