Ensure that new rock fragment sorting code gives similar results as old code #1

dylanbeaudette · 2016-04-29T17:07:10Z

The old code (SQL) may have missed some data that spanned multiple size classes.

Test:

library(soilDB)

# CRAN soilDB -- SQL based RF sorting
load('old-RF-code.rda')
# GH soilDB -- R based RF sorting
x.new <- fetchNASIS()

# sanity check: rows in same order?
# YES
all(x.old$phiid == x.new$phiid)


# compare proportion of differences old vs. new
prop.table(table(x.old$fine_gravel == x.new$fine_gravel))
prop.table(table(x.old$gravel == x.new$gravel))
prop.table(table(x.old$cobbles == x.new$cobbles))
prop.table(table(x.old$stones == x.new$stones))
prop.table(table(x.old$boulders == x.new$boulders))
prop.table(table(x.old$parafine_gravel == x.new$parafine_gravel))
prop.table(table(x.old$paragravel == x.new$paragravel))
prop.table(table(x.old$paracobbles == x.new$paracobbles))
prop.table(table(x.old$channers == x.new$channers))
prop.table(table(x.old$flagstones == x.new$flagstones))
prop.table(table(x.old$parachanners == x.new$parachanners))
prop.table(table(x.old$paraflagstones == x.new$paraflagstones))
prop.table(table(x.old$total_frags_pct == x.new$total_frags_pct))

# ... all are <= 1% differences

# compare magnitude of differences
sqrt(mean((x.old$fine_gravel - x.new$fine_gravel)^2, na.rm=TRUE)) # 2%
sqrt(mean((x.old$gravel - x.new$gravel)^2, na.rm=TRUE)) # 5%
sqrt(mean((x.old$cobbles - x.new$cobbles)^2, na.rm=TRUE)) # < 1%
sqrt(mean((x.old$total_frags_pct - x.new$total_frags_pct)^2, na.rm=TRUE)) # 0


# investigate differences in gravel
summary(x.old$gravel - x.new$gravel)

# ... new gravel volume is always smaller than old gravel

# investigate trouble horizons / pedons
idx <- which((x.old$gravel - x.new$gravel) > 5)
h.old <- horizons(x.old)[idx, c('peiid', 'phiid', 'hzname', 'fine_gravel', 'gravel', 'cobbles')]
h.new <- horizons(x.new)[idx, c('fine_gravel', 'gravel', 'cobbles', 'total_frags_pct')]

cbind(h.old, h.new)

p <- unique(h.old$peiid)
x.old$pedon_id[which(profile_id(x.old) %in% p)]

dylanbeaudette · 2016-04-29T17:09:14Z

Fixed: results are almost identical (all size fractions: < 1-2% difference) apart from bugs in original code related to odd fragment population spanning multiple size classes. The new values are more accurate than the old values.

smroecker · 2019-01-30T18:37:29Z

I've implemented a new SQL version that I think is preferable to this new version. The most compelling reason being that it should be more efficient to implement with SDA and NASIS Web Reports. SDA has a limit of 100k (?) records, while NASIS Web Reports are already quite slow. My SQL code uses temporary tables which eliminates some of the redundancy in the original SQL code. Currently I've added the results to get_extended_data_from_NASIS_db(). See the test example below from usiteiid LIKE '%CA071%' and areasymbol LIKE 'IN%'.

remotes::install_github("ncss-tech/soilDB", dependencies=FALSE, upgrade=FALSE, build=FALSE)

library(soilDB)
library(daff)

ed <- get_extended_data_from_NASIS_db()

ed$frag_summary <- ed$frag_summary[order(ed$frag_summary$phiid), ]
ed$frag_summary_v2 <- ed$frag_summary_v2[order(ed$frag_summary_v2$phiid), ]

test <- diff_data(ed$frag_summary, ed$frag_summary_v2)
render_diff(test)

It appears the majority of the difference stems from the current version misclassifying (?) different size classes. The current version assumes the fragsize_h is populated and classifies that first (then the fragsize_r if the H is missing), however their are many records where L and H values are wider than an individual class. There also appear to be a handful of other records where ONLY the fragsize_l is populated, therefore those records are currently being classified as 'unspecified'. My SQL version is slightly more robust, it classifies like so fragsize_r > (fragsize_h + fragsize_l) / 2 > fragsize_h > fragsize_l. The first TRUE statement is used. Their also appear to be several misclassified records, where it appears the thresholds used by the simplfyFragmentsData() functions are wrong (e.g. < vs. <=).
Their are also a handful of cases in the current version where pararock fragments are getting miscounted in the 'total_frags_pct_nopf' column.

If others could test with their data, I think we safely update with this new version.

Stephen

dylanbeaudette · 2019-01-30T20:13:42Z

Thanks for writing an SQL version of this code. It will be really helpful when applying these concepts to SDA and similar remote databases. I like the idea of using SQL when appropriate but I have a hard time transitioning without further discussion of each point.

First of all, I have edited soilDB:::.sieve() to use 75mm (previously 76mm) as the cutoff for gravel. I wasn't aware of the change.

Here is the difference breakdown among ~ 4k pedons, after the latest changes to both R and SQL versions:

I suspect that most of the differences have to do with things we all need to decide on:

the importance of user-defined "sieves", this is something planned for the R version
algorithms used by an R function should be coded in R so that they can be tested
assumptions about precedence of RV vs. H
interpretation of L when RV and H are missing
assumptions made in the absence of flat / hardness
are the fragment size classes inclusive on the high end (i.e. <=75mm)

Here are my suggestions:

fragsize H should take precedence over RV
therefore, midpoint (H+L)/2 is not appropriate
RV should be used when H is missing
L should not be interpreted (how common is this?)
fragment size classes are inclusive on the high end

I'll post some specific examples Wednesday mid-morning.

dylanbeaudette · 2019-01-31T17:37:10Z

Additions

There are 175 (of 23031 total) rows in the SQL-based summary that are additional to the R-based approach. It looks like these may be an artifact of NA fragment volume? It is hard to say because all of the phiid values are NULL and all fragment classes are 0.

Modifications

Use of 76mm as fragsize_h to represent GR
According to the current fragment size class definitions, this is now a data population error.

phiid	fragvol	fragsize_l	fragsize_r	fragsize_h	fragshp	fraghard
1196586	36	2	20	76	nonflat	strongly
1196586	11	76	125	250	nonflat	strongly
1196586	10	250	350	600	nonflat	strongly
1196586	2	2	20	76	nonflat	indurated
1196586	5	2	20	76	nonflat	moderately

The R-based summary pushes these fragments into CB (not ideal).

phiid	fine_gravel	gravel	cobbles	stones	boulders	channers	flagstones	parafine_gravel	paragravel	paracobbles	parastones	paraboulders	parachanners	paraflagstones	unspecified	total_frags_pct_nopf	total_frags_pct
1196586	0	0	49	10	0	0	0	0	0	5	0	0	0	0	0	59	64

The SQL-based summary uses the RV and keeps them as GR.

phiid	fine_gravel	gravel	cobbles	stones	boulders	channers	flagstones	parafine_gravel	paragravel	paracobbles	parastones	paraboulders	parachanners	paraflagstones	unspecified	total_frags_pct_nopf	total_frags_pct
1196586	0	38	11	10	0	0	0	0	5	0	0	0	0	0	0	59	64

In my example (~ 4k pedons) 30% of what should be GR use 76mm as fragsize_h. Only 3% of those with the 76mm fragsize_h have an RV that could be used to infer the correct size class.

This seems like a data population problem that should be fixed vs. worked-around.

Fragment size spans more than 1 class
The second record is classified as a ST in the R-based summary, vs CB in the SQL-based summary. Data population problem?

phiid	fragvol	fragsize_l	fragsize_r	fragsize_h	fragshp	fraghard
717529	45	2	50	76	NA	indurated
717529	4	76	200	300	NA	indurated

**Interpretation of fragsize_l and assumptions about missing fragshp
Missing fragsize_r and fragsize_h. Missing fragshp and size class the suggests a channer?

phiid	fragvol	fragsize_l	fragsize_r	fragsize_h	fragshp	fraghard
564492	2	150	NA	NA	NA	NA
564492	5	2	NA	150	NA	NA

R-based:

phiid	fine_gravel	gravel	cobbles	stones	boulders	channers	flagstones	parafine_gravel	paragravel	paracobbles	parastones	paraboulders	parachanners	paraflagstones	unspecified	total_frags_pct_nopf	total_frags_pct
564492	0	0	5	0	0	0	0	0	0	0	0	0	0	0	2	7	7

SQL-based:

phiid	fine_gravel	gravel	cobbles	stones	boulders	channers	flagstones	parafine_gravel	paragravel	paracobbles	parastones	paraboulders	parachanners	paraflagstones	unspecified	total_frags_pct_nopf	total_frags_pct
564492	0	5	2	0	0	0	0	0	0	0	0	0	0	0	0	7	7

Summary

The differences are slight (< 1% of 8,460,509 cells). I suggest we decide on the assumptions of the underlying algorithm (see above) and what fraction of data errors we are willing to leave for QC activities.

dylanbeaudette · 2019-01-31T20:25:59Z

Some more ideas after a conversation with Jay.

code() / uncode() are a good example of doing in R what was previously done in SQL
regional staff use fetchNASIS in a very different way than MLRA staff, "fixing" the data isn't so simple from the regional perspective
attempting to load / fix the data in one pass is likely too much for any single function
we should move more of the QC into dedicated functions (new functions for common QC of pedon / component data #89), leaving only the most basic to fetchNASIS or making it optional
when / why was the gravel threshold changed from 76 to 75mm? (I'll ask Curtis)
can we bulk-fix some of the more common phfrags errors in NASIS? (I'll ask Cathy)

I am going to write a very basic test for the 75/76mm issue and add to soilDB today.

dylanbeaudette · 2019-01-31T20:42:23Z

Link to NSSH Part 618.

Current standards aren't all that clear.

dylanbeaudette · 2019-01-31T22:48:08Z

A thought. We might consider including some "slop" in our evaluation of fragsize to catch those that are within 1mm.

dylanbeaudette · 2019-02-01T01:35:24Z

I just heard from Curtis Monger than the 75mm gravel/cobble cutoff may be a typo. I'll post back when I hear more.

smroecker · 2019-02-01T15:43:56Z

It's unclear to me why the default for .sieve() is the fragsize_h instead of fragsize_r. Can you explain your rationale? I unaware of another instance in soilDB or NASIS where the H if selected before the RV. We've spent a lot of time discussing the L, RV and H issue in the past, and I don't recall any preference for the H.

It seems to me if the fragsize range overlaps a class, or their is confusion as to the class limits, then the RV would be more a more robust estimate. Also, if the user purposely entered an RV, then they're in effect saying that is the most common fragsize. If using the H results in a misclassification, then the effect is shift in the particle size distribution.

dylanbeaudette · 2019-02-01T17:14:29Z

I agree that the use of low-rv-high in my argument is the opposite of what I typically advocate. I think that there is a distinction to be made here:

the low-rv-high values in the phfrags table aren't describing a summary: these are typically class limits with additional flexibility for precision
does anyone use nested sieves in the field? would those data go into a different table?
is it reasonable to mix several size classes into a single record in phfrags?
my typical argument about the use/interpretation of low-rv-high is in the context of aggregate data: components, map units, series, MLRA, and so on

The use of H vs. RV (when present) is largely by convention, as inherited from the original SQL and (I think) data population practices in the 'west. It is very rare for folks to describe and RV for records in the phfrags table for any other reason than their supervisor "told them to". This convention may not hold in other parts of the country.

Using the RV first (when present) might be a reasonable compromise, but I can think of cases where the RV is completely arbitrary:

(H+L)/2
some number that was between L and H
whatever the regional staff said to put in there

Either way, the choice of precedence (RV over H, H over RV) will lead to some (< 1% ?) errors. How about we decide and then go from there. There will be very little effect on pedons in SSR2 since the RV is very rarely populated.

I do not think that there is a case to be made for computing the RV from L/H and then using that.

Overlapping ranges are (in my opinion) a data population error.

Again, this line of reasoning only applies to pedon data.

dylanbeaudette · 2020-10-13T20:06:56Z

Settling on a compromise due to the various upper ends used to define gravels (74mm, 75mm, 76mm). The latest NSSH part 618 defines gravel as 2mm<= x < 75mm. Relevant changes:

The internal function .sieve now uses < for all evaluations (via particle diameter).
The internal function .rockFragmentSieve uses populated RV when present, calculated when missing.

… performed using that value

dylanbeaudette closed this as completed Apr 29, 2016

smroecker reopened this Jan 30, 2019

dylanbeaudette assigned dylanbeaudette, smroecker, jskovlin and brownag Feb 1, 2019

This was referenced Jan 22, 2020

conversion of NASIS data to factors should use ordered factors when appropriate #53

Closed

CRAN release soilDB 2.5 #125

Closed

dylanbeaudette closed this as completed Oct 13, 2020

dylanbeaudette added a commit that referenced this issue Oct 13, 2020

addressing #1, computing RV from L and H when absent, comparisons are…

ffc1f3b

… performed using that value

dylanbeaudette added a commit that referenced this issue Oct 13, 2020

related to #1

ad0494d

brownag mentioned this issue Dec 1, 2020

fetchSDA_spatial: Unqueryable STATSGO mukeys -- water #147

Closed

brownag mentioned this issue Oct 3, 2023

get_mapunit_from_SDA() fails for some mukeys #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that new rock fragment sorting code gives similar results as old code #1

Ensure that new rock fragment sorting code gives similar results as old code #1

dylanbeaudette commented Apr 29, 2016 •

edited

Loading

dylanbeaudette commented Apr 29, 2016

smroecker commented Jan 30, 2019

dylanbeaudette commented Jan 30, 2019 •

edited

Loading

dylanbeaudette commented Jan 31, 2019

dylanbeaudette commented Jan 31, 2019

dylanbeaudette commented Jan 31, 2019 •

edited

Loading

dylanbeaudette commented Jan 31, 2019

dylanbeaudette commented Feb 1, 2019

smroecker commented Feb 1, 2019

dylanbeaudette commented Feb 1, 2019

dylanbeaudette commented Oct 13, 2020

Ensure that new rock fragment sorting code gives similar results as old code #1

Ensure that new rock fragment sorting code gives similar results as old code #1

Comments

dylanbeaudette commented Apr 29, 2016 • edited Loading

dylanbeaudette commented Apr 29, 2016

smroecker commented Jan 30, 2019

dylanbeaudette commented Jan 30, 2019 • edited Loading

dylanbeaudette commented Jan 31, 2019

Additions

Modifications

Summary

dylanbeaudette commented Jan 31, 2019

dylanbeaudette commented Jan 31, 2019 • edited Loading

dylanbeaudette commented Jan 31, 2019

dylanbeaudette commented Feb 1, 2019

smroecker commented Feb 1, 2019

dylanbeaudette commented Feb 1, 2019

dylanbeaudette commented Oct 13, 2020

dylanbeaudette commented Apr 29, 2016 •

edited

Loading

dylanbeaudette commented Jan 30, 2019 •

edited

Loading

dylanbeaudette commented Jan 31, 2019 •

edited

Loading