Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that new rock fragment sorting code gives similar results as old code #1

Closed
dylanbeaudette opened this issue Apr 29, 2016 · 11 comments
Assignees

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Apr 29, 2016

The old code (SQL) may have missed some data that spanned multiple size classes.

Test:

library(soilDB)

# CRAN soilDB -- SQL based RF sorting
load('old-RF-code.rda')
# GH soilDB -- R based RF sorting
x.new <- fetchNASIS()

# sanity check: rows in same order?
# YES
all(x.old$phiid == x.new$phiid)


# compare proportion of differences old vs. new
prop.table(table(x.old$fine_gravel == x.new$fine_gravel))
prop.table(table(x.old$gravel == x.new$gravel))
prop.table(table(x.old$cobbles == x.new$cobbles))
prop.table(table(x.old$stones == x.new$stones))
prop.table(table(x.old$boulders == x.new$boulders))
prop.table(table(x.old$parafine_gravel == x.new$parafine_gravel))
prop.table(table(x.old$paragravel == x.new$paragravel))
prop.table(table(x.old$paracobbles == x.new$paracobbles))
prop.table(table(x.old$channers == x.new$channers))
prop.table(table(x.old$flagstones == x.new$flagstones))
prop.table(table(x.old$parachanners == x.new$parachanners))
prop.table(table(x.old$paraflagstones == x.new$paraflagstones))
prop.table(table(x.old$total_frags_pct == x.new$total_frags_pct))

# ... all are <= 1% differences

# compare magnitude of differences
sqrt(mean((x.old$fine_gravel - x.new$fine_gravel)^2, na.rm=TRUE)) # 2%
sqrt(mean((x.old$gravel - x.new$gravel)^2, na.rm=TRUE)) # 5%
sqrt(mean((x.old$cobbles - x.new$cobbles)^2, na.rm=TRUE)) # < 1%
sqrt(mean((x.old$total_frags_pct - x.new$total_frags_pct)^2, na.rm=TRUE)) # 0


# investigate differences in gravel
summary(x.old$gravel - x.new$gravel)

# ... new gravel volume is always smaller than old gravel

# investigate trouble horizons / pedons
idx <- which((x.old$gravel - x.new$gravel) > 5)
h.old <- horizons(x.old)[idx, c('peiid', 'phiid', 'hzname', 'fine_gravel', 'gravel', 'cobbles')]
h.new <- horizons(x.new)[idx, c('fine_gravel', 'gravel', 'cobbles', 'total_frags_pct')]

cbind(h.old, h.new)

p <- unique(h.old$peiid)
x.old$pedon_id[which(profile_id(x.old) %in% p)]
@dylanbeaudette
Copy link
Member Author

Fixed: results are almost identical (all size fractions: < 1-2% difference) apart from bugs in original code related to odd fragment population spanning multiple size classes. The new values are more accurate than the old values.

@smroecker
Copy link
Member

I've implemented a new SQL version that I think is preferable to this new version. The most compelling reason being that it should be more efficient to implement with SDA and NASIS Web Reports. SDA has a limit of 100k (?) records, while NASIS Web Reports are already quite slow. My SQL code uses temporary tables which eliminates some of the redundancy in the original SQL code. Currently I've added the results to get_extended_data_from_NASIS_db(). See the test example below from usiteiid LIKE '%CA071%' and areasymbol LIKE 'IN%'.

remotes::install_github("ncss-tech/soilDB", dependencies=FALSE, upgrade=FALSE, build=FALSE)

library(soilDB)
library(daff)

ed <- get_extended_data_from_NASIS_db()

ed$frag_summary <- ed$frag_summary[order(ed$frag_summary$phiid), ]
ed$frag_summary_v2 <- ed$frag_summary_v2[order(ed$frag_summary_v2$phiid), ]

test <- diff_data(ed$frag_summary, ed$frag_summary_v2)
render_diff(test)

image

  • It appears the majority of the difference stems from the current version misclassifying (?) different size classes. The current version assumes the fragsize_h is populated and classifies that first (then the fragsize_r if the H is missing), however their are many records where L and H values are wider than an individual class. There also appear to be a handful of other records where ONLY the fragsize_l is populated, therefore those records are currently being classified as 'unspecified'. My SQL version is slightly more robust, it classifies like so fragsize_r > (fragsize_h + fragsize_l) / 2 > fragsize_h > fragsize_l. The first TRUE statement is used. Their also appear to be several misclassified records, where it appears the thresholds used by the simplfyFragmentsData() functions are wrong (e.g. < vs. <=).
  • Their are also a handful of cases in the current version where pararock fragments are getting miscounted in the 'total_frags_pct_nopf' column.

If others could test with their data, I think we safely update with this new version.

Stephen

@smroecker smroecker reopened this Jan 30, 2019
@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Jan 30, 2019

Thanks for writing an SQL version of this code. It will be really helpful when applying these concepts to SDA and similar remote databases. I like the idea of using SQL when appropriate but I have a hard time transitioning without further discussion of each point.

First of all, I have edited soilDB:::.sieve() to use 75mm (previously 76mm) as the cutoff for gravel. I wasn't aware of the change.

Here is the difference breakdown among ~ 4k pedons, after the latest changes to both R and SQL versions:
image

I suspect that most of the differences have to do with things we all need to decide on:

  • the importance of user-defined "sieves", this is something planned for the R version
  • algorithms used by an R function should be coded in R so that they can be tested
  • assumptions about precedence of RV vs. H
  • interpretation of L when RV and H are missing
  • assumptions made in the absence of flat / hardness
  • are the fragment size classes inclusive on the high end (i.e. <=75mm)

Here are my suggestions:

  • fragsize H should take precedence over RV
  • therefore, midpoint (H+L)/2 is not appropriate
  • RV should be used when H is missing
  • L should not be interpreted (how common is this?)
  • fragment size classes are inclusive on the high end

I'll post some specific examples Wednesday mid-morning.

@dylanbeaudette
Copy link
Member Author

Additions

There are 175 (of 23031 total) rows in the SQL-based summary that are additional to the R-based approach. It looks like these may be an artifact of NA fragment volume? It is hard to say because all of the phiid values are NULL and all fragment classes are 0.

Modifications

Use of 76mm as fragsize_h to represent GR
According to the current fragment size class definitions, this is now a data population error.

phiid fragvol fragsize_l fragsize_r fragsize_h fragshp fraghard
1196586 36 2 20 76 nonflat strongly
1196586 11 76 125 250 nonflat strongly
1196586 10 250 350 600 nonflat strongly
1196586 2 2 20 76 nonflat indurated
1196586 5 2 20 76 nonflat moderately

The R-based summary pushes these fragments into CB (not ideal).

phiid fine_gravel gravel cobbles stones boulders channers flagstones parafine_gravel paragravel paracobbles parastones paraboulders parachanners paraflagstones unspecified total_frags_pct_nopf total_frags_pct
1196586 0 0 49 10 0 0 0 0 0 5 0 0 0 0 0 59 64

The SQL-based summary uses the RV and keeps them as GR.

phiid fine_gravel gravel cobbles stones boulders channers flagstones parafine_gravel paragravel paracobbles parastones paraboulders parachanners paraflagstones unspecified total_frags_pct_nopf total_frags_pct
1196586 0 38 11 10 0 0 0 0 5 0 0 0 0 0 0 59 64

In my example (~ 4k pedons) 30% of what should be GR use 76mm as fragsize_h. Only 3% of those with the 76mm fragsize_h have an RV that could be used to infer the correct size class.

This seems like a data population problem that should be fixed vs. worked-around.

Fragment size spans more than 1 class
The second record is classified as a ST in the R-based summary, vs CB in the SQL-based summary. Data population problem?

phiid fragvol fragsize_l fragsize_r fragsize_h fragshp fraghard
717529 45 2 50 76 NA indurated
717529 4 76 200 300 NA indurated

**Interpretation of fragsize_l and assumptions about missing fragshp
Missing fragsize_r and fragsize_h. Missing fragshp and size class the suggests a channer?

phiid fragvol fragsize_l fragsize_r fragsize_h fragshp fraghard
564492 2 150 NA NA NA NA
564492 5 2 NA 150 NA NA

R-based:

phiid fine_gravel gravel cobbles stones boulders channers flagstones parafine_gravel paragravel paracobbles parastones paraboulders parachanners paraflagstones unspecified total_frags_pct_nopf total_frags_pct
564492 0 0 5 0 0 0 0 0 0 0 0 0 0 0 2 7 7

SQL-based:

phiid fine_gravel gravel cobbles stones boulders channers flagstones parafine_gravel paragravel paracobbles parastones paraboulders parachanners paraflagstones unspecified total_frags_pct_nopf total_frags_pct
564492 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 7 7

Summary

The differences are slight (< 1% of 8,460,509 cells). I suggest we decide on the assumptions of the underlying algorithm (see above) and what fraction of data errors we are willing to leave for QC activities.

@dylanbeaudette
Copy link
Member Author

Some more ideas after a conversation with Jay.

  • code() / uncode() are a good example of doing in R what was previously done in SQL
  • regional staff use fetchNASIS in a very different way than MLRA staff, "fixing" the data isn't so simple from the regional perspective
  • attempting to load / fix the data in one pass is likely too much for any single function
  • we should move more of the QC into dedicated functions (new functions for common QC of pedon / component data #89), leaving only the most basic to fetchNASIS or making it optional
  • when / why was the gravel threshold changed from 76 to 75mm? (I'll ask Curtis)
  • can we bulk-fix some of the more common phfrags errors in NASIS? (I'll ask Cathy)

I am going to write a very basic test for the 75/76mm issue and add to soilDB today.

@dylanbeaudette
Copy link
Member Author

dylanbeaudette commented Jan 31, 2019

Link to NSSH Part 618.

Current standards aren't all that clear.

image

@dylanbeaudette
Copy link
Member Author

A thought. We might consider including some "slop" in our evaluation of fragsize to catch those that are within 1mm.

@dylanbeaudette
Copy link
Member Author

I just heard from Curtis Monger than the 75mm gravel/cobble cutoff may be a typo. I'll post back when I hear more.

@smroecker
Copy link
Member

It's unclear to me why the default for .sieve() is the fragsize_h instead of fragsize_r. Can you explain your rationale? I unaware of another instance in soilDB or NASIS where the H if selected before the RV. We've spent a lot of time discussing the L, RV and H issue in the past, and I don't recall any preference for the H.

It seems to me if the fragsize range overlaps a class, or their is confusion as to the class limits, then the RV would be more a more robust estimate. Also, if the user purposely entered an RV, then they're in effect saying that is the most common fragsize. If using the H results in a misclassification, then the effect is shift in the particle size distribution.

@dylanbeaudette
Copy link
Member Author

I agree that the use of low-rv-high in my argument is the opposite of what I typically advocate. I think that there is a distinction to be made here:

  • the low-rv-high values in the phfrags table aren't describing a summary: these are typically class limits with additional flexibility for precision
  • does anyone use nested sieves in the field? would those data go into a different table?
  • is it reasonable to mix several size classes into a single record in phfrags?
  • my typical argument about the use/interpretation of low-rv-high is in the context of aggregate data: components, map units, series, MLRA, and so on

The use of H vs. RV (when present) is largely by convention, as inherited from the original SQL and (I think) data population practices in the 'west. It is very rare for folks to describe and RV for records in the phfrags table for any other reason than their supervisor "told them to". This convention may not hold in other parts of the country.

Using the RV first (when present) might be a reasonable compromise, but I can think of cases where the RV is completely arbitrary:

  • (H+L)/2
  • some number that was between L and H
  • whatever the regional staff said to put in there

Either way, the choice of precedence (RV over H, H over RV) will lead to some (< 1% ?) errors. How about we decide and then go from there. There will be very little effect on pedons in SSR2 since the RV is very rarely populated.

I do not think that there is a case to be made for computing the RV from L/H and then using that.

Overlapping ranges are (in my opinion) a data population error.

Again, this line of reasoning only applies to pedon data.

@dylanbeaudette
Copy link
Member Author

Settling on a compromise due to the various upper ends used to define gravels (74mm, 75mm, 76mm). The latest NSSH part 618 defines gravel as 2mm<= x < 75mm. Relevant changes:

  • The internal function .sieve now uses < for all evaluations (via particle diameter).
  • The internal function .rockFragmentSieve uses populated RV when present, calculated when missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants