Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels #130

Closed
41 tasks
dylanbeaudette opened this issue May 11, 2020 · 4 comments · Fixed by #240
Closed
41 tasks
Labels
NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions

Comments

@dylanbeaudette
Copy link
Member

dylanbeaudette commented May 11, 2020

We are passing around default.stringsAsFactors() to a lot of functions, each with its own arguments, flow-control, and documentation to manage. As far as I can tell (@smroecker please correct me if I am wrong), this is all to handle how uncode() converts or does not convert characters to factors, using the NASIS domains as the default levels. We have discussed at great length in #53, but some issues remain:

  • using the NASIS domains to set levels doesn't always result in logical ordering
  • using the NASIS domains to set levels often generates far more levels than the data support (i.e. droplevels() usually required to clean-up output)
  • uncode() should probably use an option to reduce (41 functions below) argument / documentation overhead
  • some characters (e.g. taxonname) should never be converted to factors
  • it may be more efficient to explicitly encode specific variables as factors, vs. explicitly not encoding (current behavior of fetchNASIS)

Functions to Fix

  • fetchNASIS_components.R:.fetchNASIS_components <- function(SS=TRUE, rmHzErrors=TRUE, fill = FALSE, stringsAsFactors = default.stringsAsFactors()) {

  • fetchNASIS_pedons.R:.fetchNASIS_pedons <- function(SS=TRUE, rmHzErrors=TRUE, nullFragsAreZero=TRUE, soilColorState='moist', lab=FALSE, stringsAsFactors = default.stringsAsFactors()) {

  • fetchNASIS_report.R:.get_site_from_NASISReport <- function(url = NULL, nullFragsAreZero = TRUE, stringsAsFactors = default.stringsAsFactors()

  • fetchNASIS_report.R:.get_pediagfeatures_from_NASISTemp <- function(stringsAsFactors = default.stringsAsFactors()

  • fetchVegdata.R:fetchVegdata <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_component_data_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_legend_from_NASIS <- function(SS = TRUE, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_lmuaoverlap_from_NASIS <- function(SS = TRUE, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_mapunit_from_NASIS <- function(SS = TRUE, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_component_correlation_data_from_NASIS_db <- function(SS=TRUE, dropAdditional=TRUE, dropNotRepresentative=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_component_copm_data_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_component_esd_data_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_data_from_NASIS_db.R:get_comonth_from_NASIS_db <- function(SS=TRUE, fill=FALSE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_component_from_NASISWebReport <- function(projectname, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_chorizon_from_NASISWebReport <- function(projectname, fill = FALSE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_legend_from_NASISWebReport <- function(areasymbol, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_lmuaoverlap_from_NASISWebReport <- function(areasymbol, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_mapunit_from_NASISWebReport <- function(areasymbol, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_projectmapunit_from_NASISWebReport <- function(projectname, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_LIMS.R:get_projectmapunit2_from_NASISWebReport <- function(mlrassoarea, fiscalyear, projectname, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_SDA.R:get_legend_from_SDA <- function(WHERE = NULL, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_component_from_SDA.R:get_lmuaoverlap_from_SDA <- function(WHERE = NULL, droplevels = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_concentrations_from_NASIS_db.R:get_concentrations_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_cosoilmoist_from_LIMS.R:get_cosoilmoist_from_NASISWebReport <- function(projectname, impute = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_cosoilmoist_from_NASIS.R:get_cosoilmoist_from_NASIS <- function(impute = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_extended_data_from_NASIS_db.R:get_extended_data_from_NASIS_db <- function(SS=TRUE, nullFragsAreZero=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_hz_data_from_NASIS_db.R:get_hz_data_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_phfmp_from_NASIS_db.R:get_phfmp_from_NASIS_db <- function(SS = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_projectmapunit_from_NASIS.R:get_projectmapunit_from_NASIS <- function(SS = TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_site_data_from_NASIS_db.R:get_site_data_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_soilseries_from_NASIS.R:get_soilseries_from_NASIS <- function(stringsAsFactors = default.stringsAsFactors()) {

  • get_soilseries_from_NASIS.R:get_soilseries_from_NASISWebReport <- function(soils, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R:get_vegplot_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_location_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_trhi_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_species_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_transect_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_transpecies_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_tree_si_summary_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_tree_si_details_from_NASIS_db <- function(SS=TRUE, stringsAsFactors = default.stringsAsFactors()) {

  • get_vegplot_data_from_NASIS_db.R: get_vegplot_textnote_from_NASIS_db <- function(SS=TRUE, fixLineEndings=TRUE, stringsAsFactors = default.stringsAsFactors()) {

@brownag
Copy link
Member

brownag commented May 11, 2020

Saw this coming a couple months ago.

Relevant background:
https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html

@smroecker
Copy link
Member

smroecker commented May 28, 2020

Yes I saw that blog also, and expected we would have to update soilDB shortly. uncode() only converts those columns that match the ColumnPhysicalNames from within NASIS's metadata domains into factors. All other character data columns, such as taxonname, are left as characters.

I'm not sure converting the stringsAsFactors argument into an option is the way to go, as that obscures the functionality from your typical user, but I'd be open to if that is the consensus. We should probably still document that functionality in each functions help file details section. If so it won't really reduce the documentation overhead.

I really think it's preferable to maintain parity with base R, therefore I suggest we change the argument name from stringsAsFactors to as.is. Although I think in issue #53 it was decided that the default should be TRUE.

As to the issue of the resulting factor ordering produced by uncode(). I think the best long-term solution would be to create a subroutine within uncode() to correct the ordering for those factors deemed to be 'out of order'. Rather than modify within each and every function where it occurs. The same solution might be used for those variables that should be excluded. I had considered the possibility of going to the source, and fixing the ordering in NASIS with the help of George or Kyle, but... decided that probably wasn't a good idea as it would destroy any backward compatibility with old data.

@brownag brownag changed the title stringsAsFactors will be deprecated in R >= 4.0 stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels Jan 16, 2021
@brownag brownag added the NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions label Jan 16, 2021
@brownag
Copy link
Member

brownag commented Mar 16, 2021

default.stringsAsFactors() returns FALSE, which means default result of get_cosoilmoist_from_NASISWebReport "status" column value is NULL / all NA. I fixed an example that was breaking in the CI since last week with this stack trace:

  Error: Assigned data `l` must be compatible with existing data.
  x Existing data has 119 rows.
  x Element 9 of assigned data has 0 rows.
  i Only vectors of size 1 are recycled.
  Backtrace:
       x
    1. +-soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded")
    2. | \-soilDB:::.cosoilmoist_prep(d.cosoilmoist, impute = impute, stringsAsFactors = stringsAsFactors)
    3. |   +-base::within(...)
    4. |   \-base::within.data.frame(...)
    5. |     +-base::`[<-`(...)
    6. |     \-tibble:::`[<-.tbl_df`(...)
    7. |       \-tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
    8. |         \-tibble:::vectbl_recycle_rhs(...)
    9. |           +-base::withCallingHandlers(...)
   10. |           \-vctrs::vec_recycle(value[[j]], nrow)
   11. +-vctrs:::stop_recycle_incompatible_size(...)
   12. | \-vctrs:::stop_vctrs(...)
   13. |   \-rlang::abort(message, class = c(class, "vctrs_error"), ...)
   14. |     \-rlang:::signal_abort(cnd)
   15. |       \-base::signalCondition(cnd)
   16. \-(function (cnd) ...
  Execution halted

Not sure how it suddenly causes a problem, since stringsAsFactors has been FALSE by default since R 4+, and this just started testing bad last week.

Comparison toggling stringsAsFactors back to former default TRUE

daff::diff_data(
  soilDB::get_cosoilmoist_from_NASISWebReport(
    "EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded",
    stringsAsFactors = FALSE
  ),
  
  soilDB::get_cosoilmoist_from_NASISWebReport(
    "EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded",
    stringsAsFactors = TRUE
  )
)
#> Loading required namespace: rvest
#> Warning in eval(substitute(expr), e): NAs introduced by coercion

#> Warning in eval(substitute(expr), e): NAs introduced by coercion
#> Daff Comparison: 'soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded", ' '    stringsAsFactors = FALSE)' vs. 'soilDB::get_cosoilmoist_from_NASISWebReport("EVAL - MLRA 111A - Ross silt loam, 0 to 2 percent slopes, frequently flooded", ' '    stringsAsFactors = TRUE)' 
#>   First 6 and last 6 patch lines:
#>     nationalmusym muname                                                   
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ->  5d8l          Ross silt loam                                           
#> ... ...           ...                                                      
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#> ->  2w56h         Ross silt loam, 0 to 2 percent slopes, frequently flooded
#>     dmuiid coiid   compname comppct_r drainagecl  month
#> ->  119696 238422  Ross     100       well        jan  
#> ->  119696 238422  Ross     100       well        feb  
#> ->  119696 238422  Ross     100       well        mar  
#> ->  119696 238422  Ross     100       well        apr  
#> ->  119696 238422  Ross     100       well        may  
#> ->  119696 238422  Ross     100       well        jun  
#> ... ...    ...     ...      ...       ...         ...  
#> ->  749825 2494482 Sloan    5         very poorly sep  
#> ->  749825 2494482 Sloan    5         very poorly oct  
#> ->  749825 2494482 Sloan    5         very poorly nov  
#> ->  749825 2494482 Sloan    5         very poorly nov  
#> ->  749825 2494482 Sloan    5         very poorly dec  
#> ->  749825 2494482 Sloan    5         very poorly dec  
#>     flodfreqcl               floddurcl  pondfreqcl                ponddurcl
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->frequent  very brief Not populated->none       <NA>     
#> ->  Not populated->very rare very brief Not populated->none       <NA>     
#> ... ...                      ...        ...                       ...      
#> ->  Not populated            <NA>       Not populated             <NA>     
#> ->  Not populated            <NA>       Not populated             <NA>     
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#> ->  Not populated->frequent  brief      Not populated->occasional brief    
#>     cosoilmoistiid dept_l  dept_r  dept_h  depb_l  depb_r  depb_h 
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ->  NA             NA->201 NA->201 NA->201 NA->201 NA->201 NA->201
#> ... ...            ...     ...     ...     ...     ...     ...    
#> ->  23327632       0       0       0       200     200     200    
#> ->  23327616       0       0       0       200     200     200    
#> ->  23327627       0       0       0       0       7       15     
#> ->  23327628       0       7       15      200     200     200    
#> ->  23327630       0       0       0       0       7       15     
#> ->  23327629       0       7       15      200     200     200    
#>     status             
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ->  NULL->Not populated
#> ... ...                
#> ->  NULL->moist        
#> ->  NULL->moist        
#> ->  NULL->moist        
#> ->  NULL->wet          
#> ->  NULL->moist        
#> ->  NULL->wet

@smroecker
Copy link
Member

Thanks Andrew. I looked into and fixed. I'll try and examine how do deal with all of the other functions shortly.

brownag added a commit that referenced this issue Feb 24, 2022
 - default.stringsAsFactors() has been deprecated;
@brownag brownag linked a pull request Feb 25, 2022 that will close this issue
@brownag brownag reopened this Mar 4, 2022
brownag added a commit that referenced this issue Mar 4, 2022
* Deprecate `stringsAsFactors` argument to `uncode()` #130
 - default.stringsAsFactors() has been deprecated;

* fetchHenry: backward compatibility with R<4.0

* Add soilDB.NASIS.DomainsAsFactor option and helper function NASISDomainsAsFactor() to facilitate deprecation of base R stringsAsFactors option

* Docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NASIS-local This tag is used for pull requests, issues, discussions etc. for soilDB local NASIS functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants