Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate stringsAsFactors argument to uncode() #240

Merged
merged 4 commits into from
Mar 4, 2022
Merged

Conversation

brownag
Copy link
Member

@brownag brownag commented Feb 25, 2022

  • R 4.0 brought a new default value (FALSE) for stringsAsFactors where strings are converted to factors
  • Overall, stringsAsFactors is not a generic solution for deciding where and how soilDB functions are assigning factor levels. In latest version of R-devel default.stringsAsFactors() has been deprecated which is now breaking in our CI with warnings -- we can no longer use this in our function definitions (stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels #130).

In this draft PR, all default stringsAsFactors arguments have been converted to NULL. Going forward no functions internal to soilDB package will specify stringsAsFactors or rely on it being anything but FALSE for the sake of reproducibility. If the argument is specified (not missing) deprecation message will be issued.

Current cases that depend on stringsAsFactors = TRUE

  • fetchNASISWebReport() and get_chorizon_from_SDA() use this argument to convert texcl to factor
    • these could probably be replaced with aqp::SoilTextureLevels() or similar.
  • get_mapunit_from_NASIS() conversion of farmlndcl to have labels as factor levels / character values
    • this is a special case that differs from most other uncoding because the domain names are numbers. My preference would be to have a calculated field e.g. farmland_class with the label and have the default farmlndcl match the choice list options (which are numbers, not class names).

From my view deviating from the NASIS schema and putting character values in farmlndcl query results is not the way to go. Those values can't be re-coded using the source domains and therefore should be in a column of a different name.

uncode() does a good job of doing the conversions to/from domains when columns match NASIS, but doesn't allow for customization beyond the levels used in the data / metadata choice lists.

There could be a more generic interface to it that allows for targeting only explicitly named columns, custom ordering, dropping of some / unused levels, etc. This could be implemented as a function that operates on data.frame or SoilProfileCollection and takes a generic category/metadata structure

Creating draft PR to verify that CI issues are resolved.

@brownag brownag self-assigned this Feb 25, 2022
@brownag brownag linked an issue Feb 25, 2022 that may be closed by this pull request
41 tasks
@smroecker
Copy link
Member

This is clearly an issue that can't be ignored any longer.

We've discussed this topic in several issues. My thougts are best captured in #53. Since then I haven't seen any need to change my opinon. The primary reason being that having the default be stringsAsFactors = TRUE makes plotting easier. Most users like easier and plotting. So that seems like a sensable default. Therefore I don't support converting the default to FALSE, but I'm happy to go along with the concensus.

Otherwise the rest of the changes look good.

Thanks for forcing the issue.

@hammerly hammerly marked this pull request as ready for review March 2, 2022 14:19
@brownag
Copy link
Member Author

brownag commented Mar 3, 2022

I think that it is best to follow the base R default for the data.frame of stringsAsFactors FALSE. Despite the fact that we use factors to facilitate "uncoding" NASIS domains it is always safer to return the character label rather than the factor()ed code. options(stringsAsFactors = TRUE) is deprecated for good reason and will be disabled; running that expression in R 4.1.2 will issue a warning. Going forward I think FALSE is what users should expect for essentially all R functions that historically would have responded to stringsAsFactors TRUE.

The primary reason being that having the default be stringsAsFactors = TRUE makes plotting easier.

Most plotting tools treat character inputs as categories as needed. For instance, aqp::plotSPC and groupedProfilePlot, graphics::boxplot, ggplot2::ggplot. It's not hard to convert character to factor, and doing so gives the user an opportunity to customize the levels/order if they choose. In general, I think as.factor() needs to be called less frequently with stringsAsFactors FALSE than as.character() when stringsAsFactors is TRUE, so its actually probably "easier" in terms of nested calls and keystrokes to go with the R 4 default (FALSE).

Since we do not convert ALL strings to factors, just those in the NASIS metadata definition and a matching column name/levels, I argue it was misleading to co-opt stringsAsFactors in the first place.

@dylanbeaudette
Copy link
Member

dylanbeaudette commented Mar 3, 2022

Thanks all for the input / discussion / and proposed changes. I will assume that the results pre/post PR have been checked to ensure that they are identical--given the large number of modifications.

A couple of thoughts:

  • split functionality for further / better customization
  • uncode() performs the de-coding of values in NASIS, using the latest version of the metadata table from the local database (if possible)
  • a new function or suite of functions would convert specific variables to factors, set the desired levels, and upgrade to ordered factors when appropriate. these functions will include an argument for dropping unused levels
  • behavior of this "second-pass" over the uncoded data can be controlled via argument to fetchNASIS or global preference set with option()

Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An invert argument can be added to reverse factor levels which is sometimes handy. That said, I don't think that we should attempt to convert all character data → factors (e.g. parent material origin) by default, just those that are most commonly used as factors (texture class, hillslope position, drainage class, etc.).

The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs.

# x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.encode_NASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {
  
  # all = TRUE
  # use NASIS metadata

  # all = FALSE
  # use column-specific rules as follows
  # ...
  
  # drop = TRUE
  # drop unused levels, no matter the encoding strategy above

  # modified data.frame is returned
  return(res)
}

Finally, I suggest that fetchNASIS() should default to:

  • convert most of the commonly used nominal / ordinal data to factors / ordered factors
  • this would exclude such things as IDs, date/time, names, taxonomic information, or cases with >n unique values
  • factor levels should be set manually in the to-be-written function whenever possible
  • unused levels should be dropped

All of this and more described in #241

…insAsFactor() to facilitate deprecation of base R stringsAsFactors option
@brownag
Copy link
Member Author

brownag commented Mar 4, 2022

This PR was just a draft to point out the upcoming change in R and attempt to get the package back into a passing state.
While it would be great to have said generic functionality and I will work on a prototype, I am not going to be implementing all that at this time/in this PR.

We are about due to make a CRAN submission, so for the purposes of that and in response to the request from CRAN I have made a new option and helper function (soilDB.NASIS.NASISDomainsAsFactor and NASISDomainsAsFactor() 0267e49) that will allow us to retain the current factor behavior when stringsAsFactors = TRUE (with a deprecation message) without passing that argument around all over the place. Eventually it will not be necessary/allowed to pass any stringsAsFactors arguments, but for the time being if arg is set will trigger the setting of the package option and ensure coding as factors continues to work.

This function will also check the base R "stringsAsFactors" option in case it is set to TRUE. As I mentioned above eventually people will not be able to set this so we need to provide our own option to be able to control factor levels across all the different database query functions in a high-level call to e.g. fetchNASIS.

@dylanbeaudette
Copy link
Member

That sounds like a reasonable solution, thanks for ensuring a smoother transition.

Suggestions:

  1. merge this PR
  2. close stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels #130
  3. I'll open a new issue outlining some options for encoding factors in a way that meets all of our needs
  4. I'll schedule some time with all of us to discuss on the phone / teams the new issue

@smroecker
Copy link
Member

I've been tinkering with the results of this change, and have some questions.

  1. The warning message that occurs when stringsAsFactors = NULL implies that the argument will be deprecated. Is this saying that we will be completely removing the argument, or simply that we're removing the previous default of default.stringsAsFactors()?
  2. The NASISDomainsAsFactor() seems to be a temporary solution to transition to the new default, however it doesn't appear to be working as I would have expected. For example, I tested on both fetchNASIS(from = "pedons") and fetchNASIS(from = "pedon_report") and set stringsAsFactors to NULL, TRUE, and FALSE. All versions returned factors. Is that expected or a bug?

@brownag
Copy link
Member Author

brownag commented Mar 9, 2022

  1. The issue was to deprecate the argument with the goal of removing it entirely.
  2. NASISDomainsAsFactor() isn't intended to be temporary. I think it is working as it should, but open to suggestions if you have an example where it isn't working right. I did have to fix something that had it not playing right if you had the base R stringsAsFactors option set and there was a typo in one of the options() calls.

The key thing to know in how its currently set up is if you set stringsAsFactors = TRUE it will set the option soilDB.NASIS.DomainsAsFactors behind the scenes (to allow for item 1 / removal of arguments from all the calls)

See below, which is what I was expecting:

library(soilDB)

f <- fetchNASIS()
#> Loading required namespace: odbc
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#>  Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#>  Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
class(f$texcl)
#> [1] "character"

f <- fetchNASIS(stringsAsFactors = TRUE)
#> Warning: stringsAsFactors = TRUE argument is deprecated.
#> Setting package option with `NASISDomainsAsFactor(TRUE)`
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#>  Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#>  Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
f$texcl
#>  [1] sl   sl   sl   <NA> <NA> lcos ls   lcos <NA> <NA> lcos lcos lcos cos  <NA>
#> [16] sl   cosl cosl sl   <NA> ls   ls   ls   ls   <NA> <NA> cosl sl   scl  <NA>
#> [31] <NA> ls   ls   sl   scl  sl   <NA> ls   ls   ls   <NA> ls   ls   <NA> <NA>
#> [46] lcos ls   lcos lcos s    <NA> cosl ls   lcos ls   <NA> lcos lcos s    <NA>
#> 21 Levels: cos s fs vfs lcos ls lfs lvfs cosl sl fsl vfsl l sil si scl ... c

NASISDomainsAsFactor(FALSE)

f <- fetchNASIS(stringsAsFactors = FALSE)
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#>  Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#>  Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
f$texcl
#>  [1] "sl"   "sl"   "sl"   NA     NA     "lcos" "ls"   "lcos" NA     NA    
#> [11] "lcos" "lcos" "lcos" "cos"  NA     "sl"   "cosl" "cosl" "sl"   NA    
#> [21] "ls"   "ls"   "ls"   "ls"   NA     NA     "cosl" "sl"   "scl"  NA    
#> [31] NA     "ls"   "ls"   "sl"   "scl"  "sl"   NA     "ls"   "ls"   "ls"  
#> [41] NA     "ls"   "ls"   NA     NA     "lcos" "ls"   "lcos" "lcos" "s"   
#> [51] NA     "cosl" "ls"   "lcos" "ls"   NA     "lcos" "lcos" "s"    NA

NASISDomainsAsFactor(TRUE)

f <- fetchNASIS()
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#>  Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#>  Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
class(f$texcl)
#> [1] "factor"

@smroecker
Copy link
Member

See below. The stringsAsFactors = FALSE option doesn't work, unless NASISDomainsAsFactor(FALSE) is set. That doesn't seem very intiutive.

I don't support removing this argument. My original assumption was that you were deprecating the usage of default.stringsAsFactors(), not deprecating the argument all together. So now users will need to set NASISDomainsAsFactor() in order to return factors. This seems needlessly complex. The package already has a lot of complexity which many of our internal users already struggle with. Base R doesn't appear to be deprecating the stringsAsFactor argument, so why would we? Keeping the argument seems more initiutive and would require less familiarity of soilDB to operate. I can handle switching the default to FALSE, but removing the argument altogether is a bridge too far.

f <- fetchNASIS()
f_T <- fetchNASIS(stringsAsFactors = TRUE)
f_F <- fetchNASIS(stringsAsFactors = FALSE)
sum(sapply(horizons(f), is.factor))
[1] 0
sum(sapply(horizons(f_T), is.factor))
[1] 10
sum(sapply(horizons(f_F), is.factor))
[1] 10

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] soilDB_2.6.14 aqp_1.41

loaded via a namespace (and not attached):
[1] Rcpp_1.0.8 knitr_1.37 cluster_2.1.2 xml2_1.3.3 raster_3.5-15
[6] magrittr_2.0.2 hms_1.1.1 odbc_1.3.3 bit_4.0.4 lattice_0.20-45
[11] rlang_1.0.1 fastmap_1.1.0 blob_1.2.2 stringr_1.4.0 plyr_1.8.6
[16] tools_4.1.2 rgdal_1.5-28 grid_4.1.2 data.table_1.14.2 xfun_0.29
[21] terra_1.5-21 cli_3.2.0 DBI_1.1.2 ellipsis_0.3.2 htmltools_0.5.2
[26] bit64_4.0.5 yaml_2.3.5 digest_0.6.29 lifecycle_1.0.1 farver_2.1.0
[31] vctrs_0.3.8 codetools_0.2-18 curl_4.3.2 evaluate_0.15 rmarkdown_2.11
[36] sp_1.4-6 stringi_1.7.6 compiler_4.1.2 pkgconfig_2.0.3

@brownag
Copy link
Member Author

brownag commented Mar 9, 2022

Using that argument is not good practice and is confusing because it does not do what base R stringsAsFactors does. It has all sorts of extra logic that comes in when dealing with stuff beyond just simple uncoding of NASIS results e.g. SDA functions.
If we want to add additional arguments to fetchNASIS and other high level functions that are more specific to the functionality we provide around factors, ideally with more customization options, then we can do so.

I would strongly suggest new users not use stringsAsFactors because it has potential to give results that are inconsistent with default behavior in R, can be problematic and will be removed from all 40ish of those functions in a subsequent release... so passing it will someday throw an error. That is what the title of this PR meant--we are easing the transition without having a hard error by removing it entirely... but we are not supporting stringsAsFactors in the long term

I do agree that stringsAsFactors = FALSE passed to a function should turn off factors regardless of the option settings, update the option in this interim period while we allow the argument to continue working, and also issue a deprecation message, so I have a fix for that

@dylanbeaudette
Copy link
Member

Stephen and I discussed this and here is the strategy that we would like to pursue.

Before the next push to CRAN:

  1. Use arguments + option (see below) to invoke selective encoding of factors: encodeFactors = c('none', 'all', 'some') (or pick a better name). It is clear that stringsAsFactors as an argument name is fraught.
  2. No invocation of default.stringsAsFactors() as this PR nicely dealt with
  3. Propagation of encodeFactors to low-level functions via argument, with default value set by option when appropriate. We may have to enumerate / agree on how many functions this will affect.
  4. NASISDomainsAsFactor() used to set options, but not required by high-level functions where encoding can be explicitly set by argument fetchNASIS(encodeFactors = 'none'). Asking users to set options adds complexity: see rms package and common problems people encounter there. Advanced users can make the choice option vs. argument.
  5. Selective encoding of factors only in high level functions for now, encoding factors in fetch* functions #241 can wait until the next release

brownag added a commit that referenced this pull request Mar 9, 2022
brownag added a commit that referenced this pull request Mar 9, 2022
brownag added a commit that referenced this pull request Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels
3 participants