-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate stringsAsFactors
argument to uncode()
#240
Conversation
- default.stringsAsFactors() has been deprecated;
This is clearly an issue that can't be ignored any longer. We've discussed this topic in several issues. My thougts are best captured in #53. Since then I haven't seen any need to change my opinon. The primary reason being that having the default be stringsAsFactors = TRUE makes plotting easier. Most users like easier and plotting. So that seems like a sensable default. Therefore I don't support converting the default to FALSE, but I'm happy to go along with the concensus. Otherwise the rest of the changes look good. Thanks for forcing the issue. |
I think that it is best to follow the base R default for the data.frame of stringsAsFactors
Most plotting tools treat character inputs as categories as needed. For instance, aqp::plotSPC and groupedProfilePlot, graphics::boxplot, ggplot2::ggplot. It's not hard to convert character to factor, and doing so gives the user an opportunity to customize the levels/order if they choose. In general, I think Since we do not convert ALL strings to factors, just those in the NASIS metadata definition and a matching column name/levels, I argue it was misleading to co-opt |
Thanks all for the input / discussion / and proposed changes. I will assume that the results pre/post PR have been checked to ensure that they are identical--given the large number of modifications. A couple of thoughts:
Apart from the compatibility issue with a pending version of R, there are no reasons why we can't all get what we want out of NASIS. The factor-conversion code can be written to look for NASIS column names, and encode levels according to either the metadata or a manually-specified vector. An The new function / functions will likely be internal to soilDB, and will "know" how to exclude IDs. # x: data.frame
# all: encode all character data, or just those manually defined in the function
# invert: invert factor levels / ordering
# drop: drop unused levels
.encode_NASIS_factors <- function(x, all = FALSE, invert = FALSE, drop = TRUE) {
# all = TRUE
# use NASIS metadata
# all = FALSE
# use column-specific rules as follows
# ...
# drop = TRUE
# drop unused levels, no matter the encoding strategy above
# modified data.frame is returned
return(res)
} Finally, I suggest that
All of this and more described in #241 |
…insAsFactor() to facilitate deprecation of base R stringsAsFactors option
This PR was just a draft to point out the upcoming change in R and attempt to get the package back into a passing state. We are about due to make a CRAN submission, so for the purposes of that and in response to the request from CRAN I have made a new option and helper function (soilDB.NASIS.NASISDomainsAsFactor and This function will also check the base R "stringsAsFactors" option in case it is set to TRUE. As I mentioned above eventually people will not be able to set this so we need to provide our own option to be able to control factor levels across all the different database query functions in a high-level call to e.g. fetchNASIS. |
That sounds like a reasonable solution, thanks for ensuring a smoother transition. Suggestions:
|
I've been tinkering with the results of this change, and have some questions.
|
The key thing to know in how its currently set up is if you set stringsAsFactors = TRUE it will set the option soilDB.NASIS.DomainsAsFactors behind the scenes (to allow for item 1 / removal of arguments from all the calls) See below, which is what I was expecting: library(soilDB)
f <- fetchNASIS()
#> Loading required namespace: odbc
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#> Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#> Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
class(f$texcl)
#> [1] "character"
f <- fetchNASIS(stringsAsFactors = TRUE)
#> Warning: stringsAsFactors = TRUE argument is deprecated.
#> Setting package option with `NASISDomainsAsFactor(TRUE)`
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#> Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#> Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
f$texcl
#> [1] sl sl sl <NA> <NA> lcos ls lcos <NA> <NA> lcos lcos lcos cos <NA>
#> [16] sl cosl cosl sl <NA> ls ls ls ls <NA> <NA> cosl sl scl <NA>
#> [31] <NA> ls ls sl scl sl <NA> ls ls ls <NA> ls ls <NA> <NA>
#> [46] lcos ls lcos lcos s <NA> cosl ls lcos ls <NA> lcos lcos s <NA>
#> 21 Levels: cos s fs vfs lcos ls lfs lvfs cosl sl fsl vfsl l sil si scl ... c
NASISDomainsAsFactor(FALSE)
f <- fetchNASIS(stringsAsFactors = FALSE)
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#> Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#> Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
f$texcl
#> [1] "sl" "sl" "sl" NA NA "lcos" "ls" "lcos" NA NA
#> [11] "lcos" "lcos" "lcos" "cos" NA "sl" "cosl" "cosl" "sl" NA
#> [21] "ls" "ls" "ls" "ls" NA NA "cosl" "sl" "scl" NA
#> [31] NA "ls" "ls" "sl" "scl" "sl" NA "ls" "ls" "ls"
#> [41] NA "ls" "ls" NA NA "lcos" "ls" "lcos" "lcos" "s"
#> [51] NA "cosl" "ls" "lcos" "ls" NA "lcos" "lcos" "s" NA
NASISDomainsAsFactor(TRUE)
f <- fetchNASIS()
#> NOTICE: multiple `labsampnum` values / horizons; see pedon IDs:
#> S2017CA039001
#> NOTE: some records are missing rock fragment volume
#> -> QC: some fragsize_h values == 76mm, may be mis-classified as cobbles [6 / 96 records]
#> NOTE: all records are missing artifact volume
#> -> QC: horizon errors detected:
#> Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
#> Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
class(f$texcl)
#> [1] "factor" |
See below. The stringsAsFactors = FALSE option doesn't work, unless NASISDomainsAsFactor(FALSE) is set. That doesn't seem very intiutive. I don't support removing this argument. My original assumption was that you were deprecating the usage of default.stringsAsFactors(), not deprecating the argument all together. So now users will need to set NASISDomainsAsFactor() in order to return factors. This seems needlessly complex. The package already has a lot of complexity which many of our internal users already struggle with. Base R doesn't appear to be deprecating the stringsAsFactor argument, so why would we? Keeping the argument seems more initiutive and would require less familiarity of soilDB to operate. I can handle switching the default to FALSE, but removing the argument altogether is a bridge too far.
Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
Using that argument is not good practice and is confusing because it does not do what base R stringsAsFactors does. It has all sorts of extra logic that comes in when dealing with stuff beyond just simple uncoding of NASIS results e.g. SDA functions. I would strongly suggest new users not use I do agree that stringsAsFactors = FALSE passed to a function should turn off factors regardless of the option settings, update the option in this interim period while we allow the argument to continue working, and also issue a deprecation message, so I have a fix for that |
Stephen and I discussed this and here is the strategy that we would like to pursue. Before the next push to CRAN:
|
default.stringsAsFactors()
has been deprecated which is now breaking in our CI with warnings -- we can no longer use this in our function definitions (stringsAsFactors default is FALSE on R 4.0+ / not generic solution for factor levels #130).In this draft PR, all default stringsAsFactors arguments have been converted to NULL. Going forward no functions internal to soilDB package will specify stringsAsFactors or rely on it being anything but FALSE for the sake of reproducibility. If the argument is specified (not missing) deprecation message will be issued.
Current cases that depend on
stringsAsFactors = TRUE
fetchNASISWebReport()
andget_chorizon_from_SDA()
use this argument to converttexcl
to factoraqp::SoilTextureLevels()
or similar.get_mapunit_from_NASIS()
conversion offarmlndcl
to have labels as factor levels / character valuesfarmland_class
with the label and have the defaultfarmlndcl
match the choice list options (which are numbers, not class names).From my view deviating from the NASIS schema and putting character values in
farmlndcl
query results is not the way to go. Those values can't be re-coded using the source domains and therefore should be in a column of a different name.uncode()
does a good job of doing the conversions to/from domains when columns match NASIS, but doesn't allow for customization beyond the levels used in the data / metadata choice lists.There could be a more generic interface to it that allows for targeting only explicitly named columns, custom ordering, dropping of some / unused levels, etc. This could be implemented as a function that operates on data.frame or SoilProfileCollection and takes a generic category/metadata structure
Creating draft PR to verify that CI issues are resolved.