collapse version 1.7.0
collapse 1.7.0
collapse 1.7.0, released mid January 2022, brings major improvements in the computational backend of the package, it's data manipulation capabilities, and a whole set of new functions that enable more flexible and memory efficiency R programming - significantly enhancing the language itself. For the vast majority of codes, updating to 1.7 should not cause any problems.
Changes to functionality
-
num_vars
is now implemented in C, yielding a massive performance increase over checking columns usingvapply(x, is.numeric, logical(1))
. It selects columns where(is.double(x) || is.integer(x)) && !is.object(x)
. This provides the same results for most common classes found in data frames (e.g. factors and date columns are not numeric), however it is possible for users to define methods foris.numeric
for other objects, which will not be respected bynum_vars
anymore. A prominent example are base R's 'ts' objects i.e.is.numeric(AirPassengers)
returnsTRUE
, butis.object(AirPassengers)
is alsoTRUE
so the above yieldsFALSE
, implying - if you happened to work with data frames of 'ts' columns - thatnum_vars
will now not select those anymore. Please make me aware if there are other important classes that are found in data frames and whereis.numeric
returnsTRUE
.num_vars
is also used internally incollap
so this might affect your aggregations. -
In
flag
,fdiff
andfgrowth
, if a plain numeric vector is passed to thet
argument such thatis.double(t) && !is.object(t)
, it is coerced to integer usingas.integer(t)
and directly used as time variable, rather than applying ordered grouping first. This is to avoid the inefficiency of grouping, and owes to the fact that in most data imported into R with various packages, the time (year) variables are coded as double although they should be integer (I also don't know of any cases where time needs to be indexed by a non-date variable with decimal places). Note that the algorithm internally handles irregularity in the time variable so this is not a problem. Should this break any code, kindly raise an issue on GitHub. -
The function
setrename
now truly renames objects by reference (without creating a shallow copy). The same is true forvlabels<-
(which was rewritten in C) and a new functionsetrelabel
. Thus additional care needs to be taken (with use inside functions etc.) as the renaming will take global effects unless a shallow copy of the data was created by some prior operation inside the function. If in doubt, better usefrename
orrelabel
which do create a shallow copy. -
Some improvements to the
BY
function, both in terms of performance and security. Performance is enhanced through a new C functiongsplit
, providing split-apply-combine computing speeds competitive with dplyr on a much broader range of R objects. Regarding Security: if the result of the computation has the same length as the original data, names / rownames and grouping columns (for grouped data) are only added to the result object if known to be valid, i.e. if the data was originally sorted by the grouping columns (information recorded byGRP.default(..., sort = TRUE)
, which is called internally on non-factor/GRP/qG objects). This is becauseBY
does not reorder data after the split-apply-combine step (unlikedplyr::mutate
); data are simply recombined in the order of the groups. Because of this, in general,BY
should be used to compute summary statistics (unless data are sorted before grouping). The added security makes this explicit. -
Added a method
length.GRP
giving the length of a grouping object. This could break code callinglength
on a grouping object before (which just returned the length of the list). -
Functions renamed in collapse 1.6.0 will now print a message telling you to use the updated names. The functions under the old names will stay around for 1-3 more years.
-
The passing of argument
order
instead ofsort
in functionGRP
(from a very early version of collapse), is now disabled.
Bug Fixes
- Fixed a bug in some functions using Welfords Online Algorithm (
fvar
,fsd
,fscale
andqsu
) to calculate variances, occurring when initial or final zero weights caused the running sum of weights in the algorithm to be zero, yielding a division by zero andNA
as output although a value was expected. These functions now skip zero weights alongside missing weights, which also implies that you can pass a logical vector to the weights argument to very efficiently calculate statistics on a subset of data (e.g. usingqsu
).
Additions
Basic Computational Infrastructure
-
Function
group
was added, providing a low-level interface to a new unordered grouping algorithm based on hashing in C and optimized for R's data structures. The algorithm was heavily inspired by the greatkit
package of Morgan Jacob, and now feeds into the package through multiple central functions (includingGRP
/fgroup_by
,funique
andqF
) when invoked with argumentsort = FALSE
. It is also used in internal groupings performed in data transformation functions such asfwithin
(when no factor or 'GRP' object is provided to theg
argument). The speed of the algorithm is very promising (often superior toradixorder
), and it could be used in more places still. I welcome any feedback on it's performance on different datasets. -
Function
gsplit
provides an efficient alternative tosplit
based on grouping objects. It is used as a new backend torsplit
(which also supports data frame) as well asBY
,collap
,fsummarise
andfmutate
- for more efficient grouped operations with functions external to the package. -
Added multiple functions to facilitate memory efficient programming (written in C). These include elementary mathematical operations by reference (
setop
,%+=%
,%-=%
,%*=%
,%/=%
), supporting computations involving integers and doubles on vectors, matrices and data frames (including row-wise operations viasetop
) with no copies at all. Furthermore a set of functions which check a single value against a vector without generating logical vectors:whichv
,whichNA
(operators%==%
and%!=%
which return indices and are significantly faster than==
, especially inside functions likefsubset
),anyv
andallv
(allNA
was already added before). Finally, functionssetv
andcopyv
speed up operations involving the replacement of a value (x[x == 5] <- 6
) or of a sequence of values from a equally sized object (x[x == 5] <- y[x == 5]
, orx[ind] <- y[ind]
whereind
could be pre-computed vectors or indices) in vectors and data frames without generating any logical vectors or materializing vector subsets. -
Function
vlengths
was added as a more efficient alternative tolengths
(without method dispatch, simply coded in C). -
Function
massign
provides a multivariate version ofassign
(written in C, and supporting all basic vector types). In addition the operator%=%
was added as an efficient multiple assignment operator. (It is called%=%
and not%<-%
to facilitate the translation of Matlab or Python codes into R, and because the zeallot package already provides multiple-assignment operators (%<-%
and%->%
), which are significantly more versatile, but orders of magnitude slower than%=%
)
High-Level Features
-
Fully fledged
fmutate
function that provides functionality analogous todplyr::mutate
(sequential evaluation of arguments, including arbitrary tagged expressions andacross
statements).fmutate
is optimized to work together with the packages Fast Statistical and Data Transformation Functions, yielding fast, vectorized execution, but also benefits fromgsplit
for other operations. -
across()
function implemented for use insidefsummarise
andfmutate
. It is also optimized for Fast Statistical and Data Transformation Functions, but performs well with other functions too. It has an additional arguments.apply = FALSE
which will apply functions to the entire subset of the data instead of individual columns, and thus allows for nesting tibbles and estimating models or correlation matrices by groups etc..across()
also supports an arbitrary number of additional arguments which are split and evaluated by groups if necessary. Multipleacross()
statements can be combined with tagged vector expressions in a single call tofsummarise
orfmutate
. Thus the computational framework is pretty general and similar to data.table, although less efficient with big datasets. -
Added functions
relabel
andsetrelabel
to make interactive dealing with variable labels a bit easier. Note that both functions operate by reference. (Throughvlabels<-
which is implemented in C. Taking a shallow copy of the data frame is useless in this case because variable labels are attributes of the columns, not of the frame). The only difference between the two is thatsetrelabel
returns the result invisibly. -
function shortcuts
rnm
andmtt
added forfrename
andfmutate
.across
can also be abbreviated usingacr
. -
Added two options that can be invoked before loading of the package to change the namespace:
options(collapse_mask = c(...))
can be set to export copies of selected (or all) functions in the package that start withf
removing the leadingf
e.g.fsubset
->subset
(bothfsubset
andsubset
will be exported). This allows masking base R and dplyr functions (even basic functions such assum
,mean
,unique
etc. if desired) with collapse's fast functions, facilitating the optimization of existing codes and allowing you to work with collapse using a more natural namespace. The package has been internally insulated against such changes, but of course they might have major effects on existing codes. Alsooptions(collapse_F_to_FALSE = FALSE)
can be invoked to get rid of the lead operatorF
, which masksbase::F
(an issue raised by some people who like to useT
/F
instead ofTRUE
/FALSE
). Read the help page?collapse-options
for more information.
Improvements
-
Package loads faster (because I don't fetch functions from some other C/C++ heavy packages in
.onLoad
anymore, which implied unnecessary loading of a lot of DLLs). -
fsummarise
is now also fully featured supporting evaluation of arbitrary expressions andacross()
statements. Note that mixing Fast Statistical Functions with other functions in a single expression can yield unintended outcomes, read more at?fsummarise
. -
funique
benefits fromgroup
in the defaultsort = FALSE
, configuration, providing extra speed and unique values in first-appearance order in both the default and the data frame method, for all data types. -
Function
ss
supports both emptyi
orj
. -
The printout of
fgroup_by
also shows minimum and maximum group size for unbalanced groupings. -
In
ftransformv/settransformv
andfcomputev
, thevars
argument is also evaluated inside the data frame environment, allowing NSE specifications using column names e.g.ftransformv(data, c(col1, col2:coln), FUN)
. -
qF
with optionsort = FALSE
now generates factors with levels in first-appearance order (instead of a random order assigned by the hash function), and can also be called on an existing factor to recast the levels in first-appearance order. It is also faster withsort = FALSE
(thanks togroup
). -
finteraction
has argumentsort = FALSE
to also take advantage ofgroup
. -
rsplit
has improved performance throughgsplit
, and an additional argumentuse.names
, which can be used to return an unnamed list. -
Speedup in
vtypes
and functionsnum_vars
,cat_vars
,char_vars
,logi_vars
andfact_vars
. Note thannum_vars
behaves slightly differently as discussed above. -
vlabels(<-)
/setLabels
rewritten in C, giving a ~20x speed improvement. Note that they now operate by reference. -
vlabels
,vclasses
andvtypes
have ause.names
argument. The default isTRUE
(as before). -
colorder
can rename columns on the fly and also has a new modepos = "after"
to place all selected columns after the first selected one, e.g.:colorder(mtcars, cyl, vs_new = vs, am, pos = "after")
. Thepos = "after"
option was also added toroworderv
.
add_stub
andrm_stub
have an additionalcols
argument to apply a stub to certain columns only e.g.add_stub(mtcars, "new_", cols = 6:9)
.
-
namlab
has additional argumentsN
andNdistinct
, allowing to display number of observations and distinct values next to variable names, labels and classes, to get a nice and quick overview of the variables in a large dataset. -
copyMostAttrib
only copies the"row.names"
attribute when known to be valid. -
na_rm
can now be used to efficiently remove empty orNULL
elements from a list. -
flag
,fdiff
andfgrowth
produce less messages (i.e. no message if you don't use a time variable in grouped operations, and messages about computations on highly irregular panel data only if data length exceeds 10 million obs.). -
The print methods of
pwcor
andpwcov
now have areturn
argument, allowing users to obtain the formatted correlation matrix, for exporting purposes. -
replace_NA
,recode_num
andrecode_char
have improved performance and an additional argumentset
to take advantage ofsetv
to change (some) data by reference. Forreplace_NA
, this feature is mature and settingset = TRUE
will modify all selected columns in place and return the data invisibly. Forrecode_num
andrecode_char
only a part of the transformations are done by reference, thus users will still have to assign the data to preserve changes. In the future, this will be improved so thatset = TRUE
toggles all transformations to be done by reference.