v0.0.0.9010
Pre-release
Pre-release
jlmelville
released this
01 Apr 01:19
·
514 commits
to master
since this release
New features
- New parameter:
pcg_rand
. IfTRUE
(the default), then a random number generator from the PCG family is used during the stochastic optimization phase. The old PRNG, a direct translation of an implementation of the Tausworthe "taus88" PRNG used in the Python version of UMAP, can be obtained by settingpcg_rand = FALSE
. The new PRNG is slower, but is likely superior in its statistical randomness. This change in behavior will be break backwards compatibility: you will now get slightly different results even with the same seed. - New parameter:
fast_sgd
. IfTRUE
, then the following combination of parameters are set:n_sgd_threads = "auto"
,pcg_rand = FALSE
andapprox_pow = TRUE
. These will result in a substantially faster optimization phase, at the cost of being slightly less accurate and results not being exactly repeatable.fast_sgd = FALSE
by default but if you are only interested in visualization, thenfast_sgd
gives perfectly good results. For more generic dimensionality reduction and reproducibility, keepfast_sgd = FALSE
. - New parameter:
init_sdev
which specifies how large the standard deviation of each column of the initial coordinates should be. This will scale any input coordinates (including user-provided matrix coordinates).init = "spca"
can now be thought of as an alias ofinit = "pca", init_sdev = 1e-4
. This may be too aggressive scaling for some datasets. The typical UMAP spectral initializations tend to result in standard deviations of around2
to5
, so this might be more appropriate in some cases. If spectral initialization detects multiple components in the affinity graph and falls back to scaled PCA, it usesinit_sdev = 1
. - As a result of adding
init_sdev
, theinit
optionssspectral
,slaplacian
andsnormlaplacian
have been removed (they weren't around for very long anyway). You can get the same behavior by e.g.init = "spectral", init_sdev = 1e-4
.init = "spca"
is sticking around because I use it a lot.
Bug fixes and minor improvements
- Spectral initialization (the default) was sometimes generating coordinates that had too large a range, due to an erroneous scale factor that failed to account for negative coordinate values. This could give rise to embeddings with very noticeable outliers distant from the main clusters.
- Also during spectral initialization, the amount of noise being added had a standard deviation an order of magnitude too large compared to the Python implementation (this probably didn't make any difference though).
- If requesting a spectral initialization, but multiple disconnected components are present, fall back to
init = "spca"
. - Removed dependency on C++
<random>
header. This breaks backwards compatibility even if you setpcg_rand = FALSE
. metric = "cosine"
results were incorrectly using the unmodified Annoy angular distance.- Numeric matrix columns can be specified as the target for the
categorical
metric (fixes #20).