Skip to content

v0.0.0.9010

Pre-release
Pre-release
Compare
Choose a tag to compare
@jlmelville jlmelville released this 01 Apr 01:19
· 514 commits to master since this release

New features

  • New parameter: pcg_rand. If TRUE (the default), then a random number generator from the PCG family is used during the stochastic optimization phase. The old PRNG, a direct translation of an implementation of the Tausworthe "taus88" PRNG used in the Python version of UMAP, can be obtained by setting pcg_rand = FALSE. The new PRNG is slower, but is likely superior in its statistical randomness. This change in behavior will be break backwards compatibility: you will now get slightly different results even with the same seed.
  • New parameter: fast_sgd. If TRUE, then the following combination of parameters are set: n_sgd_threads = "auto", pcg_rand = FALSE and approx_pow = TRUE. These will result in a substantially faster optimization phase, at the cost of being slightly less accurate and results not being exactly repeatable. fast_sgd = FALSE by default but if you are only interested in visualization, then fast_sgd gives perfectly good results. For more generic dimensionality reduction and reproducibility, keep fast_sgd = FALSE.
  • New parameter: init_sdev which specifies how large the standard deviation of each column of the initial coordinates should be. This will scale any input coordinates (including user-provided matrix coordinates). init = "spca" can now be thought of as an alias of init = "pca", init_sdev = 1e-4. This may be too aggressive scaling for some datasets. The typical UMAP spectral initializations tend to result in standard deviations of around 2 to 5, so this might be more appropriate in some cases. If spectral initialization detects multiple components in the affinity graph and falls back to scaled PCA, it uses init_sdev = 1.
  • As a result of adding init_sdev, the init options sspectral, slaplacian and snormlaplacian have been removed (they weren't around for very long anyway). You can get the same behavior by e.g. init = "spectral", init_sdev = 1e-4. init = "spca" is sticking around because I use it a lot.

Bug fixes and minor improvements

  • Spectral initialization (the default) was sometimes generating coordinates that had too large a range, due to an erroneous scale factor that failed to account for negative coordinate values. This could give rise to embeddings with very noticeable outliers distant from the main clusters.
  • Also during spectral initialization, the amount of noise being added had a standard deviation an order of magnitude too large compared to the Python implementation (this probably didn't make any difference though).
  • If requesting a spectral initialization, but multiple disconnected components are present, fall back to init = "spca".
  • Removed dependency on C++ <random> header. This breaks backwards compatibility even if you set pcg_rand = FALSE.
  • metric = "cosine" results were incorrectly using the unmodified Annoy angular distance.
  • Numeric matrix columns can be specified as the target for the categorical metric (fixes #20).