List of improvements for big datasets #53

grst · 2020-04-09T14:58:18Z

speed up group_abundance
alignment dist as CLI that reada and writes anndata -> to be able to submit this easily to a cluster
parallelize chain pairing
increase chunk size in multiprocessing (less IOwait for pickling... maybe can reduce the amount of data sent to each worker)
sensible defaults for graph layout. How would force atlase work?
increase point size and set edges to False, when plotting many clonotypes
~~optimize/parallelize construction and reduction of coord dictionary~~. (superseeded by Refactor CDR3-network construction. #191)
~~further optimize matrix housekeeping in tcr_neighbors~~ (superseeded by Refactor CDR3-network construction. #191)

The text was updated successfully, but these errors were encountered:

grst · 2020-04-09T14:58:21Z

In GitLab by @grst on Apr 1, 2020, 14:43

grst · 2020-04-09T14:58:24Z

In GitLab by @grst on Apr 6, 2020, 08:26

changed title from List of improvements for {-many cell-}s to List of improvements for {+big dataset+}s

grst · 2020-04-09T14:58:27Z

In GitLab by @grst on Apr 6, 2020, 08:26

changed the description

grst · 2020-05-03T07:58:28Z

Maybe this can be abused to compute Levenshtein distance in linear time:
https://github.com/wolfgarbe/SymSpell

and/or to prefilter which alignments to compute

grst · 2020-05-31T16:04:15Z

Consider storing clonotype x clonotypes matrices rather than cell x cell matrices.
This could save a lot of space & computation time
The clonotype network plot would then be updated to show a single dot per
clonotype (potentially connected into clusters), where the size represents the number of cells.

grst · 2020-08-04T13:47:26Z

I tried to compute edit distance with symspellpy, but is actually 3x slower than the current levenshtein implementation. The reason for this is probably that the Levenshtein package is a highly optimized C library, while symspellpy is a pure python port of the SymSpell library.
Would need a Python wrapper for C SymSpell for max performance.

grst · 2020-09-28T10:18:41Z

Maybe WFA provides even better alignment performances than parasail?
https://github.com/smarco/WFA
Need to try it out at some point. In any case, we would need a python wrapper for that library first.

grst · 2020-11-12T15:20:00Z

mmseqs2 should do a great job for this.
Just tried it for some bacterial sequences (many-against-many). It's super fast and I don't see why we couldn't apply it here.

grst · 2021-01-13T17:10:39Z

superseded by #190 (faster chain_pairing) and #230.

grst added the future label Apr 9, 2020

grst removed the future label Apr 9, 2020

grst self-assigned this May 19, 2020

grst mentioned this issue Aug 4, 2020

Optimize parallel sequence distance calculation. #171

Merged

grst mentioned this issue Sep 22, 2020

Use adata.obsp instead of adata.uns for graphs. #142

Closed

grst closed this as completed Jan 13, 2021

grst added this to scirpy-dev May 28, 2024

grst moved this to Done in scirpy-dev May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of improvements for big datasets #53

List of improvements for big datasets #53

grst commented Apr 9, 2020 •

edited

Loading

grst commented Apr 9, 2020 •

edited

Loading

grst commented Apr 9, 2020

grst commented Apr 9, 2020

grst commented May 3, 2020

grst commented May 31, 2020

grst commented Aug 4, 2020

grst commented Sep 28, 2020

grst commented Nov 12, 2020

grst commented Jan 13, 2021

List of improvements for big datasets #53

List of improvements for big datasets #53

Comments

grst commented Apr 9, 2020 • edited Loading

grst commented Apr 9, 2020 • edited Loading

grst commented Apr 9, 2020

grst commented Apr 9, 2020

grst commented May 3, 2020

grst commented May 31, 2020

grst commented Aug 4, 2020

grst commented Sep 28, 2020

grst commented Nov 12, 2020

grst commented Jan 13, 2021

grst commented Apr 9, 2020 •

edited

Loading

grst commented Apr 9, 2020 •

edited

Loading