Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of improvements for big datasets #53

Closed
5 of 8 tasks
grst opened this issue Apr 9, 2020 · 9 comments
Closed
5 of 8 tasks

List of improvements for big datasets #53

grst opened this issue Apr 9, 2020 · 9 comments
Assignees

Comments

@grst
Copy link
Collaborator

grst commented Apr 9, 2020

  • speed up group_abundance
  • alignment dist as CLI that reada and writes anndata -> to be able to submit this easily to a cluster
  • parallelize chain pairing
  • increase chunk size in multiprocessing (less IOwait for pickling... maybe can reduce the amount of data sent to each worker)
  • sensible defaults for graph layout. How would force atlase work?
  • increase point size and set edges to False, when plotting many clonotypes
  • optimize/parallelize construction and reduction of coord dictionary. (superseeded by Refactor CDR3-network construction.  #191)
  • further optimize matrix housekeeping in tcr_neighbors (superseeded by Refactor CDR3-network construction.  #191)
@grst grst added the future label Apr 9, 2020
@grst
Copy link
Collaborator Author

grst commented Apr 9, 2020

In GitLab by @grst on Apr 1, 2020, 14:43

@grst
Copy link
Collaborator Author

grst commented Apr 9, 2020

In GitLab by @grst on Apr 6, 2020, 08:26

changed title from List of improvements for {-many cell-}s to List of improvements for {+big dataset+}s

@grst
Copy link
Collaborator Author

grst commented Apr 9, 2020

In GitLab by @grst on Apr 6, 2020, 08:26

changed the description

@grst grst removed the future label Apr 9, 2020
@grst
Copy link
Collaborator Author

grst commented May 3, 2020

Maybe this can be abused to compute Levenshtein distance in linear time:
https://github.com/wolfgarbe/SymSpell

and/or to prefilter which alignments to compute

@grst grst self-assigned this May 19, 2020
@grst
Copy link
Collaborator Author

grst commented May 31, 2020

Consider storing clonotype x clonotypes matrices rather than cell x cell matrices.
This could save a lot of space & computation time
The clonotype network plot would then be updated to show a single dot per
clonotype (potentially connected into clusters), where the size represents the number of cells.

@grst
Copy link
Collaborator Author

grst commented Aug 4, 2020

I tried to compute edit distance with symspellpy, but is actually 3x slower than the current levenshtein implementation. The reason for this is probably that the Levenshtein package is a highly optimized C library, while symspellpy is a pure python port of the SymSpell library.
Would need a Python wrapper for C SymSpell for max performance.

@grst
Copy link
Collaborator Author

grst commented Sep 28, 2020

Maybe WFA provides even better alignment performances than parasail?
https://github.com/smarco/WFA
Need to try it out at some point. In any case, we would need a python wrapper for that library first.

@grst
Copy link
Collaborator Author

grst commented Nov 12, 2020

mmseqs2 should do a great job for this.
Just tried it for some bacterial sequences (many-against-many). It's super fast and I don't see why we couldn't apply it here.

@grst
Copy link
Collaborator Author

grst commented Jan 13, 2021

superseded by #190 (faster chain_pairing) and #230.

@grst grst closed this as completed Jan 13, 2021
@grst grst added this to scirpy-dev May 28, 2024
@grst grst moved this to Done in scirpy-dev May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant