Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime improvements for deduplication #43

Open
wants to merge 28 commits into
base: dev
Choose a base branch
from
Open

Conversation

atrull314
Copy link
Collaborator

We found an issue with the curent PR, where umitools took a very long time to process the transcriptome bams. We've implemented two solutions:

1.) Add picard as an alternative. We note that we have seen some use of this for single cell, however it may not be the best as it does not account for UMIs, and also results in much more filtering

2.) Split the bam up for umitools. This was already implemented for the genome alignment. For transcriptome alignment, we group transcripts by chromosome and then split in order to ensure there are not too many files created.

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 5ecc97d

+| ✅ 187 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 1.1.0
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-scnanoseq_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-scnanoseq_logo_dark.png
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • template_strings - template_strings

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2025-02-19 14:35:39

@atrull314 atrull314 mentioned this pull request Feb 19, 2025
@atrull314 atrull314 requested a review from lianov February 19, 2025 14:55
Copy link
Member

@lianov lianov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding for the record as @atrull314 stated that this has been reviewed in depth (code and results). This PR addresses the gap in performance in UMI-Tools processing for transcriptome files but also adds Pircard MarkDuplicates as an alternative to deduplication (UMI-Tools still is the default, as it is more sensitive than Pircard since UMI-Tools uses the UMI sequences to track unique molecules instead of just the Cell Barcodes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants