merge with master

CGATOxford · Oct 1, 2024 · 03b42b8 · 03b42b8
2 parents 5c0d427 + bcce5e6
commit 03b42b8
Show file tree

Hide file tree

Showing 39 changed files with 143,350 additions and 1,971 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,25 @@
+---
+name: Bug report
+about: Create a bug report to help us improve
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior. If possible, include example data. At the very least, explain what the input data was and provide the `umi_tools` command used.
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Environment**
+ - OS: [e.g. iOS]
+ - UMI-tools version [e.g. 1.1.3]
+ - How was UMI-tools installed (e.g pip/conda etc)
+
+**Additional context**
+Add any other context about the problem here. Was UMI-tools being run within another pipeline, etc.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,22 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Is this something you propose to solve/implement, or is this a request?**
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. 
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered or have implemented. What are the limitations of this approach compared to what you are requesting/implementing?
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
diff --git a/.github/ISSUE_TEMPLATE/support_request.md b/.github/ISSUE_TEMPLATE/support_request.md
@@ -0,0 +1,24 @@
+---
+name: Support request
+about: Help with running UMI-tools
+title: ''
+labels: ''
+assignees: ''
+
+---
+:warning: :warning: :warning:
+**Often you can get help more quickly from existing online material. Before you raise an issue to request help with running UMI-tools, please refer to:**
+
+- [Frequently asked questions](https://umi-tools.readthedocs.io/en/latest/faq.html)
+- [Documentation](https://umi-tools.readthedocs.io/en/latest/)
+- Search within the existing [open/closed issues](https://github.com/CGATOxford/UMI-tools/issues?q=is%3Aissue)
+- [Biostars](https://www.biostars.org/post/search/?query=umi_tools) 
+- Web search engine
+
+
+If the above links don't provide the information/advice required, please consider posting on [Biostars](https://www.biostars.org/new/post/). You're likely to get a response just as quick and your question will be visible to a larger community who can help and/or benefit from the post
+
+If you believe the support request is best made via a github issue, please delete this text and go ahead :smile:
+
+
+
diff --git a/.github/workflows/nosetests.yml b/.github/workflows/nosetests.yml
@@ -9,7 +9,7 @@ jobs:
     strategy:
       matrix:
         python-version: ["3.7", "3.8", "3.9"]
-        os: [ubuntu-latest, macOS-latest]
+        os: [ubuntu-latest]
 
     steps:
       - uses: actions/checkout@v3

diff --git a/doc/faq.md b/doc/faq.md
@@ -1,23 +1,29 @@
 # FAQ
  - **Why is my `umi_tool group`/`dedup` command taking so long?**
-The time taken to resolve each position is dependent on how many unique UMIs there are and how closely related they are since these factors will affect how large the network of connected UMIs is. Some of the factors which can affect run time are:
-    1. UMI length (shorter => fewer possible UMIs => increased connectivity between UMIs => larger networks => longer run time)
-    2. Sequencing error rate (higher => more error UMIs => larger networks => longer run time)
-    3. Sequencing depth (higher = greater proportion of possible UMIs observed => larger network => longer run time)
-    4. Running `umi_tools dedup --output-stats` requires a considerable amount of time and memory to generate the null distributions. If you want these stats, consider obtaining stats for just a single contig, eg. `--chrom=chr22`.
-    5. If you are doing single-cell RNA-seq and you have reads from more than one cell in your BAM file, make sure you are running with the `--per-cell` swtich.
+
+	The time taken to resolve each position is dependent on how many unique UMIs there are and how closely related they are since these factors will affect how large the network of connected UMIs is. Some of the factors which can affect run time are:
+   1. UMI length (shorter => fewer possible UMIs => increased connectivity between UMIs => larger networks => longer run time)
+   2. Sequencing error rate (higher => more error UMIs => larger networks => longer run time)
+   3. Sequencing depth (higher = greater proportion of possible UMIs observed => larger network => longer run time)
+   4. Running `umi_tools dedup --output-stats` requires a considerable amount of time and memory to generate the null distributions. If you want these stats, consider obtaining stats for just a single contig, eg. `--chrom=chr22`.
+   5. If you are doing single-cell RNA-seq and you have reads from more than one cell in your BAM file, make sure you are running with the `--per-cell` swtich.
 &nbsp;
+
 - **Why is my `umi_tool group`/`dedup` command taking so much memory?**
-There are a few reasons why your command could require an excessive amount of memory:
-    1. Most of the above factors increasing run time can also increase memory
-    2. Running `umi_tools dedup` with `chimeric-reads=use` (default) can take a large amount of memory. This is because `dedup` will wait until a whole contig has been deduplicated before re-parsing the reads from the contig and writing out the read2s which match the retained read1s. For chimeric read pairs, the read2s will not be found on the same contig and will be kept in a buffer of "orphan" read2s which may take up a lot of memory. Consider using `chimeric-reads=discard` instead to discard chimeric read pairs. Similarly, use of `--unmapped-reads=use` with `--paired` can also increase memory requirements.
+
+	There are a few reasons why your command could require an excessive amount of memory:
+   1. Most of the above factors increasing run time can also increase memory
+   2. Running `umi_tools dedup` with `chimeric-reads=use` (default) can take a large amount of memory. This is because `dedup` will wait until a whole contig has been deduplicated before re-parsing the reads from the contig and writing out the read2s which match the retained read1s. For chimeric read pairs, the read2s will not be found on the same contig and will be kept in a buffer of "orphan" read2s which may take up a lot of memory. Consider using `chimeric-reads=discard` instead to discard chimeric read pairs. Similarly, use of `--unmapped-reads=use` with `--paired` can also increase memory requirements.
 &nbsp;
+
  - **Can I run `umi_tools` with parallel threads?**
-Not yet! This is something we have been discussing for a while (see [#203](https://github.com/CGATOxford/UMI-tools/issues/203) & [#257](https://github.com/CGATOxford/UMI-tools/issues/257)) but haven't found the time to actually implement. If you'd like to help us out, get in touch!
-&nbsp;
+
+	Not yet! This is something we have been discussing for a while (see [#203](https://github.com/CGATOxford/UMI-tools/issues/203) & [#257](https://github.com/CGATOxford/UMI-tools/issues/257)) but haven't found the time to actually implement. If you'd like to help us out, get in touch!
 &nbsp;
+
 - **What's the correct regex to use for technique X?**
-Here is a table of the techniques we have come across that are not easily processed with the basic barcode pattern syntax, and the correct regex's to use with them:
+
+	Here is a table of the techniques we have come across that are not easily processed with the basic barcode pattern syntax, and the correct regex's to use with them:
 
    | Technique | regex |
    | --------- | ------ |
@@ -27,15 +33,50 @@ Here is a table of the techniques we have come across that are not easily proces
    | BD-Rhapsody | `(?<cell_1>.{9})(?<discard_1>.{12})(?<cell_2>.{9})(?<discard_2>.{13})(?<cell_3>.{9})(?<umi_1>.{8})T+` |
 
    If you know of other, please drop us a PR with them!
+&nbsp;
 
 - **Can I use `umi_tools` to determine consensus sequences?**
-Right now, you can use `umi_tools group` to identify the duplicated read groups. From this, you can then derive consensus sequences as you wish. We have discussed adding consense sequence calling as a separate `umi_tools` command (see [#203](https://github.com/CGATOxford/UMI-tools/issues/181)). If you'd like to help us out, get in touch!
-&nbsp;
+
+	Right now, you can use `umi_tools group` to identify the duplicated read groups. From this, you can then derive consensus sequences as you wish. We have discussed adding consense sequence calling as a separate `umi_tools` command (see [#203](https://github.com/CGATOxford/UMI-tools/issues/181)). If you'd like to help us out, get in touch!
 &nbsp;
+
 - **What do the `--per-gene`, `--gene-tag` and `--per-contig` options do in `umi_tools group`/`dedup`/`count`?**
-These options are designed to handle samples from sequence library protocols where fragmentation occurs post amplification, e.g CEL-Seq for single cell RNA-Seq. For such sample, mapping coordinates of duplicate reads will no longer be identical and `umi_tools` can instead use the gene to which the read is mapped.
-This behaviour is switched on with `--per-gene`. `umi_tools` then needs to be told if the reads are directly mapped to a transcriptome (`--per-contig`), or mapped to a genome and the transcript/gene assignment is contained in a read tag (`--gene-tag=[TAG]`). If you have assigned reads to transcripts but want to use the gene IDs to determine UMI groups, there is a further option to provide a file mapping transcript IDs to gene IDs (`--gene-transcript-map`). Finally, for single cell RNA-Seq, you must specify `--per-cell`. See `umi_tools dedup`/`group`/`count --help` for details of further related options and the [UMI-tools single cell RNA-Seq guide](https://github.com/CGATOxford/UMI-tools/blob/%7BTS%7D-AddFAQ/doc/Single_cell_tutorial.md).
+
+	These options are designed to handle samples from sequence library protocols where fragmentation occurs post amplification, e.g CEL-Seq for single cell RNA-Seq. For such sample, mapping coordinates of duplicate reads will no longer be identical and `umi_tools` can instead use the gene to which the read is mapped. This behaviour is switched on with `--per-gene`. `umi_tools` then needs to be told if the reads are directly mapped to a transcriptome (`--per-contig`), or mapped to a genome and the transcript/gene assignment is contained in a read tag (`--gene-tag=[TAG]`). If you have assigned reads to transcripts but want to use the gene IDs to determine UMI groups, there is a further option to provide a file mapping transcript IDs to gene IDs (`--gene-transcript-map`). Finally, for single cell RNA-Seq, you must specify `--per-cell`. See `umi_tools dedup`/`group`/`count --help` for details of further related options and the [UMI-tools single cell RNA-Seq guide](https://github.com/CGATOxford/UMI-tools/blob/%7BTS%7D-AddFAQ/doc/Single_cell_tutorial.md).
+&nbsp;
+
+- **Should I use `--per-gene` with my sequencing from method X?**
+
+	It can be difficult to work this out sometimes! So far we have come across the following technqies that require the use of `--per-gene`: CEL-seq2, SCRB-seq, 10x Chromium, inDrop, Drop-seq and SPLiT-seq. Let us know if you know of more
 &nbsp;
+
+
+- **How are reads/read pairs defined as having the same alignment coordinates?**
+    Defining which reads have the same alignment coordinates is more difficult that one might intuitively expect. For single-end reads, `umi_tools` uses the position of the start of the read (_not the alignment start position_ - see below) and the strand. For paired-end read, `umi_tools` additionally uses the template length (this can be turned off with `--ignore-tlen`.
+
+    When calculating the start of the read, `umi_tools` takes into account the soft-clipping and the strand of the read. Softclipped bases are part of the read, and on the forward strand UMI-tools calculates the read start position as the alignment start position minus the number of clipped bases. This is to avoid base miss-calling errors at the start of a read that could make two reads appear to have unique alignment coordinates. If the read is on the reverse strand, then the read start position is actually the alignment end position (adjust for soft-clipping). The following reads all have the same alignment start position, as defined by the "pos" column in the BAM/SAM.
+
+   ![Calculating the read start](./read_start_calculation.png)
+
+    The black read has the same read start and alignment start. The red read has 6 bases soft-clipped at the start. While these bases don't align to the genome, and the part of the read that aligns to the genome starts at the same sequence as the black read, we take these six additional nucleotides at the start to make it unlikely that the red read is a PCR duplicate of the black read.  The blue read is on the reverse strand - while according to the SAM format specification its start position is the same as the black read, it's sequences is actually completely different, and again is unlikely to be a PCR duplicate. The orange read is also on the reverse strand. It has a different alignment start position to the blue read (perhaps due to quality trimming). It also has a different alignment end. But the last two bases are soft-clipped. If you add back on these two bases (perhaps they are sequencing errors) you find that the organe and blue reads have the same outer fragment coordinates. 
+
+    The optimality of these choices was confirmed by looking at how well they accounted for identical UMIs of reads near each other. 
+
+    `umi_tools` can additionally use the 'spliced' status of a read to define possible duplicates. This behaviour is turned on with the `--spliced-is-unique` option. This is obtained by inspecting the cigar string to identify `N` anywhere within the cigar (skipped regions within the reference) or, alternatively, `S` at the 3' end of the cigar (soft-clipped at the end of the read). By default, 4 bases of `S` at the 3' end is the threshold for a read to be considered spliced. This can be controlled with the `--soft-clip-threshold` option.
+
+- **Why do I have reads with the same alignment coordinates and UMIs post deduplication?**
+    It's possible for reads/read pairs with the same or very similar UMIs and seemingly the same alignment coordinates when inspecting the BAM to be put into separate UMI groups. For `umi_tools dedup`, this would mean multiple output reads which look like duplicates. Refering to the question above about how alignment coordinates are defined, an inspection of the strand, alignment start, 5' softclipping and template length (if paired end) +/- the splicing status (if `--spliced-is-unique`  has been used), will likely clarify why these reads/read pairs were not considered duplicates.
+
+- **Has the whitelist command been peer-reviewed and compared to alternatives?**
+
+	No. At the time of the [UMI-tools publication](http://genome.cshlp.org/content/early/2017/01/18/gr.209601.116.abstract`) on 18 Jan '17, the only tools available were `extract`, `dedup` and `group`. The `count` and `whitelist` commands were added later. With `count`, the deduplication part is identical to `dedup`, so it's reasonable to say the underlying agorithm has been peer-reviewed. However, `whitelist` is using an entirely different approach (see [here](https://github.com/CGATOxford/UMI-tools/pull/317) which has not been rigourously tested, compared to alternative algorithms or peer-reviewed. We recommend users to explore other options for whitelisting also.
 &nbsp;
-- **Should I use `--per-gene` with my sequencing from method X**
-It can be difficult to work this out sometimes! So far we have come across the following technqies that require the use of `--per-gene`: CEL-seq2, SCRB-seq, 10x Chromium, inDrop, Drop-seq and SPLiT-seq. Let us know if you know of more.
+
+- **Can I use whitelist without UMIs?**
+
+	Strickly speaking, yes, but only with `--extract-method string`. If you use `--extract-method regex` and don't provide both a UMI and Cell barcode position in the regex, you'll get an error.
+
+&nbsp;
+- **Can I extract barcodes from just read2?**
+
+    Yes, you can use --read2-only with `umi_tools extract` & `whitelist`. Previously, we had been recommending users achieve this by simply swapping the read1 and read2 input files around. That's still a viable approach. 
diff --git a/doc/read_start_calculation.png b/doc/read_start_calculation.png