Counts.txt file not created when using .bam files as input #150

mikrzol · 2024-04-23T09:46:49Z

I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread.

The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome.

Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the --bamfiles rather than --readfiles option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis.

Is this an expected limitation when using .bam files as input?

The text was updated successfully, but these errors were encountered:

MikeAxtell · 2024-04-23T12:31:14Z

Thanks Michał for the feedback and questions. About slow alignments: The large number of libraries and the highly repetitive nature of the barley genome are the sources of the slowness. ShortStack's alignments take care to make best guesses at where to place multi-mapped reads (see Johnson et al. G3). But that care comes at the cost of extra time, especially when there are a lot of repeats. You can make it easier in at least two ways: 1. Run multiple --align_only jobs with just small numbers of libraries at a time. Then combine all of the resulting BAM files. 2. Run in --mmap -r mode ... just guess at where to drop multi-mapped reads. Will be much faster, at the cost of accuracy. 3. Align with another tool (like sensible bowtie defaults) and then merge the BAMs before running ShortStack. About the slow performance with a single BAM file: With 128 libraries and a giant, repetitive genome, I'm not surprised that it is slow. You may want to call cluster using ShortStack with just a subset of the libraries, instead of all 128. If your input BAM file has read groups specified, yes, you should receive a Counts.txt file. Hope this helps. From: Michał Stanoch ***@***.***> Date: Tuesday, April 23, 2024 at 5:47 AM To: MikeAxtell/ShortStack ***@***.***> Cc: Subscribed ***@***.***> Subject: [MikeAxtell/ShortStack] Counts.txt file not created when using .bam files as input (Issue #150) I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread. The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome. Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the --bamfiles rather than --readfiles option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis. Is this an expected limitation when using .bam files as input? — Reply to this email directly, view it on GitHub<#150>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABUJPCK4Z6JNL5HFD3KVIGLY6YU2BAVCNFSM6AAAAABGURAHHKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TQNBSHA3TGNQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

MikeAxtell · 2024-05-08T18:52:28Z

I took another look at this.

When more than one BAM is given to --bamfile no Counts.txt is created. This doesn't seem like the behavior that I intended. But there may have been a technical reason ... I will look into it and report back on this thread.

The workaround is simple: First merge your multiple bamfiles making sure to mark readgroups per file names using the -r switch on samtools merge :

samtools merge --threads [n_threads] -r -o merged.bam in1.bam in2.bam ...

You can then input the merged.bam file as a single argument to ShortStack's --bamfile option, and Counts.txt file will be created.

MikeAxtell · 2024-05-09T14:50:20Z

As of commit 5335d2e this enhancement has been added. Now, when user inputs more than one BAM file to --bamfile, a Counts.txt will be created.

This will be included in the next release, 4.0.4, which should be out soon.

mikrzol · 2024-05-10T10:30:45Z

Hi Prof. Axtell,

thanks for the update. Glad I could point out a bug and it'll get fixed soon!

Thanks also for the suggestion to split the alignment and merge the alignments later. I submitted the jobs to perform in parallel on the server I'm using for the analysis and I ran into some issues, but it was probably because of wrong versions of tools being used for some reason (e.g., the default server samtools rather than the one installed in the conda environment with ShortStack).

I'll keep you updated.

MikeAxtell · 2024-05-16T18:49:15Z

Resolved in release 4.0.4

mikrzol · 2024-05-20T14:40:11Z

Hi Prof. Axtell,

thanks for resolving the issue. I tried using ShortStack (4.0.3) to first align files in batches, then merging them into one file with samtools merge -r, and finally using the resulting file as the input, but I still didn't get the Counts.txt file (despite using only one merged.bam file).

I hope updating ShortStack will solve the issue. Would you kindly add the newest version to bioconda for ease of access?

MikeAxtell · 2024-05-20T16:34:31Z

Yes, version 4.0.4 should solve this.
New releases are automatically bumped by bionconda. Except that right now the bioconda system for updating recipes is broken ... see bioconda/bioconda-recipes#41025
ShortStack v4.0.4 will show up on bioconda as soon as bioconda's team solves their current technical issues. That is out of my hands.

In the meantime you can just download the ShortStack v4.0.4 script directly from this repo and execute it in a suitable environment.

dirkjanvw · 2024-05-23T14:50:45Z

Hi! Just a quick note that ShortStack v4.0.4 is now available via bioconda :) The issues with the bioconda system seem solved.

MikeAxtell added the question label May 8, 2024

MikeAxtell self-assigned this May 8, 2024

MikeAxtell added the bug label May 8, 2024

MikeAxtell added enhancement and removed bug labels May 9, 2024

MikeAxtell closed this as completed May 16, 2024

MikeAxtell reopened this May 20, 2024

MikeAxtell closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counts.txt file not created when using .bam files as input #150

Counts.txt file not created when using .bam files as input #150

mikrzol commented Apr 23, 2024

MikeAxtell commented Apr 23, 2024 via email

MikeAxtell commented May 8, 2024

MikeAxtell commented May 9, 2024

mikrzol commented May 10, 2024 •

edited

Loading

MikeAxtell commented May 16, 2024

mikrzol commented May 20, 2024

MikeAxtell commented May 20, 2024

dirkjanvw commented May 23, 2024

Counts.txt file not created when using .bam files as input #150

Counts.txt file not created when using .bam files as input #150

Comments

mikrzol commented Apr 23, 2024

MikeAxtell commented Apr 23, 2024 via email

MikeAxtell commented May 8, 2024

MikeAxtell commented May 9, 2024

mikrzol commented May 10, 2024 • edited Loading

MikeAxtell commented May 16, 2024

mikrzol commented May 20, 2024

MikeAxtell commented May 20, 2024

dirkjanvw commented May 23, 2024

mikrzol commented May 10, 2024 •

edited

Loading