-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counts.txt file not created when using .bam files as input #150
Comments
Thanks Michał for the feedback and questions.
About slow alignments: The large number of libraries and the highly repetitive nature of the barley genome are the sources of the slowness. ShortStack's alignments take care to make best guesses at where to place multi-mapped reads (see Johnson et al. G3). But that care comes at the cost of extra time, especially when there are a lot of repeats. You can make it easier in at least two ways:
1. Run multiple --align_only jobs with just small numbers of libraries at a time. Then combine all of the resulting BAM files.
2. Run in --mmap -r mode ... just guess at where to drop multi-mapped reads. Will be much faster, at the cost of accuracy.
3. Align with another tool (like sensible bowtie defaults) and then merge the BAMs before running ShortStack.
About the slow performance with a single BAM file: With 128 libraries and a giant, repetitive genome, I'm not surprised that it is slow. You may want to call cluster using ShortStack with just a subset of the libraries, instead of all 128.
If your input BAM file has read groups specified, yes, you should receive a Counts.txt file.
Hope this helps.
From: Michał Stanoch ***@***.***>
Date: Tuesday, April 23, 2024 at 5:47 AM
To: MikeAxtell/ShortStack ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [MikeAxtell/ShortStack] Counts.txt file not created when using .bam files as input (Issue #150)
I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread.
The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome.
Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the --bamfiles rather than --readfiles option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis.
Is this an expected limitation when using .bam files as input?
—
Reply to this email directly, view it on GitHub<#150>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABUJPCK4Z6JNL5HFD3KVIGLY6YU2BAVCNFSM6AAAAABGURAHHKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TQNBSHA3TGNQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
I took another look at this. When more than one BAM is given to The workaround is simple: First merge your multiple bamfiles making sure to mark readgroups per file names using the
You can then input the |
As of commit 5335d2e this enhancement has been added. Now, when user inputs more than one BAM file to This will be included in the next release, 4.0.4, which should be out soon. |
Hi Prof. Axtell, thanks for the update. Glad I could point out a bug and it'll get fixed soon! Thanks also for the suggestion to split the alignment and merge the alignments later. I submitted the jobs to perform in parallel on the server I'm using for the analysis and I ran into some issues, but it was probably because of wrong versions of tools being used for some reason (e.g., the default server samtools rather than the one installed in the conda environment with ShortStack). I'll keep you updated. |
Resolved in release 4.0.4 |
Hi Prof. Axtell, thanks for resolving the issue. I tried using ShortStack (4.0.3) to first align files in batches, then merging them into one file with samtools merge -r, and finally using the resulting file as the input, but I still didn't get the Counts.txt file (despite using only one merged.bam file). I hope updating ShortStack will solve the issue. Would you kindly add the newest version to bioconda for ease of access? |
Yes, version 4.0.4 should solve this. In the meantime you can just download the ShortStack v4.0.4 script directly from this repo and execute it in a suitable environment. |
Hi! Just a quick note that ShortStack v4.0.4 is now available via bioconda :) The issues with the bioconda system seem solved. |
I am trying to use ShortStack for analyzing a large sRNA-seq experiment (132 samples). The alignment part takes a very long time (almost a full week) on a supercomputer server I am working on, despite specifying 12 threads, 20 GB RAM per thread.
The organism is barley (Hordeum vulgare). I use the toplevel genome without any masking and added mitochondrion, nonchromosomal and chloroplast sequences, which makes the reference rather large, but I got poor results when using a masked genome.
Since I can only submit a job for max a week on the server, I only managed to get the .bam files. I thought I could use them for further analysis by using the
--bamfiles
rather than--readfiles
option to get the results and it works fine, but I don't get the Counts.txt file I would need for DE Seq analysis.Is this an expected limitation when using .bam files as input?
The text was updated successfully, but these errors were encountered: