This repository shares quick tips for better data management and terminal efficiency on servers and HPC systems. Also, it provides a comprehensive collection of commands and scripts commonly used while analyzing sequencing data. The content is derived from my daily usage and experience.
- How I Organize My Data on the Server?
- How to calculate number of reads in fastq file?
- How to calculate the average read depth after aligning reads to a reference genome?
- How to fetch mapped and unmapped reads from BAM file using samtools?
- How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? Example sentence : ">>>>>>> Coverage per contig"
- How to check the delimiter of a file in bash?
- How to verify data integrity using md5sum?
- How to convert multi-line fasta to single-line fasta?
- How to convert BAM to fastq and split to paired reads?
- How to count reads in FASTQ files across multiple subdirectories?
I create unique project folders for each project. Each project folder contains a set of subfolders that organize the different stages of analysis. This structured approach ensures that all data and results related to a specific project are kept together, making it easier to manage, access, and navigate through the project's lifecycle.
Example: Genome assembly project: Ma_gasm
,
Population genomics project: Ma_popgen
Using numerical values before the folder name helps maintain order and enables efficient use of the TAB key for quick access to any folder through terminal.
Example: 01_Ma_gasm
, 02_Ma_popgen
Using spaces in folder names can lead to issues with command-line operations and scripts. Instead, use underscores (_
) as word separators.
Example: Genome Assembly
should be Genome_Assembly
.
I prefer using short forms instead of lengthy titles in places where possible.
Example: 01_Maethiopoides_genome_assembly
(very long) becomes 01_Ma_gasm
(short and simple)
Within each project folder, I organize the analysis steps as individual folders. Raw data comes first, followed by subsequent analysis steps.
Example:
Within 01_Ma_gasm
:
01_raw_data
02_Base_calling
03_ncgnm_asm
Further within 03_ncgnm_asm
:
01_flye
02_medaka
Within flye
, each sample is placed in individual folders, starting with a QC folder:
00_QC
01_Magm1
02_Magm2
03_Magm3
Quality check (QC) files can be the files generated while assessing the output of a particular dataset. For example, let's say we have FASTQ files which will undergo QC using tools like FastQC and MultiQC. The files generated from these tools should be organized into 00_QC
folders, and separated into individual folders for each sample within the QC folder. This ensures that QC results are clearly organized and easily accessible.
Within 02_Illumina
:
00_QC
01_Sample1
fastqc_report.html
multiqc_report.html
02_Sample2
fastqc_report.html
multiqc_report.html
03_Sample3
fastqc_report.html
multiqc_report.html
In a Fastq file, each read is represented by four lines, with each read beginning with the "@" symbol. This command functions by counting the occurrences of "@" at the start of each read.
For a single file:
grep -c "^@" input.fastq
For multiple files:
grep -c "^@" *.fastq > read_count.txt
For zipped files:
zcat *fastq.gz | grep -c "^@" > read_count.txt
Depth generally refers to the number of times a particular nucleotide in the genome is read during the sequencing process. It reflects the number of reads aligned to a specific position in the genome. Evaluating read depth helps determine if there are a sufficient number of reads aligned over a given base position, which is crucial for our analysis. Whereas coverage means to check if the reads are aligned over the genome by checking the percentage. This is a common way to evaluate the quality of samples by checking read depth after alignment.
Step 1: Convert SAM file to BAM file
samtools view -bS --threads <number_of_threads> input.sam > output.bam
Step 2: Sort BAM file based on position
samtools sort input.bam -o output_sorted.bam -@ <number_of_threads>
Step 3: Calculating Average Depth
There are two tools either “samtools depth” or “Mosdepth”
samtools depth -a output_sorted.bam | awk '{sum+=$3} END { print "Average = ",sum/NR}'
OR
Index Sorted BAM file and use mosdepth to calculate average read depth
samtools index input_sorted.bam -@ <number_threads>
mosdepth sample_depth input_sorted.bam
awk '$1=="total"{print $4;}' sample_depth.mosdepth.summary.txt
To fetch mapped reads:
samtools view -b -F 4 input.bam > output.bam
“-F” flag to fetch mapped reads & “-b” to produce output BAM file To fetch unmapped reads:
samtools view -b -f 4 input.bam > output.bam
“-f” flag to fetch unmapped reads
OR
Fetching mapped reads from BAM and output as Fastq files
samtools fastq -F 4 <input.bam> > <output.fq>
How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? [Example sentence : ">>>>>>> Coverage per contig"]
- Edit within same file
sed -i '1,/>>>>>>> Coverage per contig/d' yourfile.txt
- Create new file with edits
sed '1,/>>>>>>> Coverage per contig/d' yourfile.txt > newfile.txt
- Save the following code as "delim_check.awk"
BEGIN {
sep[","] = "comma"
sep["\\|"] = "pipe"
sep["\t"] = "tab"
}
{
for (x in sep) {
c = gsub(x,"&",$0)
if (c) cnt[sep[x] " " (c+1)]++
}
}
END {
for (x in cnt) {
if (max == "" || cnt[x] > max) {
max = cnt[x]
est = x
}
}
print est
}
- Run the code as following:
awk -f delim_check.awk <file_to_check>
Prints type delimiter(tab, pipe, comma) with number of columns (int)
md5sum -c <list_of_files.md5>
Example: Fastq files from SRA:
- Md5sum check on the fastq files and store in ".md5":
md5sum *.fastq.gz > md5sum_check.md5
- Verify files
md5sum -c md5sum_check.md5
Results: If "OK" files are good
Usage:
multi2singlefa.pl <input_multi.fasta> > <output_single.fasta>
Save the following code as a Perl script
#!/usr/bin/perl -w
use strict;
my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");
my $line = <IN>;
print $line;
while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>/gi) { print "\n",$line,"\n"; }
else { print $line; }
}
print "\n";
Use samtools
to convert your BAM file to a FASTQ file.
samtools bam2fq SAMPLE.bam > SAMPLE.fastq
By using above command paired-end reads typically have /1 or /2 added to the end of read names. To split a single FASTQ file of paired-end reads into two separate files:
- Extract Reads Ending with /1 (Forward Reads):
cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq
- Extract Reads Ending with /2 (Reverse Reads):
cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq
This part demonstrates how to efficiently count the number of reads in FASTQ files located in multiple subdirectories. This can be particularly useful when working with sequencing data organized by sample or experiment.
- Access to a Unix/Linux terminal.
- The
wc
command (available on most Unix/Linux systems). - The
zcat
command if working with gzipped files (.fastq.gz
). - Your data should be organized in subdirectories.
Assume the following structure with subdirectories containing FASTQ files:
project_directory/
├── HAM_01/
│ ├── sample1.fastq
│ ├── sample2.fastq
├── INV_01/
│ ├── sample3.fastq.gz
│ ├── sample4.fastq.gz
├── ...
Use the following script to count reads in all .fastq
files across subdirectories:
for dir in HAM_* INV_* IRE_* LIN_* NOR_*; do
for file in "$dir"/*.fastq; do
[ -e "$file" ] || continue # Skip if no files are found
echo "$file: $(( $(wc -l < "$file") / 4 )) reads"
done
done > read_counts.txt
for dir in HAM_* INV_* IRE_* LIN_* NOR_*;
: Loops through subdirectories matching the patterns (e.g., HAM_01, INV_01).for file in "$dir"/*.fastq;
: Loops through.fastq
files in each subdirectory.[ -e "$file" ] || continue
: Ensures the loop skips if no matching files are found.wc -l < "$file"
: Counts the number of lines in the file./ 4
: Divides the line count by 4 (each read in a FASTQ file has 4 lines).> read_counts.txt
: Saves the output to a file namedread_counts.txt
.
If your files are gzipped (.fastq.gz
), use this modified script:
for dir in HAM_* INV_* IRE_* LIN_* NOR_*; do
for file in "$dir"/*.fastq.gz; do
[ -e "$file" ] || continue # Skip if no files are found
echo "$file: $(( $(zcat "$file" | wc -l) / 4 )) reads"
done
done > read_counts.txt
zcat "$file"
: Decompresses the gzipped file on-the-fly.- All other steps remain the same.
Both scripts will produce a read_counts.txt
file containing output in this format:
HAM_01/sample1.fastq: 50000 reads
HAM_01/sample2.fastq: 60000 reads
INV_01/sample3.fastq.gz: 75000 reads
INV_01/sample4.fastq.gz: 80000 reads
...
- Replace
HAM_* INV_* IRE_* LIN_* NOR_*
with patterns that match your directory names if they differ. - Ensure your shell supports pattern matching (default in Bash).
- For very large files, these commands might take some time to execute.