Markdown References
Table of Contents
Click to display the content of this site:
- Select region from 1KGP chr.vcf.gz, extract, compress, and tabix
- Remove "chr" string in variants in a VCF
- Add "chr" string in variants in VCF
- Sort karyotipically a VCF (version 1)
- Sort karyotypically a VCF (version 2). Use '-V', natural sort of (version) numbers within text
- Sort karyotypically a VCF (version 3): using vcf-sort (from vcftools)
- Sort karyotypically a VCF (version 4): using PICARD
- Replace spaces with a single tab
- Compute BAM coverage with BEDtools
- Find duplicated lines in a VCF matching the whole line
- Find duplicated lines in a VCF matching files 1, 2, and 5
- Find a line by a field on it, delete it, and save the result
- Count the number of mapped reads for each read in PE reads
- Replace white spaces with tabs
- Add rs from INFO field (avsnp150) to ID column in a VCF
- Remove duplicated lines in a file while keeping the original order
- Number of files per extension type in the current directory
- Parse file with AWK, sum column values in each line, and shows the result
- Count genotypes in a VCF imputed at Michigan Imputation Server
- Replace spaces with tab
- Prepare a header
- Extract variants that have a genotype count equal or higher than a genotype count threshold
- Count the lines of each file in a dir and get the sum of all lines
- Count the total number of lines in the selected files of a dir
- Grab the body of a file excluding the header
- Count number of lines in a file
- Count number of columns in a file
- Replace many spaces with single tabs, and specially the leading spaces of a PLINK '*.frq' file
- Recode (0/0, 0/1, 1/0, 1/1) genotypes into (0,1,2) genotypes
- Convert a VCF into a table of variants combining PERL and GATK4 VariantsToTable
- RefSeq gene annotation with ANNOVAR from a VCF file
- Split multiallelic variants (MNPs) into biallelic variants
- Annotate the ID field of each variant in a VCF file using dbSNP database
- Annotate the ID field of each variant in a VCF file using dbSNP database
- Parse a VCF file and count the number of HomRef, HomAlt, and Het genotypes from dose data
- Transposing the row-vector of individuals into a column-vector
- Change the sample name in a BAM header using SAMtools
- Sort a file using a certain numerical column while keeping the header at the top of the output
- Convert a multisample columnar FASTA file into a multisample single-line FASTA file
- Extract a region of interest from a BAM file with SAMtools and index
- Compress files without directory structure
- Shorten the current directory path on terminal
- Check if directory/file exist/does not exist
- Determine strand orientation and allele switch in a VCF
- Get the file with the smallest number of lines
- Get the shared positions by three lists
- Reredirect .out and .err outputs to a log file
- Filter a VCF file using certain values in the FILTER tag
- Check that a VCF is sorted (via indexing)
- String Functions in GNU
- Sort numerically a list of files within a directory and save the list into a new file
- Replace './.' or unknown genotypes with '0/0' or homozygous-reference genotypes
- Replace metadata ids in a FASTA file with new ids
- Avoid the '--' group-separator in a grep output providing multiple matches
- Iterate over a list of elements within a tuple
- Annotate fields (AF, AC, MAF, etc.) in a VCF
- Concatenate VCF files into a single-all-chrs VCF file
- Monitor processes in your HPC (squeue, sacct, etc.)
- List file recursively by last modification time
- Remove extension from a file name
- Create a template 'Empy Document' in order to show this option in the contextual menu in Ubuntu
- Set an infinite history length of bashrc in Ubuntu using HISTSIZE and HISTFILESIZE in bash
- Set a string to lowercase or uppercase in AWK
- Mount a network NTFS volume into a local folder from the shell
- Configure a workstation folder mounting on a remote NAS volume from the shell
- Grab the header of a file excluding the last line
- Break MNPs and exclude spanning deletions (SD) from a VCF/gVCF file using BCFtools
- Split a file with a list in a number of sublists using BASH
- Add new line ('\n') at the end of a file using BASH
- Keep only SNPs from a VCF using BCFtools
- Grab the list of variants from a VCF using BCFtools
- Merge a list of VCF files using BCFtools
- List the content of a file, line by line, showing line numbers with BASH
Select region from 1KGP chr.vcf.gz, extract, compress, and tabix
#!/bin/bash
region="3:10000-11000"
infile="ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
outfile="1KGP.chr3.region"
tabix -h ${infile} ${region} > ${outfile}.vcf
bgzip -c ${outfile}.vcf > ${outfile}.vcf.gz
tabix -p vcf ${outfile}.vcf.gz
# End of script
Remove "chr" string in variants in a VCF
#!/bin/bash
awk '{gsub(/^chr/,""); print}' infile.vcf > infile.no_chr.vcf
# End of script
Add "chr" string in variants in VCF
#!/bin/bash
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' infile.no_chr.vcf > infile.vcf
# End of script
Sort karyotipically a VCF (version 1)
#!/bin/bash
##Notice that header must be removed at some point
grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | sort -V -k1,1 -k2,2n >> out.vcf
# End of script
Sort karyotypically a VCF (version 2). Use '-V', natural sort of (version) numbers within text
#!/bin/bash
sort -V -k1,1 -k2,2n infile.vcf > outfile.vcf
# End of script
Sort karyotypically a VCF (version 3): using vcf-sort (from vcftools)
#!/bin/bash
cat infile.vcf | vcf-sort --chromosomal-order > infile.sorted.vcf
# End of script
Sort karyotypically a VCF (version 4): using PICARD
#!/bin/bash
java -jar picard.jar SortVcf I=unsorted.infile.vcf O=sorted.infile.vcf
# End of script
Replace spaces with a single tab
#!/bin/bash
sed 's/ \+/\t/g' infile > outfile
# End of script
Compute BAM coverage with BEDtools
#!/bin/bash
bedtools genomecov -ibam infile.bam -bg > coverage.txt
# End of script
Find duplicated lines in a VCF matching the whole line
#!/bin/bash
awk ' !uniq[$0]++ ' infile.vcf
# End of script
Find duplicated lines in a VCF matching files 1, 2, and 5
#!/bin/bash
awk ' !uniq[$1 FS $2 FS $5]++ ' infile.vcf
# End of script
Find a line by a field on it, delete it, and save the result
#!/bin/bash
string=abc
grep -v $string infile > outfile
# End of script
Count the number of mapped reads for each read in PE reads
#!/bin/bash
samtools view -F 0x40 infile.bam | cut -f1 | sort | uniq | wc -l
#Left read: view the BAM content filtering by the provided SAM flags, cut the first column,
# sort the data on that column, keep only the uniq data on that column, and count the number of lines.
samtools view -f 0x40 -F 0x4 infile.bam | cut -f1 | sort | uniq | wc -l
#Right read<br>
samtools view -f 0x80 -F 0x4 infile.bam | cut -f1 | sort | uniq | wc -l
Note: for flags information, see page 5 of https://samtools.github.io/hts-specs/SAMv1.pdf
# End of script
Replace white spaces with tabs
#!/bin/bash
awk -v OFS="\t" '$1=$1' infile > outfile
# End of script
Add rs from INFO field (avsnp150) to ID column in a VCF
#!/bin/bash
cat infile | awk '
BEGIN { FS=OFS="\t" }
{
if ($3 == ".") {
$3=match($8, /avsnp150=((rs[0-9]+)|.)/, arr)
$3=arr[0]
sub(/avsnp150=/, "", $3)
}
print
}' > outfile
# End of script
Remove duplicated lines in a file while keeping the original order
#!/bin/bash
awk '!visited[$0]++' infile > deduplicated_infile
#Note: see https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html
# Or equivalently (using cat, sort and cut: cat showing numbers, sort using the second column and
# keep the first ocurrence of number, sort using the number, and cut the second column; clever!):
cat -n infile | sort -uk2 | sort -nk1 | cut -f2-
# End of script
Number of files per extension type in the current directory
#!/bin/bash
nfiletypes () { find . -maxdepth 1 -type f | sed 's/.*\.//' | sort | uniq -c \
| sed 's/^ *//g' | sed 's/ /\t/g'; }
# Note: run this code and then write "nfiletypes" at the prompt and will see the count of
# files per extension at the current directory.
# End of script
Parse file with AWK, sum column values in each line, and shows the result
#!/bin/bash
awk -F'[\t]' 'BEGIN{sum=0; OFS="\t"} { for (i=1;i<=NF;i++) a[i]+=$i } END{ for (i in a) print a[i] }' infile
# End of script
Count genotypes in a VCF imputed at Michigan Imputation Server
Based on script by tommycarstensen
In each line of your infile, split the line by ":" and the split again by genotype delimiters (unphased, "/"; phased, "|"). Then, split again the genotype field and assign its values (REF, delimiter, and ALT) to GT awk-variable. Then, parse the GT variable according to the expected genotype values (./. for missing or untyped genotype; 0/0 for hom-ref; ref=alt for hom-alt; and heterozygous for the rest of conditions: 1/0 or 0/1). Finally, show the count of genotypes and, possible, the count of variant genotypes (Het+HomAlt).
#!/bin/bash
zgrep -v "^#" infile.vcf.gz | awk '{
unknown=0; homref=0; het=0; homalt=0; for(i=10;i<=NF;i++) {
split($i,a,":"); split(a[1],GT,"[/|]");
if(GT[1]=="."&>[2]==".") {unknown++}
else if(GT[1]==0&>[2]==0) {homref++}
else if(GT[1]==GT[2]) {homalt++}
else {het++}};
print $1,$2,$3":"$4"-"$5,$4,$5,unknown,homref,het,homalt,het+homalt}' > tmp1
# End of script
Replace spaces with tab
#!/bin/bash
awk -v OFS="\t" '$1=$1' tmp1 > tmp2
# End of script
Prepare a header
#!/bin/bash
echo -e "#CHR\tPOS\tID\tREF\tALT\tnumGT.Unknowns\tnumGT.HomRef\tnumGT.Het \
\tnumGT.HomAlt\tnumGT.(Het+HomAlt)" > header<br>
cat header tmp2 > infile.variant-genotypes.counts
# End of script
Extract variants that have a genotype count equal or higher than a genotype count threshold
#!/bin/bash
awk -F'[ ]' '{ if ($10 >= 5) print $3 }' infile.variant-genotypes.counts > variant-list
# End of script
Count the lines of each file in a dir and get the sum of all lines
Credits: http://stackoverflow.com/questions/13727917/ddg#13728131
#!/bin/bash
find ${indir} -type f -name "*.selected-files" -print0 | wc -l --files0-from=-
# End of script
Count the total number of lines in the selected files of a dir
Credits: http://stackoverflow.com/questions/13727917/ddg#13728131
#!/bin/bash
find ${indir} -type f -name "*.selected-files" -exec cat {} + | wc -l
# End of script
Grab the body of a file excluding the header
#!/bin/bash
tail -n +2 ${infile} > file-without-header
# End of script
Count number of lines in a file
#!/bin/bash
wc -l ${infile}
# End of script
Count number of columns in a file
#!/bin/bash
head -n 1 ${infile} | awk '{print NF}'
#Or
awk '{print NF; exit}' ${infile}
# End of script
Replace many spaces with single tabs, and specially the leading spaces of a PLINK '*.frq' file
#!/bin/bash
sed 's/ \+/\t/g' ${infile} | sed -e 's/^[\t]*//' >${infile}.no-trailing-spaces.tabs
# End of script
Recode (0/0, 0/1, 1/0, 1/1) genotypes into (0,1,2) genotypes
#!/bin/bash
sed 's/0|0/0/g' test | sed 's/0|1/1/g' | sed 's/1|0/1/g' | sed 's/1|1/2/g
# End of script
Convert a VCF into a table of variants combining PERL and GATK4 VariantsToTable
Credits: https://gatkforums.broadinstitute.org/gatk/profile/ericminikel && https://gatkforums.broadinstitute.org/gatk/profile/dobenshain
#!/bin/bash
# Grab INFO tags
zcat infile.vcf.gz | perl -ne '/^.*INFO.*ID=([a-zA-Z0-9_]+),/ && print "-F $1 \\ "' | uniq
# Example of output (allele frequencies from gnomAD):
-F AC -F AF -F AF_afr -F AF_amr -F AF_asj \
# Grab FORMAT tags
zcat infile.vcf.gz | perl -ne '/^.*FORMAT.*ID=([a-zA-Z0-9_]+),/ && print "-GF $1 \\ "' | uniq
# Example of output:
-GF AD \ -GF DP \ -GF GQ \ -GF GT \ -GF MIN_DP \ -GF PGT \ -GF PID \ -GF PL \ -GF RGQ \ -GF SB \
# Complex example using GATK4 (notice that INFO tags are written after "-F" and FORMAT tags are written after "GF").
# GATK3.8 does not produce a proper header in the output and does not allow for a filtering control as GATK4.
module load java-jre/1.8.0_77
module load gatk/4.1.4.0
gatk VariantsToTable \
--variant ${infile} \
--show-filtered true \
--showHidden true \
--verbosity INFO \
-F CHROM \
-F POS \
-F ID \
-F REF \
-F ALT \
-F QUAL \
-F FILTER \
-F AAChange.refGene -F AC -F AF -F AF_afr -F AF_amr -F AF_asj -F AF_eas -F AF_female -F AF_fin -F AF_male -F AF_nfe -F AF_oth -F AF_popmax \
-F AF_raw -F AF_sas -F AN -F BA1 -F BP1 -F BP2 -F BP3 -F BP4 -F BP5 -F BP6 -F BP7 -F BS1 -F BS2 -F BS3 -F BS4 -F BaseQRankSum -F CADD_phred \
-F CADD_raw -F CADD_raw_rankscore -F CLNALLELEID -F CLNDISDB -F CLNDN -F CLNREVSTAT -F CLNSIG -F ClippingRankSum -F DANN_rankscore \
-F DANN_score -F DB -F DP -F DS -F Eigen-PC-raw -F Eigen-raw -F Eigen_coding_or_noncoding -F ExAC_AFR -F ExAC_ALL -F ExAC_AMR -F ExAC_EAS \
-F ExAC_FIN -F ExAC_NFE -F ExAC_OTH -F ExAC_SAS -F ExcessHet -F ExonicFunc.refGene -F FATHMM_converted_rankscore -F FATHMM_pred -F FATHMM_score \
-F FS -F Func.refGene -F GERP++_RS -F GERP++_RS_rankscore -F GTEx_V6_gene -F GTEx_V6_tissue -F Gene.refGene -F GeneDetail.refGene \
-F GenoCanyon_score -F GenoCanyon_score_rankscore -F HaplotypeScore -F InbreedingCoeff -F InterVar_automated -F Interpro_domain \
-F LRT_converted_rankscore -F LRT_pred -F LRT_score -F M-CAP_pred -F M-CAP_rankscore -F M-CAP_score -F MLEAC -F MLEAF -F MQ -F MQRankSum \
-F MetaLR_pred -F MetaLR_rankscore -F MetaLR_score -F MetaSVM_pred -F MetaSVM_rankscore -F MetaSVM_score -F MutationAssessor_pred \
-F MutationAssessor_score -F MutationAssessor_score_rankscore -F MutationTaster_converted_rankscore -F MutationTaster_pred \
-F MutationTaster_score -F NEGATIVE_TRAIN_SITE -F PM1 -F PM2 -F PM3 -F PM4 -F PM5 -F PM6 -F POSITIVE_TRAIN_SITE -F PP1 -F PP2 -F PP3 \
-F PP4 -F PP5 -F PROVEAN_converted_rankscore -F PROVEAN_pred -F PROVEAN_score -F PS1 -F PS2 -F PS3 -F PS4 -F PVS1 -F Polyphen2_HDIV_pred \
-F Polyphen2_HDIV_rankscore -F Polyphen2_HDIV_score -F Polyphen2_HVAR_pred -F Polyphen2_HVAR_rankscore -F Polyphen2_HVAR_score -F QD \
-F RAW_MQ -F ReadPosRankSum -F SIFT_converted_rankscore -F SIFT_pred -F SIFT_score -F SNPEFF_AMINO_ACID_CHANGE -F SNPEFF_CODON_CHANGE \
-F SNPEFF_EFFECT -F SNPEFF_EXON_ID -F SNPEFF_FUNCTIONAL_CLASS -F SNPEFF_GENE_BIOTYPE -F SNPEFF_GENE_NAME -F SNPEFF_IMPACT \
-F SNPEFF_TRANSCRIPT_ID -F SOR -F SiPhy_29way_logOdds -F SiPhy_29way_logOdds_rankscore -F VEST3_rankscore -F VEST3_score -F VQSLOD \
-F avsnp150 -F controls_AF_popmax -F culprit -F cytoBand -F fathmm-MKL_coding_pred -F fathmm-MKL_coding_rankscore -F fathmm-MKL_coding_score \
-F integrated_confidence_value -F integrated_fitCons_score -F integrated_fitCons_score_rankscore -F non_cancer_AF_popmax -F non_neuro_AF_popmax \
-F non_topmed_AF_popmax -F phastCons100way_vertebrate -F phastCons100way_vertebrate_rankscore -F phastCons20way_mammalian \
-F phastCons20way_mammalian_rankscore -F phyloP100way_vertebrate -F phyloP100way_vertebrate_rankscore -F phyloP20way_mammalian \
-F phyloP20way_mammalian_rankscore \
-GF GT \
-GF AD \
-GF DP \
-GF GQ \
-GF PL \
--output ${infile}.Variants-to-Table.txt
# End of script
RefSeq gene annotation with ANNOVAR from a VCF file
Credits: ANNOVAR, https://annovar.openbioinformatics.org/en/latest/user-guide/gene/
The output generates two files starting from "infile.vcf": 'infile.ANNOVAR.gene_annotation.variant_function' and 'infile.ANNOVAR.gene_annotation.exonic_variant_function'
Output "infile.ANNOVAR.gene_annotation.variant_function". The first column in the file output tells whether the variant hit exons or hit intergenic regions, or hit introns, or hit a non-coding RNA genes. If the variant is exonic/intronic/ncRNA, the second column gives the gene name (if multiple genes are hit, comma will be added between gene names); if not, the second column will give the two neighboring genes and the distance to these neighboring genes.
Output "infile.ANNOVAR.gene_annotation.exonic_variant_function". The second output file contains the amino acid changes as a result of the exonic variant. The exact format of the output below may change slightly between different versions of ANNOVAR.
#!/bin/bash
module load annovar/18.04.16
infile="${workdir}/infile.vcf"
outfile="${workdir}/infile.ANNOVAR.out.avinput"
convert2annovar.pl -format vcf4 ${infile} -outfile ${outfile} -allsample -withfreq -include
infile="${workdir}/infile.ANNOVAR.out.avinput"
outfile="${workdir}/infile.ANNOVAR.gene_annotation"
humandb="..ANNOVAR/humandb/"
annotate_variation.pl -out ${outfile} -build hg19 ${infile} ${humandb}
# Outputs two files: 'infile.ANNOVAR.gene_annotation.variant_function' and 'infile.ANNOVAR.gene_annotation.exonic_variant_function'
# End of script
Split multiallelic variants (MNPs) into biallelic variants
Source: http://samtools.github.io/bcftools/bcftools.html#norm
SNPs and indels will be merged separately into two records.
#!/bin/bash
module load bcftools
bcftools norm -m -both infile.vcf.gz -Oz -o outfile.vcf.gz
# End of script
Annotate the ID field of each variant in a VCF file using dbSNP database
Source: http://samtools.github.io/bcftools/bcftools.html#annotate
#!/bin/bash
module load htslib
module load bcftools
# dbSNP database index tbi file should be place within the same folder
dbsnp="../dbSNP/GCF_000001405.25.gz"
indir=".."
infile="${indir}/infile.vcf"
outfile="${infile}.dbsnp.vcf"
# First, compress and tabix the VCF file
bgzip -c ${infile} > ${infile}.gz
tabix -p vcf ${infile}.gz
# Then annotate the ID field
bcftools annotate -c ID -a ${dbsnp} -o ${outfile} ${infile}.gz
# End of script
Annotate the ID field of each variant in a VCF file using dbSNP database
Source: http://samtools.github.io/bcftools/bcftools.html#annotate
#!/bin/bash
# Get the union of variants from two files
awk 'FNR==NR{a[$1];next}$1 on a' ${infile1} ${infile2} > ${union}
# End of script
Parse a VCF file and count the number of HomRef, HomAlt, and Het genotypes from dose data
In each line of a VCF file from Michigan Imputation Server (FORMAT field use to be GT:DS:GP), split the fields from the 10th column by "." (GT:DS:etc.). Then, split the "DS" but without using any delimiter... in other words, do not split. And then take the second field of the array, and assign it to DS variable. Then filter by DS value. From the sums "het+homalt" and "hethomref", we keep the minumum to filter out the variant afterwards.
#!/bin/bash
zgrep -v "^#" ${indir}/${infile} \
| awk 'BEGIN{OFS="\t"} {
homref=0; het=0; homalt=0; for(i=10;i<=NF;i++) {
split($i,a,":"); split(a[2],DS);
if(DS[1]<=0.5) {homref++}
else if(DS[1]>1.5) {homalt++}
else {het++}};
if(het+homalt<het+homref) print $1,$2,$3,$4,$5,homref,het,homalt,het+homalt
else if(het+homalt>=het+homref) print $1,$2,$3,$4,$5,homref,het,homalt,het+homref}' > ${outdir}/${outfile}
# End of script
Transposing the row-vector of individuals into a column-vector
#!/bin/bash
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str"\t"a[i,j];
}
print j"->"str" MLDOSE" # Prepare a matrix with columns 1, 2, and 3 following probABEL specifications
}
}' $output/individuals > $output/t_individuals
# End of script
Change the sample name in a BAM header using SAMtools
#!/bin/bash
# Define inbam and outbam files
samtools view -H ${inbam} | sed "s/SM:[^\t]*/SM:new-sample-name/g" | samtools reheader - ${bamfile} > ${outbam}
# End of script
Sort a file using a certain numerical column while keeping the header at the top of the output
#!/bin/bash
# The header is the first line at the top of the file
# Column3 in the input file is numeric (i.e. physical position) and will be used as the index to order the dataset
awk 'NR == 1; NR > 1 {print $0 | "sort -k 3"}' ${infile} > ${outfile}
# End of script
Convert a multisample columnar FASTA file into a multisample single-line FASTA file
#!/bin/bash
infile="columnar.fasta"
outfile="singleline.fasta"
awk '{
if (NR==1) { print $0 }
else
{
if ($0 ~ /^>/)
{ print "\n"$0 }
else
{ printf $0 }
}
}' ${infile} > ${outfile}
# End of script
Extract a region of interest from a BAM file with SAMtools and index
#!/bin/bash
# Use at HPC:
module load xz/5.2.5/gcc htslib/1.12/gcc samtools/1.12/gcc
#Define region
region="chr2:163140072-163160072"
indir="path-to-input-dir"
infile="input-file"
outfile="${infile}.${region}"
#Extract Region of Interest
samtools view -b -h ${indir}/${infile}.bam ${region} > ${indir}/${outfile}.bam
#Index
samtools index ${indir}/${outfile}.bam
# End of script
Compress files without directory structure
#!/bin/bash
zip -r -j *.zip
# End of script
Shorten the current directory path on terminal
#!/bin/bash
#See (1): https://www.gnu.org/software/bash/manual/html_node/Controlling-the-Prompt.html
#See (2): https://unix.stackexchange.com/questions/381113/how-do-i-shorten-the-current-directory-path-shown-on-terminal
#Open the shell, write this, and press ENTER:
PS1='\u:\W\$ '
#Done!
# End of script
Check if directory/file exist/does not exist
#!/bin/bash
##### DIRECTORIES
#Check whether a directory exists or not.
indir="your-directory"
#Check if a directory exists
if [ -d "${indir}" ]; then
echo "${indir} directory exists."
fi
#Check if a directory does not exist
if [ ! -d "${indir}" ]; then
echo "${indir} directory does not exist."
fi
##### FILES
#Check whether file exists or not.
infile="your-file"
#Check if a file exists
if [ -f "${infile}" ]; then
echo "${infile} exists."
fi
#Check is a file does not exist
if [ ! -f "${infile}"]; then
echo "${infile} does not exist."
fi
#Check if a file is not empty
if [ -s "${infile}" ]; then
echo "${infile} is not empty."
fi
#Together
if [ -f "${infile}" ]; then
echo "${infile} exists."
else
echo "${infile} does not exist."
fi
#Check if multiple files exist
#Use '-a' or '&&' together with [[
#With '-a'
if [ -f "${infile1}" -a -f "${infile2}" ]; then
echo "${infile1} and ${infile2} exist."
fi
#With '&&'
if [[ -f "${infile1}" && -f "${infile2}" ]]; then
echo "${infile1} and ${infile2} exist."
fi
# End of script
Determine strand orientation and allele switch in a VCF
#!/bin/bash
Source: http://samtools.github.io/bcftools/howtos/plugin.fixref.html
Use BCFtools plugins.
reference="path-to-reference-fasta"
infile="infile.vcf.gz"
#Stats to assess the number of REF allele mismatches and the number of non-biallelic sites
bcftools +fixref ${infile} -- -f ${reference}
#Check the reference allele mismatches
bcftools norm --check-ref e -f ${reference} ${infile} -Ou -o /dev/null
# End of script
Get the file with the smallest number of lines
#!/bin/bash
wc -l file1 > count1
wc -l file2 > count2
wc -l file3 > count3
if [ ${count1} -lt ${count2} ] && [ ${count1} -lt ${count3} ]
then
cod=1
count=${count1}
file="${infile1}"
echo "File number: ${cod}"
echo "Number of lines: "${count}
elif [ ${count2} -lt ${count1} ] && [ ${count2} -lt ${count3} ]
then
cod=2
count=${count2}
file="${infile2}"
echo "File number: ${cod}"
echo "Number of lines: "${count}
else
cod=3
count=${count3}
file="${infile3}"
echo "File number: ${cod}"
echo "Number of lines: "${count}
fi
# End of script
Get the shared positions by three lists
#!/bin/bash
#cat list1
#1
#2
#3
#4
#cat list2
#1
#2
#3
#4
#5
#cat list3
#2
#3
#4
# Cat list of positions
# Then sort them
# Count the number of duplicated lines
# Trim any leading and trailing white spaces
# Grab only the lines starting by 3 (3 lines duplicated)
# Keep only the selected lines and save into 'list123' file
cat list1 list2 list3 | sort -k1,1n | uniq -c | awk '{$1=$1;print}' | grep "^3" | awk '{ print $2 }' > list123
#cat list1 list2 list3 | sort -k1,1n | uniq -c
# 2 1
# 3 2
# 3 3
# 2 4
# 2 5
# 1 6
# 1 7
# 1 8
#cat list1 list2 list3 | sort -k1,1n | uniq -c | awk '{$1=$1;print}'
#2 1
#3 2
#3 3
#2 4
#2 5
#1 6
#1 7
#1 8
#cat list1233
#2
#3
# End of script
Reredirect .out and .err outputs to a log file
#!/bin/bash
log="your-log-file"
./your-script.sh 2>&1 | tee -a ${log}
'your-line-commands-here' 2>&1 | tee -a ${log}
# End of script
Filter a VCF file using certain values in the FILTER tag
#!/bin/bash
bcftools view -i "%FILTER='PASS' | %FILTER='.'" infile.vcf.gz -Oz -o outfile.vcf.gz
# End of script
Check that a VCF is sorted (via indexing)
#!/bin/bash
bgzip -c infile.vcf > infile.vcf.gz
bcftools index -t -f infile.vcf.gz
#If the latter command fails, the file is not sorted.
# End of script
String Functions in GNU
Sort numerically a list of files within a directory and save the list into a new file
Source: https://stackoverflow.com/questions/13360925/sort-files-numerically-in-bash
#!/bin/bash
#Search for and sort '*.chr22:a-b.Fst' files, where 'a' and 'b' are numbers (e.g., typically, a chromosome region)
#Use '-v1' parameter:
# -v: natural sort of (version) numbers within text
# -1: list one file per line
#List files numerically and save the list into ${list}
ls -v1 "*.Fst" > ${list}
# End of script
Replace './.' or unknown genotypes with '0/0' or homozygous-reference genotypes
#!/bin/bash
sed 's/\.\/\./0\/0/g' infile.vcf > infile.with-replaced-genotypes.vcf
# End of script
Replace metadata ids in a FASTA file with new ids
#!/bin/bash
#Files:
#'map' = a two-column file with old and new ids
#'old-ids.fasta' = a FASTA file with old-ids
#'new-ids.fasta' = a new FASTA file with new ids replacing old-ids
#'map' file is a two column file as follows:
#old-id1 [tab] new-id1
#old-id2 [tab] new-id2
#old-id3 [tab] new-id3
#... [tab] ...
#Example of 'map' file:
#HUNSC_ITER_150265155 hCoV-19/Spain/CN-HUNSC_ITER_150265155/2022
#HUNSC_ITER_900188949 hCoV-19/Spain/CN-HUNSC_ITER_900188949/2022
#HUNSC_ITER_180111581 hCoV-19/Spain/CN-HUNSC_ITER_180111581/2022
#HUNSC_ITER_402054223 hCoV-19/Spain/CN-HUNSC_ITER_402054223/2022
#... ...
#Example of 'old-ids.fasta' file:
#>HUNSC_ITER_150265155
#TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT...
#>HUNSC_ITER_900188949
#TTGTAGATCTGTTCTCTAAACGAACTTTAAAATCT...
#>HUNSC_ITER_180111581
#TTGTAGATCTGTTCTCTAAACGAACTTTAAAATCT...
awk -F'\t' '
NR==FNR { map[">"$1] = ">"$2; next }
$0 in map { $0 = map[$0] }
{ print }
' map old-ids.fasta > new-ids.fasta
#Example of output in 'new-ids.fasta' file:
#>hCoV-19/Spain/CN-HUNSC_ITER_150265155/2022
#TCTTGTAGATCTGTTCTCTAAACGAACTTTAAAAT...
#>hCoV-19/Spain/CN-HUNSC_ITER_900188949/2022
#TTGTAGATCTGTTCTCTAAACGAACTTTAAAATCT...
#>hCoV-19/Spain/CN-HUNSC_ITER_180111581/2022
#TTGTAGATCTGTTCTCTAAACGAACTTTAAAATCT...
# End of script
Avoid the '--' group-separator in a grep output providing multiple matches
#!/bin/bash
# According to 'man grep':
#-A NUM, --after-context=NUM
# Print NUM lines of trailing context after matching lines.
# Places a line containing a group separator (--) between
# contiguous groups of matches. With the -o or --only-matching
# option, this has no effect and a warning is given.
#Grep a multisample linear FASTA file witn '-A' and get rid of the annoying
#'--' lines when multiple matches are found
grep -A 1 --no-group-separator infile.fasta > outfile.fasta
#Or alternatively clean the output avoind the '--' lines
my-grep-command | sed '/^--$/d'
# End of script
Iterate over a list of elements within a tuple
Source: https://stackoverflow.com/questions/9713104/loop-over-tuples-in-bash
#!/bin/bash
#Example:
#Instead of nesting two for-loops (one for population and one for number of individuals
#within the selected population), define an array of tuples with two elements each.
for tuple in "CEU 100" "IBS 121" "GBR 106"
do
# convert the "tuple" into the param args $1 $2...
set -- ${tuple}
pop=${1}
n=${2}
done
#Alternatively, define explicitely the separator between tuple values using IFS
for tuple in CEU,100 IBS,121 GBR,106
do
IFS=',' read -a strarr <<< "${tuple}"
pop=${strarr[0]}
n=${strarr[1]}
done
# End of script
Annotate fields (AF, AC, MAF, etc.) in a VCF
Source: https://www.biostars.org/p/180894/
#!/bin/bash
infile="in.vcf"
outfile="out.vcf"
bcftools +fill-tags ${infile} -- -t AF,AC,MAF > ${outfile}
# End of script
Concatenate VCF files into a single-all-chrs VCF file
#!/bin/bash
#Prepare a list of individuals VCF.gz files sorted numerically
ls -v1 *.vcf.gz > vcf.gz.list
cat vcf.gz.list
#Concatenate VCF files into a single-all-chrs VCF file
bcftools concat -f vcf.gz.list -Oz -o all-chrs-VCF.vcf.gz
# End of script
Monitor processes in your HPC (squeue, sacct, etc.)
#!/bin/bash
#Use the shell and connect to your HPC running SLURM
#Example1: monitor partitions and jobs
watch squeue
#Example2: monitor running jobs and subjobs
watch sacct
# End of script
List file recursively by last modification time
#!/bin/bash
ls -Rt
# End of script
Remove extension from a file name
#!/bin/bash
filename="basename.ext"
n=${filename%.*}
echo ${n}`
# End of script
Create a template 'Empy Document' in order to show this option in the contextual menu in Ubuntu
#!/bin/bash
#Open a shell and run this:
touch ~/Templates/Empty\ Document
#Then go to any folder and right-click to see that the contextual
#item 'New Document' > 'Empty Document' has been created.
# End of script
Set an infinite history length of bashrc in Ubuntu using HISTSIZE and HISTFILESIZE in bash
#!/bin/bash
#Edit the '.bashrc' file in your /home/user directory
#Find the HISTSIZE and HISTFILESIZE sections and replace with these values:
# Numeric values less than zero result in every command being saved on the history list (there is no limit).
# Old value: HISTSIZE=1000
HISTSIZE=-1
# Non-numeric values and numeric values less than zero inhibit truncation.
# Old value: HISTFILESIZE=2000
HISTFILESIZE=-1
# End of script
Set a string to lowercase or uppercase in AWK
#!/bin/bash
#To lowercase field in position 1
awk -F'[\t]' '{ tolower($1) }' infile > outfile
#To uppercase field in position 2
awk -F'[\t]' '{ tolower($2) }' infile > outfile
# End of script
Mount a network NTFS volume into a local folder from the shell
#!/bin/bash
sudo mount.cifs -v //192.168.xx.yy/<remote-folder> \
/mnt/cifs1 -o sec=ntlmv2,vers=1.0,domain=<your-domain>,username=<username>
# End of script
Configure a workstation folder mounting on a remote NAS volume from the shell
#!/bin/bash
sudo mount -t cifs \
-o user=<username>,password=<password>,iocharset=utf8,noperm \
//192.168.xx.yy/<remote-folder> \
/mnt/<local-folder>
# End of script
Grab the header of a file excluding the last line
#!/bin/bash
#Example: keep the header of a VCF file except the last line (i.e., the line with CHR, POS, ID,
#REF, ALT, and the rest of fields).
head -n -1 ${infile} > file-without-the-last-line
# End of script
Break MNPs and exclude spanning deletions (SD) from a VCF/gVCF file using BCFtools
#!/bin/bash
#Define your files
infile="file.vcf.gz"
outfile="file.noMNP.noSD.vcf.gz"
# For each single VCF or gVCF, split MNPS and avoid spanning-deletions (marked as ALT=*)(SD), and index
bcftools norm -m -both ${infile} -Ou | \
bcftools view -e 'ALT="*"' -Oz -o ${outfile}
bcftools index -f -t ${outfile}
# End of script
Split a file with a list in a number of sublists using BASH
#!/bin/bash
# Split a 'list' file (i.e., with sample codes in rows) into 10 new sublists
# 'sublist' is the name preffix for new files
split -d list sublist -n 10
# End of script
Add new line ('\n') at the end of a file using BASH
#!/bin/bash
sed -i -z 's/$/\n/g' infile > outfile
# End of script
Keep only SNPs from a VCF using BCFtools
#!/bin/bash
infile="in.vcf.gz"
outfile="out.onlySNPs.vcf.gz"
bcftools view --types snps -m 2 -M 2 ${infile} -Oz -o ${outfile}
# End of script
Grab the list of variants from a VCF using BCFtools
#!/bin/bash
infile="in.vcf.gz"
outfile="variants.list"
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\n' ${infile} > ${outfile}
# End of script
Merge a list of VCF files using BCFtools
#!/bin/bash
list="list.of.vcf.files"
outfile="merged.vcf.gz"
#Merge
bcftools merge -m none -l ${list} -Oz -o ${outfile}
#Index
bcftools index -f -t ${outfile}
# End of script
List the content of a file, line by line, showing line numbers with BASH
#!/bin/bash
infile="filename-here"
#Option1
awk '{print NR "> " $s}' ${infile}
#Option2
#See 'nl --help' for options
nl -b a ${infile}
#Option3
#See 'pr --help' for options
pr -t -n ${infile}
# End of script