Transposons from Long DNA Reads
tldr requires python > 3.6 and has the following dependencies:
- HTSLIB/Samtools
- minimap2
- MAFFT
- Exonerate
- some python dependecies in the background.
There is a pre-baked Conda (or mamba) environment file provided (tldr.yml) that can be used to create a tldr Conda environment with all of the necessary dependencies.
git clone https://github.com/adamewing/tldr.git
cd tldr
conda env create -f tldr.yml
conda activate tldr
pip install -e $PWD
tldr -h
If you use the above method, make sure to activate the Conda environment first with conda activate tldr
whenever using tldr.
Easiest method is via conda:
conda install -c bioconda tabix
conda install -c bioconda samtools
Manual installation:
git clone https://github.com/samtools/htslib.git
git clone https://github.com/samtools/samtools.git
make -C htslib && sudo make install -C htslib
make -C samtools && sudo make install -C samtools
Via conda:
conda install -c bioconda minimap2
For manual installation see minimap2 github
Via conda:
conda install -c bioconda mafft
For manual installation see the mafft website
Note: different versions of MAFFT may yield different results from tldr. We currently recommend MAFFT v7.480.
Via conda:
conda install -c bioconda exonerate
For manual installation see the exonerate website
Install tldr package + python dependencies:
python setup.py install
Synopsis (minimal input requirements), assuming reads aligned to hg38 using minimap2:
tldr -b aligned_reads.bam -e /path/to/tldr/ref/teref.ont.human.fa -r /path/to/minimap2-indexed/reference/genome.fasta --color_consensus
Multiple .bam files can provided in a comma-delimited list.
Reference elements in .fasta format. The header for each should be formatted as>Superfamily:Subfamily
e.g. >ALU:AluYb9
.
If none
is specifed instead of a filename, tldr will run without a reference TE collection. This is useful for genomes where active mobile element content is not well understood or for unbiased identification of inserted sequenced and is also useful for identifying viral intregration and gene retrocopy insertions.
Reference genome .fasta, expects a samtools index i.e. samtools faidx
.
Spread work over p processes. Uses python multiprocessing.
Minimum supporting read count to trigger a consensus / insertion call (default = 3)
Minimum number of reads completely embeddeding the insertion (default = 1, requires at least 1).
Specify a base name for output files. The default is to use the name of the input bam(s) without the .bam extension and joined with "_" if > 1 .bam file given
Specify a text file of chromosome names (one per line) and tldr will focus only on these.
Maximum insertion size (default = 10000)
Minimum insertion size (default = 200)
Parameter for allowing base changes in consensus cleanup (default = 0.5)
Parameter for allowing base changes in consensus cleanup (default = 3)
Parameter for allowing base changes in consensus cleanup (default = 0.25)
Limit cluster size and downsample clusters larger than the cutoff (default = no limit). Downsampling is biased such that reads completely embedding the inserted sequence are preferred.
Allows for sloppy breakpoints in initial breakpoint search (default = 50)
Trim reads to contain at most --flanksize
bases on either side of the insertion. Setting too large makes consensus building slower and more error-prone.
Annotate insertion with known non-reference insertion sites (examples provided in /path/to/tldr/ref
This will annotate the consensus sequence with ANSI escape characters that yield coloured text on the command-line:
red = TSD, blue = TE insertion sequence, yellow = non-TE insertion sequence
While this looks nice on the command line (try less -R
) and is helpful for evaluating insertion calls, the output may not translate well to other applications as the escape sequences for the ANSI colours will be embedded in the sequence.
Creates a directory (name is the output base name) with extended consensus sequences, per-insertion read mapping information and per-insertion .bam files. Required for mCpG analysis.
If --detail_output option is enabled, extend output per-sample consensus by n bases (default 0). This is useful in the analysis of CpG methylation to add context on either end of the insertion.
Adds 5' and 3' transduction columns needed by the call_transductions.py
script, if you're into that kind of thing.
Saves pickles for later.
Search specified folder for .pickle files and use them instead of clustering reads. Faster for re-running with different options, requires --keep_pickles
.
Some fields in the output table (basename.table.txt) may not be self-explainatory:
Start / end position relative to TE consensus provided via -e/--elts
Length of actual inserted sequence. Not necessarily the same as EndTE-StartTE
Internal inversion detected in TE
Fraction of inserted sequence covered by TE sequence
Median mapping quality score from input .bam(s)
Overall mean identity to TE in reference library (-e/--elts
)
Number of reads used in consensus generation
Number of reads which completely embed the insertion
Number of samples (.bam files) in which the insertion was detected
Per-sample accounting of supporting reads
Number of reads spanning both TSDs +/- --wiggle
parameter with no evidence for insertion, useful for inferring genotype
If -n/--nonref
given, annotate whether insertion is a known non-reference insertion ("NA" otherwise)
Target site duplication (based on reference genome)
Upper case bases = reference genome sequence, lower case bases = insertion sequence. If --color_consensus
given TSD will be red, TE will be blue, other inserted sequence (e.g. transduction) will be yellow using ANSI terminal colours (may be affected by specific terminal config)
Annotate whether an insertion call is problematic; "PASS" otherwise (similar to VCF filter column).
Non-reference methylation can be assessed through the use of scripts located in the scripts/
directory:
script | description |
---|---|
tldr_callmeth.sh | Must be run from within the diretory where nanopolish index was run to index a .fastq file against a set of ONT .fast5 files. Takes as input a .fastq (indexed via nanopolish index ), an output directory created via the --detail_output option, a UUID and a sample name. Creates a tabix indexed table from the output of nanopolish call-methylation on the sample+uuid combination. Can be automated via xargs or GNU parallel. |
tablemeth_nonref.py | Creates a table with per-element mCpG summary data given a tldr output table and the directory created by --detail_output . Only considers element + sample combinations from the tldr table where tldr_callmeth.sh has been run. Requires pysam, pandas, numpy, and scipy. |
plotmeth_nonref.py | Makes a plot of a TE (requires running tldr_callmeth.sh first) plus the surrounding region if --extend_consensus is specified. Tracks include translation to CpG space, raw log-likelihood, and smoothed methylation fraction. Requires pysam, pandas, numpy, scipy, matplotlib, and seaborn. |
See https://github.com/adamewing/te-nanopore-tools
Adam D. Ewing, Nathan Smits, Francisco J. Sanchez-Luque, Sandra R. Richardson, Seth W. Cheetham, Geoffrey J. Faulkner. Nanopore Sequencing Enables Comprehensive Transposable Element Epigenomic Profiling. 2020. Molecular Cell, Online ahead of print: https://doi.org/10.1016/j.molcel.2020.10.024
Reporting issues and questions through github is preferred versus e-mail.