Skip to content

Latest commit

 

History

History
135 lines (100 loc) · 9.39 KB

README.md

File metadata and controls

135 lines (100 loc) · 9.39 KB

flair

FLAIR (Full-Length Alternative Isoform analysis of RNA) for the correction, isoform definition, and alternative splicing analysis of noisy reads. FLAIR has primarily been used for nanopore cDNA, native RNA, and PacBio sequencing reads.

Table of Contents

Overview

FLAIR can be run optionally with short-read data to help increase splice site accuracy of the long read splice junctions. FLAIR uses multiple alignment steps and splice site filters to increase confidence in the set of isoforms defined from noisy data. FLAIR was designed to be able to sense subtle splicing changes in nanopore data from Tang et al. (2018). Please read for more description of some methods. flair workflow

It is recommended to combine all samples together prior to running FLAIR modules for isoform assembly, followed by read assignment of each sample individually to isoforms of the combined assembly for downstream analyses. It is also good to note that bed12 and PSL can be converted easily using kentUtils bedToPsl or pslToBed, or using bin/bed_to_psl.py.

Requirements

  1. python v2.7+ and python modules: Cython, intervaltree, kerneltree, tqdm, pysam
  2. bedtools, samtools
  3. minimap2

FLAIR modules

flair.py is a wrapper script with modules for running various processing scripts located in bin/. Modules are assumed to be run in order (align, correct, collapse), but the user can forgo the wrapper if a more custom build is desired.

flair align

Aligns reads to the genome using minimap2, and converts the aligned minimap2 sam output to BED12 and optionally PSL. Aligned reads in psl format can be visualized in IGV or the UCSC Genome browser.

Alternatively, the user can align the reads themselves with their aligner of choice and convert bam output to bed12 using bin/bam2Bed12.py to supply for flair-correct. This step smooths gaps in the alignment.

Usage:

python flair.py align -r <reads.fq>/<reads.fa> -g genome.fa [options]

run with --help for a description of optional arguments. Outputs (1) sam of raw aligned reads and (2) smoothed bed12 file of aligned reads to be supplied to flair-correct.

flair correct

Corrects misaligned splice sites using genome annotations.

Usage:

python flair.py correct -f annotation.gtf -c chromsizes.tsv -q query.bed12 [options]

run with --help for description of optional arguments. Outputs (1) bed12 of corrected reads, (2) bed12 of reads that weren't able to be corrected, (3) psl of corrected reads to be supplied in flair-collapse.

Short-read junctions

To use short-read splice sites to aid with correction, use junctionsFromSam.py to extract splice junctions.

Usage:

python junctionsFromSam.py -s shortreads.sam -n outname -o outdir

the file that can be supplied to flair-correct with -j is in the output file outname_junctions.bed.

Alternatively, splice junctions from STAR 2-pass alignment of short-reads (SJ.out.tab) can also be supplied for junctions.

flair collapse

Defines isoforms from corrected reads. By default, redundant isoforms (those that are proper subsets of another isoform in the set) are filtered out, an option that can be toggled with -e. As FLAIR does not use annotations to define isoforms, within a set of reads that define an isoform, FLAIR will pick the name of a read to be the isoform name. It is recommended to provide a GTF with -f, which is used to rename FLAIR isoforms that match isoforms in existing annotation according to their Ensembl ID. This can help with sorting for/against annotated isoforms just by grep [-v] ENST. Again, isoforms in psl format can be visualized in IGV or the UCSC genome browser if columns after 21 (1-indexed) are removed.

Usage:

python flair.py collapse -r <reads.fq>/<reads.fa> -q query.psl -g genome.fa [options]

run with --help for description of optional arguments. Outputs (1) extended psl containing the data-specific isoforms and (2) fasta file of isoform sequences.

Quantification

To quantify the expression of each isoform for a specific sample for use in other scripts:

  1. Align read sequences to the isoform sequences using minimap2 (--secondary=no option recommended, alternatively primary alignments can be selectively retained with samtools view -F 256 -S on the resulting sam)
  2. Count read-isoform assignments - bin/count_sam_genes.py sam counts.txt
  3. Append a new column to the isoform file containing the sample-specific isoform expression - bin/match_counts.py counts.txt isoforms.psl 1 isoforms.out.psl

Scripts

We have also provided standalone scripts for splicing and productivity analysis of quantified isoforms from FLAIR output.

mark_intron_retention.py

Requires three positional arguments to identify intron retentions in isoforms: (1) a psl of isoforms, (2) psl file output name, (3) txt file output name for coordinates of introns found.

Usage:

python mark_intron_retention.py isoforms.psl isoforms.ir.psl coords.txt

Outputs (1) an extended psl with an additional column containing either values 0 or 1 classifying the isoform as either spliced or intron-retaining, respectively; (2) txt file of intron retentions with format isoform name chrom intron 5' intron 3'.

mark_productivity.py

Requires three positional arguments to classify isoforms according to productivity: (1) reads or psl format, (2) gtf genome annotation, (3) fasta genome sequences.

Usage:

python mark_productivity.py psl annotation.gtf genome.fa > productivity.psl

Outputs an extended psl with an additional column containing either values 0, 1, or 2 corresponding to a productive, unproductive (premature stop codon), and lncRNA (no start codon) classifications respectively.

find_alt3prime_5prime_ss.py

Requires two positional arguments to identify and calculate significance of alternative 5' and 3' splicing between two samples using Fisher's exact tests, and two arguments specifying output files: (1) an extended psl of isoforms containing two extra columns for read counts of each isoform per sample type, (2) the 0-indexed column number of the two extra columns (assumed to be last two), (3) txt file output name for alternative 3' SS, (4) txt file output name for alternative 5' SS. See quantification for obtaining (1).

Usage:

python find_alt3prime_5prime_ss.py isoforms.psl annotation.gtf colnum alt_acceptor.txt alt_donor.txt 

Output file format: chrom intron 5' coordinate intron 3' coordinate p-value strand sample1 intron count sample2 intron count sample1 alternative introns counts sample2 alternative introns counts isoform name canonical SS distance from predominant alternative SS canonical SS

diff_iso_usage.py

Requires three positional arguments to identify and calculate significance of alternative 3' and 5' splicing between two samples using Fisher's exact tests: (1) an extended psl of isoforms containing two extra columns for read counts of each isoform per sample type, (2) the 0-indexed column number of the two extra columns (assumed to be last two), (3) txt file output name for differentially used isoforms. See quantification for obtaining (1).

Usage:

python diff_iso_usage.py isoforms.psl colnum diff_isos.txt

Output file format: gene name isoform name p-value sample1 isoform count sample2 isoform count sample1 alternative isoforms for gene count sample2 alternative isoforms for gene count

NanoSim_Wrapper.py

A wrapper script written for simulating nanopore transcriptome data using Nanosim.

Example Files

We have provided the following example files:

  • na12878.cdna.200k.fa, containing 200,000 nanopore cDNA sequencing reads subsampled from the Native RNA Consortium. This can be run through the FLAIR workflow starting from alignment.
  • cll_shortread_junctions.gp, a genepred-formatted file of splice junctions observed from short read sequencing of CLL samples that can be used in the correction step. Junctions from short read sequencing are optional (deprecated)
  • gencode_v24_complete.gp, splice junctions from GENCODE v24 annotation that is supplied to the correction step (deprecated)

Other downloads:

  • promoter BED file to supplement in FLAIR-collapse for better TSS-calling for GM12878 cells