Skip to content

Counting genomic features

Damien Farrell edited this page Feb 23, 2017 · 1 revision

Counting features means counting the overlap of your reads with the locations of gene annotations (the features) in a reference genome. A short read aligner is the first step. Note that if you are aligning to a human-sized reference genome, creating the index from the fasta file can take some time. Also this process can use a lot of memory. See Aligners for more information.

After read alignment to a reference genome the next step in transcriptomics is usually to count the features that each read intersects. This done using the sam or bam file generated by the aligner (e.g. bowtie) along with a gtf or gff file that stores the annotated genes - the features. These files are available for many spedcies from Ensembl. These will include annotations for non coding RNAs.

Counting features

import smallrnaseq as smrna
import pandas as pd
readcounts = pd.read_csv('counts.csv')
fcounts = smrna.count_features(samfile, gtffile, truecounts=readcounts)

This returns a pandas dataframe of the form:

                   name   reads
259         _no_feature  290705
50            _unmapped  145281
645  ENSBTAT00000060484   15162
287  ENSBTAT00000042309    6284
538  ENSBTAT00000051764    5575
...

This will merge the gtf file fields with the read counts, which can also be done explicitly using:

res = base.merge_features(fcounts, gtffile)

References

HTSeq feature counting

Ensembl ftp downloads

Ensembl genome annotation

Clone this wiki locally