-
Notifications
You must be signed in to change notification settings - Fork 19
Counting genomic features
Counting features means counting the overlap of your reads with the locations of gene annotations (the features) in a reference genome. A short read aligner is the first step. Note that if you are aligning to a human-sized reference genome, creating the index from the fasta file can take some time. Also this process can use a lot of memory. See Aligners for more information.
After read alignment to a reference genome the next step in transcriptomics is usually to count the features that each read intersects. This done using the sam or bam file generated by the aligner (e.g. bowtie) along with a gtf or gff file that stores the annotated genes - the features. These files are available for many spedcies from Ensembl. These will include annotations for non coding RNAs.
import smallrnaseq as smrna
import pandas as pd
readcounts = pd.read_csv('counts.csv')
fcounts = smrna.count_features(samfile, gtffile, truecounts=readcounts)
This returns a pandas dataframe of the form:
name reads
259 _no_feature 290705
50 _unmapped 145281
645 ENSBTAT00000060484 15162
287 ENSBTAT00000042309 6284
538 ENSBTAT00000051764 5575
...
This will merge the gtf file fields with the read counts, which can also be done explicitly using:
res = base.merge_features(fcounts, gtffile)