-
Notifications
You must be signed in to change notification settings - Fork 19
Command line interface
Installing the package provides the command smallrnaseq
in your path. This allows users is a command line interface to the library without the need for any Python coding at all. It provides a set of pre-defined functions with parameters specified in a text configuration file.
Usage largely involves setting up the config file and having your input files prepared. Running the command smallrnaseq -c default.conf
will create a new config file for you to work from if none exists. Just edit this with a text editor and then to execute:
smallrnaseq -c default.conf -r
Several other functions are available from the command line without the config file, i.e. to collapse or trim reads. Type smallrnaseq -h
to get a list of options.
The advantage of configuration files is in avoiding long commands that have to be remembered or are prone to mistakes. Also the config files can be kept to recall what setting we used or to copy them for another set of files. The current options available in the file are shown below. The meaning of each option is explained explained below. If you are unsure or don't require an option value, leave it at the default. Options can be excluded from the file completely and defaults will be used but it's recommended to just leave unused options blank. You can also comment out lines with a '#' at the start if you want it to be ignored. The [base] heading should always be present an indicates a section of the file. The [aligner] section is for setting alignment parameters on a per library basis if you need to. The [novel] section is for a few novel mirna prediction settings. The [de] section is used for differential expression which can be ignored if you don't need that.
[base]
filenames =
path =
overwrite = 0
index_path = indexes
libraries =
ref_fasta =
features =
output = results
add_labels = 1
aligner = bowtie
mirna = 0
species = hsa
pad5 = 3
pad3 = 5
[aligner]
default_params = -v 1 --best
mirna_params = -v 1 -a --best --strata --norc
[novel]
score_cutoff = 0.8
read_cutoff = 50
[de]
sample_labels =
sep = ,
sample_col =
factors_col =
conditions =
logfc_cutoff =
Settings explained:
name | example value | meaning |
---|---|---|
filenames | test.fastq | input fastq file(s) with reads |
path | testfiles | folder containing fastq files instead of listing files |
index_path | indexes | location of bowtie or subread indexes |
aligner | bowtie | which aligner to use, bowtie or subread |
ref_fasta | hg19 | reference genome fasta file, optional |
libraries | RFAM_human,mirbase-hsa | names of annotated library indexes to map to |
features | Homo_sapiens.GRCh37.75.gtf | genome annotation file |
output | smrna_results | output folder for temp files |
add_labels | 1 | whether to add labels to replace the file names in the results (0 or 1) |
mirna | 0 | run mirna counting workflow (0 or 1) |
species | bta | mirbase species to use |
pad5 | 3 | 3' flanking bases to add when generating mature mirbase sequences |
pad3 | 5 | 5' flanking bases to add |
sample_labels | samplefile.txt | csv file with sample labels |
default_params | -v 1 --best | default alignment parameters |
mirna_params | -v 1 -a --best --strata --norc | default miRNA alignment parameters |
This will simply consist of setting the libraries
option to match the names of files you have created aligner indexes for. You can download RNA annotations in fasta format from http://rnacentral.org/. The index files should be placed in the index_path
folder. To create an index using bowtie you supply a fasta file and the indexes are placed in a folder called 'indexes'. You can move them to another folder if needed:
smallrnaseq -b myrnas.fa
By default file names are replaced with short ids of the form s01, s02, etc. This also writes out a file called sample_labels.csv
in the output folder. Set add_labels = 0
if you don't want this behaviour.
Say we have a set of fastq files in the folder 'testfiles' that we want to count miRNAs in. We would set the options mirna = 1
and path = testfiles
. Note if just mapping to mirbase mature/precursor sequence you don't have to create an index file since it is generated automatically. If you are using any other libraries you should create the aligner index first. For novel miRNA discovery we need the reference genome and index for that.
There are some advantages to using a full reference genome in that it allows us to see of the reads align outside the target transcripts and we might therefore exclude them. Also we can normalise the read counts by the sum of all mapped reads. This depends on what features you have in the gtf file. To count our reads using features in a reference genome, provide the ref_genome
name which should correspond to the index name of the reference sequence. We also need to set features
, a set of features stored in a gtf, gff or bed file. Remember that the coordinates in this file must correspond to the reference sequence. See Counting features for more information.
The main outputs are csv files with the counts for each sample in a column, also produced is a 'long form' csv file with a separate row for every sample. These csv files can be opened in a spreadsheet. Temporary files such as collapsed read files are placed in a 'temp' folder inside the output folder.
This workflow is done using a separate command and via some extra options not shown above for clarity. To execute this type of analysis you must have done gene counting (e.g. miRNAs) and have the results in a csv file. The analysis is then run using:
smallrnaseq -c default.conf -d
In the default file the additional options needed are in a separate [de] section. Most are blank by default. If you want to do differential expression analysis of the genes from your results the other main thing you need to provide is a sample label file that matches the filenames you have analysed to the conditions. You then choose which conditions you want to compare. The options are explained below.
name | example value | meaning |
---|---|---|
count_file | mirna_counts.csv | csv file with gene counts |
sample_labels | labels.txt | csv/text file with sample labels |
sep | , | separator for sample labels text file |
sample_col | file_name | column for the sample file names |
factors_col | status | column for factor labels |
conditions | control,infected | conditions to compare from factor column (each replicate will have the same condition) |
logfc_cutoff | 1.5 | cutoff for log fold changes (lower are ignored) |
This analysis is run using the R edgeR package, so R must be installed. See Installing R
The sample label file below shows how we use the above options. Our filenames are in the Run_s column. We want to compare the conditions 3 and 6 months samples in the age_s (the 'factor') column. mirna_mature_counts.csv is the file where counts from our mirna analysis are stored. This should have a column for each sample. Note that you could use counts output from any program as long as they are csv and have the right column names, see Result file format. It is assumed you are using replicates. Note that column names are case sensitive.
Run_s age_s isolate_s
SRR3457948 3 months animal1
SRR3457949 6 months animal1
SRR3457950 6 months animal4
SRR3457951 15 months animal4
SRR3457952 3 months animal5
SRR3457953 6 months animal5
SRR3457954 15 months animal5
SRR3457955 3 months animal6
SRR3457956 6 months animal6
...
So the config file will look like this:
[de]
sample_labels = SraRunTable.txt
sep = tab #tab delimiter
count_file = mirna_mature_counts.csv
sample_col = Run_s
factors_col = age_s
conditions = 3 months,6 months
logfc_cutoff = 1.5