-
Notifications
You must be signed in to change notification settings - Fork 4
FakeSequence is an artificial non-functional realistic DNA sequence generator.
License
caballero/Garlic
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
GARLIC - An artificial non-functional realistic DNA sequence generator. Copyright (C) 2011-2015 Juan Caballero [Institute for Systems Biology] DESCRIPTION GARLIC is an artificial non-functional realistic DNA sequence generator. Why do we need it? Because with current sequencing technology we can sequence any organism and start mining the genome. Many genomic analysis use strong statistical tests, but until now, a good negative control has never developed. For example, we can use several programs (Genscan, Augustus, GlimmerHMM, ...) to predict coding genes, all these programs has been properly trained to recognize how a gene looks like using well characterized genes. So you expect a low false negative rate, but none implements a negative control therefore you will find a lot of false positive genes. This is currently true in many model organisms (including human) where the number of predicted coding genes are more than the number of coding genes with evidence. Of course you can use intergenic regions as a negative control, but these regions are limited in number and size, also you cannot be 100% sure that the regions don't contain genes. So, we developed this new tool to recreate realistic sequences based on the properties of the background genome. We define the background genome as the remained sequences of a genome after removal of genes (coding and non-coding), pseudogenes, interspersed repeats and low complexity sequences (600 Mb in hg19). We modeled: (1) composition of the background genome (2) interspersed repeats (3) low complexity sequences The current algorithm creates a base sequence, then the sequence is bombarded with artificially evolved elements (interspersed repeats, low complexity) as expected in the reference genome. The final output is a Fasta file with the new sequence generated. REQUERIMENTS - Perl - Genome models. You can download from: http://www.repeatmasker.org/garlic/ or create your own, see below. - RepBase consensus sequences. You need to download the EMBL file from RepBase [http://www.girinst.org/repbase] (registration required) and put the file in data/repbase [suggested]. USAGE 1. Create a new sequence perl createFakeSequence.pl -m hg19 -s 1Mb -o fake.fa For more options, please read the documentation using: perl createFakeSequence.pl --help 2. Train a new model You can obtain the models from our website, we are currently suporting some organisms with complete annotation in the UCSC Genome Database: [http://hgdownload.cse.ucsc.edu/downloads.html] You can create your own model fetching the data from the UCSC site: perl createModel.pl -m hg19 Also you can use your own sequences and annotations to create a model: perl createModel.pl -m myOrg -f myOrg.fa -r RM.out -t TRF.out -g Genes.table Please read the documentation using: perl createModel.pl --help CITATION Realistic artificial DNA sequences as negative controls for computational genomics. Caballero J, Smit AF, Hood L, Glusman G. Nucl. Acids Res. 2014 doi: 10.1093/nar/gku356 LICENSE All the code is under the GPLv3 licence, see LICENSE file for details. CHANGES 1.5 : - "unitialized value" warnings caused by UCSC model lookups fixed. - TRF data should be in UCSC BED format which uses 0-based/half-open coordinates. Fixed a bug in the code where 1-based was assumed, causing a zero-valued start coordinate to go negative. - createModel.pl: Added support for relative paths in input parameters.
About
FakeSequence is an artificial non-functional realistic DNA sequence generator.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published