-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the AccessingGenbank wiki!
The on-line databases for biological sequence data are begging to be mined, but it is infeasible to do so manually. Even downloading the files is a hopeless task going through the web interfaces. Therefore I "mined" Stackexchange for a solution.
What I was looking for was a python solution for:
- downloading a list of Genbank files (automatically)
- mining the Genbank files for a certain field (automatically)
- Reporting a list of unique entries in the field
Why would you do this? To search for e.g. habitats where a certain group of micro organisms can be found.
The Genbank files have a certain "flat" file format (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with some informations such as the sequence data, accession IDs, author and publication.
` LOCUS AB290396 1145 bp DNA linear ENV 04-JUL-2007
DEFINITION Uncultured bacterium gene for 16S ribosomal RNA, partial sequence,
clone: MBF16_34.
ACCESSION AB290396
VERSION AB290396.1 GI:139004549
KEYWORDS ENV.
SOURCE uncultured bacterium
ORGANISM uncultured bacterium
Bacteria; environmental samples.
REFERENCE 1
AUTHORS Hatamoto,M., Imachi,H., Yashiro,Y., Ohashi,A. and Harada,H.
TITLE Diversity of Anaerobic Microorganisms Involved in Long-Chain Fatty
Acid Degradation in Methanogenic Sludges as Revealed by RNA-Based
Stable Isotope Probing
JOURNAL Appl. Environ. Microbiol. 73 (13), 4119-4127 (2007)
PUBMED 17483279
REFERENCE 2 (bases 1 to 1145)
AUTHORS Hatamoto,M. and Imachi,H.
TITLE Direct Submission
JOURNAL Submitted (16-JAN-2007) Hiroyuki Imachi, Japan Agency for
Marine-Earth Science & Technology (JAMSTEC), Extremobiosphere
Research Center; 2-15 Natsushima-cho, Yokosuka, Kanagawa 237-0061,
Japan (E-mail:[email protected], Tel:81-46-867-9709,
Fax:81-46-867-9715)
FEATURES Location/Qualifiers
source 1..1145
/organism="uncultured bacterium"
/mol_type="genomic DNA"
/isolation_source="Mesophilic anaerobic sludge treating
palm oil mill effluent"
/db_xref="taxon:77133"
/clone="MBF16_34"
/environmental_sample
rRNA <1..>1145
/product="16S ribosomal RNA"
ORIGIN
1 agtcgagaat cttccccaat gggcgaaagc ctgagggagc gacgccgcgt gagggatgaa
61 ggccctttgg gttgtaaacc tctgttaggg ggaaagaaaa gcagtggaag caatatgtcc
121 attgcctgac gttaccccca gagaaagctc cggccaactc cgtgccagca gccgcggtaa
181 tacgggggga gcaagcgttg tccggaatca ttgggcgtaa agggcgtgta ggcggcttgg
241 caagtcgaat gtgaaatccc acggctcaac cgtggaactg cgttcgaaac tgccttgctt
301 gagtgcggga gaggtgtgcg gaattcctgg tgtagcggtg gaatgcgtag atatcaggaa
361 gaacaccggt ggcgaaggcg gcacactggc ccagcactga cgctgaggcg cgaaagcgtg
421 gggagcgaac gggattagat accccggtag tccacgctgt aaactttggg cactaggtat
481 tggaggtctc aaccccttca gtgccgtagc taacgcgtta agtgccccgc ctggggagta
541 cggtcgcaag gctgaaactc aaaggaattg acgggggccc gcacaagcgg tggagcatgt
601 ggtttaattc gatgcaacgc gaagaacctt accggggttt gacatgggag cctcgccgca
661 aggcgaggtc agccctatga aagtagggtg tgtccacaca ggtgctgcat ggctgtcgtc
721 agctcgtgtc gtgagatgtt gggttaagtc ccgcaacgag cgcaaccctc gccgatagtt
781 accaacgggt catgccgggg actctatcgg gactgccggt gataaaccgg aggaaggtgg
841 ggatgatgtc aagtcatcat ggcccttaca tcccgggcta cacacgtgct acaatggtcg
901 gtacagcggg ttgcaatacc gcgaggtgga gcaaatcctc aaagccggcc tcagtacgga
961 ttggagtctg caactcgact ctatgaagcc ggaatcgcta gtaatcgcgg atcagaatgc
1021 cgcggtgaat acgttcccgg gccttgtaca caccgcccgt caagccatgg gaatcgccag
1081 cactcgaagt cgctggccta accgcaaggg gggaggcgcc gaaagtgaag ccgatgactg
1141 gggct
//
`
Luckily some nice people have made packages that let you search it by "field names" e.g. you can extract the "isolation_source". Extracting the field in the file above would provide you with the text string "Mesophilic anaerobic sludge treating palm oil mill effluent" telling you something about the place where you might find this organism.