Skip to content
RasmusKirkegaard edited this page Jan 27, 2015 · 8 revisions

Welcome to the AccessingGenbank wiki!

The on-line databases for biological sequence data are begging to be mined, but it is infeasible to do so manually. Even downloading the files is a hopeless task going through the web interfaces. Therefore I "mined" Stackexchange for a solution.

What I was looking for was a python solution for:

  • downloading a list of Genbank files (automatically)
  • mining the Genbank files for a certain field (automatically)
  • Reporting a list of unique entries in the field

Why would you do this? To search for e.g. habitats where a certain group of micro organisms can be found.

The Genbank files have a certain "flat" file format (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) with some informations such as the sequence data, accession IDs, author and publication.


`	LOCUS       AB290396                1145 bp    DNA     linear   ENV 04-JUL-2007
DEFINITION  Uncultured bacterium gene for 16S ribosomal RNA, partial sequence,
			clone: MBF16_34.
ACCESSION   AB290396
VERSION     AB290396.1  GI:139004549
KEYWORDS    ENV.
SOURCE      uncultured bacterium
  ORGANISM  uncultured bacterium
			Bacteria; environmental samples.
REFERENCE   1
  AUTHORS   Hatamoto,M., Imachi,H., Yashiro,Y., Ohashi,A. and Harada,H.
  TITLE     Diversity of Anaerobic Microorganisms Involved in Long-Chain Fatty
			Acid Degradation in Methanogenic Sludges as Revealed by RNA-Based
			Stable Isotope Probing
  JOURNAL   Appl. Environ. Microbiol. 73 (13), 4119-4127 (2007)
   PUBMED   17483279
REFERENCE   2  (bases 1 to 1145)
  AUTHORS   Hatamoto,M. and Imachi,H.
  TITLE     Direct Submission
  JOURNAL   Submitted (16-JAN-2007) Hiroyuki Imachi, Japan Agency for
			Marine-Earth Science & Technology (JAMSTEC), Extremobiosphere
			Research Center; 2-15 Natsushima-cho, Yokosuka, Kanagawa 237-0061,
			Japan (E-mail:[email protected], Tel:81-46-867-9709,
			Fax:81-46-867-9715)
FEATURES             Location/Qualifiers
	 source          1..1145
					 /organism="uncultured bacterium"
					 /mol_type="genomic DNA"
					 /isolation_source="Mesophilic anaerobic sludge treating
					 palm oil mill effluent"
					 /db_xref="taxon:77133"
					 /clone="MBF16_34"
					 /environmental_sample
	 rRNA            <1..>1145
					 /product="16S ribosomal RNA"
ORIGIN      
		1 agtcgagaat cttccccaat gggcgaaagc ctgagggagc gacgccgcgt gagggatgaa
	   61 ggccctttgg gttgtaaacc tctgttaggg ggaaagaaaa gcagtggaag caatatgtcc
	  121 attgcctgac gttaccccca gagaaagctc cggccaactc cgtgccagca gccgcggtaa
	  181 tacgggggga gcaagcgttg tccggaatca ttgggcgtaa agggcgtgta ggcggcttgg
	  241 caagtcgaat gtgaaatccc acggctcaac cgtggaactg cgttcgaaac tgccttgctt
	  301 gagtgcggga gaggtgtgcg gaattcctgg tgtagcggtg gaatgcgtag atatcaggaa
	  361 gaacaccggt ggcgaaggcg gcacactggc ccagcactga cgctgaggcg cgaaagcgtg
	  421 gggagcgaac gggattagat accccggtag tccacgctgt aaactttggg cactaggtat
	  481 tggaggtctc aaccccttca gtgccgtagc taacgcgtta agtgccccgc ctggggagta
	  541 cggtcgcaag gctgaaactc aaaggaattg acgggggccc gcacaagcgg tggagcatgt
	  601 ggtttaattc gatgcaacgc gaagaacctt accggggttt gacatgggag cctcgccgca
	  661 aggcgaggtc agccctatga aagtagggtg tgtccacaca ggtgctgcat ggctgtcgtc
	  721 agctcgtgtc gtgagatgtt gggttaagtc ccgcaacgag cgcaaccctc gccgatagtt
	  781 accaacgggt catgccgggg actctatcgg gactgccggt gataaaccgg aggaaggtgg
	  841 ggatgatgtc aagtcatcat ggcccttaca tcccgggcta cacacgtgct acaatggtcg
	  901 gtacagcggg ttgcaatacc gcgaggtgga gcaaatcctc aaagccggcc tcagtacgga
	  961 ttggagtctg caactcgact ctatgaagcc ggaatcgcta gtaatcgcgg atcagaatgc
	 1021 cgcggtgaat acgttcccgg gccttgtaca caccgcccgt caagccatgg gaatcgccag
	 1081 cactcgaagt cgctggccta accgcaaggg gggaggcgcc gaaagtgaag ccgatgactg
	 1141 gggct
//

`


Luckily some nice people have made packages that let you search it by "field names" e.g. you can extract the "isolation_source". Extracting the field in the file above would provide you with the text string "Mesophilic anaerobic sludge treating palm oil mill effluent" telling you something about the place where you might find this organism.

Clone this wiki locally