Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SILVA/greengenes DB to 16S amptk taxonomy #44

Open
nextgenusfs opened this issue Oct 3, 2018 · 11 comments
Open

add SILVA/greengenes DB to 16S amptk taxonomy #44

nextgenusfs opened this issue Oct 3, 2018 · 11 comments

Comments

@nextgenusfs
Copy link
Owner

problem is reformatting the taxonomy information in format needed by UTAX/SINTAX. I've previously looked at SILVA taxonomy -- the taxonomy appeared to be a hot mess (I'm not a bacteriologist).... So the challenge will be convert the taxonomy strings to proper format

example taxonomy strings:

>BOLD:ACI6695;tax=k:Animalia,p:Arthropoda,c:Insecta,o:Coleoptera,f:Elateridae,g:Nipponoelater,s:Nipponoelater babai
>S004604051;tax=k:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Hymenochaetales,f:Hymenochaetaceae,g:Inonotus,s:Sanghuangporus zonatus
>S004127186;tax=k:Fungi,p:Ascomycota
>S004061552;tax=k:Fungi,p:Ascomycota,c:Eurotiomycetes,s:Pyrenula sanguinea
@nextgenusfs nextgenusfs changed the title add SILVA/greengenes DB to amptk taxonomy add SILVA/greengenes DB to 16S amptk taxonomy Oct 3, 2018
@nextgenusfs
Copy link
Owner Author

update: I've tried this, but the size of the database is large and searches are incredibly slow, need to figure out why before doing this. There seems to be a lot of redundancy in the database, one solution would be to dereplicate and then LCA, but not sure that will then be any better.

@AlexGaithuma
Copy link

AlexGaithuma commented Apr 12, 2019

Failed to create database UTAX and SINTAX but usearch created.
my command:

#!/bin/bash
SILVA="/media/bioadmin/Think/MiseqDATA/SILVA_Database"

download the SILVA_132_SSURef_NR99 database and extract the V4 region

mkdir $SILVA && cd $SILVA
RELEASE=132
URL="https://www.arb-silva.de/fileadmin/silva_databases/release_${RELEASE}/Exports"
INPUT="SILVA_${RELEASE}_SSURef_Nr99_tax_silva.fasta.gz"

Download and check

wget -c ${URL}/${INPUT}{,.md5} && md5sum -c ${INPUT}.md5

Define variables and output files

Primers V3-V4 region (~465 bp) 341F & 805R

OUTPUT="${INPUT/.fasta.gz/_341F_805R.fasta}"
LOG="${INPUT/.fasta.gz/_341F_805.log}"
PRIMER_F="CCTACGGGNGGCWGCAG"
PRIMER_R="GACTACHVGGGTATCTAATCC"
ANTI_PRIMER_R="GGATTAGATACCCBDGTAGTC"
MIN_LENGTH=32
MIN_F=$(( ${#PRIMER_F} * 2 / 3 ))
MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))

Extract target region using forward & reverse primers

zcat "${INPUT}" | sed '/^>/ ! s/U/T/g' |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -g "${PRIMER_F}" -O "${MIN_F}" - 2> "${LOG}" |
cutadapt --discard-untrimmed --minimum-length ${MIN_LENGTH} -a "${ANTI_PRIMER_R}" -O "${MIN_R}" - 2>> "${LOG}" |
sed '/^>/ s/;/|/g ; /^>/ s/ //g ; /^>/ s// /1' > "${OUTPUT}"

Format extracted SILVA database

sed 's/://g' $SILVA/${OUTPUT} |
sed 's/-/
/g' | sed 's/(//g' | sed 's/)//g' |
sed 's/ Bacteria/;tax=d:Bacteria,/g' |
sed 's/|/p:"/1' | sed 's/|/",c:/1' | sed 's/|/,o:/1' |
sed 's/|/,f:/1' | sed 's/|/,g:/1' | sed 's/|/,s:/1' > $SILVA/AMPtk_${OUTPUT}

amptk database = This is my output:

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S --format off --create_db usearch
--skip_trimming --install --primer_required none --derep_fulllength

[12:50:36 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:50:36 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:50:36 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S
[12:50:36 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:50:49 PM]: 594,027 records loaded
[12:50:49 PM]: Using 8 cpus to process data
[12:51:29 PM]: 593,752 records passed (99.95%)
[12:51:29 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:51:29 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:51:40 PM]: 359,733 records passed (60.56%)
[12:51:41 PM]: Creating USEARCH Database (VSEARCH)
[12:52:08 PM]: Database /usr/local/lib/python3.4/dist-packages/amptk/DB/16S.udb created successfully

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S_SINTAX --format off --create_db sintax
--skip_trimming --install --primer_required none --derep_fulllength

[12:52:46 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:52:46 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:52:46 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_SINTAX
[12:52:46 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:52:59 PM]: 594,027 records loaded
[12:52:59 PM]: Using 8 cpus to process data
[12:53:36 PM]: 593,752 records passed (99.95%)
[12:53:36 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:53:36 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:53:47 PM]: 359,733 records passed (60.56%)
[12:53:47 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
File "/usr/local/bin/amptk", line 735, in
main()
File "/usr/local/bin/amptk", line 726, in main
mod.main(arguments)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main
makeDB(OutName, args=args)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB
amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment

amptk database -i $SILVA/AMPtk_${OUTPUT} -o 16S_UTAX --format off --create_db sintax
--skip_trimming --install --primer_required none --derep_fulllength

[12:55:22 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3
[12:55:22 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.9.0
[12:55:22 PM]: Base name set to: /usr/local/lib/python3.4/dist-packages/amptk/DB/16S_UTAX
[12:55:22 PM]: Working on file: /media/bioadmin/Think/MiseqDATA/SILVA_Database/AMPtk_SILVA_132_SSURef_Nr99_tax_silva_341F_805R.fasta
[12:55:35 PM]: 594,027 records loaded
[12:55:35 PM]: Using 8 cpus to process data
[12:56:12 PM]: 593,752 records passed (99.95%)
[12:56:12 PM]: Errors: 0 no taxonomy info, 0 no ID, 0 length out of range, 275 too many ambiguous bases, 0 no primers found
[12:56:12 PM]: Now dereplicating sequences (collapsing identical sequences)
[12:56:22 PM]: 359,733 records passed (60.56%)
[12:56:22 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
File "/usr/local/bin/amptk", line 735, in
main()
File "/usr/local/bin/amptk", line 726, in main
mod.main(arguments)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 623, in main
makeDB(OutName, args=args)
File "/usr/local/lib/python3.4/dist-packages/amptk/extract_region.py", line 418, in makeDB
amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment

@nextgenusfs
Copy link
Owner Author

Likely running out of memory or the taxonomy labels not formatted correctly. Check the logfile to see what usearch died with.

@AlexGaithuma
Copy link

AlexGaithuma commented Apr 12, 2019

usearch v9.2.64_i86linux32, 4.0Gb RAM (1040Gb total), 160 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

00:00 37Mb 0.1% Reading 16S_SINTAX.extracted.fa
00:01 84Mb 21.5% Reading 16S_SINTAX.extracted.fa
00:02 174Mb 65.5% Reading 16S_SINTAX.extracted.fa
00:02 247Mb 100.0% Reading 16S_SINTAX.extracted.fa
00:02 214Mb 0.1% Converting to upper case
00:03 214Mb 65.2% Converting to upper case
00:03 214Mb 100.0% Converting to upper case
00:03 215Mb 0.1% Word stats
00:04 215Mb 8.9% Word stats
00:05 215Mb 22.3% Word stats
00:06 215Mb 35.9% Word stats
00:07 215Mb 47.4% Word stats
00:08 215Mb 59.2% Word stats
00:09 215Mb 70.6% Word stats
00:10 215Mb 81.7% Word stats
00:11 215Mb 93.5% Word stats
00:11 215Mb 100.0% Word stats
00:11 215Mb 100.0% Alloc rows
00:11 798Mb 0.1% Build index
00:12 798Mb 2.0% Build index
00:13 798Mb 10.2% Build index
00:14 798Mb 17.5% Build index
00:15 798Mb 24.9% Build index
00:16 798Mb 32.7% Build index
00:17 798Mb 41.0% Build index
00:18 798Mb 49.3% Build index
00:19 798Mb 57.7% Build index
00:20 798Mb 65.7% Build index
00:21 798Mb 73.7% Build index
00:22 798Mb 81.9% Build index
00:23 798Mb 88.4% Build index
00:24 798Mb 96.4% Build index
00:24 798Mb 100.0% Build index
00:24 801Mb 0.0% Initialize taxonomy data
00:25 805Mb 13.9% Initialize taxonomy data
00:26 810Mb 33.0% Initialize taxonomy data
00:27 814Mb 54.0% Initialize taxonomy data
00:28 820Mb 75.6% Initialize taxonomy data
00:29 823Mb 94.2% Initialize taxonomy data
00:29 824Mb 100.0% Initialize taxonomy data
00:29 824Mb 0.0% Building name table
00:29 826Mb 100.0% Building name table
00:29 826Mb 58553 names, tax levels min 0, avg 6.5, max 7

WARNING: 1535 taxonomy nodes have >1 parent

usearch9 -makeudb_sintax 16S_SINTAX.extracted.fa -output 16S_SINTAX.udb -notrunclabels

---Fatal error---
../utaxdata.cpp(1176) assert failed: Size > 0 && Ids != 0

@nextgenusfs
Copy link
Owner Author

Hopefully that error makes sense - you have sequences (at least one) that doesn’t have any taxonomy information.

@FilipeMatteoli
Copy link

FilipeMatteoli commented Nov 1, 2019

I am commenting here because it is related, I am trying to build a tailored V4 database out of SILVA NR 132. I guess I've managed to format the headers properly, the fasta has now around 200 Mb. However, after running:

amptk database -i '/media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132' -o V4_NR132 --format off --create_db utax --skip_trimming --install --primer_required none --derep_fulllength

I got:
[06:30:23 PM]: OS: debian buster/sid, 16 cores, ~ 66 GB RAM. Python: 3.7.3
[06:30:23 PM]: AMPtk v1.4.0, USEARCH v9.2.64, VSEARCH v2.13.6
[06:30:23 PM]: Base name set to: /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132
[06:30:23 PM]: Working on file: /media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132
[06:30:33 PM]: 556,586 records loaded
[06:30:33 PM]: Using 16 cpus to process data
[06:30:46 PM]: 556,389 records passed (99.96%)
[06:30:46 PM]: Errors: 0 no taxonomy info, 0 no ID, 1 length out of range, 196 too many ambiguous bases, 0 no primers found
[06:30:46 PM]: Now dereplicating sequences (collapsing identical sequences)
[06:30:50 PM]: 255,268 records passed (45.86%)
[06:30:50 PM]: Creating UTAX Database, this may take awhile
[07:03:52 PM]: There was a problem creating the DB, check the UTAX log file /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.utax.log

The utax logfile says:

00:00 37Mb 0.1% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 112Mb 7
0.6% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 142Mb 100.0% Reading /ho
me/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa
00:01 108Mb 0.1% Converting to upper case ^M00:01 108Mb 10
0.0% Converting to upper case
00:01 109Mb 0.1% Word stats ^M00:02 109Mb 55.9% Word stats^M00:02 109Mb 100.0% Word stats
00:02 109Mb 100.0% Alloc rows
00:02 358Mb 0.1% Build index^M00:03 358Mb 19.3% Build index^M00:04 358Mb 66.9% Build index^M00:04 358Mb 100.0% Build index
00:04 360Mb 0.0% Initialize taxonomy data^M00:05 364Mb 17.6% Initialize taxonomy data^M00:06 373Mb 75.9% Initialize taxonomy data
^M00:06 377Mb 100.0% Initialize taxonomy data
00:06 377Mb 0.0% Building name table ^M00:06 378Mb 100.0% Building name table
00:06 378Mb 41947 names, tax levels min 2, avg 6.6, max 7

WARNING: 1247 taxonomy nodes have >1 parent
00:06 491Mb 0.1% Word stats^M00:07 491Mb 33.8% Word stats^M00:07 491Mb 100.0% Word stats
00:07 491Mb 100.0% Alloc rows
00:07 740Mb 0.1% Build index^M00:08 740Mb 7.0% Build index^M00:09 740Mb
......
27.6% Distance matrix/usort^M33:02 4.3Gb 27.6% Distance matrix/usort
---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required

I guess this is related to the 32bit version of usearch being limited to 4 Gb, unfortunately, 64bit version is expensive. My system has 64 Gb.

Is there any way to create this database using vsearch?
Is it possible to split the fasta, loop in the database creation then join files at the end?
Any suggestions would be good.

Best,

@nextgenusfs
Copy link
Owner Author

UTAX is specific to usearch. The memory limit doesn’t seem to always be consistent, technically it should be 4 GB but that doesn’t mean it can work on files equal to 4 GB. Anyway the current hybrid method will use the entire database with vsearch for global alignment. So you can randomly sub sample the input data to generate the UTAX and sintax databases whole keeping the whole thing for the usearch database.

@theo-allnutt-bioinformatics
Copy link

theo-allnutt-bioinformatics commented Dec 3, 2019

Not sure if this will help now but just in case anyone else wants to convert silva to sintax fasta format, I wrote this.. it won't give the proper taxonomic levels to eukaryotes in silva132 but it will work with -makeudb_sintax.
Theo

#!/usr/bin/env python

#reformats SILVA formatted taxonomy in the header of fasta files to be compatible with usearch and sintax dbs.
#useage:
#reformat_silva.py infile.fasta outfile.fasta

import sys

f= open(sys.argv[1],'r')
h= open(sys.argv[2],'w')
tax=['d','p','c','o','f','g','s']


for i in f:
	out=""
	if i[0]==">":
		k=i.split(";")
		if len(k)>7:
			k=k[-7:]
		acc=k[0].split(" ")[0]
		k[-1]=k[-1].rstrip("\n")
		if " " in k[0]:
			k[0]=k[0].split(" ")[1]
		c=-1
		for x in k:
			c=c+1
			out=out+tax[c]+":"+x+","
		
		out=out.rstrip(",")
		h.write(acc+";tax="+out+"\n")
			
		
	else:
		i=i.replace('U','T')
		h.write(i)

@nicolereynolds1
Copy link

nicolereynolds1 commented Jan 31, 2020

I am trying to make the SILVA LSU fasta file into a new LSU database in amptk. I think I have the taxonomy fixed (partially using the script from theo-allnutt-bioinformatics above), but now when I try to create the database, amptk gives me errors. In the command line it says

[Jan 31 03:57 PM]: Creating SINTAX Database, this may take awhile
Traceback (most recent call last):
  File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 736, in <module>
    main()
  File "/Users/smithlab/miniconda/envs/amptk/bin/amptk", line 727, in main
    mod.main(arguments)
  File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 623, in main
    makeDB(OutName, args=args)
  File "/Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/extract_region.py", line 418, in makeDB
    amptklib.log.error("There was a problem creating the DB, check the UTAX log file %s" % utax_log)
UnboundLocalError: local variable 'utax_log' referenced before assignment

in the log file it says

usearch9(16829,0xa98521c0) malloc: *** mach_vm_map(size=2963820544) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug


usearch9 -makeudb_sintax /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.extracted.fa -output /Users/smithlab/miniconda/envs/amptk/lib/python3.7/site-packages/amptk/DB/LSU_SINTAX.udb -notrunclabels

---Fatal error---
Memory limit of 32-bit process exceeded, 64-bit build required

I tried Googling malloc_error_break and the results I found suggested that the script needs to be debugged to see where the error is occurring and then insert the breakpoint. I do not know how to do this, so do you have any suggestions?

@nextgenusfs
Copy link
Owner Author

Malloc errors are memory related, so the free version of usearch is 32 but and thus the file you are trying to make a database out of is too large and it errors. So need to reduce the size to build the sintax and utax databases. The global alignment or usearch uses vsearch which doesn’t have the memory limit. Look at the readthedocs manual on how I subsampled the data for other databases to build sintax and utax.

@nicolereynolds2
Copy link

Going back to the Green Genes question, I have made an AMPtk formatted Green Genes database (too big to upload here, so I've put it here (gg_13_5_replace), which is also where the 28S database I made (RDP_SILVA_LSU_update) is stored). I added a few endosymbiont sequences (for the specific project I'm working on) and some fungal 18S sequences because the primers we've been using sometimes amplify fungal sequences.

I was able to successfully install it into AMPtk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants