-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add SILVA/greengenes DB to 16S amptk taxonomy #44
Comments
update: I've tried this, but the size of the database is large and searches are incredibly slow, need to figure out why before doing this. There seems to be a lot of redundancy in the database, one solution would be to dereplicate and then LCA, but not sure that will then be any better. |
Failed to create database UTAX and SINTAX but usearch created. #!/bin/bash download the SILVA_132_SSURef_NR99 database and extract the V4 regionmkdir $SILVA && cd $SILVA Download and checkwget -c ${URL}/${INPUT}{,.md5} && md5sum -c ${INPUT}.md5 Define variables and output filesPrimers V3-V4 region (~465 bp) 341F & 805ROUTPUT="${INPUT/.fasta.gz/_341F_805R.fasta}" Extract target region using forward & reverse primerszcat "${INPUT}" | sed '/^>/ ! s/U/T/g' | Format extracted SILVA databasesed 's/://g' $SILVA/${OUTPUT} | amptk database = This is my output:amptk database -i [12:50:36 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 amptk database -i [12:52:46 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 amptk database -i [12:55:22 PM]: OS: Ubuntu 14.04, 8 cores, ~ 8 GB RAM. Python: 3.4.3 |
Likely running out of memory or the taxonomy labels not formatted correctly. Check the logfile to see what usearch died with. |
usearch v9.2.64_i86linux32, 4.0Gb RAM (1040Gb total), 160 cores 00:00 37Mb 0.1% Reading 16S_SINTAX.extracted.fa WARNING: 1535 taxonomy nodes have >1 parent usearch9 -makeudb_sintax 16S_SINTAX.extracted.fa -output 16S_SINTAX.udb -notrunclabels ---Fatal error--- |
Hopefully that error makes sense - you have sequences (at least one) that doesn’t have any taxonomy information. |
I am commenting here because it is related, I am trying to build a tailored V4 database out of SILVA NR 132. I guess I've managed to format the headers properly, the fasta has now around 200 Mb. However, after running: amptk database -i '/media/filipe/84CAA4DCCAA4CC2C/databases/SILVA_fasta/amptk_SILVA_132' -o V4_NR132 --format off --create_db utax --skip_trimming --install --primer_required none --derep_fulllength I got: The utax logfile says:00:00 37Mb 0.1% Reading /home/filipe/miniconda3/envs/amptk/lib/python3.7/site-packages/amptk/DB/V4_NR132.extracted.fa^M00:01 112Mb 7 WARNING: 1247 taxonomy nodes have >1 parent
|
UTAX is specific to usearch. The memory limit doesn’t seem to always be consistent, technically it should be 4 GB but that doesn’t mean it can work on files equal to 4 GB. Anyway the current hybrid method will use the entire database with vsearch for global alignment. So you can randomly sub sample the input data to generate the UTAX and sintax databases whole keeping the whole thing for the usearch database. |
Not sure if this will help now but just in case anyone else wants to convert silva to sintax fasta format, I wrote this.. it won't give the proper taxonomic levels to eukaryotes in silva132 but it will work with -makeudb_sintax.
|
I am trying to make the SILVA LSU fasta file into a new LSU database in amptk. I think I have the taxonomy fixed (partially using the script from theo-allnutt-bioinformatics above), but now when I try to create the database, amptk gives me errors. In the command line it says
in the log file it says
I tried Googling malloc_error_break and the results I found suggested that the script needs to be debugged to see where the error is occurring and then insert the breakpoint. I do not know how to do this, so do you have any suggestions? |
Malloc errors are memory related, so the free version of usearch is 32 but and thus the file you are trying to make a database out of is too large and it errors. So need to reduce the size to build the sintax and utax databases. The global alignment or usearch uses vsearch which doesn’t have the memory limit. Look at the readthedocs manual on how I subsampled the data for other databases to build sintax and utax. |
Going back to the Green Genes question, I have made an AMPtk formatted Green Genes database (too big to upload here, so I've put it here (gg_13_5_replace), which is also where the 28S database I made (RDP_SILVA_LSU_update) is stored). I added a few endosymbiont sequences (for the specific project I'm working on) and some fungal 18S sequences because the primers we've been using sometimes amplify fungal sequences. I was able to successfully install it into AMPtk. |
problem is reformatting the taxonomy information in format needed by UTAX/SINTAX. I've previously looked at SILVA taxonomy -- the taxonomy appeared to be a hot mess (I'm not a bacteriologist).... So the challenge will be convert the taxonomy strings to proper format
example taxonomy strings:
The text was updated successfully, but these errors were encountered: