Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing gene - blastn evalue rounded to 0.0 #60

Open
crcardenas opened this issue Jun 20, 2024 · 2 comments
Open

Missing gene - blastn evalue rounded to 0.0 #60

crcardenas opened this issue Jun 20, 2024 · 2 comments

Comments

@crcardenas
Copy link

crcardenas commented Jun 20, 2024

I am running mitofinder on assembled contigs to search for mtDNA/Genomes. However, I am missing CO1 from the output. I thought maybe I just didn't capture that mtGene. However, to confirm my suspicions, I ran blastn using the same 5 reference genes and found a contig or two that should contain that gene (three hits actually, but the longest blast alignment fits the expected length of 1.5k bp). For some reason the tmp files directory generated by mitofinder has the CO1 blast output empty (actually all of them are) so I cannot check the blast scores used by mitofinder.

Here is an example of my sanity check where all CO1 evalues are 0.0

contig_2018	NC_036507.1@COX1	91.063	884	77	2	1	883	364	1246	0.0	1194
contig_2018	NC_046469.1@COX1	90.611	884	81	2	1	883	364	1246	0.0	1171
contig_2018	NC_016469.1@COX1	90.498	884	82	2	1	883	364	1246	0.0	1166
contig_2018	NC_018339.1@COX1	89.581	883	92	0	1	883	355	1237	0.0	1122
contig_2018	NC_044759.1@COX1	89.052	886	91	3	1	883	364	1246	0.0	1094
...
contig_1535	NC_036507.1@COX1	90.537	1173	102	4	15804	16968	1	1172	0.0	1543
contig_1535	NC_044759.1@COX1	89.608	1174	115	6	15804	16970	1	1174	0.0	1485
contig_1535	NC_016469.1@COX1	89.632	1167	112	4	15804	16962	1	1166	0.0	1476
contig_1535	NC_046469.1@COX1	89.632	1167	112	4	15804	16962	1	1166	0.0	1476
contig_1535	NC_018339.1@COX1	89.261	1164	116	5	15813	16968	1	1163	0.0	1448
...
contig_2876	NC_016469.1@COX1	90.944	1546	138	2	3549	5093	1	1545	0.0	2078
contig_2876	NC_036507.1@COX1	90.980	1541	137	2	3549	5088	1	1540	0.0	2074
contig_2876	NC_046469.1@COX1	90.962	1538	137	2	3549	5085	1	1537	0.0	2069
contig_2876	NC_044759.1@COX1	90.201	1541	145	3	3549	5086	1	1538	0.0	2004
contig_2876	NC_018339.1@COX1	90.137	1531	151	0	3558	5088	1	1531	0.0	1991

My understanding is that blast will round a very small e-value to 0 if its a good match.

However, I noticed that this could be an issue with the way the max value is added in the python script:
line 84 mitofinder

parser.add_argument('-e', '--blast-eval', help='e-value of blast program used for contig identification and annotation. Default = 0.00001', type=float,
						default=0.00001, dest='blasteVal')

I was also missing some other genes before adding mtDNA reference from a sister genus, which I think might explain why some of the genes were missing. Here is a good example, ND4: ~ 1.3k bp reference genes. Which now appear with near zero evalues for the longest alignment length contig.

contig_2359	NC_016469.1@ND4	90.280	751	73	0	259262	260012	1290	540	0.0	983
contig_2359	NC_036507.1@ND4	90.280	751	73	0	259262	260012	1290	540	0.0	983
contig_2359	NC_046469.1@ND4	89.614	751	78	0	259262	260012	1290	540	0.0	955
contig_2359	NC_044759.1@ND4	89.418	756	76	4	259262	260017	1290	539	0.0	950
contig_2359	NC_018339.1@ND4	89.390	754	78	2	259262	260015	1290	539	0.0	948
...
contig_3886	NC_016469.1@ND4	89.389	311	32	1	837341	837651	839	1148	1.64e-107	390
contig_3886	NC_018339.1@ND4	89.389	311	32	1	837341	837651	839	1148	1.64e-107	390
contig_3886	NC_036507.1@ND4	89.644	309	29	3	837341	837648	839	1145	1.64e-107	390
contig_3886	NC_046469.1@ND4	88.746	311	34	1	837341	837651	839	1148	3.56e-104	379
contig_3886	NC_044759.1@ND4	87.539	321	36	4	837335	837653	832	1150	7.70e-101	368

I would normally try tweaking the code myself, but I am using the singularity container and don't know how to edit anything there.

Any advice would be greatly appreciated!


To clarify, I am using oxford nanopore output, so this might be part of the issue. However, I have kept the max length for contigs as default (25k bp); so I do not think it would be in conflict with the framework of using mitofinder for UCE assemblies. It may still not be appropriate, but I thought you would want to be made aware of this issue.

system information:
ubuntu server, mitofinder version: mitofinder_v1.4.2.sif, singularity version: 3.7.1
for my own blastn: BLAST 2.12.0+
@RemiAllio
Copy link
Owner

Hi!

Thank you for your feedback.
I am a little surprised by what you found.

Is it possible for you to share the listed contigs, your command line, and your reference so I can check all of this?

Best,
Rémi

@crcardenas
Copy link
Author

crcardenas commented Jun 20, 2024

Hello Rémi,

Interestingly, when I subset it like this, I end up with COX1. However, I still see disagreement between my blastn output and mitofinder:
I see the following genes in my blast: COX1, COX2, COX3, ND2, ND3, ND4, ND5, ND6, and rrnS. However, mitofinder recovers these genes in addition to ND1,ND4L, and CYTB; but not rrnS.

It looks like blastx is used at some point, so that might explain part of this but; so I think my initial prediction about the 0.0 value was inaccurate.

However, I cannot be certain given that the mitofinder blast files are inconsistently empty ( those found in CBX1139_subset/CBX1139_subset_tmp/*_blast_out.txt) and there is inconsistencies between some of those missing genes.

Please see the attached files: CBX1139_subset_files.zip

reference.fasta
-this is the file found in the CBX1139_subset/CBX1139_subset_tmp/contig_id_database.fasta its what I used to create the blast database.
CBX1139_subset.fasta contans any contigs below the default 25k threshold.

Hopefully everything I have provided is clear and I didn't miss anything.

Best,
Cody

Example mitofinder script

#!/bin/bash
source /local/anaconda3/bin/activate
conda activate singularity
cd /home/cody/CALOSOMA_Genomes/NANOPORE/blobtools/CBX1139/mitofinder
for i in 1; do
# can be setup to iterate over a directory of files; cp assemblies over (singularity does not like symlinks) and adjust -B,-j,-a, etc.
/Calosoma_WORKING/MitoFinder/ mitofinder_v1.4.1.sif \
singularity run  \
-B /home/cody/mito/:/home/cody/mito/,/data/raw/Calosoma/NANOPORE/blobtools/CBX1139/mitofinder/:/data/raw/Calosoma/NANOPORE/blobtools/CBX1139/mitofinder/ mitofinder_v1.4.2.sif \
-j CBX1139_subset \
-a /data/raw/Calosoma/NANOPORE/blobtools/CBX1139/mitofinder/CBX1139_subset.fasta \
-r /data/raw/Calosoma/NANOPORE/blobtools/CBX1139/mitofinder/reference.gb \
-m 64 -p 8 -o 5 --rename-contig no \
--min-contig-size 400 --max-contig-size 30000  
done |& tee -a ./mitofinder_`date +%Y%m%d-%H%M%S`.log

example blastn script

for i in 1; do 
#see previous comment about iteration, less necessary here.
DB="../CBX1139/mitofinder/find_COI"; 
blastn -db ${DB}/test.fasta -query CBX1139_subset.fasta \
-outfmt "6" -max_target_seqs 10  -evalue 1e-45 -num_threads 3 \
-out blastn_mito_CBX1139_blast.out; 
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants