Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome names do not match annotation #13

Open
Marlinski95 opened this issue Apr 19, 2022 · 6 comments
Open

Genome names do not match annotation #13

Marlinski95 opened this issue Apr 19, 2022 · 6 comments

Comments

@Marlinski95
Copy link

Hi,
I am trying to convert my emapper annotations into genebank format using your tool. I have the following directories set up:

ANNOTATION/ FASTAPROT/ FASTNUCLEIC/ GENBANK/ GFF/ HITS/ ORTHOLOGS/

(emapper2gbk) [mjensen2$] ls FASTNUCLEIC/
BC-1_bin.100.fna BC-1_bin.116.fna BC-1_bin.14.fna etc.

(emapper2gbk) [mjensen2$] ls FASTAPROT/
BC-1_bin.100.emapper.genepred.faa BC-1_bin.116.emapper.genepred.faa BC-1_bin.14.emapper.genepred.faa etc.

(emapper2gbk) [mjensen2$] ls ANNOTATION/
BC-1_bin.100.emapper.annotations BC-1_bin.116.emapper.annotations BC-1_bin.14.emapper.annotations etc.

When I run the following command, however, I get the an error saying that the genomes names do not match the annotation names.

(emapper2gbk) [mjensen2$] emapper2gbk genes -fn ./FASTNUCLEIC/ -fp ./FASTAPROT/ -o ./GENBANK/ -a ./ANNOTATION/ -c 10 -n BC-1 -go gobasic -g ./GFF/

Since it is not the filenames I checked the file content and noticed that emapper has added an additional number to the identifier when it predicted genes and annotated these, e.g.

Contig ID: >bin.1.fak127_1021
Prot ID: >bin.1.fak127_1021_1
Annotation ID: bin.1.fak127_1021_1

I believe this is the problem but I don't know how to work around this as this is something emapper added. Have you encountered this before? I might just be missing a flag of some sort but I am unsure and would appreciate your help!

Cheers,
Marlene

@ArnaudBelcour
Copy link

Hi @Marlinski95,

When giving a directory as input to emapper2gbk, the tool expects that the files for a same organism in the ANNOTATION/FASTAPROT/FASTNUCLEIC folders have the same name. And it seems that this is not the case for your data.

Your input seems to be:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.emapper.genepred.faa
    ├── BC-1_bin.116.emapper.genepred.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.emapper.annotations
    ├── BC-1_bin.116.emapper.annotations
    ├── ...

With these names, emapper2gbk will not be able to map the different files to the same organism and will return an error. They should be formatted as this:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.faa
    ├── BC-1_bin.116.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.tsv
    ├── BC-1_bin.116.tsv
    ├── ...

By renaming the files, this should fix this issue.

Best Regards,
Arnaud Belcour.

@Marlinski95
Copy link
Author

Marlinski95 commented Apr 19, 2022

Oh I see! Thank so much - so it is always required to reformat the emapper output.
Thanks for the quick help. I fixed the file names now but it still has an issue with the .gff files.

When I run the command above it now gives me the following error:
image

My gff files are renamed as well and now look like this:

(emapper2gbk) [mjensen2@kleinerserver BC-1]$ ls GFF
BC-1_bin.100.gff BC-1_bin.116.gff BC-1_bin.14.gff etc.

The instructions on the user page are not entirely clear "the GFF file corresponding to the genome or a folder containing multiple GFF files (must be the same name as the nucleotide folder).". Does this mean the gff directory has to be in the nucleotide directory (when I have anything but the .fna files in there it complains)? Could you clarify?

Thanks again!

@ArnaudBelcour
Copy link

Can you give me the complete error message and the command you used?
It seems that the path ./gff/ has been associated with the -go option (option used to select/download the go-basic.obo file to process Gene Ontology Terms) instead of the -g option (to handle GFF).

My gff files are renamed as well and now look like this:

(emapper2gbk) [mjensen2@kleinerserver BC-1]$ ls GFF
BC-1_bin.100.gff BC-1_bin.116.gff BC-1_bin.14.gff etc.

This GFF folder seems to be correct and should not produce error.

The instructions on the user page are not entirely clear "the GFF file corresponding to the genome or a folder containing multiple GFF files (must be the same name as the nucleotide folder).". Does this mean the gff directory has to be in the nucleotide directory (when I have anything but the .fna files in there it complains)? Could you clarify?

Sorry, there is a typo in it, I will fix it. The correct sentence is: the GFF file corresponding to the genome or a folder containing multiple GFF files (each GFF files must have the same name as the corresponding nucleotide files).
What is explained here, is that as for the FASTAPROT and ANNOTATION folders, the name of the files in the GFF folder must be the same than the files from the FASTNUCLEIC folder.

So something like this:

FASTNUCLEIC
    ├── BC-1_bin.100.fna
    ├── BC-1_bin.116.fna
    ├── ...
FASTAPROT
    ├── BC-1_bin.100.faa
    ├── BC-1_bin.116.faa
    ├── ...
ANNOTATION
    ├── BC-1_bin.100.tsv
    ├── BC-1_bin.116.tsv
    ├── ...
GFF
    ├── BC-1_bin.100.gff
    ├── BC-1_bin.116.gff
    ├── ...

And the GFF folder is an independent folder (such as FASTAPROT and ANNOTATION) so it must not be in the nucleotide folder. The location of the GFF folder is given to emapper2gbk with the option -g when using emapper2gbk genomes.

@Marlinski95
Copy link
Author

Hello,
thanks for the clarification and extensive response! I truly appreciate it. I know realized that the reason it wasn't working was that I ran the "genes" mode instead of the "genomes" mode. My apologies - rookie mistake.
It ran now but I received this error message for every single bin

Creating GFF database (gffutils) for BC-1_bin.15
/!\ Error with BC-1 this taxa has not been found in https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/
/!\ Check the name of the taxa and its presence in the EBI taxonomy database.
/!\ No genbank will be created for BC-1.
/!\ Only 0 on 127 genbanks have been created, check the logs for error.
--- Total runtime 32.14 seconds ---

Am I still missing something? I know that my taxonomic resolutions isn't very high since we suspect a lot of Candidate species in my samples but I think I don't entirely understand how this is tied to reformatting the data.

Thanks a thousands for your help and time!
Best,

@ArnaudBelcour
Copy link

Hi,

The issue here is that BC-1 is too precise as a taxonomic resolutions for the taxonomic database.

The search on https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/BC-1 show no results.

You should use a higher taxonomic rank (either species or genus). By adding the taxon name to the address https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/, you should see if this is working.

For example https://www.ebi.ac.uk/ena/data/taxonomy/v1/taxon/scientific-name/Escherichia%20coli
(The %20 here is to replace a space between the genus and the species names.)

And do you have only one taxon for all your data?
Because you can give a taxonomic file by using the option -nf. This option takes as input a .tsv file with 2 columns (first is the name of the organism and the second is the name of the corresponding taxon).
For example:

BC-1_bin.100 Genus species
BC-1_bin.116 Escherichia coli
... ...

In this way you can give the specific taxon associated to each genome.

Best regards,
Arnaud Belcour.

@Marlinski95
Copy link
Author

Marlinski95 commented Apr 20, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants