Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[convert_genbank_to_gff3.py] key_error: locus_tag #60

Open
Piplopp opened this issue May 13, 2019 · 5 comments
Open

[convert_genbank_to_gff3.py] key_error: locus_tag #60

Piplopp opened this issue May 13, 2019 · 5 comments
Assignees
Labels

Comments

@Piplopp
Copy link

Piplopp commented May 13, 2019

Hello !
I'm trying to use the genbank to gff3 converter on this Genbank file: NC_008724

But I get a KeyError: 'locus_tag', some features doesn't have the locus_tag qualifier and it seems that's why the error is raised.

@Piplopp
Copy link
Author

Piplopp commented May 13, 2019

I just added a quick and dirty try/except on all the locus_tag getters and the conversion went fine and skipped the features that didn't have this qualifier.

Something like: (~ line 122 in the file)

elif feat.type == 'tRNA':
    try:
        locus_tag = feat.qualifiers['locus_tag'][0]
    except:
        print("WARNING: The following feature was skipped due to lack of locus_tag:\n{0}".format(feat))
        features_skipped_count += 1

Maybe an if statement would have been prettier tho...

@jorvis
Copy link
Owner

jorvis commented May 13, 2019

Good catch. Would you like to submit a pull request for attribution?

@Piplopp
Copy link
Author

Piplopp commented May 13, 2019

Sure thing, I'll submit tomorrow

@Piplopp
Copy link
Author

Piplopp commented May 14, 2019

Just a quick glance over the genbank file that lead to the error (for readability I removed some useless informations but you can find the whole file here):

On this first example, you can see that the tRNA feature does not have the locus_tag and that they are located outside the gene.

     gene            83175..83459
                     /gene="Z254R"
                     /locus_tag="ATCV1_Z254R"
                     /db_xref="GeneID:5470877"
     CDS             83175..83459
                     /gene="Z254R"
                     /locus_tag="ATCV1_Z254R"
                     [...]
     misc_feature    <83208..>83435
                     /gene="Z254R"
                     /locus_tag="ATCV1_Z254R"
                     [...]
     tRNA            83727..83793
                     /product="tRNA-Ser"
     tRNA            83799..83872
                     /product="tRNA-Arg"
     [...]
     tRNA            84456..84528
                     /product="tRNA-Lys"

On this second example, we have the tRNA feature without a locus_tag but, its coordinates are within the range of the previous gene:

     gene            complement(84467..84664)
                     /gene="Z255L"
                     /locus_tag="ATCV1_Z255L"
                     /db_xref="GeneID:5470549"
     CDS             complement(84467..84664)
                     /gene="Z255L"
                     /locus_tag="ATCV1_Z255L"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="YP_001426736.1"
                     /db_xref="GeneID:5470549"
                     /translation="MTSSLPLRLESADSALIGIEPMTPWLTAKCSNRLSYSALHLKSA
                     DLPDVGLEPTTTRLRALRSAD"
     tRNA            84551..84624
                     /product="tRNA-Asn"

And finally we have the best behavior possible, the one that you already covered: both the locus_tag and within the range:

     gene            84655..84966
                     /gene="Z256R"
                     /locus_tag="ATCV1_Z256R"
                     /db_xref="GeneID:5470266"
     CDS             84655..84966
                     /gene="Z256R"
                     /locus_tag="ATCV1_Z256R"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="YP_001426737.1"
                     /db_xref="GeneID:5470266"
                     /translation="MKSFVILRSPVRSWLGPKYQGPIAQLGERKTEVFGISPIRQHRL
                     VRSRPGVLRTSPLMRAWVRIPLLPFLIYHICAVSNPNQLCNHSIKNAQKQTYEVLGEK
                     R"
     tRNA            84773..84858
                     /gene="Z256R"
                     /locus_tag="ATCV1_Z256R"
                     /product="tRNA-Leu"
                     /db_xref="GeneID:5470266"

Now, knowing that, what behavior do you prefer between those two ?

1st: If the locus_tag qualifier is absent, just skip the whole feature. I've tried it only for tRNA but I apply it to the mRNA and rRNA aswell pretty easily.

elif feat.type == 'tRNA':
    try:
        locus_tag = feat.qualifiers['locus_tag'][0]
    except KeyError:
        print("WARNING: The following feature was skipped due to missing locus_tag qualifier:\n{0}".format(feat))
        features_skipped_count += 1
        continue

2nd: Or I've tried a second version which is checking if the feature is located inside the current_gene, if it is then we'll add it, and if it's not we'll skip it:

elif feat.type == 'tRNA':
    try:
        locus_tag = feat.qualifiers['locus_tag'][0]
    except KeyError:
        # If the feature does not have a locus_tag, we can check if
        # it's inside the current_gene or not. If it is then we can
        # add it using the last locus_tag in memory. If it isn't we
        # can skip the feature.
        pass
    rna_count_by_gene[locus_tag] += 1
    feat_id = "{0}.tRNA.{1}".format( locus_tag, rna_count_by_gene[locus_tag] )
    tRNA = things.tRNA(id=feat_id, parent=current_gene)
    tRNA.locate_on( target=current_assembly, fmin=fmin, fmax=fmax, strand=strand )

    if not tRNA.contained_within(current_gene):
        print("WARNING: The following feature was skipped:\n{0}".format(feat))
        features_skipped_count += 1
        rna_count_by_gene[locus_tag] -= 1  # remove the feature from the count since we drop it
        continue
    gene.add_tRNA(tRNA)
    current_RNA = tRNA
    ...

In this solution I'm using the last locus_tag in memory which should be the same as the current gene.

@jorvis
Copy link
Owner

jorvis commented May 17, 2019

Well isn't this ugly. According to the Genbank flat file spec for these features:

RNA features (rRNA, tRNA, ncRNA) must include a corresponding gene feature 
with a locus_tag qualifier.

In their examples the gene and ncRNA feature coordinates match. I'd rather annotations not be lost in the process, so the safest might be to print warnings anytime the gene and ncRNA coordinates don't match, then save the gene/ncRNA pairing if they overlap or if there are no other features the previous gene id is associated with.

The more I look though, the uglier this gets. That full annotation file you linked has an entire block of tRNAs with no associated genes:

     tRNA            83727..83793
                     /product="tRNA-Ser"
     tRNA            83799..83872
                     /product="tRNA-Arg"
     tRNA            83898..83968
                     /product="tRNA-Gly"
     tRNA            83991..84062
                     /product="tRNA-Asp"
     tRNA            84086..84158
                     /product="tRNA-Val"
     tRNA            84181..84253
                     /product="tRNA-Val"
     tRNA            84276..84349
                     /product="tRNA-Asn"
     tRNA            84372..84453
                     /product="tRNA-Tyr"
     tRNA            84456..84528
                     /product="tRNA-Lys"

Also, the one you mention that I already had covered 'ATCV1_Z256R' is actually a protein coding gene whose coordinates happen to overlap their annotated tRNA. They don't have anything to do with each other, so the current version would actually erroneously assign the locus tag to that tRNA and it shouldn't have.

The strict solution here would be erroring if there's an ncRNA feature without matching coordinates on a gene feature, and referencing the NCBI spec.

The little bit nicer one is to use the locus_tag for an ncRNA feature only if the coordinates match, else export the tRNA/gene stack to GFF3 without a locus tag for these free-floating annotations.

Happy to hear opinions.

@jorvis jorvis self-assigned this May 17, 2019
@jorvis jorvis added the bug label May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants