Skip to content
rankishore edited this page Apr 25, 2018 · 8 revisions

Table of Contents

Gene descriptions project of the Alliance of Genome Research (AGR)

Aim: To generate a brief text summary of the current knowledge of a gene based on data present in AGR. This documentation aims to record what has been done and how, for current tasks not yet completed see the AGR JIRA site: https://agr-jira.atlassian.net/browse/AGR-787

Description of algorithm

Writes a gene description based on several types of annotations, eg. GO annotations, DO annotations, etc. If more than three terms are present in the original annotation set, trims terms by finding the Lowest Common Ancestor (LCA) that most represents the terms in the annotation set.

  1. In annotation set, if both parents and children are present, removes parents
  2. If more than three terms present, finds the most representative Lowest Common Ancestor (LCA)
  3. Thresholds set: Distance from root: 3 for Function, 2 for Process and 5 for Component.

Data categories planned for the gene description

  1. Molecular function/identity (based on GO MF annoations)
  2. Biological processes involved in (based on GO BP annotations)
  3. Cellular localization (based on GO CC annotations)
  4. Disease associations (based on Disease Ontology (DO) term annotations)
  5. Tissue expression

Data sources

General data source: https://s3.amazonaws.com/mod-datadumps

GO module

For a given set of GO terms, the algorithm finds the Lowest Common Ancestor (LCA) that is the most representative of the original annotation set of terms. Trims the number of terms to include only three GO terms per aspect (F, P and C) in the sentences that describe function, process and cellular localization of a gene.

Annotation priority by evidence code

Annotation terms are included according to the following evidence code priority:

  1. Experimental: EXP, IDA, IPI, IMP, IGI, IEP
  2. High-throughput experimental: HTP, HDA, HMP, HGI, HEP
  3. Phylogenetic and sequence based analysis: IBA, IBD, ISS, ISO, ISA, ISM, TSA
  4. Electronic and computational analysis: IEA, RCA

Sentence templates for GO

Function

Experimental evidence codes:

  1. Exhibits
  2. A <structural></structural>
Non-experimental evidence codes:
  1. predicted to have
  2. predicted to be a <structural></structural>

Process

Experimental evidence codes: Involved in Predicted to be involved in

Component

Experimental evidence codes: Localizes to the <component></component>

Non-experimental evidence codes: Predicted to localize to the

For GO term: intracellular Is <intracellular></intracellular> Predicted to be <intracellular></intracellular>

Annotation exclusions (not done yet)

  1. 'NOT' annotations
  2. terms in the 'do not annotate' file at http://geneontology.org/ontology/subsets/gocheck_do_not_annotate.obo

Term exclusions

  • GO:0008150 biological_process
  • GO:0003674 molecular_function
  • GO:0005575 cellular component
  • GO:0005488 binding
  • GO:0005515 protein binding
  • GO:0044877 protein-containing complex binding

Term replacements

  1. molting cycle, collagen and cuticulin-based cuticle: molting cycle
  2. molting cycle, chitin-based cuticle: molting cycle
  3. multicellular organism growth: growth
  4. embryo development ending in birth or egg hatching: embryo development
  5. synaptic transmission, <some></some>: <some></some> synaptic transmission

Disease module