Skip to content

matiollipt/GO-graph-EDA

Repository files navigation

Exploratory Data Analysis (EDA): Gene Ontology (GO) Graph

Many of the complex systems around us have entities connected by their relations in a network:

  • Friends are connected by friendships
  • Neurons are connected by synapses
  • Websites are connected by hyperlinks
  • Addresses are connected by roads, streets, walking paths
  • Proteins are connected by molecular interactions, pathways

Left: yeast PPI network, right: human PPI network

Figure 1. Protein-Protein Interaction (PPI) Networks of Saccharomyces cerevisiae (left) and Homo sapiens(right). Each node represents one protein, and the edges indicate the proteins that interact. Both interactomes (i.e. proteins connected by their molecular interactions organized in a network) were obtained using yeast two-hybrid (Y2H) technique (Jeong et al., 2001 411(3), Rual et al. Nature 2005: 437(4) - Macmillan Publishers Ltd.)

Networks are an organizing principle of Nature. The basic idea to abstract real-world networks with their entities and relations into computationally tractable data structures goes as follows:

  • entities are represented by nodes, also known as vertices
  • relations are represented by links, also known as edges
  • nodes are connected by edges in many possible ways in a graph

We will use networks and graphs interchangebly in the notebook, and nodes and edges for entities and relations, respectively.

Simple network example

Figure 2. Undirected Graph, Directed Cyclic Graph (DCG) and Directed Acyclic Graph (DAG). In undirected graphs, the edges are bidirectional and equivalent, while in directed graphs edges have a direction of the relationship that are depicted as arrows departing from a node and arriving at another node (the node can be connected to itself, what we call self-loop). A directed graph is cyclic when we can trace a path from one node and return to the same node, as indicated by the red arrows. If the graph has at least one cyclic path, which can be a self-loop, the graph is cyclic. In directed acyclic graphs, such closed cycles do not exist. There are also Multigraphs with multiple edges connecting the same two nodes ( parallel edges). This is the case for the Gene Ontology graph.

Real-world networks can be quite large and complex, with thousands to millions of entities (e.g. social networks, PPI networks, internet addresses, etc.). To analyze such large networks, we can focus on specific features and ignore the others. All examples above can be abstracted into entities (users, neurons, websites, ...) represented by nodes and connected by edges denoting their relations (friendships, synapses, links, ...). The differences between these networks are related to the size and node/edge composition, but still, the basic building blocks and the math used to analyze them are the same. Thus, networks follow the same organizing principles and can be analyzed using similar tools.

Data scientists will probably be dealing with graph data structure a great deal. By using data structured in graphs, we can take advantage of the relational structure to make better predictions about the behavior of the network over time (predict new nodes, new edges, or new graphs). Graph Neural Network (GNN) is a 'hot' subfield of machine learning and the new frontier of Deep Learning. Graph Theory and Geometric Deep Learning aim to organize networks' principles by creating tools and approaches to construct and to analyze graphs, as well as to predict the appearance of new entities or relations in the graph. For machine learning (ML) applications, frameworks such as Deep Graph Library (DGL) and PyTorch Geometric (PyG) are packed with useful classes and functions to create and manipulate graphs easily.

DGL and PyG frameworks also provide out-of-the-box algorithms for learning on graphs (e.g. Graph Convolutional Networks (GNNs)), bearing the idea that using graphs for prediction tasks can be way more efficient than traditional deep learning algorithms. The topological information encoded in the graph structure tells the algorithm what is essential to look at during training. The topological information and the attributes of nodes/edges are embedded into a vector and used for prediction.

Gene Ontology Directed Acyclic Multigraph

The Gene Ontology (GO) terms are organized in a hierarchical directed acyclic multigraph. Each GO term (e.g. GO:0000001) is a node and the edges represent the relationship between them (e.g. "is_a", "regulates"). As a directed graph, we have ancestors and descendants nodes, which are referred to as child and parent nodes in the GO graph. Parents are closer to the root of the GO graph, and children's terms are more specific regarding the annotation.

Mitochondrion parents

Figure 3. Parent and child example: mitochondrion. In this figure from Relations in the Gene Ontology, mitocondrion has two parents: *cytoplasm* and *organelle*, and the parent term *organelle* has two children: *mitochondrion* and *organelle membrane*

Unlike taxonomy trees, where child entities have one parent, a child GO term can have more than one parent. For example: a chloroplast 'is an' organelle and 'is part of' the cytoplasm. The GO graph does not have a single root but three separated by the three major ontologies: Cellular Component (CC), Molecular Function (MF) and Biological Process (BP). These roots are is_a disjoint because there is no such relation between these three ancestors' ontologies.

Hexose Biosynthetic Process

Figure 4. The hexose biosynthetic process. The biosynthetic process is a subtype of metabolic process and hexose is a subtype of monosaccharide. (Source: Gene Ontology Consortium)

Here we perform an Exploratory Data Analysis (EDA) of the GO graph. The EDA presented in this notebook was motivated by the Critical Assessment of Functional Annotation (CAFA) competition, which is hosted by CAFA initiative to engage the data science community into finding new insights on how to improve the prediction of protein function.

Before that, we will walk through some basic operations on graphs:

1. Basic operations on graphs:

  • Creating graphs
  • Adding nodes and edges
  • Adding graph, node and edge attributes
  • Analyzing graph and node degrees
  • Analyzing graph connectivity
  • Vizualizing graphs

2. Exploratory Data Analysis - Gene Ontology Graph

  • Reading and parsing GO graph data
  • Analyzing graph and nodes' degrees
  • Extracting and visualizing nodes' attributes
  • Analyzing graph connectivity
  • Exploring Parents and Children GO terms
  • Visualizing the GO graph
  • Spliting the GO graph into major sub-ontologies
  • Extracting nodes' attributes into a dataframe

About

Exploratory Data Analysis: Gene Ontology Directed Acyclic Multigraph

Resources

License

Stars

Watchers

Forks

Releases

No releases published