Exploratory Data Analysis (EDA): Gene Ontology (GO) Graph

Many of the complex systems around us have entities connected by their relations in a network:

Friends are connected by friendships
Neurons are connected by synapses
Websites are connected by hyperlinks
Addresses are connected by roads, streets, walking paths
Proteins are connected by molecular interactions, pathways

Figure 1. Protein-Protein Interaction (PPI) Networks of Saccharomyces cerevisiae (left) and Homo sapiens(right). Each node represents one protein, and the edges indicate the proteins that interact. Both interactomes (i.e. proteins connected by their molecular interactions organized in a network) were obtained using yeast two-hybrid (Y2H) technique (Jeong et al., 2001 411(3), Rual et al. Nature 2005: 437(4) - Macmillan Publishers Ltd.)

Networks are an organizing principle of Nature. The basic idea to abstract real-world networks with their entities and relations into computationally tractable data structures goes as follows:

entities are represented by nodes, also known as vertices
relations are represented by links, also known as edges
nodes are connected by edges in many possible ways in a graph

We will use networks and graphs interchangebly in the notebook, and nodes and edges for entities and relations, respectively.

Figure 2. Undirected Graph, Directed Cyclic Graph (DCG) and Directed Acyclic Graph (DAG). In undirected graphs, the edges are bidirectional and equivalent, while in directed graphs edges have a direction of the relationship that are depicted as arrows departing from a node and arriving at another node (the node can be connected to itself, what we call self-loop). A directed graph is cyclic when we can trace a path from one node and return to the same node, as indicated by the red arrows. If the graph has at least one cyclic path, which can be a self-loop, the graph is cyclic. In directed acyclic graphs, such closed cycles do not exist. There are also Multigraphs with multiple edges connecting the same two nodes ( parallel edges). This is the case for the Gene Ontology graph.

Real-world networks can be quite large and complex, with thousands to millions of entities (e.g. social networks, PPI networks, internet addresses, etc.). To analyze such large networks, we can focus on specific features and ignore the others. All examples above can be abstracted into entities (users, neurons, websites, ...) represented by nodes and connected by edges denoting their relations (friendships, synapses, links, ...). The differences between these networks are related to the size and node/edge composition, but still, the basic building blocks and the math used to analyze them are the same. Thus, networks follow the same organizing principles and can be analyzed using similar tools.

Data scientists will probably be dealing with graph data structure a great deal. By using data structured in graphs, we can take advantage of the relational structure to make better predictions about the behavior of the network over time (predict new nodes, new edges, or new graphs). Graph Neural Network (GNN) is a 'hot' subfield of machine learning and the new frontier of Deep Learning. Graph Theory and Geometric Deep Learning aim to organize networks' principles by creating tools and approaches to construct and to analyze graphs, as well as to predict the appearance of new entities or relations in the graph. For machine learning (ML) applications, frameworks such as Deep Graph Library (DGL) and PyTorch Geometric (PyG) are packed with useful classes and functions to create and manipulate graphs easily.

DGL and PyG frameworks also provide out-of-the-box algorithms for learning on graphs (e.g. Graph Convolutional Networks (GNNs)), bearing the idea that using graphs for prediction tasks can be way more efficient than traditional deep learning algorithms. The topological information encoded in the graph structure tells the algorithm what is essential to look at during training. The topological information and the attributes of nodes/edges are embedded into a vector and used for prediction.

Gene Ontology Directed Acyclic Multigraph

The Gene Ontology (GO) terms are organized in a hierarchical directed acyclic multigraph. Each GO term (e.g. GO:0000001) is a node and the edges represent the relationship between them (e.g. "is_a", "regulates"). As a directed graph, we have ancestors and descendants nodes, which are referred to as child and parent nodes in the GO graph. Parents are closer to the root of the GO graph, and children's terms are more specific regarding the annotation.

Figure 3. Parent and child example: mitochondrion. In this figure from Relations in the Gene Ontology, mitocondrion has two parents: *cytoplasm* and *organelle*, and the parent term *organelle* has two children: *mitochondrion* and *organelle membrane*

Unlike taxonomy trees, where child entities have one parent, a child GO term can have more than one parent. For example: a chloroplast 'is an' organelle and 'is part of' the cytoplasm. The GO graph does not have a single root but three separated by the three major ontologies: Cellular Component (CC), Molecular Function (MF) and Biological Process (BP). These roots are is_a disjoint because there is no such relation between these three ancestors' ontologies.

Figure 4. The hexose biosynthetic process. The biosynthetic process is a subtype of metabolic process and hexose is a subtype of monosaccharide. (Source: Gene Ontology Consortium)

Here we perform an Exploratory Data Analysis (EDA) of the GO graph. The EDA presented in this notebook was motivated by the Critical Assessment of Functional Annotation (CAFA) competition, which is hosted by CAFA initiative to engage the data science community into finding new insights on how to improve the prediction of protein function.

Before that, we will walk through some basic operations on graphs:

1. Basic operations on graphs:

Creating graphs
Adding nodes and edges
Adding graph, node and edge attributes
Analyzing graph and node degrees
Analyzing graph connectivity
Vizualizing graphs

2. Exploratory Data Analysis - Gene Ontology Graph

Reading and parsing GO graph data
Analyzing graph and nodes' degrees
Extracting and visualizing nodes' attributes
Analyzing graph connectivity
Exploring Parents and Children GO terms
Visualizing the GO graph
Spliting the GO graph into major sub-ontologies
Extracting nodes' attributes into a dataframe

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitattributes		.gitattributes
GO-graph-EDA-env.txt		GO-graph-EDA-env.txt
LICENSE		LICENSE
README.md		README.md
diag-dag-example.gif		diag-dag-example.gif
go-basic.obo		go-basic.obo
hexose-biosynthetic-process.png		hexose-biosynthetic-process.png
main.ipynb		main.ipynb
network_1.png		network_1.png
ppi_hs_sc.png		ppi_hs_sc.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis (EDA): Gene Ontology (GO) Graph

About

Releases

Languages

License

matiollipt/GO-graph-EDA

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis (EDA): Gene Ontology (GO) Graph

About

Resources

License

Stars

Watchers

Forks

Releases

Languages