Skip to content

Latest commit

 

History

History
538 lines (433 loc) · 28.7 KB

README.md

File metadata and controls

538 lines (433 loc) · 28.7 KB

Build Status DOI

Table of contents

About the Translator project, Team Expander Agent, and ARAX

We are Expander Agent, a team of researchers and software experts working within a consortium effort called the Biomedical Data Translator program ("Translator program"). Initiated by the NIH National Center for Advancing Translational Sciences (NCATS), the Translator program's goal is to accelerate the development of disease therapies by harnessing artificial intelligence, Web-based distributed computing, and computationally-assisted biomedical knowledge exploration. Although the Translator program is only in its third year, Translator tools are already being used by biomedical researchers for hypothesis generation and by analysts who are supporting clinical management of rare disease cases. Key intended applications of Translator include repositioning already-approved drugs for new indications (thus accelerating time-to-market), identifying molecular targets for developing new therapeutic agents, and providing software infrastructure that could enable development of more powerful clinical decision support tools. The 18 teams on the translator project are working with NCATS to achieve this goal by building the Translator software, a modular system of Web services for biomedical knowledge exploration, reasoning, and hypothesis generation.

After a multiyear feasibility assessment (2017-2019), during which our team built a prototype Translator reasoning tool called RTX, the Translator program is moving into a prototype development phase during early 2020. In this new phase, our team is building a modular web-based software system, ARAX, that enables expressive, automatable, and reproducible exploration and analysis of biomedical knowledge graphs without requiring computer programming expertise. As we develop ARAX, we will provide up-to-date software code for ARAX and RTX to the scientific community via this software repository. While ARAX is currently under development, our team will demonstrate ARAX for NCATS in mid-March 2020; at that time we will make ARAX publicly available on the Web and we will provide demonstration web pages and notebooks that illustrate how to access and operate it.

ARAX analyzes knowledge graphs to answer biomedical questions

ARAX is a tool for querying, manipulating, filtering, and exploring biomedical knowledge graphs. It is designed to be a type of middleware—an autonomous relay agent—within the Translator system. The top-level layer of Translator (which is called the autonomous relay system) will issue structured queries to ARAX via ARAX's web application programming interface. Then, based on the query type, ARAX will determine which knowledge providers it needs to consult in order to be able to answer the query; ARAX will then query the required knowledge providers, synthesize the information that it gets from those queries, and respond to the top-level layer in a standardized structured data format. When completed, ARAX will contribute to and advance the Translator program in four key ways:

  1. ARAX provides a powerful domain-specific language (DSL), called ARAXi (technical documentation on ARAXi can be found here), that is designed to enable researchers and clinicians to formulate, reuse, comprehend, and share workflows for biomedical knowledge exploration. One of the key advantages of ARAXi is that it is not a general-purpose programming language; it is purpose-built for the task of describing—in user friendly syntax—a knowledge graph manipulation workflow in terms of ARAX's modular capabilities. All of ARAX's capabilities are exposed through ARAXi. Using ARAXi, an analyst can:
  • define a small query graph; query for entities (e.g., proteins, pathways, or diseases) that match the search criteria represented in the query graph
  • expand a knowledge graph, pulling in concepts that are related to concepts that are already in the knowledge graph
  • filter a knowledge graph, eliminating concepts or relationships that do not match a given set of search criteria
  • overlay contextual information from large datasets (such as co-occurrence of terms in clinical health records or in abstracts of articles in the biomedical literature)
  • resultify: enumerate and return matches of a query graph against a larger knowledge graph; as a sub-case of this step, ARAX can return as a single result, all concepts from the knowledge graph that match a given concept type and that match a given pattern of neighbor-concept-type relationships.
  1. ARAX is based on a modular architecture. It provides distinct, orthogonal, and human-understandable knowledge graph manipulation and analysis capabilities via five operations (query graph, expand, overlay, filter, and resultify) that can be accessed individually or in combination by the Translator top-level layer, other Translator tools, or by individual researchers directly using ARAX. Due to this transparent and modular design, the five ARAX operations are easy-to-use in isolation and easy to compose into workflows. Unifying these modules within a single service framework (ARAX) also provides significant speed benefits for workflows that are implemented end-to-end within ARAX, because the knowledge graph is stored on the server and does not need to round-trip to the client with each operation. Through example ARAX-powered analysis vignettes linked below, we describe how an ARAX workflow using ARAXi can be much more powerful than the sum of its individual parts.

  2. ARAX is a Web service that speaks the information standard—called the Reasoners Standard Application Programming Interface—that Translator has adopted for data interchange between Translator components. Team Expander Agent has been at the forefront of the development and stewardship of the Reasoners Standard API (as described below), and with this perspective, ARAX was built from the ground up to seamlessly interoperate with other Translator software components. In addition to complying with the Translator standard for inter-reasoner communication, ARAX uses knowledge sources (see below) that comply with the Biomedical Data Translator Knowledge Graph Standard (for which our team has been an active participant in the standards development process) which is based on the Biolink Model.

  3. ARAX is integrated with the RTX reasoning tool's knowledge graphs and graph visualization capabilities. During the NCATS Translator feasibility assessment phase, our team built a prototype reasoning tool system called RTX, whose knowledge graphs (both the first-generation knowledge graph RTX-KG1 and the second-generation knowledge graph RTX-KG2) and user interface capabilities are now available through ARAX. This integration enables a user of ARAX to seamlessly refer a server-side knowledge graph or result-set for RTX-based graphical visualization within a web browser. It also provides ARAX with significant speed efficiencies for graph expansion and identifier mapping.

How does ARAX work?

When the ARAX server is queried by the Autonomous Relay System or by another application, four things happen in sequence:

  1. From the query data structure that is provided to ARAX in accordance with the Reasoners Standard API, ARAX extracts a series of ARAXi commands, a query graph, or a natural-language question that has been interpreted by ARAX to be of a specific question type.

  2. ARAX chooses—based on the ARAXi commands, the query graph, or the interpreted question—which upstream knowledge providers to query in order to obtain the information required to answer the question

  3. ARAX integrates and processes the information returned from the knowledge providers, in accordance with the query type.

  4. ARAX responds to the question with an answer that complies with the Reasoners Standard API. ARAX's responses to questions or queries typically contain three parts:

    • a recapitulation of the original query (which may involve a restatement of the question or a structured representation of a small query graph of biomedical concepts)
    • a list of results, each of which may be a concept (e.g., "imatinib" or "BCR-Abl tyrosine kinase") or a small graph of concepts and relationships between the concepts
    • a knowledge graph representing the union of concepts in all of the results, along with all known relationships among the concepts in the union.

The Reasoners Standard Application Programming Interface

During the Feasibility Assessment Phase of the Biomedical Data Translator program, a data interchange standard (the Reasoners Standard Application Programming Interface) for communication to/from Translator reasoning tools was developed and ratified by the Translator stakeholders. The Reasoners Standard API is formally defined by a Yet Another Markup Language (YAML) file that is in OpenAPI 3.0.1 format and that is versioned on GitHub. The current version of the Reasoners Standard API is 0.9.1. The GitHub repository for the Reasoners Standard API contains an issue list (the primary forum for documented and archived discussion of issues and proposed changes with the standard), automated regression tests, some presentation slide decks that provide information about the standard, and (of course) the YAML file that defines the standard. Team Expander Agent (and under our previous name during the Feasibility Assessment phase, Team X-ray) is an active participant in the ongoing development process for the Reasoners Standard API.

What knowledge providers does ARAX use?

Currently, ARAX/RTX directly accesses four main knowledge providers in order to handle queries, along with several additional APIs for identifier mapping.

RTX-KG2

RTX-KG2 (GitHub project area is RTXteam/RTX-KG2) is a knowledge graph comprising 7.5M nodes and 34.3M relationships that is built by integrating concepts and concept-predicate-concept triples obtained from:

  1. Unified Medical Language System (UMLS; including SNOMED CT)
  2. NCBI Genes
  3. Ensembl Genes
  4. UniChem
  5. Semantic Medline Database (SemMedDB)

RTX-KG2 complies with the Biomedical Data Translator Knowledge Graph object model standard, which is based on the Biolink model. RTX-KG2 is hosted in a Neo4j graph database server and can be accessed at kg2endpoint2.rtx.ai:7474 (username is neo4j; contact Team Expander Agent for the password). Alternatively, a JSON dump of KG2 is available from the RTX-KG2 S3 bucket. A version of KG2 that is formatted and indexed for the knowledge graph exploration tool mediKanren is available; contact Team Expander Agent for details. For extensive technical documentation on RTX-KG2, see this repository's KG2 subdirectory.

Columbia Open Health Data (COHD)

ARAX accesses the Columbia Open Health Data (COHD) resource (provided by the Red Team and the Tatonetti Lab from the NCATS Translator Feasibility Assessment Phase) for overlaying clinical health record co-occurrence significance information for biomedical concepts in a knowledge graph, via the overlay feature (for more information, see code/ARAX). ARAX accesses COHD via a web API.

PubMed

ARAX accesses PubMed for overlaying biomedical literature abstract co-occurrence significance information for biomedical concepts in a knowledge graph, via the overlay feature (for more information, see code/ARAX). For overlaying literature co-occurrence information, ARAX uses a pre-indexed version of PubMed (indexed for Medical Subject Heading or MeSH terms). For any concepts that cannot be mapped to MeSH, ARAX queries PubMed via a web API.

Identifier mapping

RTX's reasoning code uses several different web services for on-the-fly mapping between certain identifier types:

  1. Ontology Lookup Service
  2. MyChem.info
  3. Disease Ontology
  4. PubChem
  5. NCBI eUtils
  6. Human Metabolome Database

A computable file enumerating and summarizing the external APIs that are used by ARAX/RTX, in YAML format, can be found here.

Team Expander Agent: who we are

Our team includes investigators from Oregon State University, the Pennsylvania State University, Institute for Systems Biology, and Radboud University in the Netherlands.

Principal investigators

Name Role Email GitHub username Areas of relevant expertise
Stephen Ramsey OSU [email protected] saramsey compbio, systems biology
David Koslicki PSU [email protected] dkoslicki compbio, graph algorithms
Eric Deutsch ISB [email protected] edeutsch bioinformatics, data management, standards development

Team members

Name Affiliation Email GitHub username Areas of relevant expertise
Jared Roach ISB [email protected] genomics, genetics, medicine, systems biology
Luis Mendoza ISB [email protected] isbluis software engineering, proteomics, systems biology
Finn Womack OSU [email protected] finnagin drug repositioning, Neo4j
Amy Glen OSU [email protected] amykglen knowledge graphs
Arun Muluka PSU [email protected] aruntejam1 knowledge graphs
Chunyu Ma PSU [email protected] chunyuma programmer/analyst
Sundareswar Pullela OSU [email protected] sundareswarpullela programmer, knowledge graphs

For our work on the Translator program, we also extensively collaborate and cooperate with investigators at Oregon Health & Science University, Lawrence Berkeley National Laboratory, University of North Carolina Chapel Hill, and the University of Alabama Birmingham.

What is RTX? How does it differ from ARAX?

During the Translator program's feasibility assessment phase (2017-2019), our team—under the name "X-ray" that was assigned in accordance with the feasibility assessment's team-naming scheme based on the electromagnetic spectrum—built and released a prototype biomedical reasoning tool called RTX, which is why this software repository is called RTX. RTX's capabilities center around answering questions from a list of natural-language question templates (e.g., what proteins does acetaminophen target? or what drugs target proteins associated with the glycolysis pathway?) and around graphical construction of a "query graph" that is used as a template for finding subgraphs of the RTX-KG1 knowledge graph. The design for the ARAX software system relates to RTX in four ways:

  1. ARAX builds on the code-base for RTX and leverages the already-built user interface and knowledge graphs for RTX (RTX-KG1 and RTX-KG2).
  2. Through the expressive (but user-friendly) domain-specific language ARAXi, ARAX exposes RTX's graph exploration and analysis capabilities so that they can be used (in combination or individually) by Translator tools or workflows in accordance with the Translator standard for inter-tool communication (Reasoners Standard API).
  3. ARAX adds new and powerful graph exploration and analysis capabilities, such as expand, overlay and filter, that make ARAX significantly more flexible (in terms of the types of graph exploration/analysis workflows that it can implement) than RTX.
  4. RTX could produce results in the Reasoners Standard API format. However, its more extensive reasoning capabilities could not be queried in this API format without specialized knowledge of the RTX system. In contrast, ARAX can now perform its complex reasoning capabilities upon receiving any input Reasoners Standard API, while still producing such a standardized output format. As such, ARAX and its reasoning capabilities will be accessible to any automated reasoning agent, automated reasoning system, or knowledge provider capable of sending a Reasoners Standard API message.

Organization of the ARAX/RTX software repository

ARAX and RTX are mostly written in the Python programming language and a small amount of JavaScript and bash shell. Yet Another Markup Language (YAML) and JavaScript Object Notation (JSON) are extensively used for configuration files. Many examples of analysis workflow code that access RTX and/or ARAX are provided in Jupyter notebook format, in several places in the code-base.

subdirectory code

All software code files for ARAX and RTX are stored under this directory link.

subdirectory code/ARAX

Contains the core software code for ARAX link.

subdirectory code/ARAX/Examples

Contains example Jupyter notebooks for using ARAX from software link

subdirectory code/UI/OpenAPI

Contains (1) the YAML code that defines the Reasoners Standard API and (2) the code for the Reasoners Standard API python object model that is used to describe a knowledge graph, query nodes, and results (link).

subdirectory code/UI/Feedback

Contains the code for the server-side logging system for the RTX web browser-based user interface (link).

subdirectory code/UI/interactive

Contains the code for the RTX web browser-based user interface (link).

subdirectory code/kg2

Contains the code for building the RTX second-generation knowledge graph (RTX-KG2) and hosting it in Neo4j (link).

subdirectory code/kg2/mediKanren

Contains the code for exporting a version of the RTX-KG2 knowledge graph that is formatted and indexed for use with the mediKanren knowledge graph exploration tool (link).

subdirectory code/reasoningtool/kg-construction

Contains the code for building the RTX first-generation knowledge graph (RTX-KG1) (link).

subdirectory code/reasoningtool/SemMedDB

Contains the code for a python interface to an instance of the Semantic Medline Database (SemMedDB) that is being hosted in a MySQL database (link).

subdirectory code/reasoningtool/QuestionAnswering

Contains the code for parsing and answering questions posed to the RTX reasoning tool (link).

subdirectory code/reasoningtool/MLDrugRepurposing

Contains the code that is used for the machine-learning model for drug repositioning that was described in the article Leveraging distributed biomedical knowledge sources to discover novel uses for known drugs article by Finn Womack, Jason McClelland, and David Koslicki (link).

subdirectory code/autocomplete

Contains the code for the concept autocomplete feature in the RTX web browser-based user interface (link).

subdirectory data

Text data files for the RTX system that are deployed using git are stored under this subdirectory. There are only a few such files because the RTX software obtains most of the information that makes up the RTX-KG1 knowledge graph by querying external knowledge providers via web APIs, rather than by loading flat files (link).

Key repository branches

The most up-to-date branch of the RTX repository (including the latest code for the ARAX system) is demo. The master branch contains the most stable recent release of RTX.

License

ARAX and RTX are furnished under the MIT open-source software license; see the LICENSE file for details. For the copyright on the code in the code/NLPCode subdirectory, see the LICENSE file in that subdirectory.

Disclaimer

Per the MIT license, the ARAX and RTX software are provided "as-is" without warranty of any kind. The content of this site and the RTX and ARAX software is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Credits

Many people contributed to the development of ARAX and RTX. A list of code contributors can be found under the contributors tab for this repository, in addition to the current team members listed above. In addition to the code contributors, we gratefully acknowledge technical assistance, contributions, and helpful feedback from NCATS staff (Christine Colvis, Noel Southall, Mark Williams, Trung Nguyen, Tyler Beck, Sarah Stemann, Debbi Adelakun, Dena Procaccini, and Tyler Peryea), and Will Byrd, Greg Rosenblatt, Michael Patton, Chunlei Wu, Kevin Xin, Tom Conlin, Harold Solbrig, Matt Brush, Karamarie Fecho, Julie McMurray, Kent Shefchek, Chris Bizon, Steve Cox, Deepak Unni, Tim Putman, Patrick Wang, Sui Huang, Theo Knijnenburg, Gwênlyn Glusman, John Earls, Andrew Su, Chris Mungall, Marcin Joachimiak, Michel Dumontier, Richard Bruskiewich, and Melissa Haendel. Support for the development of RTX was provided by NCATS through the Translator program award OT2TR002520. Support for the development of ARAX was provided by NCATS through the Translator program award OT2TR003428.

Installation and dependencies

ARAX is designed to be installed on an Amazon Web Services Elastic Compute Cloud (EC2) instance with the following minimum requirements (we use a m5a.4xlarge instance):

  • 16 vCPUs
  • 64 GiB of RAM
  • 1,023 GiB of elastic block storage
  • host OS Ubuntu v18.04.

The host OS has nginx v1.14.0 installed and configured (see notes/ARAX/rtx-host-os-nginx-config for configuration details) for SSL/TLS termination and proxying of HTTP traffic to localhost:8080. The SSL site certificate was generated using Letsencrypt (certbot v0.27.0). ARAX and all of its database dependencies run inside a Docker container (Docker v19.03.5) that is configured to map TCP ports as follows (host-port:container-port):

  • 7473:7473
  • 7474:7474
  • 7687:7687
  • 8080:80

(for the specific Docker run command, see notes/ARAX/arax-run-container-nodes.md). Within the Docker container, ARAX uses

  • Ubuntu v16.04
  • Apache v2.4.18
  • python v3.7.3
  • Neo4j v3.2.6 (see code/reasoningtool/kg-construction on how to set up Neo4j for running ARAX/RTX)
  • OpenJDK v1.8.0_131
  • mysql v5.7.19-0ubuntu0.16.04.1

The python package requirements for ARAX are described in the top-level requirements.txt file. RTX makes extensive use of internal caching via SQLite v3.11.0.

Contact us

The best way to contact Team Expander Agent is by

See also the contact information for the Team Expander Agent PIs above.

Try out ARAX/RTX...

...in your web browser

Here is the link to access the web browser interface to RTX: arax.rtx.ai

...using our web API

Here is the link to documentation on the web API interface to RTX: arax.rtx.ai/api/arax/v1.2/ui/.

...by customizing a Jupyter Notebook

Three Jupyter notebooks that demonstrate how to programmatically use ARAX are provided here.

Links

General links for the Biomedical Data Translator project

ARAX- and RTX-specific links