Skip to content

Chapter 5: Building bioinformatics infrastructure

tyhsu389 edited this page Dec 11, 2018 · 1 revision

Step 2: Setting up compute infrastructure

If the public health laboratory decides to hire a bioinformatician, they should have in place or be ready to obtain compute infrastructure, which includes i) computer(s) with enough power, ii) a network with appropriate speeds, iii) adequate storage space for data, and iv) a data policy. If the bioinformatician needs IT support (Table 1), they should have the needed contacts.

Need to Obtain Purpose(s)
Computer(s)
  • Run bioinformatics analyses
Network with appropriate speeds
  • Download/upload data from public databases (NCBI) and collaborators
  • Move data between sequencers and compute engine
Data policy
  • Governs the safety and security of data by determining:
    • How sequencing and metadata are separated within the state
    • What sequences or metadata are uploaded to public databases in NCBI, or used in research
    • Material transfer agreements (MTA) or sequencing transfers with external partners
Personnel
  • Depending on the skills the bioinformatician has (outlined in Table 1), s/he may need to work with laboratory staff, epidemiologists, and IT.

Table 2. General set up requirements for in-house bioinformatics analyses.

A bioinformatician will need compute solutions for analyses and storage. A minimal computer will ideally have a 64-bit Linux operating system (OS) with at least 16 GB of RAM (Oakeson, Wagner, Mendenhall, Rohrwasser, & Atkinson-Dunn, 2017). Danny Park (of the Sabeti Lab), he begins explorational analyses with (on Google Cloud) with the n1-standard-2 machine, which consists of 2 cores and 7.5 GB of RAM (personal communication). Generally, the speed and power of the computer depends on:

  1. Random access memory (RAM)
  2. Number of computer processing units (CPUs) a. Processor speed and memory cache of each CPU b. Note on terminology: Sometimes there is confusion between the terms “socket”, “CPU”, “core”, and “threads”. A computer can have multiple sockets, each of which can hold multiple “CPUs”. A single CPU may have multiple cores, and a single core may have multiple threads (if hyperthreading is enabled).
  3. System bus a. Determines how quickly can communicate with other computer components
  4. Hard drive a. The larger and less complicated the drive, the faster the computer will be

This information above was summarized from the article, “What determines the speed and power of a PC?” As an example, the CLC Genomics Workbench recommends 4 GB of RAM, with an Intel or AMD CPU. However, there are special requirements for read mapping, with 8 GB of RAM being recommended for human reads, and different requirements for de novo assembly. Specifically for E. coli, the white paper states that peak memory was 129 MB, but 8 cores/CPUs were used.

As illustrated above, the technical specifications of the computer needed will depend on the i) quantity of data and ii) type of analyses. Assembly of sequencing reads into genomes is much more computationally intensive than alignment of reads to genomes. Storage will depend both on the rate that data is generated as well as the number of databases stored. Both the compute engine and storage can either be bought or rented: the former involves 1) buying the physical machine and setting it up in-house, while the latter may involve 2) purchasing cloud services or 3) collaborating with a university or institution and using their compute cluster (see section, “Three approaches for obtaining compute infrastructure”). In both cases, the PHL will have to write a data policy, in order to determine what data is allowed on the compute machine (whether it is internal or external), as well as how sequencing data and metadata are separated.

Network speeds should be at minimum 10 Mbps, but preferably 100 Mbps to 1 Gbps (BioNumerics 7.6 Training, July 2018). Increased speeds help with i) transferring sequence data from the sequencer to the compute engine, and ii) transferring data from the World Wide Web onto the compute engine. The former matters regardless where the compute engine is; the latter matters much less if the compute engine is part of the cloud.

Three approaches for obtaining compute infrastructure

So far, most states have taken 3 approaches, which includes:

  1. Purchasing in-house computer and network hardware
  2. Utilizing cloud services
  3. Collaborating with another institution, such as a university or institute

Each has its own pros and cons, as listed below:

1. In-House Compute 2. Cloud Services 3. Collaborating with another institution
General Process An in-house compute system should include the compute engines (for analyses), storage (for data), and the network that connects all the components (to the MiSeq and World Wide Web). The state will need to purchase and maintain these parts. Utilizing cloud services involves creating a service contract between the company (Google, Amazon, Microsoft) and the state. In exchange, the state has access to servers run by the company. Collaboration also involves creating a contract between the institution and state. In exchange, the state can use the institution’s compute cluster.
Pros
  • Complete control over security and data policies, as well as machine maintenance
  • Access to “unlimited” compute resources as long as willing or able to pay for them
  • Compute and network maintenance is outsourced to company
  • Easy to deploy or share compute environments with collaborators
  • Access to a compute cluster (rather than a single machine)
  • Compute and network maintenance is outsourced to institution
Cons
  • Will need IT staff for system and network administration to maintain the compute infrastructure
  • May need to update compute or storage as more data is generated
  • Difficult to convince state personnel and/or IT to approve of cloud services
  • May be more expensive with heavy usage, since costs are calculated by machine power and time
  • Security and data policies are dictated by institution
  • Need to determine life of compute cluster
Life of System Depends on what is purchased Likely no limit for life of system (as long as the companies continue to exist) Depends on the institution or university (for eg., Broad has begun deprecating their compute cluster, while Harvard has maintained theirs since 2007).
Costs
  • Personnel
    • System administrator
    • Network administrator
    • Bioinformatician
  • Equipment
    • Server (computers)
    • Tape backup + tapes
    • Network (if need a new one)
      • Firewall
      • Switches
      • Cables
  • Internet
  • Storage
  • Personnel
    • Bioinformatician
  • Equipment
    • Standard computer
    • Compute time (on cloud)
    • Network (if need a new one)
    • Storage (in-house, if not on cloud)
  • Software
    • API for cloud services
    • SSH client
  • Personnel
    • Bioinformatician
    • A contact with the institution
  • Equipment
    • Fee for using the compute cluster
    • Network (if need a new one)
    • Storage (in-house or with cluster)
  • Software
    • VPN
    • SSH client

The are different challenges to buying versus renting compute solutions. Buying the compute hardware will give the PHL complete control over their system. However, the physical components (for both computers and networks) may need to be upgraded every few years. As an example, the Broad Institute maintained their own compute cluster for many years, but finally ran out of both power and space for the machines and data storage. Their solution was to move to and pay for cloud services (personal communication with Danny Park). Furthermore, it should be noted that even though the laboratory is using the compute machines, it will be state IT that “owns” the machines. Maintaining the machines and associated network(s) will require significant support from IT, and they will be essential to setting up the computers, network, and security policies unless the state chooses to outsource to another IT team.

In contrast, the most difficult part of “renting” compute hardware is drawing service contracts between the state and the institution. For example, the Minnesota State Public Health Laboratory utilizes the compute cluster at the Minnesota SuperComputing Institute (MSI); the contract took about 6 months to negotiate. Colorado Department of Public Health and Environment (CDPHE) utilizes Google Compute Engine, since the state had a pre-existing contract with Google.

The most general cloud options include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Engine. In addition, there are multiple other services that use these cloud options to provide more specialized services, including BaseSpace and DNA Nexus. The advantage to the latter is that they may already have created contracts with AWS, Microsoft, or Google to ensure that their cloud environment is HIPAA or code of federal regulations (CFR) compliant.