-
Notifications
You must be signed in to change notification settings - Fork 2
Chapter 5: Building bioinformatics infrastructure
If the public health laboratory decides to hire a bioinformatician, they should have in place or be ready to obtain compute infrastructure, which includes i) computer(s) with enough power, ii) a network with appropriate speeds, iii) adequate storage space for data, and iv) a data policy. If the bioinformatician needs IT support (Table 1), they should have the needed contacts.
Need to Obtain | Purpose(s) |
---|---|
Computer(s) |
|
Network with appropriate speeds |
|
Data policy |
|
Personnel |
|
Table 2. General set up requirements for in-house bioinformatics analyses.
A bioinformatician will need compute solutions for analyses and storage. A minimal computer will ideally have a 64-bit Linux operating system (OS) with at least 16 GB of RAM (Oakeson, Wagner, Mendenhall, Rohrwasser, & Atkinson-Dunn, 2017). Danny Park (of the Sabeti Lab), he begins explorational analyses with (on Google Cloud) with the n1-standard-2 machine, which consists of 2 cores and 7.5 GB of RAM (personal communication). Generally, the speed and power of the computer depends on:
- Random access memory (RAM)
- Number of computer processing units (CPUs) a. Processor speed and memory cache of each CPU b. Note on terminology: Sometimes there is confusion between the terms “socket”, “CPU”, “core”, and “threads”. A computer can have multiple sockets, each of which can hold multiple “CPUs”. A single CPU may have multiple cores, and a single core may have multiple threads (if hyperthreading is enabled).
- System bus a. Determines how quickly can communicate with other computer components
- Hard drive a. The larger and less complicated the drive, the faster the computer will be
This information above was summarized from the article, “What determines the speed and power of a PC?” As an example, the CLC Genomics Workbench recommends 4 GB of RAM, with an Intel or AMD CPU. However, there are special requirements for read mapping, with 8 GB of RAM being recommended for human reads, and different requirements for de novo assembly. Specifically for E. coli, the white paper states that peak memory was 129 MB, but 8 cores/CPUs were used.
As illustrated above, the technical specifications of the computer needed will depend on the i) quantity of data and ii) type of analyses. Assembly of sequencing reads into genomes is much more computationally intensive than alignment of reads to genomes. Storage will depend both on the rate that data is generated as well as the number of databases stored. Both the compute engine and storage can either be bought or rented: the former involves 1) buying the physical machine and setting it up in-house, while the latter may involve 2) purchasing cloud services or 3) collaborating with a university or institution and using their compute cluster (see section, “Three approaches for obtaining compute infrastructure”). In both cases, the PHL will have to write a data policy, in order to determine what data is allowed on the compute machine (whether it is internal or external), as well as how sequencing data and metadata are separated.
Network speeds should be at minimum 10 Mbps, but preferably 100 Mbps to 1 Gbps (BioNumerics 7.6 Training, July 2018). Increased speeds help with i) transferring sequence data from the sequencer to the compute engine, and ii) transferring data from the World Wide Web onto the compute engine. The former matters regardless where the compute engine is; the latter matters much less if the compute engine is part of the cloud.
So far, most states have taken 3 approaches, which includes:
- Purchasing in-house computer and network hardware
- Utilizing cloud services
- Collaborating with another institution, such as a university or institute
Each has its own pros and cons, as listed below:
1. In-House Compute | 2. Cloud Services | 3. Collaborating with another institution | |
---|---|---|---|
General Process | An in-house compute system should include the compute engines (for analyses), storage (for data), and the network that connects all the components (to the MiSeq and World Wide Web). The state will need to purchase and maintain these parts. | Utilizing cloud services involves creating a service contract between the company (Google, Amazon, Microsoft) and the state. In exchange, the state has access to servers run by the company. | Collaboration also involves creating a contract between the institution and state. In exchange, the state can use the institution’s compute cluster. |
Pros |
|
|
|
Cons |
|
|
|
Life of System | Depends on what is purchased | Likely no limit for life of system (as long as the companies continue to exist) | Depends on the institution or university (for eg., Broad has begun deprecating their compute cluster, while Harvard has maintained theirs since 2007). |
Costs |
|
|
|
The are different challenges to buying versus renting compute solutions. Buying the compute hardware will give the PHL complete control over their system. However, the physical components (for both computers and networks) may need to be upgraded every few years. As an example, the Broad Institute maintained their own compute cluster for many years, but finally ran out of both power and space for the machines and data storage. Their solution was to move to and pay for cloud services (personal communication with Danny Park). Furthermore, it should be noted that even though the laboratory is using the compute machines, it will be state IT that “owns” the machines. Maintaining the machines and associated network(s) will require significant support from IT, and they will be essential to setting up the computers, network, and security policies unless the state chooses to outsource to another IT team.
In contrast, the most difficult part of “renting” compute hardware is drawing service contracts between the state and the institution. For example, the Minnesota State Public Health Laboratory utilizes the compute cluster at the Minnesota SuperComputing Institute (MSI); the contract took about 6 months to negotiate. Colorado Department of Public Health and Environment (CDPHE) utilizes Google Compute Engine, since the state had a pre-existing contract with Google.
The most general cloud options include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Engine. In addition, there are multiple other services that use these cloud options to provide more specialized services, including BaseSpace and DNA Nexus. The advantage to the latter is that they may already have created contracts with AWS, Microsoft, or Google to ensure that their cloud environment is HIPAA or code of federal regulations (CFR) compliant.