awosome-bioinformatics

Abstract: A curated list of resources for learning bioinformatics. Some of this repo resources were collected by BioInstaller project. You can use BioInstaller to directly download the source code or database files, or fetch the meta information by BioInstaller::get.meta()$item.

Purpose:

Provide some of bioinformatics learning resources for beginners
Provide a profiling of bioinformatics

Field:

Next generation sequencing (NGS)
Bioinformatics Data Analysis

Table of content

Resources
Skills
Organization
Institute
People
Blog
Contributors

Resources

General

Wikipedia
Org

Journal

Bioinformatics

Bioinformatics
BMC Bioinformatics
Nucleic Acids Research
bioRxiv Bioinformatics
Current Bioinformatics
Advances in Bioinformatics
Briefings in Bioinformatics
Current Protocols in Bioinformatics
Journal of Bioinformatics and Computational Biology
Evolutionary Bioinformatics
Bioinformatics and Biology Insights
Advances and Applications in Bioinformatics and Chemistry
Genomics, Proteomics and Bioinformatics
Plos Computational Biology

Genomics

Genomics
Human Genomics
Current Genomics
Genome Research
Nature Genetics
Nature Method
BMC Genomics
Marine Genomics
BMC Medical Genomics
Briefings in Functional Genomics
Cancer Genomics & Proteomics

Proteomics

Journal of Proteomics
Molecular & Cellular Proteomics
Clinical Proteomics
Expert Review of Proteomics

Transcripteomics

Transcription

Metabolomics

Metabolomics

Epigenomics

Epigenomics

Sequencing Technology

This section mainly copied from enseqlopedia.

Thanks this work: Hadfield, J. & Retief, J. A profusion of confusion in NGS methods naming. Nat Methods 15, 7-8 (2018).

RNA Sequencing Methods

Low-Level RNA Detection

CEL-Seq
CirSeq
CLaP
CytoSeq
Digital RNA Sequencing
DP-Seq
Drop-Seq
Hi-SCL
InDrop
MARS-Seq
Nuc-Seq
PAIR
Quartz-Seq
scM&T-Seq
SCRB-Seq
scRNA-Seq
scTrio-seq
Smart-Seq
Smart-Seq2
snRNA-Seq
STRT-Seq
SUPeR-Seq
TCR-LA-MC PCR
TIVA
UMI
5C
Div-Seq
FRISCR
TCR Chain Pairing
AbPair

RNA Modifications

ICE
MeRIP-Seq
miCLIP-m6A
Pseudo-Seq
PSI-Seq

RNA Structure

CAP-seq
Cap-Seq
CIP-TAP
PARS-Seq
SPARE
Structure-Seq/DMS-Seq
CIRS-Seq
icSHAPE
SHAPE-MaP
SHAPE-Seq

RNA Transcription

2P-Seq
3'NT Method
3P-Seq
3Seq
3′-Seq
5′-GRO-Seq
BruChase-Seq
BruDRB-Seq
Bru-Seq
CAGE
CHART
ChIRP
ClickSeq
GRO-seq
NET-Seq
PAL-Seq
PARE-Seq
PEAT
PRO-Cap
PRO-Seq
RAP
RARseq
RASL-Seq
RNA-Seq
SMORE-Seq
TAIL-Seq
TATL-Seq
TIF-Seq
TL-Seq
4sUDRB-Seq
CaptureSeq
cP-RNA-Seq
FRT-Seq
GMUCT
mNET-Seq

RNA-Protein Interactions

AGO-CLIP
CLASH
CLIP-Seq or HITS-CLIP
DLAF
eCLIP
hiCLIP
iCLIP
miR-CLIP
miTRAP
PAR-CLIP
PIP-Seq
Pol II CLIP
RBNS
Ribo-Seq or ARTSeq
RIP-Seq
TRAP-Seq
TRIBE
BrdU-CLIP
HiTS-RAP
irCLIP

DNA Sequencing Methods

Protein-Protein Interaction

PD-Seq
ProP-PD/PDZ-Seq

Sequence Rearrangements

2b-RAD
CPT-seq
ddRADseq
Digenome-seq
EC-seq
hyRAD
RAD-Seq
Rapture
RC-Seq
Repli-Seq
SLAF-seq
TC-Seq
Tn-Seq/INSeq
Bubble-Seq
NSCR
NS-Seq
Rep-Seq/Ig-Seq/MAF

DNA Break Mapping

BLESS
DSB-Seq
GUIDE-seq
HTGTS
LAM-HTGTS
Break-seq
SSB-Seq

DNA Protein Interactions

DNaseI Seq or DNase-Seq
Pu-seq
3-C/Capture-C/Hi-C
4C-seq
5C
ATAC-Seq/Fast-ATAC
CATCH_IT
Chem-seq
ChIA-PET
ChIPmentation
ChIP-Seq/HT-ChIP/ChIP-exo/Mint-ChIP
DamID
DNase I SIM
FAIRE-seq/Sono-Seq
FiT-Seq
HiTS-FLIP
MINCE-seq
MNase-Seq/MAINE-Sequcleo-Sequc-seq
MPE-seq
NG Capture-C
NOMe-Seq
ORGANIC
PAT-ChIP
PB_seq
SELEX or SELEX-seq / HT-SELEX
THS-seq
UMI-4C
X-ChIP-seq

Epigenetics

Aba-seq
BisChIP-Seq/ChIP-BS-Seq/ChIP-BMS
BSAS
BSPP
BS-Seq/Bisulfite-Seq/WGBS
CAB-Seq
EpiRADseq
fCAB-seq
fC-CET
fC-Seal
hMeDIP-seq
JBP1-seq
MAB-seq
MBDCap-seq/MethylCap-Seq/MiGS
MeDIP-Seq/DIP-seq
MIRA
MRE-Seq and Methyl-Seq
xBS-Seq
PBAT
redBS-Seq/caMAB-seq
RRBS-Seq
RRMAB-seq
TAB-Seq
TAmC-Seq
T-WGBS

Low-Level DNA Detection

Safe-SeqS
scAba-seq
scATAC-Seq (Cell index variation)
scATAC-Seq (Microfluidics variation)
scBS-Seq
scM&T-Seq
scRC-Seq
SMDB
smMIP
G&T-Seq
5C
DR-Seq
G&T-Seq
MALBAC
MDA
MIDAS/IMS-MDA/ddMDA
scM&T-Seq
Drop-ChIP/scChIP-seq
Duplex-Seq
MIPSTR
nuc-seq/SNES
OS-Seq

Tools

Package management

conda
Bioconductor
CRAN
CPAN
PyPi
npm
bower
gradle
ant
maven
Spack

Web Application Developement Framework

Galaxy
Bootstrap
Django
Yi

Web-based Service

Hiplot: a simple and user-friendly visualization platform for scientific data.
UCSC
NCBI
- CDD
ExPASy
EMBL-EBI
TCGA
COSMIC
- COSMIC-3D: a comprehensive integration of cancer mutations with protein structure across the human genome and structural proteome, seeking to support the identification and characterization of protein targets for novel drug design in precision oncology
St. Jude PeCan Data Portal
BIG Data Center
DAVID Bioinformatics Resources
cBioPortal
- Oncoprinter
- MutationMapper
Oncotator
QIAGEN Analysis Platform
Wordcloud
Omictools
iCoMut
UniProt
Pfam
SMART
STRING
DiseaseEnhancer
SEECancer
eQTL Browser
Cistrome Project
- Cistrome Data Browser
- Cistrome Cancer
- Chromatin Regulator Cistrome
- TIMER
VarCards
superdrug2
MeDReaders
ECOdrug
rSNPBase3.0
MNDR
MSDD
funcoup
proteinatlas
DGIdb
Drugbank
InterPro
ncbi-biosystems
denovo-db
The Human Phenotype Ontology (HPO)
FANTOM
dbNSFP
regSNP-intron
RADAR
DARNED
REDIportal
LNCediting
EggNOG
MiSTIC
DTMiner
PDBFlex
Cancer3d
Dsysmap
CBS Prediction Servers
wANNOVAR: Public web service of ANNOVAR
Harmonizome: Search for genes or proteins and their functional terms extracted and organized from over a hundred publicly available resources
GDA: A web-based tool that combines NCI60 uniquely large number of drug sensitivity data with CCLE and NCI60 gene mutation and expression profiles
CLUE: Unravel biology with the world’s largest perturbation-driven gene expression dataset
CMAP: The Connectivity Map (also known as cmap) is a collection of genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms that together enable the discovery of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes.
pssmsearch: a web application to discover novel protein motifs (SLiMs, mORFs, miniMotifs) and PTM sites
bammmotif: Bayesian Markov Models (BaMMs), a web server for de-novo motif discovery and regulatory sequence analysis
LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis
GeNets: a unified web platform for network-based genomic analyses
HiCExplorer: a web server for reproducible Hi-C data analysis, quality control and visualization
paintomics: a web resource for the pathway analysis and visualization of multi-omics data
kinact: a computational approach for predicting activating missense mutations in protein kinases
VAReporter: VAReporter can provide comprehensive annotation by integrating a wide variety of biomedical databases
SNPnexus: SNPnexus was designed to simplify and assist in the selection of functionally relevant Single Nucleotide Polymorphisms (SNP) for large-scale genotyping studies of multifactorial disorders
Oncoscape: an online open-access dataanalysis and visualization platform that empowers researchers and clinicians to discover novel patterns and relationships between linked clinical and molecular data
cellmarker: a manually curated resource of cell markers in human and mouse
awesome: a database of SNPs that affect protein post-translational modifications
hmdb: an online database of small molecule metabolites found in the human body, which facilitates human metabolomics research including the identification and characterization of human metabolites using NMR and MS
redoxdb: a curated database of protein oxidative modification
instruct: a database of 3D protein interactome networks with structural resolution
consensuspathdb: integrates interaction networks in Homo sapiens including binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways
phosphonetworks: a database for experimentally determined kinase-substrate relationships
dbsno: protein S-nitrosylation (SNO) is a reversible post-translational modification (PTM) and involves the covalent attachment of nitric oxide (NO) to the thiol group of cysteine (Cys) residues. Given the increasing number of proteins reported to be regulated by this modification, S-nitrosylation is considered to act, in a manner analogous to phosphorylation, as a pleiotropic regulator that elicits dual effects to regulate diverse pathophysiological processes by altering protein function, stability, and conformation change in various cancers and human disorders
hpdi: Human Protein-DNA Interactome (hPDI)
islandviewer: an integrated interface for computational identification and visualization of genomic islands
appris: a system that deploys a range of computational methods to provide annotations of alternative splice isoforms and identify principal isoforms for vertebrate species
rbpdb: a collection of RNA-binding proteins linked to a curated database of published observations of RNA binding
type2diabetesgenetics: providing data and tools to promote understanding and treatment of type 2 diabetes and its complications
pepquery: a peptide-centric search engine for novel peptide identification and validation
Gene Info eXtension (GIX): a browser extension that allows you to retrieve information about a gene product directly on any webpage simply by double clicking an official gene name, synonym or supported accession.
cancermine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer.
gpcrdb: contains data, diagrams and web tools for G protein-coupled receptors (GPCRs). Users can browse all GPCR structures and the largest collections of receptor mutants. Diagrams can be produced and downloaded to illustrate receptor residues (snake-plot and helix box diagrams) and relationships (phylogenetic trees). Reference (structure) structure-based sequence alignments take into account helix bulges and constrictions, display statistics of amino acid conservation and have been assigned generic residue numbering for equivalent residues in different receptors.
FPbase: a free, open-source, web-based, communityeditable database for fluorescent proteins (FPs) and their properties.
Image Data Resource (IDR): Image Data Resource (IDR) is a public repository of image datasets from published scientific studies, where the community can submit, search and access high-quality bio-image data.
Allen Brain Atlases and Data: The Allen Institute for Brain Science uses a unique approach to generate data, tools and knowledge for researchers to explore the biological complexity of the mammalian brain. This portal provides access to high quality data and web-based applications created for the benefit of the global research community.
Allen Cell Explorer: a python-based, open-source toolkit that combines classic 3D image segmentation with artificial intelligence to detect cellular structures.
Mitotic Cell Atlas: Provides a comprehensive and quantitative 4D model of the mitotic protein localization network in a dividing human cell. Mitotic Cell Atlas is an integrated experimental and computational framework that provides a standardized yet dynamic spatio-temporal reference system for the mitotic cell. It can be used to integrate quantitative information on any number of protein distributions sampled in thousands of different experiments.
Broad Bioimage Benchmark Collection: a collection of freely downloadable microscopy image sets. In addition to the images themselves, each set includes a description of the biological application and some type of "ground truth" (expected results).
Cell Image Library: a repository for images and movies of cells from a variety of organisms. It demonstrates cellular architecture and functions with high quality images, videos, and animations. This comprehensive and easily accessible Library is designed as a public resource first and foremost for research, and secondarily as a tool for education. The long-term goal is the construction of a library of images that will serve as primary data for research.
Mitocheck: the goal of this resource is to integrate information on cellular functions of human genes while also giving access to supporting information such as microscopy images of phenotypes. Although its primary focus is on the biology of mitosis, the resource also integrates data relevant to many other cellular functions.
ssbd: Systems Science of Biological Dynamics (SSBD) database provides a rich set of open resources for analyzing quantitative data and microscopy images of biological objects, such as single-molecule, cell, gene expression nuclei, etc. Quantitative biological data and microscopy image are collected from a variety of species, sources and methods. These include data obtained from both experiment and computational simulation.
IMPC: the International Mouse Phenotyping Consortium (IMPC) is an international effort by 19 research institutions to identify the function of every protein-coding gene in the mouse genome. The entire genome of many species has now been published and whole genome sequencing is becoming relatively quick and cheap to complete. Despite these advancements the function of the majority of genes remains unknown.
elixir: ELIXIR unites Europe’s leading life science organisations in managing and safeguarding the increasing volume of data being generated by publicly funded research. It coordinates, integrates and sustains bioinformatics resources across its member states and enables users in academia and industry to access services that are vital for their research.
Global BioImaging Project: the imaging landscape changed significantly in the last 10 years as the the concept of open user access to cutting-edge technologies became valued and well recognized. In Europe imaging experts from 25 countries joined their forces and draw the vision of a pan-European imaging infrastructure, which gave momentum to the project of founding a Euro-BioImaging European Research Infrastructure Consortium (the EuBI ERIC).

Clinical Annotation

CIViC
DoCM
ClinVar
Intogen
Cancer Hotspots
DisGeNET
Cancer Biomarkers database
OncoKB: Precision Oncology Knowledge Base
LncRNADisease: Not only a resource that curated the experimentally supported lncRNA-disease association data but also a platform that integrated tool(s) for predicting novel lncRNA-disease associatons
fusiongdb: fusion gene annotation DataBase, which collected 48 117 FGs across pan-cancer from three representative fusion gene resources: the improved database of chimeric transcripts and RNA-seq data (ChiTaRS 3.1), an integrative resource for cancerassociated transcript fusions (TumorFusions), and The Cancer Genome Atlas (TCGA) fusions by Gao et al.
sedb: the comprehensive human Super-Enhancer database.
pmkb: the cancer precision medicine knowledge base for structured clinical-grade mutations and interpretations
ewasdb: epigenome-wide association study database
dcdb: DCDB (Drug Combination Database), Accumulating scientific and clinical evidences have suggested the use of drug combinations as a safe and effective approach, to treat complicated and refractory diseases. The Drug Combination Database (DCDB) is devoted to the research and development of multi-component drugs. The current version of DCDB collected 1363 drug combinations (330 approved and 1033 investigational, including 237 unsuccessful usages), involving 904 individual drugs, 805 targets

Noncoding RNA Related Database

CSCD
AtCircDB
CircNet
circBase
circRNADb
exoRBase
EVLncRNAs
NONCODE: an integrated knowledge database dedicated to non-coding RNAs (excluding tRNAs and rRNAs)
MiTranscriptome: a catalog of human long poly-adenylated RNA transcripts derived from computational analysis of high-throughput RNA sequencing (RNA-Seq) data from over 6,500 samples spanning diverse cancer and tissue types
FANTOM CAT: an atlas of human long non-coding RNAs with accurate 5’ ends
lnc2cancer2: an updated database that provides comprehensive experimentally supported associations between lncRNAs and human cancers
sm2mir: a manual curated database which collects and incorporates the experimentally validated small molecules' effects on miRNA expression in 20 species from the published papers. Each entry contains the detailed information about small molecules, miRNAs and their relationships, including species, small molecule name, DrugBank Accession number, PubChem CID, approved by FDA or not, miRNA name, miRBase Accession number, expression pattern of miRNA, experimental detection method, tissues or conditions for detection, evidences in the reference, PubMed ID and the published year of the reference
oncomirdb: a Database for Oncogenic & Tumor-Suppressive MicroRNAs
mircancer: provides comprehensive collection of microRNA (miRNA) expression profiles in various human cancers which are automatically extracted from published literatures in PubMed. It utilizes text mining techniques for information collection. Manual revision is applied after auto-extraction to provide 100% precision
lncipedia: a public database for long non-coding RNA (lncRNA) sequence and annotation. The current release contains 127,802 transcripts and 56,946 genes
mirnest: an integrative collection of animal, plant and virus microRNA data
mirtarbase: the experimentally validated microRNA-target interactions database
mirdb: an online resource for microRNA target prediction and functional annotations

eQTL Related Database

exsnp
rVarBase
seeQTL
cancersplicingqtl: a database for genome-wide identification of splicing QTLs in human cancer

Sequencing Data Portal

GDC
EGA
dbGaP
DDBJ
GEO
ICGC

Plant-related platforms

Plant Regulomics

Local tools

Quality Control

FastQC
PRINSEQ
SolexaQA
fastx_toolkit
picard
ngsqctoolkit
MultiQC
mosdepth
fastp
ChronQC
cutadapt
trimmomatic
SOAPnuke
sickle

Alignment And Assembly

BWA
STAR
TMAP
NovoAlign
GMAP
bowtie
bowtie2
tophat2
hisat2
Edean
ABySS
SSAHA2
oases
Velvet
Trinity
MapSplice2
RUM
MECAT
DART
rHAT
taxmaps: large DNA/RNA metagenomics samples
MARVEL: consists of a set of tools that facilitate the overlapping, patching, correction and assembly of noisy (not so noisy ones as well) long reads.
vg: tools for working with genome variation graphs
TransLiG: a de novo transcriptome assembler that uses line graph iteration.
stringtie: Transcript assembly and quantification for RNA-Seq

Variant Detection (SNVs, INDELs, SVs)

GATK
MuTect
lofreq
VarScan2
freebayes
TVC
SomaticSniper
speedseq
FusionCatcher
svtoolkit
pindel
breakdancer
delly
CNVkit
speedseq
GRIDSS
PancanQTL
TumorFusions
SVScore
SVTools
RDDpred
iseq
deepvariant
SV2
facets
MutScan
svaba: structural variation and indel detection by local assembly
manta: structural variant and indel caller using mapped sequencing data
JAFFA: a multi-step pipeline that takes either raw RNA-Seq reads, or pre-assembled transcripts, then searches for gene fusions
Picky: structural variants pipeline for long reads
CREST: a algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data
Control-FREEC: a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data
Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs
GISTIC2: facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers
BreaKmer: A method to identify structural variation from sequencing data in target regions
deTiN: DeTiN is designed to measure tumor-in-normal contamination and improve somatic variant detection sensitivity when using a contaminated matched control.
vadir: an integrated approach to Variant Detection in RNA
CN_Learn: a framework to integrate Copy Number Variant (CNV) predictions made by multiple algorithms using exome sequencing datasets
SVseq2
SoftSV: a tool for the detection of small and large deletions, inversions, tandem duplications and translocations from paired-end sequencing data.
wham: consists of two programs, wham and whamg. wham, the original tool, is a very sensitive method with a high false discovery rate. The second program, whamg, is more accurate and better suited for general structural variant (SV) discovery.

Variant Annotation

ANNOVAR
SnpEff
gemini
VEP
Variant Annotation Integrator
vcfanno
pcgr
annovarR
OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes
bystro: Bystro genetic analysis (annotation, filtering, statistics
contest: a tool (and method) for estimating the amount of cross-sample contamination in next generation sequencing data. Using a Bayesian framework, contamination levels are estimated from array based genotypes and sequencing reads
pathopredictor: Predict pathogenic and benign missense variant status.

Variant Visualization (SNVs, INDELs, SVs)

ProteinPaint
AGFusion
GenomeUPlot
BreakPointSurveyor
chimeraviz
Oncoprinter
MutationMapper
pv: 3D structure visualization in WEB
g2s: mappings between protein sequence positions and PDB 3D protein structure models
NGB: structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support

Variant Screen

LARVA
DANN
NCBoost: Classifier of pathogenic non-coding variants in Mendelian diseases

Alternative Splicing

LeafCutter Annotation-free quantification of RNA splicing.
rMATS
MMSplice: variant effect predictions on splicing
pram predict intergenic transcript models from RNA-seq (Genome Res 2020)
shark Mapping-free filtering of irrelevant RNA-Seq reads（Bioinformatics 202）
PAIRADISE Paired Replicate Analysis of Allelic Differential Splicing Events (AJHG 2020)
IRFinder Detecting intron retention from RNA-Seq experiments
iread Detect intron retention(IR) events from RNA-seq datasets
DARTS Deep-learning Augmented RNA-seq analysis of Transcript Splicing
SpliceAI A deep learning-based tool to identify splice variants
DEXSeq Detecting differential usage of exons from RNA-seq data
MATS
cash Comprehensive alternative splicing hunting
tappas a comprehensive computational framework for the analysis of the functional impact of differential splicing
dsreg dSreg is a library to perform joint inference of differential splicing and regulatory mechanisms using RNA-seq data.
PSI-Sigma a comprehensive splicing detection method for short-read and long-read RNAseq analysis.
PsiCLASS Simultaneous multi-sample transcript assembler for RNA-seq data
IsoformSwitchAnalyzeR Identify, Annotate and Visualize Alternative Splicing and Isoform Switches with Functional Consequences from both short- and long-read RNA-seq data.
yanagi Transcript Segment Library Construction for RNA-Seq Quantification
AStrap Identification of alternative splicing from transcript sequences without a reference genome
DSC A deep learning approach for classification of alternative splicing events
CATANA Comprehensive Alternative Transcripts Atlas based oN Annotation (CATANA) to identify all 10 known AS and AT events.
benchmarkingDiffExprAndSpl A benchmarking of workflows for detecting differential splicing and differential expression at isoform level in human RNA-seq studies
psichomics psichomics: graphical application for alternative splicing quantification and analysis
matt A Unix toolkit for analyzing genomic sequences with focus on down-stream analysis of alternative splicing events
PathwaySplice An R Package for Unbiased Splicing Pathway Analysis

Gene Expression Data Analysis

Cufflinks
DESeq2
edgeR
HTSeq
RESM: RNA-Seq by Expectation-Maximization, accurate quantification of gene and isoform expression from RNA-Seq data.
sRNAnalyzer
mrnn: an implementation of a Gated Recurrent Unit (GRU) network for classification of transcripts as either coding or noncoding
prada: pipeline for RNA-Sequencing Data Analysis
ballgown: a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly.
subread: comprises a suite of software programs for processing next-gen sequencing read data, i.e. featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. High-performance read alignment, quantification and mutation discovery.
kallisto: a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment.
salmon: a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses new algorithms (specifically, coupling the concept of quasi-mapping with a two-phase inference procedure) to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory. Salmon performs its inference using an expressive and realistic model of RNA-seq data that takes into account experimental attributes and biases commonly observed in real RNA-seq data.
mixcr: a universal software for fast and accurate extraction of T- and B- cell receptor repertoires from any type of sequencing data. Free for academic use only
trust: Tcr Receptor Utilities for Solid Tissue (TRUST) is a computational tool to analyze TCR and BCR sequences using unselected RNA sequencing data, profiled from solid tissues, including tumors. TRUST performs de novo assembly on the hypervariable complementarity-determining region 3 (CDR3) and reports contigs containing the CDR3 DNA and amino acid sequences. TRUST then realigns the contigs to IMGT reference gene sequences to report the corresponding variable (V) or joining (J) genes.
topconfects: is intended for RNA-seq or microarray Differntial Expression analysis and similar, where we are interested in placing confidence bounds on many effect sizes--one per gene--from few samples.
PLIER: Pathway-Level Information Extractor (PLIER): a generative model for gene expression data.

Virus and Microbial Related

viral-ngs
qap
ROP: discovering the source of all RNA-seq reads, including those originating from repeat sequences, recombinant B and T cell receptors, and microbial communities
ViFi: pipeline for identifying viral integration and fusion mRNA reads from NGS data
hgtid: an efficient and sensitive workflow to detect human-viral insertion sites using next-generation sequencing data
MicroPro: a software to perform profiling of both known and unknown microbial organisms for metagenomic dataset.
FEAST: a scalable algorithm for quantifying the origins of complex microbial communities.
mcorr: inferring bacterial recombination rates from large-scale sequencing datasets.
VirusFinder2: a new software tool for characterizing intra-host viruses through next generation sequencing (NGS) data.
VirusSeq: a algorithmic tool for detecting known viruses and their integration sites using next-generation sequencing of human cancer tissue.
BatVI: a fast and sensitive method to determine viral integrations.

Single Cell

seurat
SCnorm
dropClust
scran: batch effect adjust
trendsceek: spatial expression trends in single-cell gene expression data
scRNA-tools: a database of software tools for the analysis of single-cell RNA-seq data.
awesome-single-cell: list of software packages (and the people developing these methods) for single-cell data analysis, including RNA-seq, ATAC-seq, etc.
SAVER: SAVER (Single-cell Analysis Via Expression Recovery) implements a regularized regression prediction and empirical Bayes method to recover the true gene expression profile in noisy and sparse single-cell RNA-seq data.
CellSIUS: an R package enabling the identification and characterization of (rare) cell sub-populations from complex scRNA-seq datasets: it takes as input expression values of N cells grouped into M(>1) clusters. Within each cluster, genes with a bimodal distribution are selected and only genes with cluster-specific expression are retained. Among these candidate marker genes, sets with correlated expression patterns are identified by graph-based clustering. Finally, cells are assigned to subgroups based on their average expression of each gene set. The CellSIUS algorithm output provides the rare/ sub cell types by cell indices and their transcriptomic signatures.
SCRABBLE: Single Cell RNA-Seq imputAtion constrained By BuLk RNAsEq data (SCRABBLE)
Melissa: a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic diversities and commonalities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.
paga: mapping out the coarse-grained connectivity structures of complex manifolds.
clonealign: Bayesian inference of clone-specific gene expression estimates by integrating single-cell RNA-seq and single-cell DNA-seq data
CellFishing.jl: (cell finder via hashing) is a tool to find similar cells of query cells based on their transcriptome expression profiles.
VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies.
scgen: a tensorflow implementation of scGen. scGen is a generative model to predict single-cell perturbation response across cell types, studies and species.
conos: a package to wire together large collections of single-cell RNA-seq datasets. It focuses on uniform mapping of homologous cell types across heterogeneous sample collections. For instance, a collection of dozens of peripheral blood samples from cancer patients, combined with dozens of controls. And perhaps also including samples of a related tissue, such as lymph nodes.
MAGIC: Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm for denoising high-dimensional data most commonly applied to single-cell RNA sequencing data. MAGIC learns the manifold data, using the resultant graph to smooth the features and restore the structure of the data.
zinbwave: a zero-inflated negative binomial model for single-cell RNA-seq data, with latent factors.
SIMLR_PY: Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning.
dca: a deep count autoencoder network to denoise scRNA-seq data and remove the dropout effect by taking the count structure, overdispersed nature and sparsity of the data into account using a deep autoencoder with zero-inflated negative binomial (ZINB) loss function.
scVI: deep generative modeling for single-cell transcriptomics.
PhenoGraph: a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") representing phenotypic similarities between cells and then identifying communities in this graph.
splatter: simulation of Single-cell RNA sequencing data.
DeepNovo-DIA: de novo peptide sequencing for DDA and DIA by deep learning.
scVI: Deep generative modeling for single-cell transcriptomics.

Protein Data Related

interproscan
effusion: prediction of Protein Function from Sequence Similarity Networks

Expression Quantitative Trait Loci, eQTL

CaVEMaN

ChIP-seq analysis

MACS
CEAS
MDSeqPos
conservation_plot

Primer Design

CEMAsuite
Primer3plus

Followling Copy From https://pcrprimerdesign.github.io/.

Primer3

	Publications	Availability	Interface	Language
Primer3	Rozen and Skaletsky, Koressaar and Remm, Untergasser et al.	Open	WUI/CUI	C

Sanger

	Publication	Availability	Primer3-Based?	Interface	Language
PrimerZ	Tsai et al.	Free	Yes	WUI	Java
JCVI Primer Designer	Li et al.	Open	No	CUI	Perl
ExonPrimer	N/P	Free	Yes	WUI	Perl
MPDP3 (abbr.)	N/P	Free	Yes	WUI	?
ConservedPrimer2.0	You et al.	Open	Yes	WUI/CUI	Java
PrimerDesign-M	Yoon and Leitner	Free	No	WUI	?

RT-qPCR

	Publication	Availability	Primer3-Based?	Interface	Language
AutoPrime	Wrobel et al.	Open	Yes	WUI	Perl
QuantPrime	Arvidsson et al.	Free	Yes	WUI	Python/PHP
Primer-BLAST	Ye et al.	Open	Yes	WUI	C++

SNP

	Publication	Availability	Primer3-Based?	Interface	Language
PIRA PCR designer	Ke et al.	Free	No	WUI	Java
PRIMER1	Ye	Free	No	WUI	Java
PCR designer	Ke et al.	Free	No	WUI	?

Splicing Variant

	Publication	Availability	Primer3-Based?	Interface	Language
RASE	Brosseau et al.	Free	Yes	WUI	Perl
PRIMEGENS-v2	Srivastava et al.	Open	Yes	WUI/CUI	C
PrimerSeq	Tokheim et al.	Open	Yes	GUI	Java

Methylation

	Publication	Availability	Primer3-Based?	Interface	Language
Methprimer	Li and Dahiya	Free	Yes	WUI	C/Perl
BiSearch	Tusnady et al.	Free	No	WUI	?
Bisprimer	Kovacova and Janousek	Free	No	GUI	?
MSP-HTPrimer	Pandey et al.	Open	Yes	WUI	Python

Microsatellite

	Publication	Availability	Primer3-Based?	Interface	Language
MSATCOMMANDER	Faircloth	Open	Yes	CUI/GUI	Python
WebSat	Martins et al.	Free	Yes	WUI	Javascript/PHP
QDD	Meglécz et al.	Free	Yes	CUI/Galaxy	Perl

Conserved/ Degenerate

	Publication	Availability	Primer3-Based?	Interface	Language
HYDEN	Linhart and Shamir	Free	No	CUI	C++
Amplicon	Jarman	Open	No	GUI	Python
Primaclade	Gadberry et al.	Free	Yes	WUI	Bioperl
PriFi	Fredslund et al.	Free	No	WUI	?
GeneFisher2	Lamprecht et al.	Free	No	WUI	Javascript/XML
PrimerIdent	Pessoa et al.	Free	Yes	WUI	Perl
TOPSI	Vijaya Satya et al.	Open	Yes	WUI/CUI	BioPerl
Gemi	Sobhy and Colson	Open	Yes	GUI	C#
easyPAC	Rosenkranz	Free	No	CUI	Perl

Multiplex

	Publication	Availability	Primer3-Based?	Interface	Language
MultiPLX	Kaplinski and Remm	Free	No	CUI/WUI	C++
MuPlex	Rachlin et al.	Free	No	WUI	Java
PrimerStation	Yamada et al.	Free	No	WUI	?
MPprimer	Shen et al.	Open	Yes	CUI/WUI	Python
Optimus Primer	Brown et al.	Free	Yes	WUI	?
MCMC-ODPR	Kitchen et al.	Free	No	CUI	Perl/Java
MPD	Wingo et al.	Open	No	WUI/CUI	C/Perl
Oli2go	Hendling et al.	Free	Yes	WUI	Python

Multifunctional

	Publication	Availability	Primer3-Based?	Interface	Language
Primo	Li et al.	Free	No	WUI	C
PerlPrimer	Marshall	Open	No	GUI	Perl/TK
The PCR suite	Baren and Heutink	Open	Yes	WUI	Perl
BatchPrimer3	You et al.	Free	Yes	WUI	Perl
jPCR	Kalendar et al.	Free	No	GUI	Java

Niche Applications

	Function	Publication	Availability	Primer3-Based?	Interface	Language
RJPrimers	Transposon	You et al.	Open	Yes	WUI/CUI	Perl/Java
PrimerX	Mutagenesis	N/P	Free	?	WUI	?
AcePrimer	C. elegans	Mckay and Jones	Free	Yes	WUI	Perl
PrecisePrimer	Cloning	Pauthenier and Faulon	Free	No	WUI	?
MultiMPrimer3	Pathogen	Koressaar et al.	Free	Yes	WUI	Perl
AmplifX	Management	N/P	Free	No	GUI	?

Work flow

bcbio-nextgen
nextflow
orange3
sequana
snakemake
WDL
cromwell
CWL
bpipe

Unclassified

biopython
IRanges
org.Hs.eg.db
Biobase
GenomicAlignments
GenomicRanges
Rsamtools
jvarkit
htslib
samtools
bedtools
bedops: a suite of tools to address common questions raised in genomic studies — mostly with regard to overlap and proximity relationships between data sets. It aims to be scalable and flexible, facilitating the efficient and accurate analysis and management of large-scale genomic data.
vcftools
bcftools
bamtools
maftools
bamUtil
vcflib
samstat
seqtk
sratools
bcl2fastq2
ucsc_utils
MeQA
IdCheck
SAMBLASTER
ngstk
BioInstaller
ChromHMM
ABSOLUTE
HAPSEG
Atlas-SNP, Atlas2 Suite
Beagle
CIBERSORT
biobloom
APAtrap
phenopredict: predicting phenotype sample information using gene expression
recount
bart: predicting functional transcription factors using gene set or a ChIP-seq dataset as input
LSMM (Latent Sparse Mixed Model): integrating functional annotations with genome-wide association studies
vcf2maf: Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
r2d3: R Interface to D3 Visualizations
liteq: Serverless R message queue using SQLite
ReLaXed: Create PDF documents using web technologies
dash: RStudio Addin to Run a Selection as a Background Job
threadpool: Parallel Processing in R using a Thread Pool
marina: master Regulator Inference Algorithm
paradigm: PAthway Representation and Analysis by Direct Inference on Graphical Models
hupan: a pan-genome analysis pipeline for human genomes.
RaPID: an ultra-fast tool for the identification of identity-by-descent segments among genotyped individuals.
gemini: a variational Bayesian approach to identify genetic interactions from combinatorial CRISPR screens.
CONFINED: for the purpose of capturing replicable sources of biological variability in methylation data. These sources include, for example, age, sex, and cell-type composition. Importantly, the variation captured by CONFINED does not include any variability from technical or batch effects.
marginPhase: a program for simultaneous haplotyping and genotyping.
osca: (OmicS-data-based Complex trait Analysis) is a software tool written in C/C++ for the analysis of complex traits using multi-omics data.
ChiCMaxima: a pipeline for analyzing and identificantion of chromation loops in CHi-C promoters data.
circBrain: Detection of circular RNA expression and related quantitative trait loci in the human dorsolateral prefrontal cortex.
bazam: A read extraction and realignment tool for next generation sequencing data.
DegNorm: short for degradation normalization, is a bioinformatics pipeline designed to correct for bias due to the heterogeneous patterns of transcript degradation in RNA-seq data. DegNorm helps improve the accuracy of the differential expression analysis by accounting for this degradation.
conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing
3DChromatin_ReplicateQC: Software to compute reproducibility and quality scores for Hi-C data.
rnbeads: an R package for comprehensive analysis of DNA methylation data obtained with any experimental protocol that provides single-CpG resolution. Supported assays include Infinium and EPIC microarrays and bisulfite sequencing protocols, and also MeDIP-seq and MBD-seq once the data have been preprocessed with DNA methylation level inference software.
I-Boost: a statistical boosting method that integrates multiple types of high-dimensional genomics data with clinical data for predicting survival time.
bin3C: extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
dStruct: method for identifying differential reactive regions from RNA structurome profiling data.
Skmer: a fast tool for estimating distances between genomes from low-coverage sequencing reads (genome-skims), without needing any assembly or alignment step.
iGUIDE: a pipeline written in snakemake for processing and analyzing double-strand DNA break events. These events may be induced, such as by designer nucleases like Cas9, or spontaneous, as produced through DNA replication or ionizing radiation.
plyranges: provides a consistent interface for importing and wrangling genomics data from a variety of sources. The package defines a grammar of genomic data manipulation based on dplyr and the Bioconductor packages IRanges, GenomicRanges, and rtracklayer.
FORGe: tool for ranking variants and building an optimal graph genome.
SE-MEI: tools for finding mobile element insertions from single-end datasets.
Anchor: trans-cell Type Prediction of Transcription Factor Binding Sites
adVNTR: a tool for genotyping Variable Number Tandem Repeats (VNTR) from sequence data. It works with both NGS short reads (Illumina HiSeq) and SMRT reads (PacBio) and finds diploid repeating counts for VNTRs and identifies possible mutations in the VNTR sequences.
ldsc: a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.
BigStitcher: ImgLib2/BDV implementation of Stitching for large datasets.
ivtnmr: In Vitro Transcription NMR. Protocol, code and examples for the co-transcriptional RNA folding network reconstruction.
DIVERS: (Decomposition of Variance Using Replicate Sampling), including absolute abundance estimation from spike-in sequencing and the variance/covariance decompostion of absolute bacterial abundances.
prosit: offers high quality MS2 predicted spectra for any organism and protease as well as iRT prediction
DeepCell: Software library for deep-learning-enabled single-cell analysis in the cloud. Users manage their own cloud deployment; model training and deployment are performed through a web interface.
CDeep3M: Amazon machine image for training and deploying deep learning models for 2D and 3D image segmentation
U-Net: ImageJ plug-in for single-cell image segmentation with U-Net.
CellProfiler: Python-based software for single-cell segmentation and morphological profiling. Single-cell segmentation with U-Net available through a REST API.
Mask R-CNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow.
Cell Cognition Explorer: an open-source image processing tool for the analysis of cellular phenotypes in microscopy. CellCognition Explorer enables phenotype classification by supervised machine learning. To detect rare phenotypes, outlier morphologies can be automatically found by novelty detection methods. A key feature of CellCognition Explorer is an improved classifier training procedure based on automated pre-processing of the full data set into cell gallery images, which can be automatically sorted based on phenotype similarity for efficient iterative classifier training.
DeepLabCut: a toolbox for markerless pose estimation of animals performing various tasks.
LEAP: LEAP Estimates Animal Pose, a framework for animal body part position estimation via deep learning.
idtracker.ai: a software that tracks and identifies animals in collectives from videos.
In silico labeling: Predicting fluorescent labels in unlabeled images.
Image restoration: a toolbox for Content-aware Image Restoration (CARE).
trackViewer: a Bioconductor package for interactive and integrative visualization of multi-omics data
cistopic: probabilistic modelling of cis-regulatory topics from single cell epigenomics data
selene: a framework for training sequence-level deep learning networks.
sirius: a rapid tool for turning tandem mass spectra into metabolite structure information.
SDA: Segmental Duplication Assembler (SDA).
fmriprep: a robust and easy-to-use pipeline for preprocessing of diverse fMRI data. The transparent workflow dispenses of manual intervention, thereby ensuring the reproducibility of the results.
unifrac: for high-performance phylogenetic diversity calculations

Statistical and Visualization

medcalc
GraphPad
ImageJ
SPSS
R
gvmap
easySVG
hexmapr
clustergrammer
chromVAR
echarts
plotly
qvalue: estimating q-values and false discovery rate quantities
GenVisR: genome data visualizations
r-color-palettes: Comprehensive list of color palettes available in r
sequenza: a novel set of tools providing a fast python script to genotype cancer samples, and an R package to estimate cancer cellularity, ploidy, genome wide copy number profile and infer for mutated alleles
opencpu: A system for embedded scientific computing and reproducible research with R
ggthemr: Themes for ggplot2
paletter: Build your ggplot2 palette from a picture
ggdag: An R package for working with causal directed acyclic graphs (DAGs), homepage
ggseqlogo: Publication-quality sequence logos in R.
threejs: JavaScript 3D library
higlass: Fast contact matrix visualization for the web, [homepage(http://higlass.io)

Text editor and IDE

Vim
Emacs
Atom
Sublime
Rstudio
Eclipse
PyCharm
Visual Studio

Remote Connection (SSH)

mobaXterm
Cygwin
Xshell & Xsftp
Putty
babun
cmder

Remote Connection (Desktop)

Teamviewer
Sunlogin
Splashtop
Chrome Remote Desktop app
Logmein
PC Anywhere
GoToMyPC
Radmin
UltraVNC

Other

igraph
root
boost
libtbb
docker

Books&Tutorial

R

R packages
stringr
Bioconductor Tutorial
limma
30分钟学会ggplot2
R Graphics Cookbook
Introduction to data.table
RSQLite
R Graphics
Wordcloud2

Linux&Shell

The Linux Command Line
Advanced Bash-Scripting Guide
Wicked Cool Shell Scripts
鸟哥的 Linux 私房菜
菜鸟教程

Python

Learning Python, 5th Edition
Python Examples
Learning Python
Python学习手册

C/C++

C Primer Plus
C++ Primer Plus 6th Edition

JAVA

The Java™ Tutorials

Statistics and Deep learning

SPSS Beginners Tutorials
Machine learning
Deep learning
Loss function
Maximum likelihood estimation
Bayes' theorem
Perceptron
SVM
k-nearest neighbors algorithm
Convolutional Neural Network
K-Means
HMM
STAT115 - HMM PPT
机器学习常用算法
机器学习资源列表
Review:Deep learning, genomics, and precision medicine
ML book list:

│  李航.统计学习方法.pdf
│  机器学习及其应用.pdf
│  All of Statistics - A Concise Course in Statistical Inference - Larry Wasserman - Springer.pdf
│  Machine Learning - Tom Mitchell.pdf
│  PRML.pdf
│  PRML读书会合集打印版.pdf
│  Programming Collective Intelligence.pdf
│  [奥莱理] Machine Learning for Hackers.pdf
│  [机器学习]Tom.Mitchell.pdf
│  《大数据：互联网大规模数据挖掘与分布式处理》迷你书.pdf
│  推荐系统实践.pdf
│  数据挖掘-实用机器学习技术（中文第二版）.pdf
│  数据挖掘_概念与技术.pdf
│  机器学习-Mitchell-中文-清晰版.pdf
│  机器学习导论.pdf
│  模式分类第二版中文版Duda.pdf（全）.pdf
│  深入搜索引擎--海量信息的压缩、索引和查询.pdf
│  矩阵分析.美国 Roger.A.Horn.扫描版.pdf
│  统计学习基础 数据挖掘、推理与预测.pdf
│  
├─机器学习实战
│      machinelearninginaction.zip
│      机器学习实战 单页.pdf
│      机器学习实战.pdf
│      
└─论文文集
    └─LDA
            LDA-wangyi.pdf
            LDA数学八卦.pdf
            text-est.pdf

Git

Git tutorials
Git 教程
Github Guides

Cloud

Cloud Computing
[GCP-for-Bioinformatics)(https://github.com/lynnlangit/gcp-for-bioinformatics)
Docker入门教程

Bioinfomatics

华大基因生物信息学培训教材
生物信息学入门
《生物信息学入门最佳实践》
The Biostar Handbook: A Beginner's Guide to Bioinformatics
Bioinformatics Data Skills
生信菜鸟团博客
生信技能树论坛
生信技能树开源语雀知识库

Paper

Basic of High-throughput sequencing technology

Hadfield, J. & Retief, J. A profusion of confusion in NGS methods naming. Nat Methods 15, 7-8 (2018): http://enseqlopedia.com/enseqlopedia/，
Schuster S C. Next-generation sequencing transforms today's biology[J]. Nature methods, 2008, 5(1): 16-18.
Ozsolak F, Milos P M. RNA sequencing: advances, challenges and opportunities.[J]. Nature Reviews Genetics, 2011, 12(2):87-98.
Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years[J]. Nature Reviews Genetics, 2019, 20(11): 631-656
Ansorge W J. Next-generation DNA sequencing techniques[J]. New biotechnology, 2009, 25(4): 195-203.
Heather J M, Chain B. The sequence of sequencers: The history of sequencing DNA[J]. Genomics, 2016, 107(1): 1-8.
Schneider G F, Dekker C. DNA sequencing with nanopores[J]. Nature biotechnology, 2012, 30(4): 326.
Restrepo-Pérez L, Joo C, Dekker C. Paving the way to single-molecule protein sequencing[J]. Nature nanotechnology, 2018, 13(9): 786-796.

Large research project

Cancer Genome Atlas Research, N., et al., The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 2013. 45(10): p. 1113-20.
International Cancer Genome, C., et al., International network of cancer genome projects. Nature, 2010. 464(7291): p. 993-8.
Consortium, G.T., The Genotype-Tissue Expression (GTEx) project. Nat Genet, 2013. 45(6): p. 580-5.
G.P., Enhancing GTEx by bridging the gaps between genotype, gene expression, and disease. Nat Genet, 2017. 49(12): p. 1664-1670.
Consortium, G.T., et al., Genetic effects on gene expression across human tissues. Nature, 2017. 550(7675): p. 204-213.

Precision medicine

Byron, S.A., et al., Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet, 2016. 17(5): p. 257-71.
Price, N.D., et al., A wellness study of 108 individuals using personal, dense, dynamic data clouds. Nat Biotechnol, 2017. 35(8): p. 747-756.
Kumar-Sinha, C. and A.M. Chinnaiyan, Precision oncology in the age of integrative genomics. Nat Biotechnol, 2018. 36(1): p. 46-60.
Torkamani, A., N.E. Wineinger, and E.J. Topol, The personal and clinical utility of polygenic risk scores. Nat Rev Genet, 2018.
Berdasco, M. and M. Esteller, Clinical epigenetics: seizing opportunities for translation. Nat Rev Genet, 2018.

Tumor biology

Stratton, M.R., P.J. Campbell, and P.A. Futreal, The cancer genome. Nature, 2009. 458(7239): p. 719-24.
Sanchez-Vega, F., et al., Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell, 2018. 173(2): p. 321-337 e10.
Huang, K.L., et al., Pathogenic Germline Variants in 10,389 Adult Cancers. Cell, 2018. 173(2): p. 355-370 e14.
Kahles, A., et al., Comprehensive Analysis of Alternative Splicing Across Tumors from 8,705 Patients. Cancer Cell, 2018.
Castro-Giner, F., P. Ratcliffe, and I. Tomlinson, The mini-driver model of polygenic cancer evolution. Nat Rev Cancer, 2015. 15(11): p. 680-5.
Salk, J.J., M.W. Schmitt, and L.A. Loeb, Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet, 2018.
Winters, I.P., C.W. Murray, and M.M. Winslow, Towards quantitative and multiplexed in vivo functional cancer genomics. Nat Rev Genet, 2018. 19(12): p. 741-755.
Pesavento, P.A., et al., Cancer in wildlife: patterns of emergence. Nat Rev Cancer, 2018.
Maman, S. and I.P. Witz, A history of exploring cancer in context. Nat Rev Cancer, 2018. 18(6): p. 359-376.
Hamidi, H. and J. Ivaska, Every step of the way: integrins in cancer progression and metastasis. Nat Rev Cancer, 2018.
Archetti, M. and K.J. Pienta, Cooperation among cancer cells: applying game theory to cancer. Nat Rev Cancer, 2018.

Bioinformatics databases and tools

Ding, L., et al., Expanding the computational toolbox for mining cancer genomes. Nat Rev Genet, 2014. 15(8): p. 556-70.
Cheng, F., J. Zhao, and Z. Zhao, Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Brief Bioinform, 2016. 17(4): p. 642-56.
Zhang, Z., et al., A survey and evaluation of Web-based tools/databases for variant analysis of TCGA data. Brief Bioinform, 2018.
Casper J, Zweig A S, Villarreal C, et al. The UCSC genome browser database: 2018 update[J]. Nucleic acids research, 2017, 46(D1): D762-D769.
Afgan, E., et al., The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res, 2018. 46(W1): p. W537-W544.
Sondka, Z., et al., The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat Rev Cancer, 2018. 18(11): p. 696-705.

Application of machine learning on bioinformatics

Zou, J., et al., A primer on deep learning in genomics. Nat Genet, 2019. 51(1): p. 12-18.
Eraslan, G., et al., Deep learning: new computational modelling techniques for genomics. Nat Rev Genet, 2019.
Wainberg, M., et al., Deep learning in biomedicine. Nat Biotechnol, 2018. 36(9): p. 829-838.
Ching, T., et al., Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface, 2018. 15(141).
Min, S., B. Lee, and S. Yoon, Deep learning in bioinformatics. Brief Bioinform, 2017. 18(5): p. 851-869.
Jones, W., et al., Computational biology: deep learning. Emerging Topics in Life Sciences, 2017. 1(3): p. 257-274.
Angermueller, C., et al., Deep learning for computational biology. Mol Syst Biol, 2016. 12(7): p. 878.
Zhou, J., et al., Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet, 2018. 50(8): p. 1171-1179.
Sundaram, L., et al., Predicting the clinical impact of human mutation with deep neural networks. Nat Genet, 2018.
Libbrecht, M.W. and W.S. Noble, Machine learning applications in genetics and genomics. Nat Rev Genet, 2015. 16(6): p. 321-32.
Camacho, D.M., et al., Next-Generation Machine Learning for Biological Networks. Cell, 2018. 173(7): p. 1581-1592.

Whole-genome sequencing

Kosugi, Shunichi, et al. "Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing."Genome biology20.1 (2019): 117.

Single cell sequencing

Kiselev, V.Y., T.S. Andrews, and M. Hemberg, Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet, 2019. 20(5): p. 273-282.
McInnes, L., J. Healy, and J. Melville UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv e-prints, 2018.
Maaten, L.v.d.a.H., Geoffrey, Visualizing Data using t-SNE. Journal of Machine Learning Research, 2008. 9: p. 2579--2605.
Lake, B.B., et al., Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol, 2018. 36(1): p. 70-80.
Cusanovich, D.A., et al., A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell, 2018. 174(5): p. 1309-1324 e18.
Haghverdi, L., et al., Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol, 2018.
Raj, B., et al., Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat Biotechnol, 2018. 36(5): p. 442-450.
Edsgard, D., P. Johnsson, and R. Sandberg, Identification of spatial expression trends in single-cell gene expression data. Nat Methods, 2018. 15(5): p. 339-342.

Non-coding region and synonymous mutation

Fredriksson, N.J., et al., Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat Genet, 2014. 46(12): p. 1258-63.
Weinhold, N., et al., Genome-wide analysis of noncoding regulatory mutations in cancer. Nat Genet, 2014. 46(11): p. 1160-5.
Uszczynska-Ratajczak, B., et al., Towards a complete map of the human long non-coding RNA transcriptome. Nat Rev Genet, 2018.
Chamary J V, Parmley J L, Hurst L D. Hearing silence: non-neutral evolution at synonymous sites in mammals[J]. Nature Reviews Genetics, 2006, 7(2): 98.
Sauna Z E, Kimchi-Sarfaty C. Understanding the contribution of synonymous mutations to human disease[J]. Nature Reviews Genetics, 2011, 12(10): 683.
Supek F, Miñana B, Valcárcel J, et al. Synonymous mutations frequently act as driver mutations in human cancers[J]. Cell, 2014, 156(6): 1324-1335.
Sharma, Y., et al., A pan-cancer analysis of synonymous mutations. Nat Commun, 2019. 10(1): p. 2569.

Pan-genome

Li, R., et al., Building the sequence map of the human pan-genome. Nat Biotechnol, 2010. 28(1): p. 57-63.
Duan Z, Qiao Y, Lu J, et al. HUPAN: a pan-genome analysis pipeline for human genomes[J]. Genome biology, 2019, 20(1): 149.

3D genome

Spielmann, M., D.G. Lupianez, and S. Mundlos, Structural variation in the 3D genome. Nat Rev Genet, 2018. 19(7): p. 453-467.

Skills

Programming language

Shell
Python
R
HTML/CSS
Javascript
PHP
SQL
C/C++
JAVA
Perl

Statistics

t-test
Chi-squared test
ANOVA
Normal distribution
Wilcoxon signed-rank test

Code Management

Git
Github

Organization

Google Summer of Code Registered

Open Bioinformatics Foundation: Promoting practice & philosophy of OSS & Open Science in biological research.
National Resource for Network Biology (NRNB): The National Resource for Network Biology (NRNB) organizes the development of free, open source software to enable network-based visualization, analysis, and biomedical discovery.
INCF: INCF advances data reuse and reproducibility in brain research by coordinating the development of Open, FAIR, and Citable tools and resources for neuroscience.
Computational Biology @ University of Nebraska-Lincoln: Our organization develops tools for bioinformatics and computational biology research. Our goal is to further knowledge in health through data visualization and analysis.
Biomedical Informatics, Emory University: Big Data for Healthcare and Biomedical Research
Ensembl: The Ensembl project maintains and updates databases that annotate a wide number of genome sequences and distributes them freely to the worldwide research community.
R project for statistical computing: R provides a wide variety of statistical and graphical techniques, and is highly extensible. R is often the tool of choice for research in statistical methodology.
InterMine: InterMine integrates biological data sources and makes it easy to query, visualise, and analyse the data via a graphical user interface or via APIs in Python, R, Perl, and more.
NumFOCUS: NumFOCUS supports and promotes world-class, innovative, open source scientific software.
PEcAn Project: PEcAn is an integrated ecoinformatics toolbox that consists of a set of scientific workflows that wrap around ecosystem models and manage flow of information in and out of models

Project-based community

galaxyproject: Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.
bioconda: A channel for the conda package manager specializing in bioinformatics software.
biopython: An international association of developers of freely available Python tools for computational molecular biology.
samtools: Tools (written in C using htslib) for manipulating next-generation sequencing data.
opengene: Open source tools for NGS data analysis.
MultiQC: Aggregate results from bioinformatics analyses across many samples into a single report.
Gatk: GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.
nextflow: A bioinformatics workflow manager that enables the development of portable and reproducible workflows.
spack: A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
omicX: Reap the rewards of a biological insight engine.

Communication-based community

Biotrainee: Chinese Community in Bioinformatics
bioinformatics.org: Bioinformatics community open to all people.
Zhihu | Bioinformatics: Chinese Q&A Community.
muchong: Chinese BBS for scientific research.

Institute or business company

Broad Institute
The European Bioinformatics Institute
Harbin Institute of Technology | Center for Bioinformatics
illumina
Life Technologies
QIAGEN

People

Eric Lander
Leroy Hood
Mark Gerstein
Shirley Liu
Chuan He
Bing Ren
Job Dekker
Michael Snyder
Howard Chang
Mitch Guttman
John Rinn
Bradley E. Bernstein
Richard Michael Durbin
Pavel A. Pevzner
Brendan J. Frey
Jinghui Zhang
Ira M. Hall

Blog

Jianfeng Li's blog
RNA-seq Blog
Jianming Zeng's blog
Yihui Xie's blog
Fei Zhao's blog
Mengyuan Shen's blog
Boqiang Hu's blog
Bob's Blog
Homolog.us - Frontier in Bioinformatics
r-bloggers
DataTau
Bits of DNA, Lior Pachter
Next Generation Technologist
Simply Statistics
Massgenomics
OpenHelix
QIAGEN
Loman Labs Blog
Living in an Ivory Basement Stochastic thoughts on science, testing, and programming
Neil Saunders
Mike Love’s blog
Ewan Birney
In between lines of code
Heng Li's blog
MacArthur Lab
Blue Collar Bioinformatics
Simpson Lab
Bits of Bioinformatics
Shixiang Wang's blog

Contributors

Jianfeng Li
Bowen Cui
Shixiang Wang
l0o0

Files

README.md

Latest commit

History