Workshop session #1 of the "Systems Biology of Infectious Diseases" workshop (Feb 24-26, 2020)
We will introduce a general machine learning workflow to deal with system serology datasets.
Before the workshop, please install the required programs and packages so that we can directly get started. In particular:
-
Install R and RStudio. If you use Mac OS, please follow the instructions found here to install also XQuartz and Xcode.
-
Install the following packages:
- readxl
- ggpubr
- corrr
- ropls
- glmnet
- DMwR
- pheatmap
- ggplot2
- ggrepel
- RColorBrewer
- igraph
- ggraph
- tidyverse
install.packages(c("readxl", "ggpubr", "corrr", "glmnet", "DMwR", "pheatmap", "ggplot2",
"ggrepel", "RColorBrewer", "igraph", "ggraph", "tidyverse"))
The package ropls needs to be installed from Bioconductor using:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ropls")
If you are not or only little experienced with R, consider looking at general R workshops before, for example the ones that can be found here: https://github.com/nuitrcs/rworkshops. However, the hands-on exercises are offered in different versions to have a version for every experience level.
The data for this session is taken from the following publication:
To import the data and get a overview of the data, run the Notebook part 1.
The basic workflow for machine learning systems serology data includes feature selection using LASSO (Least Absolute Shrinkage and Selection Operator), followed by PLS-DA (partial least square discriminant analysis). There are different version of the exercises for different programming skills:
The solution notebook provides code for the whole workflow including more detailed explanations of the results.