mixOmics is a large R package that provides statistical methods to integrate omics data sets (e.g transcriptomics, proteomics, metabolomics, metagenomics) that simultaneously measure the activity of thousands of biological features (e.g transcripts, proteins, metabolites, bacteria). Data integration enables identification of specific biological relationships between these features (e.g. genes and proteins), to create new insights into molecular processes involved in health and disease. MixOmics includes 19 data integration methods, amongst which 13 were developed in our lab. These methods are all based on dimension reduction using Projection to Latent Structures (PLS).
Our users include computational biologists, molecular biologists and bioinformaticians who wish to integrate their data and identify signatures of genes, proteins etc. to explain or predict a disease outcome. The package (ranked in the top 5% package in Bioconductor) is easy to use because all methods use the same underlying PLS principles and produce numerous graphics for interpretation (Fig. 1). We continuously improve the mixOmics package based on the community feedback.
As this is a large project, the internship requires complementary skillsets to:
- improve specific aspects of the package (e.g increase coverage for unit tests, trouble-shoot bugs or provide new features requested by users, improve code quality, develop new graphics)
- improve our existing tutorials and develop new ones on www.mixOmics.org
- if there is an appropriate opportunity and motivation, respond to users questions on our discussion forum at https://mixomics-users.discourse.group. This is because it would require a good mastery of the methods and would only apply towards the end of the internship.
After the (steep) learning phase, there will be opportunities for students to propose new features and functionalities in the package if they wish.
Figure 1. Overview of the methods in mixOmics for data exploration and integration of multiple omics data sets (courtesy of Prof. Lê Cao)
Skills and Pre-requisites:
- Very good knowledge of R and Linux command-line
- Ability to learn and understand high-level statistical concepts quickly
- Ability to work independently and to report to a group and discuss theories and results
- Excellent skills in statistical analysis of complex data
- Ability to work with github
- Ability to interact with users
- Interest in biological applications
Benefits for students whilst undertaking the internship include:
- Each student will get hands on experience in working in an emerging research software environment.
- Gain understanding of how real-world software is assessed, developed and how priorities and requirements are established within a research environment.
- Gain understanding of the importance of maintainable, scalable and extensible code.
- Improving oral and written communication skills in a team environment.
- Learn about new statistical methods for mining large data
- Learn about high-throughput biology