ClinBall

The goal of this project is to predict ClinVar predictions on variants that lack existing ClinVar data.

Final Model Presented can be found here or at notebooks/experiments/AIGenomics_Model_Quickstart.ipynb. For quick start to run, please see notebooks/QuickStart.txt or clone and run notebooks/experiments/AIGenomics_Model_Quickstart.ipynb directly.

Report

Introduction & motivation

ClinVar is an aggregation of various sources that provide clincal data for variants
In-silico predictors exist that predict pathogenicity, they are reasonably good see results
We are interested in this project to help our users make informed decisions: this means that we are interested in feature extraction as well as accurate predictions
Our usage would not be to select variants of pathogenicity, but to prioritze them. This is a critical distriction because it means that a False Positive is worse than other clinical applications, and a False Negative is not as bad (we are not removing variants on the basis of this prediction).
Calling variants results in hundreds of potential clinically improtant mutations, even after filtering. We need to reduce the number of false positives, so precision is more important than recall: the need is to have a small number of highly recommended mutations, where if the critical variant is not found then we are easily able to relax the thresholds.

Results

Experimental table of > 20 experiments: ( Notebook naming is as follows: id.featureset.chromosomes.model (ex: 1.1.234.svm))

Notebook number:	Input data: total # of examples	Model: details	Evaluation: baseline/primary class = 1	Evaluation: Accuracy	Evaluation: Precision	Evaluation: Recall	Evaluation: F1	Evaluation: AUC
1.1.2_4.svm	13767	SVM (rbf)	0.71	0.74	0.64	0.16	0.25	0.64
2.1.2_4.RF	13767	RF (n=10)	0.71	0.80	0.64	0.65	0.64	0.85
3.1.2_4.nn	13767	3 hidden layers	0.71	0.73	0.36	0.5	0.42
3.1.2_4.nn	13767	MPL	0.71 (but 0.68 in test set of 3442 examples )	0.68	0.0	0.0	nan
4.2.2_4.svm	13767	SVM (rbf)	0.71	0.74	0.65	0.21	0.32	0.62
5.2.2_4.RF	13767	RF(n=10)	0.71	0.80	0.64	0.63	0.68	0.84
6.2.2_4.NN	13767	MPL	0.71 (but 0.72 in test set of 3434 examples)	0.72*	0.0	0.0	nan
7.3.2_4.svm	13767	SVM (rbf)	0.71	0.74	0.68	0.18	0.0.28	0.70
8.3.2_4.RF	13767	RF(n=10)	0.71	0.81	0.64	0.68	0.66	0.85
9.3.2_4.nn	13767	MLP	0.71 (but 0.68 in test set of 3442 examples )	0.68	0.0	0.0	nan
10.4.2_4.svm	13767	SVM (rbf)	0.71	0.75	0.76	0.16	0.27	0.69
11.4.2_4.RF	13767	RF(n=10)	0.71	0.86	0.75	0.74	0.74	0.90
12.4.2_4.nn	13767	3 hidden layers	0.71	0.72	0	0	nan
13.1.2_4u.svm	40905	svm(rbf), balanced class	0.66u	0.39	0.80	0.26	0.39
14.1.2_4u.RF	40905	RF(n=10) balanced class	0.66u	0.67	0.76	0.77	0.77
15.4.2_4.nn	40905	Balanced dataloader	0.66u	0.62	0.44	0.68	0.53
16.4.2_4.GB	40905	Gradient Boost (n=10)	0.71	0.82	0.82	0.47	0.59
17.5.2_4.svm	13767	svm(rbf)	0.71	0.75	0.76	0.16	0.27	0.69
18.5.2_4.RF	13767	RF(n=1000) balanced	0.71	0.87	0.79	0.73	0.76	0.93
19.5.2_4u.RF	40905	RF(n=1000) balanced	0.66u	0.73	0.77	0.86	0.81
20.6.2_4u.RF	40905	RF(n=100) balanced	0.66u	0.74	0.78	0.88	0.82
21.7.2_4u.RF	40905	RF(n=100)	0.66u	0.75	0.89	0.78	0.83
22.8.5_7u.RF	40905	RF(n=100)	0.66u	0.76	0.77	0.91	0.84
23.8.5_7uc.RF	40905	RF(n=100)	0.66u	0.75	0.77	0.91	0.84
24.8.5_7c.RF	13767	RF(n=100)	0.66u	0.84	0.74	0.80	0.77
25.5.2_4.RFunk	13820, then 17808 (unk)	unk boosted RF(n=1000)	0.71, then 0.57	0.88	0.84	0.76	0.80
26.4.2_4.LR	13820	Logistic Regression	0.71	0.71	0.49	0.76	0.60
27.5.2_4u.KNN	13820	Nearest Neighbour (n=3)	0.66u	0.67	0.72	0.84	0.80
28.5.5_7.RF	13820	RF(n=100)	0.70	0.87	0.68	0.78	0.73	0.92
29.5.8_10.RF	9472	RF(n=100)	0.69	0.85	0.70	0.81	0.75	0.92
30.5.11_14.RF	7836	RF(n=100)	0.75	0.87	0.80	0.67	0.73	0.91
31.5.15_19.RF	17361	RF(n=100)	0.75	0.88	0.80	0.69	0.74	0.92
32.5.20_Y.RF	10154	RF(n=100)	0.78	0.87	0.78	0.62	0.69	0.91
33.5.2_4.GB	13767	RF(n=100)	0.71	0.87	0.80	0.74	0.77	0.93
34.polyphen.2_4.RF	13825	RF(n=100)	0.71	0.76	0.73	0.25	0.38	0.66
35.genocanyon.2_4.RF	13825	RF(n=100)	0.71	0.71	0.46	0.27	0.34	0.57
36.fathmm.2_4.RF	13825	RF(n=100)	0.71	0.79	0.65	0.54	0.59	0.79
37.mutationtaster.2_4.RF	13825	RF(n=100)	0.71	0.77	0.68	0.34	0.45	0.71
38.metasvm.2_4.RF	13825	RF(n=100)	0.71	0.86	0.76	0.71	0.74	0.90
39.7.2_4.RF	13825	RF(n=100)	0.71	0.88	0.90	0.94	0.92	0.94
40.8.2_4.RF	13825	RF(n=100)	0.71	0.89	0.90	0.94	0.92	0.94
41.5.2_4.RF	13825	RF(n=1000)	0.71	0.87	0.80	0.73	0.76	0.93
42.5.1_22.RF	65424	RF(n=1000)	0.722	0.81 ±0.1	0.80±0.08	0.5±0.11	0.60±0.1

Methodology

Models: Test three different model architectures with increasing hyperparameter adjustments and increased data.
Input data:
- Labeled output: Begin with binary class prediction, expand to include unknown labels so as to harness as much information from ClinVar as possible. Description of class distribution is below, models were trained with balanced by proportion of class label.
- Features: largely taken from dbNSFP which is a aggregation of >350 features including in-silico predictors, other high-level aggregated metrics, and low level information about chemistry and biological annotations at the position of interest. Feature selection is critical for our project because we seek to learn which features are most informative.
Novel feature: in order to incorporate a measure of genomic context, we sought to use the position of the variant to assess the variants nearby. Intially, a convolution was applied but to use it successfully with other features, we decided to simplify and count mutations at various Kb distances from the variant in question. This is added in a seperate data processing step.

Discussion of Results

Ran various models to test which would work best for our data. Selected Random Forest as it preformed favorably.
Experimented with various feature sets that would work well together. Selected Feature set 5 as trade off between preformance and easily interpretable features.
Fixing model (RF) and features (set 5), we fine tuned the parameters of the model.
Cross validation is a work in progress.
Unknonw labels were complex to work with as it is not quite its own class, using these examples is a work in progress.

Discussion of Future Directions and Limitations

We are bound by the assumption, biases, and data of ClinVar. With the constraints that we faced, we preferentially set our efforts to selecting the best features + model combination possible, using conventional trainings, validation, and testings splits of the labeled clinvar data. We would explore alternative sources of pathogenically labeled variants to evaluate the geerailzability of our findings. For example, Human Mandelian Genomic Database check name and link to or manually curated clinical findings would show how well the model trained on clinvar would preform on other datasets. The attempts in this direction were meaningful, but being the strict requirement of disjoint data to avoid biases was beyond the scope of what we could finish.
There was a preliminary efforts to evaluate variants using the model from experiment 19 on unlabeled data, with the required features fro mset 5. A handful of variants preidcted by the model to be pathogenic were manual evaulated to be consistent with the prediction. These can be found in the supplemental notes below.
giving a finer resolution to dgi if it seems to be informative.
Better handling of unknown label, exploration of methods such as out-of-distribtuion detection were not easily incorporated, though we suspect would likely improve the accuracy.

Supplemental notes

Distribution of labels in the data:

Early experiments were run with a subset of all genomic examples: using chromosomse 2, 3, 4. Initially, only examples with pathogenic (including likely-pathogenic) and benign (including likely-benign) were selected. These were at a balance of 71% pathogenic.
Models used balanced classes from experiment 13 onward without a large increase in accuracy.
Later experiments allowed for multi-class classification, which included the unknown clinical significance class. The distributions of the classes:

Chr2-4:

TOTAL SUM OF LABELS: 1 9863 2 3957 PERCENT OF LABELS: 1 0.713676 2 0.286324 TOTAL: 13820

Chr2-4U:

TOTAL SUM OF LABELS: 0 27177 1 9863 2 3957

PERCENT OF LABELS: 0 0.662902 1 0.240579 2 0.096519

TOTAL: 40997

Chr1_22:

TOTAL SUM OF LABELS: 1 47278 0 18146 PERCENT OF LABELS: 1 0.72264 0 0.27736 TOTAL: 65424

WIP: Predicting on unlabeled data using model 19 (exp 25):

Generate the dataframe that has the 9 features of feature set 5 for chromosome 2, 3, 4 for pathogentic, benign, and unknown. data_processing/Dataprocessing_2-4u_featureset5.ipynb. pickle the dataframe →
Use the picked dataframe to train a model, 19.5.2_4u.RF.ipynb, pickle the model → pickled_models/19.5.2_4u.RF.pkl
Generate dataframe that has 9 features of feature set 5 for unlabeled data from chromosome 1,2,3,4: → store in pickled/chr1-4_nolabels.pkl
Use the trained model on different data: chromosomes 1, 2, 3, 4 with no clinvar labels. model_predictions/dbNSFP_chr1_4_nolabels.ipynb

Dead ends

Using features from well-preforming in-silico predictors was very challenging: the publications do not release the data used for the creation of the model. Some features were well described, while others were quite obscur; of those that were available through other sources, few were ready-to-use and would require significant data preprocessing.
Integrating pharma db with our dataset -> gene - drug associations did not perform adequately in the model. Getting more depth did not help either.
Manually curating the training data did not help either. We selected the data points with the most complete data and the results did not vary.
lncRNA multiple databases exist. However, the data format was incompatible with ours. Reliable standards have not emerged yet to the point where we can use the data.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
app		app
data_dump		data_dump
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClinBall

Report

Introduction & motivation

Results

Methodology

Discussion of Results

Discussion of Future Directions and Limitations

Supplemental notes

Distribution of labels in the data:

Chr2-4:

Chr2-4U:

Chr1_22:

WIP: Predicting on unlabeled data using model 19 (exp 25):

Dead ends

About

Releases

Packages

Contributors 2

Languages

vincentlortie/ClinBall

Folders and files

Latest commit

History

Repository files navigation

ClinBall

Report

Introduction & motivation

Results

Methodology

Discussion of Results

Discussion of Future Directions and Limitations

Supplemental notes

Distribution of labels in the data:

Chr2-4:

Chr2-4U:

Chr1_22:

WIP: Predicting on unlabeled data using model 19 (exp 25):

Dead ends

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages