The goal of this project is to predict ClinVar predictions on variants that lack existing ClinVar data.
Final Model Presented can be found here or at notebooks/experiments/AIGenomics_Model_Quickstart.ipynb
. For quick start to run, please see notebooks/QuickStart.txt
or clone and run notebooks/experiments/AIGenomics_Model_Quickstart.ipynb
directly.
- ClinVar is an aggregation of various sources that provide clincal data for variants
- In-silico predictors exist that predict pathogenicity, they are reasonably good see results
- We are interested in this project to help our users make informed decisions: this means that we are interested in feature extraction as well as accurate predictions
- Our usage would not be to select variants of pathogenicity, but to prioritze them. This is a critical distriction because it means that a False Positive is worse than other clinical applications, and a False Negative is not as bad (we are not removing variants on the basis of this prediction).
- Calling variants results in hundreds of potential clinically improtant mutations, even after filtering. We need to reduce the number of false positives, so precision is more important than recall: the need is to have a small number of highly recommended mutations, where if the critical variant is not found then we are easily able to relax the thresholds.
- Experimental table of > 20 experiments: ( Notebook naming is as follows: id.featureset.chromosomes.model (ex: 1.1.234.svm))
Notebook number: | Input data: total # of examples | Model: details | Evaluation: baseline/primary class = 1 | Evaluation: Accuracy | Evaluation: Precision | **Evaluation: Recall ** | Evaluation: F1 | Evaluation: AUC |
---|---|---|---|---|---|---|---|---|
1.1.2_4.svm | 13767 | SVM (rbf) | 0.71 | 0.74 | 0.64 | 0.16 | 0.25 | 0.64 |
2.1.2_4.RF | 13767 | RF (n=10) | 0.71 | 0.80 | 0.64 | 0.65 | 0.64 | 0.85 |
3.1.2_4.nn | 13767 | 3 hidden layers | 0.71 | 0.73 | 0.36 | 0.5 | 0.42 | |
3.1.2_4.nn |
13767 | MPL | 0.71 (but 0.68 in test set of 3442 examples ) | 0.68 | 0.0 | 0.0 | nan | |
4.2.2_4.svm | 13767 | SVM (rbf) | 0.71 | 0.74 | 0.65 | 0.21 | 0.32 | 0.62 |
5.2.2_4.RF | 13767 | RF(n=10) | 0.71 | 0.80 | 0.64 | 0.63 | 0.68 | 0.84 |
6.2.2_4.NN | 13767 | MPL | 0.71 (but 0.72 in test set of 3434 examples) | 0.72* | 0.0 | 0.0 | nan | |
7.3.2_4.svm | 13767 | SVM (rbf) | 0.71 | 0.74 | 0.68 | 0.18 | 0.0.28 | 0.70 |
8.3.2_4.RF | 13767 | RF(n=10) | 0.71 | 0.81 | 0.64 | 0.68 | 0.66 | 0.85 |
9.3.2_4.nn | 13767 | MLP | 0.71 (but 0.68 in test set of 3442 examples ) | 0.68 | 0.0 | 0.0 | nan | |
10.4.2_4.svm | 13767 | SVM (rbf) | 0.71 | 0.75 | 0.76 | 0.16 | 0.27 | 0.69 |
11.4.2_4.RF | 13767 | RF(n=10) | 0.71 | 0.86 | 0.75 | 0.74 | 0.74 | 0.90 |
12.4.2_4.nn | 13767 | 3 hidden layers | 0.71 | 0.72 | 0 | 0 | nan | |
13.1.2_4u.svm | 40905 | svm(rbf), balanced class | 0.66u | 0.39 | 0.80 | 0.26 | 0.39 | |
14.1.2_4u.RF | 40905 | RF(n=10) balanced class | 0.66u | 0.67 | 0.76 | 0.77 | 0.77 | |
15.4.2_4.nn | 40905 | Balanced dataloader | 0.66u | 0.62 | 0.44 | 0.68 | 0.53 | |
16.4.2_4.GB | 40905 | Gradient Boost (n=10) | 0.71 | 0.82 | 0.82 | 0.47 | 0.59 | |
17.5.2_4.svm | 13767 | svm(rbf) | 0.71 | 0.75 | 0.76 | 0.16 | 0.27 | 0.69 |
18.5.2_4.RF | 13767 | RF(n=1000) balanced |
0.71 | 0.87 | 0.79 | 0.73 | 0.76 | 0.93 |
19.5.2_4u.RF | 40905 | RF(n=1000) balanced | 0.66u | 0.73 | 0.77 | 0.86 | 0.81 | |
20.6.2_4u.RF | 40905 | RF(n=100) balanced | 0.66u | 0.74 | 0.78 | 0.88 | 0.82 | |
21.7.2_4u.RF | 40905 | RF(n=100) | 0.66u | 0.75 | 0.89 | 0.78 | 0.83 | |
22.8.5_7u.RF | 40905 | RF(n=100) | 0.66u | 0.76 | 0.77 | 0.91 | 0.84 | |
23.8.5_7uc.RF | 40905 | RF(n=100) | 0.66u | 0.75 | 0.77 | 0.91 | 0.84 | |
24.8.5_7c.RF | 13767 | RF(n=100) | 0.66u | 0.84 | 0.74 | 0.80 | 0.77 | |
25.5.2_4.RFunk | 13820, then 17808 (unk) | unk boosted RF(n=1000) | 0.71, then 0.57 | 0.88 | 0.84 | 0.76 | 0.80 | |
26.4.2_4.LR |
13820 | Logistic Regression | 0.71 | 0.71 | 0.49 | 0.76 | 0.60 | |
27.5.2_4u.KNN | 13820 | Nearest Neighbour (n=3) | 0.66u | 0.67 | 0.72 | 0.84 | 0.80 | |
28.5.5_7.RF | 13820 | RF(n=100) | 0.70 | 0.87 | 0.68 | 0.78 | 0.73 | 0.92 |
29.5.8_10.RF | 9472 | RF(n=100) | 0.69 | 0.85 | 0.70 | 0.81 | 0.75 | 0.92 |
30.5.11_14.RF | 7836 | RF(n=100) | 0.75 | 0.87 | 0.80 | 0.67 | 0.73 | 0.91 |
31.5.15_19.RF | 17361 | RF(n=100) | 0.75 | 0.88 | 0.80 | 0.69 | 0.74 | 0.92 |
32.5.20_Y.RF | 10154 | RF(n=100) | 0.78 | 0.87 | 0.78 | 0.62 | 0.69 | 0.91 |
33.5.2_4.GB | 13767 | RF(n=100) | 0.71 | 0.87 | 0.80 | 0.74 | 0.77 | 0.93 |
34.polyphen.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.76 | 0.73 | 0.25 | 0.38 | 0.66 |
35.genocanyon.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.71 | 0.46 | 0.27 | 0.34 | 0.57 |
36.fathmm.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.79 | 0.65 | 0.54 | 0.59 | 0.79 |
37.mutationtaster.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.77 | 0.68 | 0.34 | 0.45 | 0.71 |
38.metasvm.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.86 | 0.76 | 0.71 | 0.74 | 0.90 |
39.7.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.88 | 0.90 | 0.94 | 0.92 | 0.94 |
40.8.2_4.RF | 13825 | RF(n=100) | 0.71 | 0.89 | 0.90 | 0.94 | 0.92 | 0.94 |
41.5.2_4.RF | 13825 | RF(n=1000) | 0.71 | 0.87 | 0.80 | 0.73 | 0.76 | 0.93 |
42.5.1_22.RF | 65424 | RF(n=1000) | 0.722 | 0.81 ±0.1 | 0.80±0.08 | 0.5±0.11 | 0.60±0.1 |
- Models: Test three different model architectures with increasing hyperparameter adjustments and increased data.
- Input data:
- Labeled output: Begin with binary class prediction, expand to include unknown labels so as to harness as much information from ClinVar as possible. Description of class distribution is below, models were trained with balanced by proportion of class label.
- Features: largely taken from dbNSFP which is a aggregation of >350 features including in-silico predictors, other high-level aggregated metrics, and low level information about chemistry and biological annotations at the position of interest. Feature selection is critical for our project because we seek to learn which features are most informative.
- Novel feature: in order to incorporate a measure of genomic context, we sought to use the position of the variant to assess the variants nearby. Intially, a convolution was applied but to use it successfully with other features, we decided to simplify and count mutations at various Kb distances from the variant in question. This is added in a seperate data processing step.
- Ran various models to test which would work best for our data. Selected Random Forest as it preformed favorably.
- Experimented with various feature sets that would work well together. Selected Feature set 5 as trade off between preformance and easily interpretable features.
- Fixing model (RF) and features (set 5), we fine tuned the parameters of the model.
- Cross validation is a work in progress.
- Unknonw labels were complex to work with as it is not quite its own class, using these examples is a work in progress.
- We are bound by the assumption, biases, and data of ClinVar. With the constraints that we faced, we preferentially set our efforts to selecting the best features + model combination possible, using conventional trainings, validation, and testings splits of the labeled clinvar data. We would explore alternative sources of pathogenically labeled variants to evaluate the geerailzability of our findings. For example, Human Mandelian Genomic Database check name and link to or manually curated clinical findings would show how well the model trained on clinvar would preform on other datasets. The attempts in this direction were meaningful, but being the strict requirement of disjoint data to avoid biases was beyond the scope of what we could finish.
- There was a preliminary efforts to evaluate variants using the model from experiment 19 on unlabeled data, with the required features fro mset 5. A handful of variants preidcted by the model to be pathogenic were manual evaulated to be consistent with the prediction. These can be found in the supplemental notes below.
- giving a finer resolution to dgi if it seems to be informative.
- Better handling of unknown label, exploration of methods such as
out-of-distribtuion detection
were not easily incorporated, though we suspect would likely improve the accuracy.
- Early experiments were run with a subset of all genomic examples: using chromosomse 2, 3, 4. Initially, only examples with pathogenic (including likely-pathogenic) and benign (including likely-benign) were selected. These were at a balance of 71% pathogenic.
- Models used balanced classes from experiment 13 onward without a large increase in accuracy.
- Later experiments allowed for multi-class classification, which included the
unknown clinical significance
class. The distributions of the classes:
TOTAL SUM OF LABELS: 1 9863 2 3957 PERCENT OF LABELS: 1 0.713676 2 0.286324 TOTAL: 13820
TOTAL SUM OF LABELS: 0 27177 1 9863 2 3957
PERCENT OF LABELS: 0 0.662902 1 0.240579 2 0.096519
TOTAL: 40997
TOTAL SUM OF LABELS: 1 47278 0 18146 PERCENT OF LABELS: 1 0.72264 0 0.27736 TOTAL: 65424
- Generate the dataframe that has the 9 features of feature set 5 for chromosome 2, 3, 4 for pathogentic, benign, and unknown. data_processing/Dataprocessing_2-4u_featureset5.ipynb. pickle the dataframe →
- Use the picked dataframe to train a model, 19.5.2_4u.RF.ipynb, pickle the model → pickled_models/19.5.2_4u.RF.pkl
- Generate dataframe that has 9 features of feature set 5 for unlabeled data from chromosome 1,2,3,4: → store in pickled/chr1-4_nolabels.pkl
- Use the trained model on different data: chromosomes 1, 2, 3, 4 with no clinvar labels. model_predictions/dbNSFP_chr1_4_nolabels.ipynb
- Using features from well-preforming in-silico predictors was very challenging: the publications do not release the data used for the creation of the model. Some features were well described, while others were quite obscur; of those that were available through other sources, few were ready-to-use and would require significant data preprocessing.
- Integrating pharma db with our dataset -> gene - drug associations did not perform adequately in the model. Getting more depth did not help either.
- Manually curating the training data did not help either. We selected the data points with the most complete data and the results did not vary.
- lncRNA multiple databases exist. However, the data format was incompatible with ours. Reliable standards have not emerged yet to the point where we can use the data.