-
Notifications
You must be signed in to change notification settings - Fork 11
2. Making Predictions on the Data
In this case, we would like to predict the age of the individuals in the study. We just trained an autoencoder on the data. Now, we can chuck our decoder, and add a few neural network layers to our latent embeddings that are geared towards making predictions. As we optimize the model, both the parameters of the encoder and the final prediction layers are updated, both using different learning rates.
Here's the command to accomplish this:
methylnet-predict make_prediction -h
Usage: methylnet-predict make_prediction [OPTIONS]
Train prediction model by fine-tuning VAE and appending/training MLP to make
classification/regression predictions on MethylationArrays.
Options:
-i, --train_pkl PATH Input database for beta and phenotype data. [default:
./train_val_test_sets/train_methyl_array.pkl]
-tp, --test_pkl PATH Test database for beta and phenotype data. [default:
./train_val_test_sets/test_methyl_array.pkl]
-vae, --input_vae_pkl PATH Trained VAE. [default: ./embeddings/output_model.p]
-o, --output_dir PATH Output directory for predictions. [default:
./predictions/]
-c, --cuda Use GPUs.
-ic, --interest_cols TEXT Specify columns looking to make predictions on.
[default: disease]
-cat, --categorical Multi-class prediction. [default: False]
-do, --disease_only Only look at disease, or text before
subtype_delimiter.
-hlt, --hidden_layer_topology PATH
Topology of hidden layers, comma delimited, leave
empty for one layer encoder, eg. 100,100 is example of
5-hidden layer topology. [default: ]
-lr_vae, --learning_rate_vae FLOAT
Learning rate VAE. [default: 1e-05]
-lr_mlp, --learning_rate_mlp FLOAT
Learning rate MLP. [default: 0.001]
-wd, --weight_decay FLOAT Weight decay of adam optimizer. [default: 0.0001]
-dp, --dropout_p FLOAT Dropout Percentage. [default: 0.2]
-e, --n_epochs INTEGER Number of epochs to train over. [default: 50]
-s, --scheduler [null|exp|warm_restarts]
Type of learning rate scheduler. [default: null]
-d, --decay FLOAT Learning rate scheduler decay for exp selection.
[default: 0.5]
-t, --t_max INTEGER Number of epochs before cosine learning rate restart.
[default: 10]
-eta, --eta_min FLOAT Minimum cosine LR. [default: 1e-06]
-m, --t_mult FLOAT Multiply current restart period times this number
given number of restarts. [default: 2.0]
-bs, --batch_size INTEGER Batch size. [default: 50]
-vp, --val_pkl PATH Validation Set Methylation Array Location. [default:
./train_val_test_sets/val_methyl_array.pkl]
-w, --n_workers INTEGER Number of workers. [default: 9]
-v, --add_validation_set Evaluate validation set.
-l, --loss_reduction [sum|elementwise_mean|none]
Type of reduction on loss function. [default: sum]
-hl, --hyperparameter_log PATH CSV file containing prior runs. [default:
predictions/predict_hyperparameters_log.csv]
-j, --job_name PATH Embedding job name. [default: predict_job]
-sft, --add_softmax Add softmax for predicting probability distributions.
Experimental.
-h, --help Show this message and exit.
Here's an example command of how to train the data:
methylnet-predict make_prediction -c -ic Age -v -j 45557456 -hl predictions/predict_hyperparameters_log.csv --learning_rate_vae 0.1 --learning_rate_mlp 0.05 --weight_decay 0.0001 --n_epochs 200 --scheduler null --batch_size 512 --dropout_p 0.5 --n_workers 4 --loss_reduction sum --hidden_layer_topology 200,300,3000
We make a prediction on age (-ic) using GPU with cuda (-c). Adding the validation set as before to terminate training before the model overfits. There are two separate learning rates for the vae and mlp. The null scheduler is just stating that we are using one learning rate throughout the learning process. Dropout (--dropout_p) specifies the percentage of neural network nodes for the model to ignore during the training process, which makes the model more generalizable, but too much of it and the model can have a difficult time learning. More or less, the other options are similar to the embedding ones. You'd just want to add the -cat option if you're training on a categorical variable. Also of note that if you're predicting on cell-type proportions, multiple targets can be predicted on simultaneously (like a multinomial logistic regression) using multiple -ic options followed by the pheno columns of interest. The data returned is similar to that for the embedding model, but stored in the predictions folder.
Once you are satisfied with your model's performance (check the hyperparameter logs in the predictions folder), you can again plot the data as such:
pymethyl-visualize transform_plot -i predictions/vae_mlp_methyl_arr.pkl -o visualizations/45557456_Age_mlp_embed.html -c Age -nn 8
It's easy to see that these fine-tuned embeddings are much more demonstrative of age than the VAE embeddings, but the VAE got us in the right direction: https://github.com/Christensen-Lab-Dartmouth/MethylNet/blob/master/methylnet_results/embeddings/age/finetune_embed.html
You can plot the training curves here and output some regression report here:
methylnet-predict regression_report
methylnet-visualize plot_training_curve -thr 2e6
You can also compare these results to other models, for instance we can calculate the age using other estimators:
pymethyl-utils est_age -a epitoc -a horvath -a hannum -ac Age -i age/test_set/test_methyl_array.pkl
Or if using the TCGA data, trying a support vector machine model with a hyperparameter scan (TCGA_SVC.py is in ./example_scripts; please submit an issue if any of the components are not fully functional, could be package dependency issues):
python TCGA_SVC.py -n 24 -o disease -tr train_val_test_sets/train_methyl_array.pkl -v train_val_test_sets/val_methyl_array.pkl -tt train_val_test_sets/test_methyl_array.pkl -s &
Again, note that many of these MethylNet hyperparameters may be difficult to tune, so we have a hyperparameter tuning framework that will be demonstrated in the next tutorial.