Skip to content

2. Making Predictions on the Data

Joshua Levy edited this page Jul 11, 2019 · 8 revisions

In this case, we would like to predict the age of the individuals in the study. We just trained an autoencoder on the data. Now, we can chuck our decoder, and add a few neural network layers to our latent embeddings that are geared towards making predictions. As we optimize the model, both the parameters of the encoder and the final prediction layers are updated, both using different learning rates.

Here's the command to accomplish this:

methylnet-predict make_prediction -h
Usage: methylnet-predict make_prediction [OPTIONS]

  Train prediction model by fine-tuning VAE and appending/training MLP to make
  classification/regression predictions on MethylationArrays.

Options:
  -i, --train_pkl PATH            Input database for beta and phenotype data.  [default:
                                  ./train_val_test_sets/train_methyl_array.pkl]
  -tp, --test_pkl PATH            Test database for beta and phenotype data.  [default:
                                  ./train_val_test_sets/test_methyl_array.pkl]
  -vae, --input_vae_pkl PATH      Trained VAE.  [default: ./embeddings/output_model.p]
  -o, --output_dir PATH           Output directory for predictions.  [default:
                                  ./predictions/]
  -c, --cuda                      Use GPUs.
  -ic, --interest_cols TEXT       Specify columns looking to make predictions on.
                                  [default: disease]
  -cat, --categorical             Multi-class prediction.  [default: False]
  -do, --disease_only             Only look at disease, or text before
                                  subtype_delimiter.
  -hlt, --hidden_layer_topology PATH
                                  Topology of hidden layers, comma delimited, leave
                                  empty for one layer encoder, eg. 100,100 is example of
                                  5-hidden layer topology.  [default: ]
  -lr_vae, --learning_rate_vae FLOAT
                                  Learning rate VAE.  [default: 1e-05]
  -lr_mlp, --learning_rate_mlp FLOAT
                                  Learning rate MLP.  [default: 0.001]
  -wd, --weight_decay FLOAT       Weight decay of adam optimizer.  [default: 0.0001]
  -dp, --dropout_p FLOAT          Dropout Percentage.  [default: 0.2]
  -e, --n_epochs INTEGER          Number of epochs to train over.  [default: 50]
  -s, --scheduler [null|exp|warm_restarts]
                                  Type of learning rate scheduler.  [default: null]
  -d, --decay FLOAT               Learning rate scheduler decay for exp selection.
                                  [default: 0.5]
  -t, --t_max INTEGER             Number of epochs before cosine learning rate restart.
                                  [default: 10]
  -eta, --eta_min FLOAT           Minimum cosine LR.  [default: 1e-06]
  -m, --t_mult FLOAT              Multiply current restart period times this number
                                  given number of restarts.  [default: 2.0]
  -bs, --batch_size INTEGER       Batch size.  [default: 50]
  -vp, --val_pkl PATH             Validation Set Methylation Array Location.  [default:
                                  ./train_val_test_sets/val_methyl_array.pkl]
  -w, --n_workers INTEGER         Number of workers.  [default: 9]
  -v, --add_validation_set        Evaluate validation set.
  -l, --loss_reduction [sum|elementwise_mean|none]
                                  Type of reduction on loss function.  [default: sum]
  -hl, --hyperparameter_log PATH  CSV file containing prior runs.  [default:
                                  predictions/predict_hyperparameters_log.csv]
  -j, --job_name PATH             Embedding job name.  [default: predict_job]
  -sft, --add_softmax             Add softmax for predicting probability distributions.
                                  Experimental.
  -h, --help                      Show this message and exit.

Here's an example command of how to train the data:

methylnet-predict make_prediction -c -ic Age -v  -j 45557456 -hl predictions/predict_hyperparameters_log.csv --learning_rate_vae 0.1 --learning_rate_mlp 0.05 --weight_decay 0.0001 --n_epochs 200 --scheduler null --batch_size 512 --dropout_p 0.5 --n_workers 4 --loss_reduction sum --hidden_layer_topology 200,300,3000 

We make a prediction on age (-ic) using GPU with cuda (-c). Adding the validation set as before to terminate training before the model overfits. There are two separate learning rates for the vae and mlp. The null scheduler is just stating that we are using one learning rate throughout the learning process. Dropout (--dropout_p) specifies the percentage of neural network nodes for the model to ignore during the training process, which makes the model more generalizable, but too much of it and the model can have a difficult time learning. More or less, the other options are similar to the embedding ones. You'd just want to add the -cat option if you're training on a categorical variable. Also of note that if you're predicting on cell-type proportions, multiple targets can be predicted on simultaneously (like a multinomial logistic regression) using multiple -ic options followed by the pheno columns of interest. The data returned is similar to that for the embedding model, but stored in the predictions folder.

Once you are satisfied with your model's performance (check the hyperparameter logs in the predictions folder), you can again plot the data as such:

pymethyl-visualize transform_plot -i predictions/vae_mlp_methyl_arr.pkl -o visualizations/45557456_Age_mlp_embed.html -c Age -nn 8

It's easy to see that these fine-tuned embeddings are much more demonstrative of age than the VAE embeddings, but the VAE got us in the right direction: https://github.com/Christensen-Lab-Dartmouth/MethylNet/blob/master/methylnet_results/embeddings/age/finetune_embed.html

You can plot the training curves here and output some regression report here:

methylnet-predict regression_report
methylnet-visualize plot_training_curve -thr 2e6

You can also compare these results to other models, for instance we can calculate the age using other estimators:

pymethyl-utils est_age -a epitoc -a horvath -a hannum -ac Age  -i age/test_set/test_methyl_array.pkl

Or if using the TCGA data, trying a support vector machine model with a hyperparameter scan (TCGA_SVC.py is in ./example_scripts; please submit an issue if any of the components are not fully functional, could be package dependency issues):

python TCGA_SVC.py -n 24 -o disease -tr train_val_test_sets/train_methyl_array.pkl -v train_val_test_sets/val_methyl_array.pkl -tt train_val_test_sets/test_methyl_array.pkl -s &

Again, note that many of these MethylNet hyperparameters may be difficult to tune, so we have a hyperparameter tuning framework that will be demonstrated in the next tutorial.