This package allows you to easily train and hyperparamter tune your model on a cluster of GPU equipped VMs in GCP. This package also allows you to visualize the automatically generated reports.
This package was developed using a Ubuntu 20.04 machine.
First download the Google Cloud SDK following the steps here
and enable rest API following the steps here.
Next, install this package using pip install clouDL
.
Lastly, to configure your VM cluster, run clouDL_create -f PATH
. This will create a folder containing the
necessary configuration files in PATH.
Some typical next steps (all associated files are in the newly created folder):
- Create a new
access_token
file to clone private repos from a VM - Update
user_startup.sh
to specify the bash script you want to execute once a VM is ready - Update
hyperparameters.json
to search the desired hyperparameter space - Create a
data.tar.zip
if you have training, validation, and testing data to move to the cloud - Update
quick_start.sh
if you want to change the number of workers, project_id for GCP, or top N to archive. quick_start.sh sets some default values to get you up and running easily, but more control can be accessed by usingclouDL
entrypoint directly. UseclouDL -h
for more.
Make sure to incorporate with the Manager
class in your code when training.
Manager is essentially the interface you will interact with to enable training/hyperparamter tuning on a cluster of VMs.
Remember to do the following when using Manger:
- Set the compare and goal keys for Manager using
Manager.set_compare_goal(compare, goal)
. This will allow Manager to compute the "best" params - Start the epochs on the value returned by
Manager.start_epochs()
method - Make sure to call
Manager.finished(param_dict)
once the model is done training - Use
Manager.save_progress(param_dict, best_param_dict)
sparsely since it is expensive Manager.save_progress()
can be used to track the current params and the best params- Use
Manager.add_progress(key, value)
freely - Use
Manager.track_model(model)
to automatically track the best params and load params when training is interrupted
Progress should be saved at the end of an epoch instead of the beginning. This is not mandatory but prevents unnecessary saving.
The key "epochs" should be used and managed via Manager.add_progress(key, value)
.
If not used, an approximate start epoch will be calculated when resuming training, which relies on existing progress
and epochs starting at 0.
For a complete example, visit here.
From clouDL_create
, a quick_start.sh
file is provided with four modes. The new
mode
does the following:
- Move your archived and compressed training data, access token (for VMs to access private repos), and hyperparameter configs to cloud storage
- Spin up a cluster of VMs, each with hardware specified by the configs.json.
- Run the training on VMs which manages four things: progress, results, best models, and errors.
- Once finish, the VMs will shut down automatically
The analyze
mode does the following:
- View any errors
- Plot the results grouped by hyperparameter sections
- Plot the progress for the best model in this iteration
- Archive results and maintain an overall top N models
- Plot archive data to view top N best models
- Plot meta data to track trend of your hyperparameter search
Edit your hyperparameter.json file to assign each VM a new portion of the hyperparameter grid. The available search options for a given hyperparameter are:
- Uniform random search
- Step search
- Multipe/exponential search
- Predetermined List
Then use the resume
mode from quick_start.sh
to update hyperparameter json and spin up a new cluster.
Execute bash quick_start.sh manual project_id bucket_name
to move everything to cloud storage but not spin up a cluster.
This is helpful to run manual tests on the cloud.
An early stopping module is also provided to reduce boiler plate training code. The module can be accessed using
from cloudDL.earlystop import EarlyStopping