This sample package helps you run scikit-learn
's GridSearchCV
and RandomizedSearchCV
, and scikit-optimize
's BayesSearchCV
on Google Container Engine.
The design of the workflow is to entirely stay in a Jupyter notebook, with necessary boilerplate codes abstracted away in the helpers. Below we highlight some key steps of the workflow. If you are ready to get started, skip over to Requirements.
For instance, to build the Docker image with Google Cloud Container Builder that will carry out the fitting tasks on a cluster:
from helpers.cloudbuild_helper import build
build(project_id, source_dir, bucket_name, image_name)
To create a cluster on Google Cloud Kubernetes Engine:
from helpers.gke_helper import create_cluster
create_cluster(project_id, zone, cluster_id, n_nodes=1, machine_type='n1-standard-64')
To better utilize resources, you should choose as few nodes as possible with the same total number of cores. For example 1 node with 64 cores is beter than 2 nodes with 32 cores each, since in the latter case the fitting task needs to be deployed as two (or more) separate jobs, and it is likely for some of the nodes to idle if they finish their job first.
Once the Docker image is built and a cluster started, you can create a SearchCV
object in the notebook:
from sklearn.ensemble import RandomForestClassifier
from skopt import BayesSearchCV
from skopt.space import Integer, Real
rfc = RandomForestClassifier(n_jobs=-1)
search_spaces = {
'max_features': Real(0.5, 1.0),
'n_estimators': Integer(10, 200),
'max_depth': Integer(5, 45),
'min_samples_split': Real(0.01, 0.1)
}
search = BayesSearchCV(estimator=rfc, search_spaces=search_spaces, n_jobs=-1, verbose=3, n_iter=100)
Calling search.fit
would fit the SearchCV
object on the local machine. To fit the object on the cluster, we first wrap it in a helper object:
from gke_parallel import GKEParallel
gke_search = GKEParallel(search, project_id, zone, cluster_id, bucket_name, image_name)
Now we can fit the object on the cluster:
gke_search.fit(X_train, y_train)
To check whether the fitting tasks have completed:
gke_search.done()
Once complete, the helper object can be used for some tasks supported by the original SearchCV
object:
y_predicted = gke_search.predict(X_test)
You will need a Google Cloud Platform project which has the following products enabled:
In addition, to follow the steps of the sample we recommend you work in a Jupyter notebook running Python v2.7.10 or newer.
-
Install Google Cloud Platform SDK.
-
Install kubectl.
-
Run
git clone https://github.com/GoogleCloudPlatform/ml-on-gcp.git
-
Run
cd ml-on-gcp/sklearn/hpsearch
-
Run
pip install -r requirements.txt
Now you are ready to ollow the steps in one of the notebooks to go through the workflow: