For all model parameter search methods and cross_val_score
, you have the choice of running the jobs locally or remotely.
By default, jobs are scheduled to run locally in an asynchronous fashion. This is called a LocalAsync environment.
You may also run jobs on an EC2 cluster or a Hadoop cluster. This is especially useful when you want to perform a larger scale parameter search.
To launch a job on an EC2 cluster, you first create an EC2 environment and pass it into the environment
argument:
ec2config = graphlab.deploy.Ec2Config()
ec2 = graphlab.deploy.ec2_cluster.create(name='mps',
s3_path='s3://bucket/path',
ec2_config=ec2config,
num_hosts=4)
j = graphlab.model_parameter_search.create((train, valid),
my_model, my_params,
environment=ec2)
For launching jobs on a Hadoop cluster, you instead create a Hadoop environment and pass this object into the environment
argument:
hd = gl.deploy.hadoop_cluster.create(name='hadoop-cluster',
turi_dist_path=<path to installation>)
j = graphlab.model_parameter_search.create((train, valid),
my_model, my_params,
environment=hd)
For more details on creating EC2- and Hadoop-based environments, checkout the API docs or the Deployment chapter of the userguide.
When getting started, it is useful to keep perform_trial_run=True
to make sure you are creating your models properly.