Turi Distributed supports several specific methods for doing model parameter search.
The way you specify the set of parameters over which to search is through a dictionary. The dictionary keys are the names of the parameters and the values are the parameter values. Any values that are str, int, or floats are treated as a list containing a single value.
For example, specifying {"target": "y"}
means that “y” will be the chosen target every time the model is fit. There are some list-typed arguments; in particular, features
is a list of features to be used in the model. If you want to search over a list-typed argument, you must provide an iterable over valid argument values. For example, using {"features": [["col_a"], ["col_a", "col_b"]]}
would search over the two feature sets. If you just wanted to use the same set of features for each model, you would do {"features": [["col_a"]]}
.
Grid searches are especially useful when you have a relatively small set of parameters over which to search.
You may define a grid of parameters by specifying the possible values for each parameter. The method grid_search.create
will then train a model for each unique combination.
The collection of all combinations of valid parameter values defines a grid of model parameters that will be considered. For example, providing the following params
dictionary
params = {'target': 'label',
'step_size': 0.3,
'features': [['a'], ['a', 'b']],
'max_depth': [.1, .2]}
will create the following set of combinations:
[{'target': 'label', 'step_size': 0.3, 'features': ['a'], 'max_depth': .1},
{'target': 'label', 'step_size': 0.3, 'features': ['a'], 'max_depth': .2},
{'target': 'label', 'step_size': 0.3, 'features': ['a', 'b'], 'max_depth': .1},
{'target': 'label', 'step_size': 0.3, 'features': ['a', 'b'], 'max_depth': .2}]
You may not always know which areas of a search space are most promising.
In such situations, it can be useful to pick parameter combinations from random distributions.
The top-level method, model_parameter_search
, currently chooses random_search.create
by default.
For example, for a real-valued parameter such as step_size
, you could might want to draw values from an exponential distribution.
In the following example, each parameter combination will contain
- a
target
value of 'Y' - a
max_depth
value of either 5 or 7 (chosen randomly) - a
step_size
value drawn randomly from an exponential distribution with mean of 0.1
import scipy.stats
url = 'https://static.turi.com/datasets/xgboost/mushroom.csv'
data = gl.SFrame.read_csv(url)
data['label'] = (data['label'] == 'p')
train, valid = data.random_split(.8)
params = {'target': 'label',
'max_depth': [5, 7],
'step_size': scipy.stats.distributions.expon(.1)}
job = gl.random_search.create((train, valid),
gl.boosted_trees_regression.create,
params)
job.get_results()
Columns:
model_id int
max_depth int
step_size float
target str
training_rmse float
validation_rmse float
Rows: 8
Data:
+----------+-----------+----------------+--------+-------------------+
| model_id | max_depth | step_size | target | training_rmse |
+----------+-----------+----------------+--------+-------------------+
| 9 | 7 | 0.742280945789 | label | 0.000562821322042 |
| 8 | 5 | 0.37544111673 | label | 0.00963600115039 |
| 1 | 5 | 0.138909527035 | label | 0.11368970605 |
| 0 | 7 | 0.977843893103 | label | 0.000269710408328 |
| 3 | 7 | 0.32559648473 | label | 0.0110626696535 |
| 2 | 5 | 0.330703633987 | label | 0.0137912720349 |
| 5 | 7 | 0.408652318249 | label | 0.00367912426229 |
| 4 | 7 | 0.295146249231 | label | 0.0162840474088 |
+----------+-----------+----------------+--------+-------------------+
+-------------------+
| validation_rmse |
+-------------------+
| 0.000790839939725 |
| 0.0123972020261 |
| 0.114722098681 |
| 0.000369491390958 |
| 0.0120762185507 |
| 0.0169411827805 |
| 0.00439583387505 |
| 0.0171414864358 |
+-------------------+
If you want full control over your parameter search, then you can use the manual_search.create
function. All you need to do is to pass in a list of parameter dictionaries; a model will be fit for each parameter set.
factory = gl.boosted_trees_classifier.create
params = [{'target': 'label', 'max_depth': 3},
{'target': 'label', 'max_depth': 6}]
job = gl.manual_search.create((train, valid),
factory, params)