5. Automated Machine Learning with Scikit-Learn#

In this tutorial, we will show how to automatically search among different machine learning algorithms from Scikit-Learn. Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

Let us start by installing DeepHyper.

!pip install deephyper["popt"]
!pip install ray
5.1. Classification#

On this part of the tutorial we focus on the classification case.

Create run function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of automated machine learning we use the run function provided at deephyper.sklearn.classifier.run_autosklearn1 and wrap it with our data such as:

from deephyper.sklearn.classifier import run_autosklearn1

def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y

def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper in deephyper.sklearn.classifier.problem_autosklearn1 to understand better what is happening under the hood.

from deephyper.sklearn.classifier import problem_autosklearn1

Configuration space object:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    alpha | classifier == 'MLP'
    kernel | classifier == 'SVC'
    max_depth | classifier == 'RandomForest'
    n_neighbors | classifier == 'KNeighbors'

Create an Evaluator object using the ray backend to distribute the evaluation of the run-function defined previously.

from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run,
                     "address": None,
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]

print("Number of workers: ", evaluator.num_workers)
Number of workers:  1

Finally, you can define a Bayesian optimization search called CBO (for Centralized Bayesian Optimization) and link to it the defined problem_autosklearn1 and evaluator.

from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator, log_dir="exp-automl-2")
results = search.search(100)
Once the search is over, a file named results.csv is saved in the current directory. The same dataframe is returned by the search.search(...) call. It contains the hyperparameters configurations evaluated during the search and their corresponding objective value (i.e, validation accuracy), timestamp_submit the time when the evaluator submitted the configuration to be evaluated and timestamp_gather the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the Evaluator instance).

p:classifier p:C p:alpha p:kernel p:max_depth p:n_estimators p:n_neighbors p:gamma objective job_id m:timestamp_submit m:timestamp_gather
0 Logistic 0.000986 NaN NaN NaN NaN NaN NaN 0.893617 0 1.847845 4.197699
1 KNeighbors NaN NaN NaN NaN NaN 41.0 NaN 0.946809 1 4.348262 4.359891
2 RandomForest NaN NaN NaN 48.0 51.0 NaN NaN 0.957447 2 4.588290 4.632172
3 Logistic 0.000341 NaN NaN NaN NaN NaN NaN 0.819149 3 4.764461 4.772226
4 SVC 0.000063 NaN linear NaN NaN NaN NaN 0.643617 4 4.907278 4.917221
... ... ... ... ... ... ... ... ... ... ... ... ...
95 MLP NaN 1.782067 NaN NaN NaN NaN NaN 0.989362 95 37.920646 38.037480
96 MLP NaN 1.742599 NaN NaN NaN NaN NaN 0.989362 96 38.265214 38.381637
97 MLP NaN 1.769931 NaN NaN NaN NaN NaN 0.989362 97 38.609994 38.726015
98 MLP NaN 2.019310 NaN NaN NaN NaN NaN 0.989362 98 39.031003 39.148036
99 MLP NaN 1.862691 NaN NaN NaN NaN NaN 0.989362 99 39.378585 39.494249

100 rows × 12 columns

Now that we have the full list of results we can display the top-3.

results.nlargest(n=3, columns="objective")
p:classifier p:C p:alpha p:kernel p:max_depth p:n_estimators p:n_neighbors p:gamma objective job_id m:timestamp_submit m:timestamp_gather
10 MLP NaN 3.685755 NaN NaN NaN NaN NaN 0.989362 10 6.240106 6.363712
12 MLP NaN 3.717318 NaN NaN NaN NaN NaN 0.989362 12 7.013312 7.129871
13 MLP NaN 2.902145 NaN NaN NaN NaN NaN 0.989362 13 7.361486 7.477184

5.2. Regression#

On this part of the tutorial we focus on the regression case.

Create run function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation \(R^2\)), which will be maximized by the search algorithm. In the case of automated machine learning we use the run-function provided at deephyper.sklearn.regressor.run_autosklearn1 and wrap it with our data such as:

from deephyper.sklearn.regressor import run_autosklearn1

def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y

def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood.

from deephyper.sklearn.regressor import problem_autosklearn1

Configuration space object:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | regressor == 'SVR'
    max_depth | regressor == 'RandomForest'
    n_neighbors | regressor == 'KNeighbors'

Create an Evaluator object using the ray backend to distribute the evaluation of the run-function defined previously.

from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run,
                     "address": None,
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]

print("Number of workers: ", evaluator.num_workers)
Number of workers:  1

Finally, you can define a Bayesian optimization search called CBO (for Centralized Bayesian Optimization) and link to it the defined Problem and evaluator.

from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator)
results = search.search(10)

Once the search is over, a file named results.csv is saved in the current directory. The same dataframe is returned by the search.search(...) call. It contains the hyperparameters configurations evaluated during the search and their corresponding objective value (i.e, validation \(R^2\)), timestamp_submit the time when the evaluator submitted the configuration to be evaluated and timestamp_gather the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the Evaluator instance).

p:regressor p:C p:alpha p:kernel p:max_depth p:n_estimators p:n_neighbors p:gamma objective job_id m:timestamp_submit m:timestamp_gather
0 Linear NaN NaN NaN NaN NaN NaN NaN 0.597049 0 0.135866 0.154057
1 KNeighbors NaN NaN NaN NaN NaN 41.0 NaN 0.666496 1 0.363004 0.664028
2 RandomForest NaN NaN NaN 48.0 51.0 NaN NaN 0.802510 2 0.788540 3.497786
3 RandomForest NaN NaN NaN 7.0 245.0 NaN NaN 0.719056 3 3.625416 10.249726
4 SVR 0.000063 NaN linear NaN NaN NaN NaN 0.322115 4 10.450442 13.565911
5 SVR 0.000016 NaN sigmoid NaN NaN NaN 0.004180 -0.059354 5 13.690964 17.727409
6 SVR 0.422234 NaN sigmoid NaN NaN NaN 2.779419 -321050.500503 6 17.852672 24.922506
7 RandomForest NaN NaN NaN 91.0 15.0 NaN NaN 0.796552 7 25.120666 25.918885
8 MLP NaN 1.350762 NaN NaN NaN NaN NaN 0.708333 8 26.042982 28.260683
9 MLP NaN 0.033863 NaN NaN NaN NaN NaN 0.771833 9 28.383097 31.670986

Now that we have the full list of results we can display the top-3.

results.nlargest(n=3, columns="objective")
p:regressor p:C p:alpha p:kernel p:max_depth p:n_estimators p:n_neighbors p:gamma objective job_id m:timestamp_submit m:timestamp_gather
2 RandomForest NaN NaN NaN 48.0 51.0 NaN NaN 0.802510 2 0.788540 3.497786
7 RandomForest NaN NaN NaN 91.0 15.0 NaN NaN 0.796552 7 25.120666 25.918885
9 MLP NaN 0.033863 NaN NaN NaN NaN NaN 0.771833 9 28.383097 31.670986