# Automated Machine Learning with Scikit-Learn

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deephyper/tutorials/blob/main/tutorials/colab/AutoML_with_Sklearn.ipynb)

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

Let us start by installing DeepHyper.

In [1]:
!pip install deephyper["popt"]
!pip install ray



## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run_autosklearn1` and wrap it with our data such as:

In [1]:
from deephyper.sklearn.classifier import run_autosklearn1


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper in `deephyper.sklearn.classifier.problem_autosklearn1` to understand better what is happening under the hood.

In [2]:
from deephyper.sklearn.classifier import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [3]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1


Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `problem_autosklearn1` and `evaluator`.

In [4]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator, log_dir="exp-automl-2")

In [5]:
results = search.search(100)

[2m[36m(pid=35708)[0m   from pandas import MultiIndex, Int64Index


  0%|          | 0/100 [00:00<?, ?it/s]

[2m[36m(run pid=35708)[0m   mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [6]:
results

Unnamed: 0,p:classifier,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
0,Logistic,0.000986,,,,,,,0.893617,0,1.847845,4.197699
1,KNeighbors,,,,,,41.0,,0.946809,1,4.348262,4.359891
2,RandomForest,,,,48.0,51.0,,,0.957447,2,4.588290,4.632172
3,Logistic,0.000341,,,,,,,0.819149,3,4.764461,4.772226
4,SVC,0.000063,,linear,,,,,0.643617,4,4.907278,4.917221
...,...,...,...,...,...,...,...,...,...,...,...,...
95,MLP,,1.782067,,,,,,0.989362,95,37.920646,38.037480
96,MLP,,1.742599,,,,,,0.989362,96,38.265214,38.381637
97,MLP,,1.769931,,,,,,0.989362,97,38.609994,38.726015
98,MLP,,2.019310,,,,,,0.989362,98,39.031003,39.148036


Now that we have the full list of results we can display the top-3.

In [7]:
results.nlargest(n=3, columns="objective")

Unnamed: 0,p:classifier,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
10,MLP,,3.685755,,,,,,0.989362,10,6.240106,6.363712
12,MLP,,3.717318,,,,,,0.989362,12,7.013312,7.129871
13,MLP,,2.902145,,,,,,0.989362,13,7.361486,7.477184


## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run`-function provided at `deephyper.sklearn.regressor.run_autosklearn1` and wrap it with our data such as:

In [8]:
from deephyper.sklearn.regressor import run_autosklearn1


def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood. 

In [9]:
from deephyper.sklearn.regressor import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [10]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run, 
                 method="ray", 
                 method_kwargs={
                     "address": None, 
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1


Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `Problem` and `evaluator`.

In [11]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator)

In [12]:
results = search.search(10)

  0%|          | 0/10 [00:00<?, ?it/s]

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation $R^2$), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [13]:
results

Unnamed: 0,p:regressor,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
0,Linear,,,,,,,,0.597049,0,0.135866,0.154057
1,KNeighbors,,,,,,41.0,,0.666496,1,0.363004,0.664028
2,RandomForest,,,,48.0,51.0,,,0.80251,2,0.78854,3.497786
3,RandomForest,,,,7.0,245.0,,,0.719056,3,3.625416,10.249726
4,SVR,6.3e-05,,linear,,,,,0.322115,4,10.450442,13.565911
5,SVR,1.6e-05,,sigmoid,,,,0.00418,-0.059354,5,13.690964,17.727409
6,SVR,0.422234,,sigmoid,,,,2.779419,-321050.500503,6,17.852672,24.922506
7,RandomForest,,,,91.0,15.0,,,0.796552,7,25.120666,25.918885
8,MLP,,1.350762,,,,,,0.708333,8,26.042982,28.260683
9,MLP,,0.033863,,,,,,0.771833,9,28.383097,31.670986


Now that we have the full list of results we can display the top-3.

In [14]:
results.nlargest(n=3, columns="objective")

Unnamed: 0,p:regressor,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
2,RandomForest,,,,48.0,51.0,,,0.80251,2,0.78854,3.497786
7,RandomForest,,,,91.0,15.0,,,0.796552,7,25.120666,25.918885
9,MLP,,0.033863,,,,,,0.771833,9,28.383097,31.670986
