Hyperparameter Search for Machine Learning (Advanced)

In this tutorial, we will show how to treat a learning method as a hyperparameter in the hyperparameter search. We will consider Random Forest (RF) classifier and Gradient Boosting (GB) classifier methods in scikit-learn for the Airlines data set. Each of these methods have its own set of hyperparameters and some common parameters. We model them using ConfigSpace a python package to express conditional hyperparameters and more.

Let us start by creating a DeepHyper project and a problem for our application:

Note

If you already have a DeepHyper project you do not need to create a new one for each problem.

bash
$ deephyper start-project dhproj
$ cd dhproj/dhproj/
$ deephyper new-problem hps advanced_hpo
$ cd rf_tuning/

Create a mapping.py script where you will record the classification algorithms of interest ($ touch mapping.py in the terminal then edit the file):

advanced_hpo/mapping.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
"""Mapping of available classifiers for automl.
"""

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
)


CLASSIFIERS = {
    "RandomForest": RandomForestClassifier,
    "GradientBoosting": GradientBoostingClassifier,
}

Create a script to test the accuracy of the default configuration for both the models:

advanced_hpo/test_default_configs.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from dhproj.advanced_hpo.mapping import CLASSIFIERS
from deephyper.benchmark.datasets import airlines as dataset
from sklearn.utils import check_random_state

rs_clf = check_random_state(42)

rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test = dataset.load_data(
    random_state=rs_data, test_size=ratio_test, valid_size=ratio_valid,
)

for clf_name, clf_class in CLASSIFIERS.items():
    print(clf_name)

    clf = clf_class(random_state=rs_clf)

    clf.fit(*train)

    acc_train = clf.score(*train)
    acc_valid = clf.score(*valid)
    acc_test = clf.score(*test)

    print(f"Accuracy on Training: {acc_train:.3f}")
    print(f"Accuracy on Validation: {acc_valid:.3f}")
    print(f"Accuracy on Testing: {acc_test:.3f}\n")

Run the script and record the training, validation, and test accuracy as follows:

bash
$ python test_default_configs.py

Running the script will give the the following outputs:

[Out]
RandomForest
Accuracy on Training: 0.879
Accuracy on Validation: 0.621
Accuracy on Testing: 0.620

GradientBoosting
Accuracy on Training: 0.649
Accuracy on Validation: 0.648
Accuracy on Testing: 0.649

The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest.

Next, we optimize the hyperparameters, where we seek to find the right classifier and its corresponding hyperparameters, and improve the accuracy on the vaidation and test data. Create load_data.py file to load and return training and validation data:

advanced_hpo/load_data.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from sklearn.utils import resample
from deephyper.benchmark.datasets import airlines as dataset


def load_data():

    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(seed=42)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _ = dataset.load_data(
        random_state=random_state, test_size=ratio_test, valid_size=ratio_valid
    )

    X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))
    X_valid, y_valid = resample(X_valid, y_valid, n_samples=int(1e4))

    print(f"X_train shape: {np.shape(X_train)}")
    print(f"y_train shape: {np.shape(y_train)}")
    print(f"X_valid shape: {np.shape(X_valid)}")
    print(f"y_valid shape: {np.shape(y_valid)}")
    return (X_train, y_train), (X_valid, y_valid)


if __name__ == "__main__":
    load_data()

Note

Subsampling with X_train, y_train = resample(X_train, y_train, n_samples=int(1e4)) can be useful if you want to speed-up your search. By subsampling the training time will reduce.

To test this code:

bash
$ python load_data.py

The expected output is:

[Out]
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (10000, 7)
y_valid shape: (10000,)

Create model_run.py file to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm.

advanced_hpo/model_run.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from deephyper.problem import filter_parameters
from dhproj.advanced_hpo.mapping import CLASSIFIERS
from dhproj.rf_tuning.load_data import load_data
from sklearn.metrics import accuracy_score
from sklearn.utils import check_random_state


def run(config: dict) -> float:
    """Run function which can be used for AutoML classification.

    Args:
        config (dict): [description]
        load_data (callable): [description]

    Returns:
        float: [description]
    """
    seed = 42
    config["random_state"] = check_random_state(42)

    (X_train, y_train), (X_valid, y_valid) = load_data()

    clf_class = CLASSIFIERS[config["classifier"]]

    # keep parameters possible for the current classifier
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    try:  # good practice to manage the fail value yourself...
        clf = clf_class(**clf_params)

        clf.fit(X_train, y_train)

        fit_is_complete = True
    except:
        fit_is_complete = False

    if fit_is_complete:
        y_pred = clf.predict(X_valid)
        acc = accuracy_score(y_valid, y_pred)
    else:
        acc = -1.0

    return acc

Create problem.py to define the search space of hyperparameters for each model:

advanced_hpo/problem.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import ConfigSpace as cs
from deephyper.problem import HpProblem


Problem = HpProblem(seed=45)

#! Default value are very important when adding conditional and forbidden clauses
#! Otherwise the creation of the problem can fail if the default configuration is not
#! Acceptable
classifier = Problem.add_hyperparameter(
    name="classifier",
    value=["RandomForest", "GradientBoosting"],
    default_value="RandomForest",
)

# For both
Problem.add_hyperparameter(name="n_estimators", value=(1, 1000, "log-uniform"))
Problem.add_hyperparameter(name="max_depth", value=(1, 50))
Problem.add_hyperparameter(
    name="min_samples_split", value=(2, 10),
)
Problem.add_hyperparameter(name="min_samples_leaf", value=(1, 10))
criterion = Problem.add_hyperparameter(
    name="criterion",
    value=["friedman_mse", "mse", "mae", "gini", "entropy"],
    default_value="gini",
)

# GradientBoosting
loss = Problem.add_hyperparameter(name="loss", value=["deviance", "exponential"])
learning_rate = Problem.add_hyperparameter(name="learning_rate", value=(0.01, 1.0))
subsample = Problem.add_hyperparameter(name="subsample", value=(0.01, 1.0))

gradient_boosting_hp = [loss, learning_rate, subsample]
for hp_i in gradient_boosting_hp:
    Problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))

forbidden_criterion_rf = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
    cs.ForbiddenInClause(criterion, ["friedman_mse", "mse", "mae"]),
)
Problem.add_forbidden_clause(forbidden_criterion_rf)

forbidden_criterion_gb = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
)
Problem.add_forbidden_clause(forbidden_criterion_gb)

if __name__ == "__main__":
    print(Problem)

Run the problem.py with $ python problem.py in your shell. The output will be:

[Out]
Configuration space object:
Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, mse, mae, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {deviance, exponential}, Default: deviance
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
Forbidden Clauses:
    (Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'mae', 'mse'})
    (Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'entropy', 'gini'})

Run the search for 20 model evaluations using the following command line:

bash
$ deephyper hps ambs --problem dhproj.advanced_hpo.problem.Problem --run dhproj.advanced_hpo.model_run.run --max-evals 20 --evaluator ray --n-jobs 4

Once the search is over, the results.csv file contains the hyperparameters configurations evaluated during the search and their corresponding objective value (validation accuracy). Create test_best_config.py as given below. It will extract the best configuration from the results.csv and run it for the training, validation and test set.

advanced_hpo/test_best_config.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from pprint import pprint

import pandas as pd
from deephyper.benchmark.datasets import airlines as dataset
from deephyper.problem import filter_parameters
from dhproj.advanced_hpo.mapping import CLASSIFIERS
from sklearn.utils import check_random_state

df = pd.read_csv("results.csv")
config = df.iloc[df.objective.argmax()][:-2].to_dict()
print("Best config is:")
pprint(config)

config["random_state"] = check_random_state(42)

rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test = dataset.load_data(
    random_state=rs_data, test_size=ratio_test, valid_size=ratio_valid,
)

clf_class = CLASSIFIERS[config["classifier"]]
config["n_jobs"] = 4
clf_params = filter_parameters(clf_class, config)

clf = clf_class(**clf_params)

clf.fit(*train)

acc_train = clf.score(*train)
acc_valid = clf.score(*valid)
acc_test = clf.score(*test)

print(f"Accuracy on Training: {acc_train:.3f}")
print(f"Accuracy on Validation: {acc_valid:.3f}")
print(f"Accuracy on Testing: {acc_test:.3f}")

Compared to the default configuration, we can see the accuracy improvement in the validation and test data sets.

[Out]
Accuracy on Training: 0.754
Accuracy on Validation: 0.664
Accuracy on Testing: 0.664