Hyperparameter Search for Machine Learning (Advanced)

In this tutorial, we will show how to treat a learning method as a hyperparameter in the hyperparameter search. We will consider Random Forest (RF) classifier and Gradient Boosting (GB) classifier methods in scikit-learn for the Airlines data set. Each of these methods have its own set of hyperparameters and some common parameters. We model them using ConfigSpace a python package to express conditional hyperparameters and more.

Let us start by creating a DeepHyper project and a problem for our application:

Note

If you already have a DeepHyper project you do not need to create a new one for each problem.

bash
$ deephyper start-project dhproj
$ cd dhproj/dhproj/
$ deephyper new-problem hps advanced_hpo
$ cd rf_tuning/

Create a mapping.py script where you will record the classification algorithms of interest ($ touch mapping.py in the terminal then edit the file):

advanced_hpo/mapping.py
 1"""Mapping of available classifiers for automl.
 2"""
 3
 4from sklearn.ensemble import (
 5    RandomForestClassifier,
 6    GradientBoostingClassifier,
 7)
 8
 9
10CLASSIFIERS = {
11    "RandomForest": RandomForestClassifier,
12    "GradientBoosting": GradientBoostingClassifier,
13}

Create a script to test the accuracy of the default configuration for both the models:

advanced_hpo/test_default_configs.py
 1from dhproj.advanced_hpo.mapping import CLASSIFIERS
 2from deephyper.benchmark.datasets import airlines as dataset
 3from sklearn.utils import check_random_state
 4
 5rs_clf = check_random_state(42)
 6
 7rs_data = check_random_state(42)
 8
 9ratio_test = 0.33
10ratio_valid = (1 - ratio_test) * 0.33
11
12train, valid, test = dataset.load_data(
13    random_state=rs_data,
14    test_size=ratio_test,
15    valid_size=ratio_valid,
16    categoricals_to_integers=True,
17)
18
19for clf_name, clf_class in CLASSIFIERS.items():
20    print(clf_name)
21
22    clf = clf_class(random_state=rs_clf)
23
24    clf.fit(*train)
25
26    acc_train = clf.score(*train)
27    acc_valid = clf.score(*valid)
28    acc_test = clf.score(*test)
29
30    print(f"Accuracy on Training: {acc_train:.3f}")
31    print(f"Accuracy on Validation: {acc_valid:.3f}")
32    print(f"Accuracy on Testing: {acc_test:.3f}\n")

Run the script and record the training, validation, and test accuracy as follows:

bash
$ python test_default_configs.py

Running the script will give the the following outputs:

[Out]
RandomForest
Accuracy on Training: 0.879
Accuracy on Validation: 0.621
Accuracy on Testing: 0.620

GradientBoosting
Accuracy on Training: 0.649
Accuracy on Validation: 0.648
Accuracy on Testing: 0.649

The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest.

Next, we optimize the hyperparameters, where we seek to find the right classifier and its corresponding hyperparameters, and improve the accuracy on the vaidation and test data. Create load_data.py file to load and return training and validation data:

advanced_hpo/load_data.py
 1import numpy as np
 2from sklearn.utils import resample
 3from deephyper.benchmark.datasets import airlines as dataset
 4
 5
 6def load_data():
 7
 8    # In this case passing a random state is critical to make sure
 9    # that the same data are loaded all the time and that the test set
10    # is not mixed with either the training or validation set.
11    # It is important to not avoid setting a global seed for safety reasons.
12    random_state = np.random.RandomState(seed=42)
13
14    # Proportion of the test set on the full dataset
15    ratio_test = 0.33
16
17    # Proportion of the valid set on "dataset \ test set"
18    # here we want the test and validation set to have same number of elements
19    ratio_valid = (1 - ratio_test) * 0.33
20
21    # The 3rd result is ignored with "_" because it corresponds to the test set
22    # which is not interesting for us now.
23    (X_train, y_train), (X_valid, y_valid), _, _ = dataset.load_data(
24        random_state=random_state, test_size=ratio_test, valid_size=ratio_valid
25    )
26
27    X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))
28    X_valid, y_valid = resample(X_valid, y_valid, n_samples=int(1e4))
29
30    print(f"X_train shape: {np.shape(X_train)}")
31    print(f"y_train shape: {np.shape(y_train)}")
32    print(f"X_valid shape: {np.shape(X_valid)}")
33    print(f"y_valid shape: {np.shape(y_valid)}")
34    return (X_train, y_train), (X_valid, y_valid)
35
36
37if __name__ == "__main__":
38    load_data()

Note

Subsampling with X_train, y_train = resample(X_train, y_train, n_samples=int(1e4)) can be useful if you want to speed-up your search. By subsampling the training time will reduce.

To test this code:

bash
$ python load_data.py

The expected output is:

[Out]
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (10000, 7)
y_valid shape: (10000,)

Create model_run.py file to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm.

advanced_hpo/model_run.py
 1from deephyper.problem import filter_parameters
 2from dhproj.advanced_hpo.mapping import CLASSIFIERS
 3from dhproj.rf_tuning.load_data import load_data
 4from sklearn.metrics import accuracy_score
 5from sklearn.utils import check_random_state
 6
 7
 8def run(config: dict) -> float:
 9    """Run function which can be used for AutoML classification.
10
11    Args:
12        config (dict): [description]
13        load_data (callable): [description]
14
15    Returns:
16        float: [description]
17    """
18    seed = 42
19    config["random_state"] = check_random_state(42)
20
21    (X_train, y_train), (X_valid, y_valid) = load_data()
22
23    clf_class = CLASSIFIERS[config["classifier"]]
24
25    # keep parameters possible for the current classifier
26    config["n_jobs"] = 4
27    clf_params = filter_parameters(clf_class, config)
28
29    try:  # good practice to manage the fail value yourself...
30        clf = clf_class(**clf_params)
31
32        clf.fit(X_train, y_train)
33
34        fit_is_complete = True
35    except:
36        fit_is_complete = False
37
38    if fit_is_complete:
39        y_pred = clf.predict(X_valid)
40        acc = accuracy_score(y_valid, y_pred)
41    else:
42        acc = -1.0
43
44    return acc

Create problem.py to define the search space of hyperparameters for each model:

advanced_hpo/problem.py
 1import ConfigSpace as cs
 2from deephyper.problem import HpProblem
 3
 4
 5Problem = HpProblem(seed=45)
 6
 7#! Default value are very important when adding conditional and forbidden clauses
 8#! Otherwise the creation of the problem can fail if the default configuration is not
 9#! Acceptable
10classifier = Problem.add_hyperparameter(
11    name="classifier",
12    value=["RandomForest", "GradientBoosting"],
13    default_value="RandomForest",
14)
15
16# For both
17Problem.add_hyperparameter(name="n_estimators", value=(1, 1000, "log-uniform"))
18Problem.add_hyperparameter(name="max_depth", value=(1, 50))
19Problem.add_hyperparameter(
20    name="min_samples_split", value=(2, 10),
21)
22Problem.add_hyperparameter(name="min_samples_leaf", value=(1, 10))
23criterion = Problem.add_hyperparameter(
24    name="criterion",
25    value=["friedman_mse", "mse", "mae", "gini", "entropy"],
26    default_value="gini",
27)
28
29# GradientBoosting
30loss = Problem.add_hyperparameter(name="loss", value=["deviance", "exponential"])
31learning_rate = Problem.add_hyperparameter(name="learning_rate", value=(0.01, 1.0))
32subsample = Problem.add_hyperparameter(name="subsample", value=(0.01, 1.0))
33
34gradient_boosting_hp = [loss, learning_rate, subsample]
35for hp_i in gradient_boosting_hp:
36    Problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))
37
38forbidden_criterion_rf = cs.ForbiddenAndConjunction(
39    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
40    cs.ForbiddenInClause(criterion, ["friedman_mse", "mse", "mae"]),
41)
42Problem.add_forbidden_clause(forbidden_criterion_rf)
43
44forbidden_criterion_gb = cs.ForbiddenAndConjunction(
45    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
46    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
47)
48Problem.add_forbidden_clause(forbidden_criterion_gb)
49
50if __name__ == "__main__":
51    print(Problem)

Run the problem.py with $ python problem.py in your shell. The output will be:

[Out]
Configuration space object:
Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, mse, mae, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {deviance, exponential}, Default: deviance
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
Forbidden Clauses:
    (Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'mae', 'mse'})
    (Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'entropy', 'gini'})

Run the search for 20 model evaluations using the following command line:

bash
$ deephyper hps ambs --problem dhproj.advanced_hpo.problem.Problem --run dhproj.advanced_hpo.model_run.run --max-evals 20 --evaluator ray --n-jobs 4

Once the search is over, the results.csv file contains the hyperparameters configurations evaluated during the search and their corresponding objective value (validation accuracy). Create test_best_config.py as given below. It will extract the best configuration from the results.csv and run it for the training, validation and test set.

advanced_hpo/test_best_config.py
 1from pprint import pprint
 2
 3import pandas as pd
 4from deephyper.benchmark.datasets import airlines as dataset
 5from deephyper.problem import filter_parameters
 6from dhproj.advanced_hpo.mapping import CLASSIFIERS
 7from sklearn.utils import check_random_state
 8
 9df = pd.read_csv("results.csv")
10config = df.iloc[df.objective.argmax()][:-2].to_dict()
11print("Best config is:")
12pprint(config)
13
14config["random_state"] = check_random_state(42)
15
16rs_data = check_random_state(42)
17
18ratio_test = 0.33
19ratio_valid = (1 - ratio_test) * 0.33
20
21train, valid, test = dataset.load_data(
22    random_state=rs_data,
23    test_size=ratio_test,
24    valid_size=ratio_valid,
25    categoricals_to_integers=True,
26)
27
28clf_class = CLASSIFIERS[config["classifier"]]
29config["n_jobs"] = 4
30clf_params = filter_parameters(clf_class, config)
31
32clf = clf_class(**clf_params)
33
34clf.fit(*train)
35
36acc_train = clf.score(*train)
37acc_valid = clf.score(*valid)
38acc_test = clf.score(*test)
39
40print(f"Accuracy on Training: {acc_train:.3f}")
41print(f"Accuracy on Validation: {acc_valid:.3f}")
42print(f"Accuracy on Testing: {acc_test:.3f}")

Compared to the default configuration, we can see the accuracy improvement in the validation and test data sets.

[Out]
Accuracy on Training: 0.754
Accuracy on Validation: 0.664
Accuracy on Testing: 0.664