Hyperparameter optimization and overfitting

Hyperparameter optimization and overfitting#

In this example, you will learn how to treat the choice of a learning method as just another hyperparameter. We consider the Random Forest (RF) and Gradient Boosting (GB) classifiers from Scikit-Learn on the Airlines dataset.

Each classifier has both unique and shared hyperparameters. We use ConfigSpace, a Python package for defining conditional hyperparameters and more, to model them.

By using, the objective of hyperparameter properly, and considering hyperparameter optimization as an optimized model selection method, you will also learn how to fight overfitting.

Installation and imports#

Installing dependencies with the pip installation is recommended. It requires Python >= 3.10.

%%bash
pip install "deephyper[ray] openml==0.15.1"
Code (Import statements)
from inspect import signature

import ConfigSpace as cs
import matplotlib.pyplot as plt
import numpy as np
import openml
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import check_random_state, resample

from deephyper.analysis.hpo import plot_search_trajectory_single_objective_hpo
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback
from deephyper.hpo import CBO, HpProblem

WIDTH_PLOTS = 8
HEIGHT_PLOTS = WIDTH_PLOTS / 1.618

We start by creating a function which loads the data of interest. Here we use the “Airlines” dataset from OpenML where the task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

Code (Loading the data)
def load_data(
    random_state=42,
    verbose=False,
    test_size=0.33,
    valid_size=0.33,
    categoricals_to_integers=False,
):
    """Load the "Airlines" dataset from OpenML.

    Args:
        random_state (int, optional): A numpy `RandomState`. Defaults to 42.
        verbose (bool, optional): Print informations about the dataset. Defaults to False.
        test_size (float, optional): The proportion of the test dataset out of the whole data. Defaults to 0.33.
        valid_size (float, optional): The proportion of the train dataset out of the whole data without the test data. Defaults to 0.33.
        categoricals_to_integers (bool, optional): Convert categoricals features to integer values. Defaults to False.

    Returns:
        tuple: Numpy arrays as, `(X_train, y_train), (X_valid, y_valid), (X_test, y_test)`.
    """
    random_state = (
        np.random.RandomState(random_state) if type(random_state) is int else random_state
    )

    dataset = openml.datasets.get_dataset(
        dataset_id=1169,
        download_data=True,
        download_qualities=True,
        download_features_meta_data=True,
    )

    if verbose:
        print(
            f"This is dataset '{dataset.name}', the target feature is "
            f"'{dataset.default_target_attribute}'"
        )
        print(f"URL: {dataset.url}")
        print(dataset.description[:500])

    X, y, categorical_indicator, ft_names = dataset.get_data(
        target=dataset.default_target_attribute
    )

    # encode categoricals as integers
    if categoricals_to_integers:
        for ft_ind, ft_name in enumerate(ft_names):
            if categorical_indicator[ft_ind]:
                labenc = LabelEncoder().fit(X[ft_name])
                X[ft_name] = labenc.transform(X[ft_name])
                n_classes = len(labenc.classes_)
            else:
                n_classes = -1
            categorical_indicator[ft_ind] = (
                categorical_indicator[ft_ind],
                n_classes,
            )

    X, y = X.to_numpy(), y.to_numpy()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, shuffle=True, random_state=random_state
    )

    # relative valid_size on Train set
    r_valid_size = valid_size / (1.0 - test_size)
    X_train, X_valid, y_train, y_valid = train_test_split(
        X_train,
        y_train,
        test_size=r_valid_size,
        shuffle=True,
        random_state=random_state,
    )

    return (X_train, y_train), (X_valid, y_valid), (X_test, y_test)

Then, we create a mapping to record the classification algorithms of interest:

CLASSIFIERS = {
    "RandomForest": RandomForestClassifier,
    "GradientBoosting": GradientBoostingClassifier,
}

Create a baseline code to test the accuracy of each candidate model with its default hyperparameters:

Code (Evaluate baseline models)
def evaluate_baseline():
    rs_clf = check_random_state(42)
    rs_data = check_random_state(42)

    ratio_test = 0.33
    ratio_valid = (1 - ratio_test) * 0.33

    train, valid, test = load_data(
        random_state=rs_data,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    for clf_name, clf_class in CLASSIFIERS.items():
        print("Scoring model:", clf_name)

        clf = clf_class(random_state=rs_clf)

        clf.fit(*train)

        acc_train = clf.score(*train)
        acc_valid = clf.score(*valid)
        acc_test = clf.score(*test)

        print(f"\tAccuracy on Training: {acc_train:.3f}")
        print(f"\tAccuracy on Validation: {acc_valid:.3f}")
        print(f"\tAccuracy on Testing: {acc_test:.3f}\n")


evaluate_baseline()
Scoring model: RandomForest
        Accuracy on Training: 0.879
        Accuracy on Validation: 0.620
        Accuracy on Testing: 0.619

Scoring model: GradientBoosting
        Accuracy on Training: 0.649
        Accuracy on Validation: 0.648
        Accuracy on Testing: 0.649

The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and therefore poor generalization (i.e., high accuracy on training data but not on the validation or test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest (for the default hyperparameters).

Then, we optimize the hyperparameters, where we seek to find the best classifier and its corresponding best hyperparameters to improve the accuracy on the vaidation and test data. We create a load_subsampled_data function to load and return subsampled training and validation data in order to speed up the evaluation of candidate models and hyperparameters:

def load_subsampled_data(verbose=0, subsample=True, random_state=None):
    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(random_state)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _ = load_data(
        random_state=42,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    # Uncomment the next line if you want to sub-sample the training data to speed-up
    # the search, "n_samples" controls the size of the new training data
    if subsample:
        X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))

    if verbose:
        print(f"X_train shape: {np.shape(X_train)}")
        print(f"y_train shape: {np.shape(y_train)}")
        print(f"X_valid shape: {np.shape(X_valid)}")
        print(f"y_valid shape: {np.shape(y_valid)}")

    return (X_train, y_train), (X_valid, y_valid)


print("Without subsampling")
_ = load_subsampled_data(verbose=1, subsample=False)
print()
print("With subsampling")
_ = load_subsampled_data(verbose=1)
Without subsampling
X_train shape: (242128, 7)
y_train shape: (242128,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

With subsampling
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

Then, we create a run function to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, a validation accuracy) that is maximized by the hyperparameter optimization algorithm.

Code (Utility function that filters a dictionnary based on the signature of an object)
def filter_parameters(obj, config: dict) -> dict:
    """Filter the incoming configuration dict based on the signature of obj.

    Args:
        obj (Callable): the object for which the signature is used.
        config (dict): the configuration to filter.

    Returns:
        dict: the filtered configuration dict.
    """
    sig = signature(obj)
    clf_allowed_params = list(sig.parameters.keys())
    clf_params = {(k[2:] if k.startswith("p:") else k): v for k, v in config.items()}
    clf_params = {
        k: v
        for k, v in clf_params.items()
        if (k in clf_allowed_params and (v not in ["nan", "NA"]))
    }
    return clf_params
def run(job) -> float:
    config = job.parameters.copy()
    config["random_state"] = check_random_state(42)

    (X_train, y_train), (X_valid, y_valid) = load_subsampled_data(subsample=True)

    clf_class = CLASSIFIERS[config["classifier"]]

    # keep parameters possible for the current classifier
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    try:  # good practice to manage the fail value yourself...
        clf = clf_class(**clf_params)

        clf.fit(X_train, y_train)

        fit_is_complete = True
    except Exception:
        fit_is_complete = False

    if fit_is_complete:
        y_pred = clf.predict(X_valid)
        acc = accuracy_score(y_valid, y_pred)
    else:
        acc = "F_fit_failed"

    return acc

Then, we create the HpProblem to define the search space of hyperparameters for each model.

The first hyperparameter is "classifier", the selected model.

Then, we use Condition and Forbidden to define constraints on the hyperparameters.

Default values are very important when adding Condition and Forbidden clauses. Otherwise, the creation of the problem can fail if the default configuration is not acceptable.

problem = HpProblem()

classifier = problem.add_hyperparameter(
    ["RandomForest", "GradientBoosting"], "classifier", default_value="RandomForest"
)

# For both
problem.add_hyperparameter((1, 1000, "log-uniform"), "n_estimators")
problem.add_hyperparameter((1, 50), "max_depth")
problem.add_hyperparameter((2, 10), "min_samples_split")
problem.add_hyperparameter((1, 10), "min_samples_leaf")
criterion = problem.add_hyperparameter(
    ["friedman_mse", "squared_error", "gini", "entropy"],
    "criterion",
    default_value="gini",
)

# GradientBoosting
loss = problem.add_hyperparameter(["log_loss", "exponential"], "loss")
learning_rate = problem.add_hyperparameter((0.01, 1.0), "learning_rate")
subsample = problem.add_hyperparameter((0.01, 1.0), "subsample")

gradient_boosting_hp = [loss, learning_rate, subsample]
for hp_i in gradient_boosting_hp:
    problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))

forbidden_criterion_rf = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
    cs.ForbiddenInClause(criterion, ["friedman_mse", "squared_error"]),
)
problem.add_forbidden_clause(forbidden_criterion_rf)

forbidden_criterion_gb = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
)
problem.add_forbidden_clause(forbidden_criterion_gb)

problem
Configuration space object:
  Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, squared_error, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {log_loss, exponential}, Default: log_loss
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
  Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
  Forbidden Clauses:
    (Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'gini', 'entropy'})
    (Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'squared_error'})

Then, we create an Evaluator object using the ray backend to distribute the evaluation of the run-function defined previously.

evaluator = Evaluator.create(
    run,
    method="ray",
    method_kwargs={
        "num_cpus_per_task": 1,
        "callbacks": [TqdmCallback()],
    },
)

print("Number of workers: ", evaluator.num_workers)
2025-08-18 14:14:14,435 INFO worker.py:1852 -- Started a local Ray instance.
Number of workers:  8

Finally, you can define a Bayesian optimization search called CBO (for Centralized Bayesian Optimization) and link to it the defined problem and evaluator.

  0%|          | 0/100 [00:00<?, ?it/s]
  1%|          | 1/100 [00:00<00:00, 3938.31it/s, failures=0, objective=0.593]
  2%|▏         | 2/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  2%|▏         | 2/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  3%|▎         | 3/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  4%|▍         | 4/100 [00:01<00:37,  2.56it/s, failures=0, objective=0.593]
  4%|▍         | 4/100 [00:01<00:37,  2.56it/s, failures=0, objective=0.631]
  5%|▌         | 5/100 [00:01<00:38,  2.48it/s, failures=0, objective=0.631]
  5%|▌         | 5/100 [00:01<00:38,  2.48it/s, failures=0, objective=0.631]
  6%|▌         | 6/100 [00:01<00:37,  2.48it/s, failures=0, objective=0.631]
  7%|▋         | 7/100 [00:01<00:37,  2.48it/s, failures=0, objective=0.631]
  8%|▊         | 8/100 [00:02<00:29,  3.15it/s, failures=0, objective=0.631]
  8%|▊         | 8/100 [00:02<00:29,  3.15it/s, failures=0, objective=0.631]
  9%|▉         | 9/100 [00:03<00:37,  2.44it/s, failures=0, objective=0.631]
  9%|▉         | 9/100 [00:03<00:37,  2.44it/s, failures=0, objective=0.631]
 10%|█         | 10/100 [00:04<00:43,  2.06it/s, failures=0, objective=0.631]
 10%|█         | 10/100 [00:04<00:43,  2.06it/s, failures=0, objective=0.637]
 11%|█         | 11/100 [00:05<00:52,  1.69it/s, failures=0, objective=0.637]
 11%|█         | 11/100 [00:05<00:52,  1.69it/s, failures=0, objective=0.637]
 12%|█▏        | 12/100 [00:05<00:53,  1.64it/s, failures=0, objective=0.637]
 12%|█▏        | 12/100 [00:05<00:53,  1.64it/s, failures=0, objective=0.637]
 13%|█▎        | 13/100 [00:05<00:52,  1.64it/s, failures=0, objective=0.637]
 14%|█▍        | 14/100 [00:05<00:52,  1.64it/s, failures=0, objective=0.637]
 15%|█▌        | 15/100 [00:06<00:31,  2.66it/s, failures=0, objective=0.637]
 15%|█▌        | 15/100 [00:06<00:31,  2.66it/s, failures=0, objective=0.637]
 16%|█▌        | 16/100 [00:06<00:34,  2.41it/s, failures=0, objective=0.637]
 16%|█▌        | 16/100 [00:06<00:34,  2.41it/s, failures=0, objective=0.637]
 17%|█▋        | 17/100 [00:07<00:47,  1.74it/s, failures=0, objective=0.637]
 17%|█▋        | 17/100 [00:07<00:47,  1.74it/s, failures=0, objective=0.637]
 18%|█▊        | 18/100 [00:09<01:13,  1.11it/s, failures=0, objective=0.637]
 18%|█▊        | 18/100 [00:09<01:13,  1.11it/s, failures=0, objective=0.637]
 19%|█▉        | 19/100 [00:10<01:05,  1.24it/s, failures=0, objective=0.637]
 19%|█▉        | 19/100 [00:10<01:05,  1.24it/s, failures=0, objective=0.637]
 20%|██        | 20/100 [00:12<01:29,  1.12s/it, failures=0, objective=0.637]
 20%|██        | 20/100 [00:12<01:29,  1.12s/it, failures=0, objective=0.637]
 21%|██        | 21/100 [00:12<01:28,  1.12s/it, failures=0, objective=0.637]
 22%|██▏       | 22/100 [00:18<02:37,  2.02s/it, failures=0, objective=0.637]
 22%|██▏       | 22/100 [00:18<02:37,  2.02s/it, failures=0, objective=0.637]
 23%|██▎       | 23/100 [00:18<02:35,  2.02s/it, failures=0, objective=0.637]
 24%|██▍       | 24/100 [00:18<02:33,  2.02s/it, failures=0, objective=0.637]
 25%|██▌       | 25/100 [00:18<02:31,  2.02s/it, failures=0, objective=0.637]
 26%|██▌       | 26/100 [00:18<02:29,  2.02s/it, failures=0, objective=0.637]
 27%|██▋       | 27/100 [00:27<02:16,  1.87s/it, failures=0, objective=0.637]
 27%|██▋       | 27/100 [00:27<02:16,  1.87s/it, failures=0, objective=0.639]
 28%|██▊       | 28/100 [00:27<02:14,  1.87s/it, failures=0, objective=0.639]
 29%|██▉       | 29/100 [00:33<02:28,  2.10s/it, failures=0, objective=0.639]
 29%|██▉       | 29/100 [00:33<02:28,  2.10s/it, failures=0, objective=0.639]
 30%|███       | 30/100 [00:33<02:26,  2.10s/it, failures=0, objective=0.639]
 31%|███       | 31/100 [00:33<02:24,  2.10s/it, failures=0, objective=0.639]
 32%|███▏      | 32/100 [00:33<02:22,  2.10s/it, failures=0, objective=0.639]
 33%|███▎      | 33/100 [00:33<02:20,  2.10s/it, failures=0, objective=0.639]
 34%|███▍      | 34/100 [00:42<02:09,  1.97s/it, failures=0, objective=0.639]
 34%|███▍      | 34/100 [00:42<02:09,  1.97s/it, failures=0, objective=0.639]
 35%|███▌      | 35/100 [00:42<02:07,  1.97s/it, failures=0, objective=0.64]
 36%|███▌      | 36/100 [00:42<02:05,  1.97s/it, failures=0, objective=0.64]
 37%|███▋      | 37/100 [00:48<02:06,  2.02s/it, failures=0, objective=0.64]
 37%|███▋      | 37/100 [00:48<02:06,  2.02s/it, failures=0, objective=0.64]
 38%|███▊      | 38/100 [00:48<02:04,  2.02s/it, failures=0, objective=0.64]
 39%|███▉      | 39/100 [00:48<02:02,  2.02s/it, failures=0, objective=0.64]
 40%|████      | 40/100 [00:48<02:00,  2.02s/it, failures=0, objective=0.64]
 41%|████      | 41/100 [00:48<01:58,  2.02s/it, failures=0, objective=0.64]
 42%|████▏     | 42/100 [00:57<01:49,  1.88s/it, failures=0, objective=0.64]
 42%|████▏     | 42/100 [00:57<01:49,  1.88s/it, failures=0, objective=0.64]
 43%|████▎     | 43/100 [00:57<01:47,  1.88s/it, failures=0, objective=0.64]
 44%|████▍     | 44/100 [00:57<01:45,  1.88s/it, failures=0, objective=0.64]
 45%|████▌     | 45/100 [01:03<01:48,  1.97s/it, failures=0, objective=0.64]
 45%|████▌     | 45/100 [01:03<01:48,  1.97s/it, failures=0, objective=0.64]
 46%|████▌     | 46/100 [01:03<01:46,  1.97s/it, failures=0, objective=0.64]
 47%|████▋     | 47/100 [01:03<01:44,  1.97s/it, failures=0, objective=0.64]
 48%|████▊     | 48/100 [01:03<01:42,  1.97s/it, failures=0, objective=0.64]
 49%|████▉     | 49/100 [01:03<01:40,  1.97s/it, failures=0, objective=0.642]
 50%|█████     | 50/100 [01:12<01:33,  1.88s/it, failures=0, objective=0.642]
 50%|█████     | 50/100 [01:12<01:33,  1.88s/it, failures=0, objective=0.642]
 51%|█████     | 51/100 [01:12<01:31,  1.88s/it, failures=0, objective=0.642]
 52%|█████▏    | 52/100 [01:12<01:30,  1.88s/it, failures=0, objective=0.642]
 53%|█████▎    | 53/100 [01:18<01:31,  1.95s/it, failures=0, objective=0.642]
 53%|█████▎    | 53/100 [01:18<01:31,  1.95s/it, failures=0, objective=0.642]
 54%|█████▍    | 54/100 [01:18<01:29,  1.95s/it, failures=0, objective=0.642]
 55%|█████▌    | 55/100 [01:18<01:27,  1.95s/it, failures=0, objective=0.642]
 56%|█████▌    | 56/100 [01:18<01:25,  1.95s/it, failures=0, objective=0.642]
 57%|█████▋    | 57/100 [01:18<01:23,  1.95s/it, failures=0, objective=0.642]
 58%|█████▊    | 58/100 [01:26<01:15,  1.79s/it, failures=0, objective=0.642]
 58%|█████▊    | 58/100 [01:26<01:15,  1.79s/it, failures=0, objective=0.642]
 59%|█████▉    | 59/100 [01:26<01:13,  1.79s/it, failures=0, objective=0.642]
 60%|██████    | 60/100 [01:26<01:11,  1.79s/it, failures=0, objective=0.642]
 61%|██████    | 61/100 [01:32<01:11,  1.83s/it, failures=0, objective=0.642]
 61%|██████    | 61/100 [01:32<01:11,  1.83s/it, failures=0, objective=0.642]
 62%|██████▏   | 62/100 [01:32<01:09,  1.83s/it, failures=0, objective=0.642]
 63%|██████▎   | 63/100 [01:32<01:07,  1.83s/it, failures=0, objective=0.642]
 64%|██████▍   | 64/100 [01:32<01:05,  1.83s/it, failures=0, objective=0.642]
 65%|██████▌   | 65/100 [01:32<01:03,  1.83s/it, failures=0, objective=0.642]
 66%|██████▌   | 66/100 [01:41<01:01,  1.80s/it, failures=0, objective=0.642]
 66%|██████▌   | 66/100 [01:41<01:01,  1.80s/it, failures=0, objective=0.642]
 67%|██████▋   | 67/100 [01:41<00:59,  1.80s/it, failures=0, objective=0.642]
 68%|██████▊   | 68/100 [01:41<00:57,  1.80s/it, failures=0, objective=0.642]
 69%|██████▉   | 69/100 [01:50<01:06,  2.14s/it, failures=0, objective=0.642]
 69%|██████▉   | 69/100 [01:50<01:06,  2.14s/it, failures=0, objective=0.642]
 70%|███████   | 70/100 [01:50<01:04,  2.14s/it, failures=0, objective=0.642]
 71%|███████   | 71/100 [01:50<01:01,  2.14s/it, failures=0, objective=0.642]
 72%|███████▏  | 72/100 [01:50<00:59,  2.14s/it, failures=0, objective=0.642]
 73%|███████▎  | 73/100 [01:50<00:57,  2.14s/it, failures=0, objective=0.642]
 74%|███████▍  | 74/100 [02:00<00:53,  2.04s/it, failures=0, objective=0.642]
 74%|███████▍  | 74/100 [02:00<00:53,  2.04s/it, failures=0, objective=0.642]
 75%|███████▌  | 75/100 [02:00<00:51,  2.04s/it, failures=0, objective=0.642]
 76%|███████▌  | 76/100 [02:00<00:48,  2.04s/it, failures=0, objective=0.642]
 77%|███████▋  | 77/100 [02:07<00:48,  2.11s/it, failures=0, objective=0.642]
 77%|███████▋  | 77/100 [02:07<00:48,  2.11s/it, failures=0, objective=0.642]
 78%|███████▊  | 78/100 [02:07<00:46,  2.11s/it, failures=0, objective=0.642]
 79%|███████▉  | 79/100 [02:07<00:44,  2.11s/it, failures=0, objective=0.642]
 80%|████████  | 80/100 [02:07<00:42,  2.11s/it, failures=0, objective=0.642]
 81%|████████  | 81/100 [02:07<00:40,  2.11s/it, failures=0, objective=0.642]
 82%|████████▏ | 82/100 [02:16<00:36,  2.01s/it, failures=0, objective=0.642]
 82%|████████▏ | 82/100 [02:16<00:36,  2.01s/it, failures=0, objective=0.642]
 83%|████████▎ | 83/100 [02:16<00:34,  2.01s/it, failures=0, objective=0.642]
 84%|████████▍ | 84/100 [02:16<00:32,  2.01s/it, failures=0, objective=0.642]
 85%|████████▌ | 85/100 [02:24<00:32,  2.18s/it, failures=0, objective=0.642]
 85%|████████▌ | 85/100 [02:24<00:32,  2.18s/it, failures=0, objective=0.642]
 86%|████████▌ | 86/100 [02:24<00:30,  2.18s/it, failures=0, objective=0.642]
 87%|████████▋ | 87/100 [02:24<00:28,  2.18s/it, failures=0, objective=0.642]
 88%|████████▊ | 88/100 [02:24<00:26,  2.18s/it, failures=0, objective=0.642]
 89%|████████▉ | 89/100 [02:24<00:23,  2.18s/it, failures=0, objective=0.642]
 90%|█████████ | 90/100 [02:38<00:24,  2.44s/it, failures=0, objective=0.642]
 90%|█████████ | 90/100 [02:38<00:24,  2.44s/it, failures=0, objective=0.642]
 91%|█████████ | 91/100 [02:38<00:21,  2.44s/it, failures=0, objective=0.642]
 92%|█████████▏| 92/100 [02:38<00:19,  2.44s/it, failures=0, objective=0.642]
 93%|█████████▎| 93/100 [02:49<00:18,  2.69s/it, failures=0, objective=0.642]
 93%|█████████▎| 93/100 [02:49<00:18,  2.69s/it, failures=0, objective=0.642]
 94%|█████████▍| 94/100 [02:49<00:16,  2.69s/it, failures=0, objective=0.642]
 95%|█████████▌| 95/100 [02:49<00:13,  2.69s/it, failures=0, objective=0.642]
 96%|█████████▌| 96/100 [02:49<00:10,  2.69s/it, failures=0, objective=0.642]
 97%|█████████▋| 97/100 [02:49<00:08,  2.69s/it, failures=0, objective=0.642]
 98%|█████████▊| 98/100 [02:57<00:04,  2.33s/it, failures=0, objective=0.642]
 98%|█████████▊| 98/100 [02:57<00:04,  2.33s/it, failures=0, objective=0.642]
 99%|█████████▉| 99/100 [02:57<00:02,  2.33s/it, failures=0, objective=0.642]
100%|██████████| 100/100 [02:57<00:00,  2.33s/it, failures=0, objective=0.642]
100%|██████████| 100/100 [02:57<00:00,  1.78s/it, failures=0, objective=0.642]

Once the search is over, a file named results.csv is saved in the current directory. The same dataframe is returned by the search.search(...) call. It contains the hyperparameters configurations evaluated during the search and their corresponding objective value (i.e, validation accuracy), timestamp_submit the time when the evaluator submitted the configuration to be evaluated and timestamp_gather the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the Evaluator instance).

p:classifier p:criterion p:max_depth p:min_samples_leaf p:min_samples_split p:n_estimators p:learning_rate p:loss p:subsample objective job_id job_status m:timestamp_submit m:timestamp_gather
0 GradientBoosting friedman_mse 22 9 9 3 0.644010 log_loss 0.808648 0.593486 3 DONE 1.467354 3.604347
1 RandomForest entropy 17 5 6 2 0.010000 log_loss 0.010000 0.589680 5 DONE 1.469836 4.116830
2 GradientBoosting squared_error 41 1 6 5 0.602586 log_loss 0.677687 0.579005 0 DONE 1.463099 4.121034
3 RandomForest gini 5 8 8 8 0.010000 log_loss 0.010000 0.631245 10 DONE 4.669049 5.093934
4 GradientBoosting friedman_mse 12 3 5 80 0.451887 log_loss 0.335074 0.567551 1 DONE 1.464540 5.524779
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
100 GradientBoosting squared_error 26 2 6 555 0.184134 log_loss 0.548685 0.601922 104 DONE 181.565745 205.574851
101 GradientBoosting squared_error 30 1 4 297 0.119764 log_loss 0.567435 0.604387 103 DONE 181.564635 205.577222
102 GradientBoosting friedman_mse 22 5 3 52 0.024548 log_loss 0.951610 0.612722 100 DONE 181.560469 205.579115
103 GradientBoosting squared_error 11 1 2 405 0.039474 log_loss 0.031607 0.587667 102 DONE 181.563470 205.580877
104 GradientBoosting squared_error 45 5 2 460 0.038718 log_loss 0.091673 0.605200 101 DONE 181.561892 205.582460

105 rows × 14 columns



Code (Plot results from hyperparameter optimization)
fig, ax = plt.subplots(figsize=(WIDTH_PLOTS, HEIGHT_PLOTS))
plot_search_trajectory_single_objective_hpo(results, mode="max", ax=ax)
_ = plt.title("Search Trajectory")

# Remember that these results only used a subsample of the training data!
# The baseline with the full dataset reached about the same performance, 0.64 in validation accuracy.
Search Trajectory

Then, we can now look at the Top-3 configuration of hyperparameters.

results.nlargest(n=3, columns="objective")
p:classifier p:criterion p:max_depth p:min_samples_leaf p:min_samples_split p:n_estimators p:learning_rate p:loss p:subsample objective job_id job_status m:timestamp_submit m:timestamp_gather
96 RandomForest gini 39 10 4 888 0.01 log_loss 0.01 0.642422 93 DONE 162.587231 173.139025
48 RandomForest entropy 14 7 4 303 0.01 log_loss 0.01 0.641743 44 DONE 60.722285 67.346927
72 RandomForest entropy 46 6 2 627 0.01 log_loss 0.01 0.641575 69 DONE 104.765836 114.384126


Let us define a test to evaluate the best configuration on the training, validation and test data sets.

def evaluate_config(config):
    config["random_state"] = check_random_state(42)

    rs_data = check_random_state(42)

    ratio_test = 0.33
    ratio_valid = (1 - ratio_test) * 0.33

    train, valid, test = load_data(
        random_state=rs_data,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    print("Scoring model:", config["p:classifier"])
    clf_class = CLASSIFIERS[config["p:classifier"]]
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    clf = clf_class(**clf_params)

    clf.fit(*train)

    acc_train = clf.score(*train)
    acc_valid = clf.score(*valid)
    acc_test = clf.score(*test)

    print(f"\tAccuracy on Training: {acc_train:.3f}")
    print(f"\tAccuracy on Validation: {acc_valid:.3f}")
    print(f"\tAccuracy on Testing: {acc_test:.3f}")


config = results.iloc[results.objective.argmax()][:-2].to_dict()
print(f"Best config is:\n {config}")
evaluate_config(config)
Best config is:
 {'p:classifier': 'RandomForest', 'p:criterion': 'gini', 'p:max_depth': 39, 'p:min_samples_leaf': 10, 'p:min_samples_split': 4, 'p:n_estimators': 888, 'p:learning_rate': 0.01, 'p:loss': 'log_loss', 'p:subsample': 0.01, 'objective': 0.6424223112914856, 'job_id': 93, 'job_status': 'DONE'}
Scoring model: RandomForest
        Accuracy on Training: 0.751
        Accuracy on Validation: 0.666
        Accuracy on Testing: 0.666

In conclusion, compared to the default configuration, we can see the accuracy improvement from 0.619 to 0.666 on test data and we can also see the reduction of overfitting between the training and the validation/test data sets. It was 0.879 training accuracy to 0.619 test accuracy for baseline RandomForest). It is now 0.750 training accuracy to 0.666 test accuracy with the best hyperparameters that selected the RandomForest classifier.

Total running time of the script: (5 minutes 27.192 seconds)

Gallery generated by Sphinx-Gallery