Hyperparameter optimization and overfitting

Hyperparameter optimization and overfitting#

In this example, you will learn how to treat the choice of a learning method as just another hyperparameter. We consider the Random Forest (RF) and Gradient Boosting (GB) classifiers from Scikit-Learn on the Airlines dataset.

Each classifier has both unique and shared hyperparameters. We use ConfigSpace, a Python package for defining conditional hyperparameters and more, to model them.

By using, the objective of hyperparameter properly, and considering hyperparameter optimization as an optimized model selection method, you will also learn how to fight overfitting.

Installation and imports#

Installing dependencies with the pip installation is recommended. It requires Python >= 3.10.

%%bash
pip install "deephyper[ray] openml==0.15.1"

We start by creating a function which loads the data of interest. Here we use the “Airlines” dataset from OpenML where the task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

Code (Loading the data)

def load_data(
    random_state=42,
    verbose=False,
    test_size=0.33,
    valid_size=0.33,
    categoricals_to_integers=False,
):
    """Load the "Airlines" dataset from OpenML.

    Args:
        random_state (int, optional): A numpy `RandomState`. Defaults to 42.
        verbose (bool, optional): Print informations about the dataset. Defaults to False.
        test_size (float, optional): The proportion of the test dataset out of the whole data. Defaults to 0.33.
        valid_size (float, optional): The proportion of the train dataset out of the whole data without the test data. Defaults to 0.33.
        categoricals_to_integers (bool, optional): Convert categoricals features to integer values. Defaults to False.

    Returns:
        tuple: Numpy arrays as, `(X_train, y_train), (X_valid, y_valid), (X_test, y_test)`.
    """
    random_state = (
        np.random.RandomState(random_state) if type(random_state) is int else random_state
    )

    dataset = openml.datasets.get_dataset(
        dataset_id=1169,
        download_data=True,
        download_qualities=True,
        download_features_meta_data=True,
    )

    if verbose:
        print(
            f"This is dataset '{dataset.name}', the target feature is "
            f"'{dataset.default_target_attribute}'"
        )
        print(f"URL: {dataset.url}")
        print(dataset.description[:500])

    X, y, categorical_indicator, ft_names = dataset.get_data(
        target=dataset.default_target_attribute
    )

    # encode categoricals as integers
    if categoricals_to_integers:
        for ft_ind, ft_name in enumerate(ft_names):
            if categorical_indicator[ft_ind]:
                labenc = LabelEncoder().fit(X[ft_name])
                X[ft_name] = labenc.transform(X[ft_name])
                n_classes = len(labenc.classes_)
            else:
                n_classes = -1
            categorical_indicator[ft_ind] = (
                categorical_indicator[ft_ind],
                n_classes,
            )

    X, y = X.to_numpy(), y.to_numpy()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, shuffle=True, random_state=random_state
    )

    # relative valid_size on Train set
    r_valid_size = valid_size / (1.0 - test_size)
    X_train, X_valid, y_train, y_valid = train_test_split(
        X_train,
        y_train,
        test_size=r_valid_size,
        shuffle=True,
        random_state=random_state,
    )

    return (X_train, y_train), (X_valid, y_valid), (X_test, y_test)

Then, we create a mapping to record the classification algorithms of interest:

CLASSIFIERS = {
    "RandomForest": RandomForestClassifier,
    "GradientBoosting": GradientBoostingClassifier,
}

Create a baseline code to test the accuracy of each candidate model with its default hyperparameters:

Scoring model: RandomForest
        Accuracy on Training: 0.879
        Accuracy on Validation: 0.620
        Accuracy on Testing: 0.619

Scoring model: GradientBoosting
        Accuracy on Training: 0.649
        Accuracy on Validation: 0.648
        Accuracy on Testing: 0.649

The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and therefore poor generalization (i.e., high accuracy on training data but not on the validation or test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest (for the default hyperparameters).

Then, we optimize the hyperparameters, where we seek to find the best classifier and its corresponding best hyperparameters to improve the accuracy on the vaidation and test data. We create a load_subsampled_data function to load and return subsampled training and validation data in order to speed up the evaluation of candidate models and hyperparameters:

def load_subsampled_data(verbose=0, subsample=True, random_state=None):
    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(random_state)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _ = load_data(
        random_state=42,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    # Uncomment the next line if you want to sub-sample the training data to speed-up
    # the search, "n_samples" controls the size of the new training data
    if subsample:
        X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))

    if verbose:
        print(f"X_train shape: {np.shape(X_train)}")
        print(f"y_train shape: {np.shape(y_train)}")
        print(f"X_valid shape: {np.shape(X_valid)}")
        print(f"y_valid shape: {np.shape(y_valid)}")

    return (X_train, y_train), (X_valid, y_valid)


print("Without subsampling")
_ = load_subsampled_data(verbose=1, subsample=False)
print()
print("With subsampling")
_ = load_subsampled_data(verbose=1)

Without subsampling
X_train shape: (242128, 7)
y_train shape: (242128,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

With subsampling
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

Then, we create a run function to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, a validation accuracy) that is maximized by the hyperparameter optimization algorithm.

def run(job) -> float:
    config = job.parameters.copy()
    config["random_state"] = check_random_state(42)

    (X_train, y_train), (X_valid, y_valid) = load_subsampled_data(subsample=True)

    clf_class = CLASSIFIERS[config["classifier"]]

    # keep parameters possible for the current classifier
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    try:  # good practice to manage the fail value yourself...
        clf = clf_class(**clf_params)

        clf.fit(X_train, y_train)

        fit_is_complete = True
    except Exception:
        fit_is_complete = False

    if fit_is_complete:
        y_pred = clf.predict(X_valid)
        acc = accuracy_score(y_valid, y_pred)
    else:
        acc = "F_fit_failed"

    return acc

Then, we create the HpProblem to define the search space of hyperparameters for each model.

The first hyperparameter is "classifier", the selected model.

Then, we use Condition and Forbidden to define constraints on the hyperparameters.

Default values are very important when adding Condition and Forbidden clauses. Otherwise, the creation of the problem can fail if the default configuration is not acceptable.

problem = HpProblem()

classifier = problem.add_hyperparameter(
    ["RandomForest", "GradientBoosting"], "classifier", default_value="RandomForest"
)

# For both
problem.add_hyperparameter((1, 1000, "log-uniform"), "n_estimators")
problem.add_hyperparameter((1, 50), "max_depth")
problem.add_hyperparameter((2, 10), "min_samples_split")
problem.add_hyperparameter((1, 10), "min_samples_leaf")
criterion = problem.add_hyperparameter(
    ["friedman_mse", "squared_error", "gini", "entropy"],
    "criterion",
    default_value="gini",
)

# GradientBoosting
loss = problem.add_hyperparameter(["log_loss", "exponential"], "loss")
learning_rate = problem.add_hyperparameter((0.01, 1.0), "learning_rate")
subsample = problem.add_hyperparameter((0.01, 1.0), "subsample")

gradient_boosting_hp = [loss, learning_rate, subsample]
for hp_i in gradient_boosting_hp:
    problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))

forbidden_criterion_rf = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
    cs.ForbiddenInClause(criterion, ["friedman_mse", "squared_error"]),
)
problem.add_forbidden_clause(forbidden_criterion_rf)

forbidden_criterion_gb = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
)
problem.add_forbidden_clause(forbidden_criterion_gb)

problem

Configuration space object:
  Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, squared_error, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {log_loss, exponential}, Default: log_loss
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
  Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
  Forbidden Clauses:
    (Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'gini', 'entropy'})
    (Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'squared_error'})

Then, we create an Evaluator object using the ray backend to distribute the evaluation of the run-function defined previously.

evaluator = Evaluator.create(
    run,
    method="ray",
    method_kwargs={
        "num_cpus_per_task": 1,
        "callbacks": [TqdmCallback()],
    },
)

print("Number of workers: ", evaluator.num_workers)

2025-08-18 14:14:14,435 INFO worker.py:1852 -- Started a local Ray instance.
Number of workers:  8

Finally, you can define a Bayesian optimization search called CBO (for Centralized Bayesian Optimization) and link to it the defined problem and evaluator.

max_evals = 100

search = CBO(
    problem,
    random_state=42,
)
results = search.search(evaluator, max_evals=max_evals)

  0%|          | 0/100 [00:00<?, ?it/s]
  1%|          | 1/100 [00:00<00:00, 3938.31it/s, failures=0, objective=0.593]
  2%|▏         | 2/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  2%|▏         | 2/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  3%|▎         | 3/100 [00:00<00:24,  3.99it/s, failures=0, objective=0.593]
  4%|▍         | 4/100 [00:01<00:37,  2.56it/s, failures=0, objective=0.593]
  4%|▍         | 4/100 [00:01<00:37,  2.56it/s, failures=0, objective=0.631]
  5%|▌         | 5/100 [00:01<00:38,  2.48it/s, failures=0, objective=0.631]
  5%|▌         | 5/100 [00:01<00:38,  2.48it/s, failures=0, objective=0.631]
  6%|▌         | 6/100 [00:01<00:37,  2.48it/s, failures=0, objective=0.631]
  7%|▋         | 7/100 [00:01<00:37,  2.48it/s, failures=0, objective=0.631]
  8%|▊         | 8/100 [00:02<00:29,  3.15it/s, failures=0, objective=0.631]
  8%|▊         | 8/100 [00:02<00:29,  3.15it/s, failures=0, objective=0.631]
  9%|▉         | 9/100 [00:03<00:37,  2.44it/s, failures=0, objective=0.631]
  9%|▉         | 9/100 [00:03<00:37,  2.44it/s, failures=0, objective=0.631]
 10%|█         | 10/100 [00:04<00:43,  2.06it/s, failures=0, objective=0.631]
 10%|█         | 10/100 [00:04<00:43,  2.06it/s, failures=0, objective=0.637]
 11%|█         | 11/100 [00:05<00:52,  1.69it/s, failures=0, objective=0.637]
 11%|█         | 11/100 [00:05<00:52,  1.69it/s, failures=0, objective=0.637]
 12%|█▏        | 12/100 [00:05<00:53,  1.64it/s, failures=0, objective=0.637]
 12%|█▏        | 12/100 [00:05<00:53,  1.64it/s, failures=0, objective=0.637]
 13%|█▎        | 13/100 [00:05<00:52,  1.64it/s, failures=0, objective=0.637]
 14%|█▍        | 14/100 [00:05<00:52,  1.64it/s, failures=0, objective=0.637]
 15%|█▌        | 15/100 [00:06<00:31,  2.66it/s, failures=0, objective=0.637]
 15%|█▌        | 15/100 [00:06<00:31,  2.66it/s, failures=0, objective=0.637]
 16%|█▌        | 16/100 [00:06<00:34,  2.41it/s, failures=0, objective=0.637]
 16%|█▌        | 16/100 [00:06<00:34,  2.41it/s, failures=0, objective=0.637]
 17%|█▋        | 17/100 [00:07<00:47,  1.74it/s, failures=0, objective=0.637]
 17%|█▋        | 17/100 [00:07<00:47,  1.74it/s, failures=0, objective=0.637]
 18%|█▊        | 18/100 [00:09<01:13,  1.11it/s, failures=0, objective=0.637]
 18%|█▊        | 18/100 [00:09<01:13,  1.11it/s, failures=0, objective=0.637]
 19%|█▉        | 19/100 [00:10<01:05,  1.24it/s, failures=0, objective=0.637]
 19%|█▉        | 19/100 [00:10<01:05,  1.24it/s, failures=0, objective=0.637]
 20%|██        | 20/100 [00:12<01:29,  1.12s/it, failures=0, objective=0.637]
 20%|██        | 20/100 [00:12<01:29,  1.12s/it, failures=0, objective=0.637]
 21%|██        | 21/100 [00:12<01:28,  1.12s/it, failures=0, objective=0.637]
 22%|██▏       | 22/100 [00:18<02:37,  2.02s/it, failures=0, objective=0.637]
 22%|██▏       | 22/100 [00:18<02:37,  2.02s/it, failures=0, objective=0.637]
 23%|██▎       | 23/100 [00:18<02:35,  2.02s/it, failures=0, objective=0.637]
 24%|██▍       | 24/100 [00:18<02:33,  2.02s/it, failures=0, objective=0.637]
 25%|██▌       | 25/100 [00:18<02:31,  2.02s/it, failures=0, objective=0.637]
 26%|██▌       | 26/100 [00:18<02:29,  2.02s/it, failures=0, objective=0.637]
 27%|██▋       | 27/100 [00:27<02:16,  1.87s/it, failures=0, objective=0.637]
 27%|██▋       | 27/100 [00:27<02:16,  1.87s/it, failures=0, objective=0.639]
 28%|██▊       | 28/100 [00:27<02:14,  1.87s/it, failures=0, objective=0.639]
 29%|██▉       | 29/100 [00:33<02:28,  2.10s/it, failures=0, objective=0.639]
 29%|██▉       | 29/100 [00:33<02:28,  2.10s/it, failures=0, objective=0.639]
 30%|███       | 30/100 [00:33<02:26,  2.10s/it, failures=0, objective=0.639]
 31%|███       | 31/100 [00:33<02:24,  2.10s/it, failures=0, objective=0.639]
 32%|███▏      | 32/100 [00:33<02:22,  2.10s/it, failures=0, objective=0.639]
 33%|███▎      | 33/100 [00:33<02:20,  2.10s/it, failures=0, objective=0.639]
 34%|███▍      | 34/100 [00:42<02:09,  1.97s/it, failures=0, objective=0.639]
 34%|███▍      | 34/100 [00:42<02:09,  1.97s/it, failures=0, objective=0.639]
 35%|███▌      | 35/100 [00:42<02:07,  1.97s/it, failures=0, objective=0.64]
 36%|███▌      | 36/100 [00:42<02:05,  1.97s/it, failures=0, objective=0.64]
 37%|███▋      | 37/100 [00:48<02:06,  2.02s/it, failures=0, objective=0.64]
 37%|███▋      | 37/100 [00:48<02:06,  2.02s/it, failures=0, objective=0.64]
 38%|███▊      | 38/100 [00:48<02:04,  2.02s/it, failures=0, objective=0.64]
 39%|███▉      | 39/100 [00:48<02:02,  2.02s/it, failures=0, objective=0.64]
 40%|████      | 40/100 [00:48<02:00,  2.02s/it, failures=0, objective=0.64]
 41%|████      | 41/100 [00:48<01:58,  2.02s/it, failures=0, objective=0.64]
 42%|████▏     | 42/100 [00:57<01:49,  1.88s/it, failures=0, objective=0.64]
 42%|████▏     | 42/100 [00:57<01:49,  1.88s/it, failures=0, objective=0.64]
 43%|████▎     | 43/100 [00:57<01:47,  1.88s/it, failures=0, objective=0.64]
 44%|████▍     | 44/100 [00:57<01:45,  1.88s/it, failures=0, objective=0.64]
 45%|████▌     | 45/100 [01:03<01:48,  1.97s/it, failures=0, objective=0.64]
 45%|████▌     | 45/100 [01:03<01:48,  1.97s/it, failures=0, objective=0.64]
 46%|████▌     | 46/100 [01:03<01:46,  1.97s/it, failures=0, objective=0.64]
 47%|████▋     | 47/100 [01:03<01:44,  1.97s/it, failures=0, objective=0.64]
 48%|████▊     | 48/100 [01:03<01:42,  1.97s/it, failures=0, objective=0.64]
 49%|████▉     | 49/100 [01:03<01:40,  1.97s/it, failures=0, objective=0.642]
 50%|█████     | 50/100 [01:12<01:33,  1.88s/it, failures=0, objective=0.642]
 50%|█████     | 50/100 [01:12<01:33,  1.88s/it, failures=0, objective=0.642]
 51%|█████     | 51/100 [01:12<01:31,  1.88s/it, failures=0, objective=0.642]
 52%|█████▏    | 52/100 [01:12<01:30,  1.88s/it, failures=0, objective=0.642]
 53%|█████▎    | 53/100 [01:18<01:31,  1.95s/it, failures=0, objective=0.642]
 53%|█████▎    | 53/100 [01:18<01:31,  1.95s/it, failures=0, objective=0.642]
 54%|█████▍    | 54/100 [01:18<01:29,  1.95s/it, failures=0, objective=0.642]
 55%|█████▌    | 55/100 [01:18<01:27,  1.95s/it, failures=0, objective=0.642]
 56%|█████▌    | 56/100 [01:18<01:25,  1.95s/it, failures=0, objective=0.642]
 57%|█████▋    | 57/100 [01:18<01:23,  1.95s/it, failures=0, objective=0.642]
 58%|█████▊    | 58/100 [01:26<01:15,  1.79s/it, failures=0, objective=0.642]
 58%|█████▊    | 58/100 [01:26<01:15,  1.79s/it, failures=0, objective=0.642]
 59%|█████▉    | 59/100 [01:26<01:13,  1.79s/it, failures=0, objective=0.642]
 60%|██████    | 60/100 [01:26<01:11,  1.79s/it, failures=0, objective=0.642]
 61%|██████    | 61/100 [01:32<01:11,  1.83s/it, failures=0, objective=0.642]
 61%|██████    | 61/100 [01:32<01:11,  1.83s/it, failures=0, objective=0.642]
 62%|██████▏   | 62/100 [01:32<01:09,  1.83s/it, failures=0, objective=0.642]
 63%|██████▎   | 63/100 [01:32<01:07,  1.83s/it, failures=0, objective=0.642]
 64%|██████▍   | 64/100 [01:32<01:05,  1.83s/it, failures=0, objective=0.642]
 65%|██████▌   | 65/100 [01:32<01:03,  1.83s/it, failures=0, objective=0.642]
 66%|██████▌   | 66/100 [01:41<01:01,  1.80s/it, failures=0, objective=0.642]
 66%|██████▌   | 66/100 [01:41<01:01,  1.80s/it, failures=0, objective=0.642]
 67%|██████▋   | 67/100 [01:41<00:59,  1.80s/it, failures=0, objective=0.642]
 68%|██████▊   | 68/100 [01:41<00:57,  1.80s/it, failures=0, objective=0.642]
 69%|██████▉   | 69/100 [01:50<01:06,  2.14s/it, failures=0, objective=0.642]
 69%|██████▉   | 69/100 [01:50<01:06,  2.14s/it, failures=0, objective=0.642]
 70%|███████   | 70/100 [01:50<01:04,  2.14s/it, failures=0, objective=0.642]
 71%|███████   | 71/100 [01:50<01:01,  2.14s/it, failures=0, objective=0.642]
 72%|███████▏  | 72/100 [01:50<00:59,  2.14s/it, failures=0, objective=0.642]
 73%|███████▎  | 73/100 [01:50<00:57,  2.14s/it, failures=0, objective=0.642]
 74%|███████▍  | 74/100 [02:00<00:53,  2.04s/it, failures=0, objective=0.642]
 74%|███████▍  | 74/100 [02:00<00:53,  2.04s/it, failures=0, objective=0.642]
 75%|███████▌  | 75/100 [02:00<00:51,  2.04s/it, failures=0, objective=0.642]
 76%|███████▌  | 76/100 [02:00<00:48,  2.04s/it, failures=0, objective=0.642]
 77%|███████▋  | 77/100 [02:07<00:48,  2.11s/it, failures=0, objective=0.642]
 77%|███████▋  | 77/100 [02:07<00:48,  2.11s/it, failures=0, objective=0.642]
 78%|███████▊  | 78/100 [02:07<00:46,  2.11s/it, failures=0, objective=0.642]
 79%|███████▉  | 79/100 [02:07<00:44,  2.11s/it, failures=0, objective=0.642]
 80%|████████  | 80/100 [02:07<00:42,  2.11s/it, failures=0, objective=0.642]
 81%|████████  | 81/100 [02:07<00:40,  2.11s/it, failures=0, objective=0.642]
 82%|████████▏ | 82/100 [02:16<00:36,  2.01s/it, failures=0, objective=0.642]
 82%|████████▏ | 82/100 [02:16<00:36,  2.01s/it, failures=0, objective=0.642]
 83%|████████▎ | 83/100 [02:16<00:34,  2.01s/it, failures=0, objective=0.642]
 84%|████████▍ | 84/100 [02:16<00:32,  2.01s/it, failures=0, objective=0.642]
 85%|████████▌ | 85/100 [02:24<00:32,  2.18s/it, failures=0, objective=0.642]
 85%|████████▌ | 85/100 [02:24<00:32,  2.18s/it, failures=0, objective=0.642]
 86%|████████▌ | 86/100 [02:24<00:30,  2.18s/it, failures=0, objective=0.642]
 87%|████████▋ | 87/100 [02:24<00:28,  2.18s/it, failures=0, objective=0.642]
 88%|████████▊ | 88/100 [02:24<00:26,  2.18s/it, failures=0, objective=0.642]
 89%|████████▉ | 89/100 [02:24<00:23,  2.18s/it, failures=0, objective=0.642]
 90%|█████████ | 90/100 [02:38<00:24,  2.44s/it, failures=0, objective=0.642]
 90%|█████████ | 90/100 [02:38<00:24,  2.44s/it, failures=0, objective=0.642]
 91%|█████████ | 91/100 [02:38<00:21,  2.44s/it, failures=0, objective=0.642]
 92%|█████████▏| 92/100 [02:38<00:19,  2.44s/it, failures=0, objective=0.642]
 93%|█████████▎| 93/100 [02:49<00:18,  2.69s/it, failures=0, objective=0.642]
 93%|█████████▎| 93/100 [02:49<00:18,  2.69s/it, failures=0, objective=0.642]
 94%|█████████▍| 94/100 [02:49<00:16,  2.69s/it, failures=0, objective=0.642]
 95%|█████████▌| 95/100 [02:49<00:13,  2.69s/it, failures=0, objective=0.642]
 96%|█████████▌| 96/100 [02:49<00:10,  2.69s/it, failures=0, objective=0.642]
 97%|█████████▋| 97/100 [02:49<00:08,  2.69s/it, failures=0, objective=0.642]
 98%|█████████▊| 98/100 [02:57<00:04,  2.33s/it, failures=0, objective=0.642]
 98%|█████████▊| 98/100 [02:57<00:04,  2.33s/it, failures=0, objective=0.642]
 99%|█████████▉| 99/100 [02:57<00:02,  2.33s/it, failures=0, objective=0.642]
100%|██████████| 100/100 [02:57<00:00,  2.33s/it, failures=0, objective=0.642]
100%|██████████| 100/100 [02:57<00:00,  1.78s/it, failures=0, objective=0.642]

Once the search is over, a file named results.csv is saved in the current directory. The same dataframe is returned by the search.search(...) call. It contains the hyperparameters configurations evaluated during the search and their corresponding objective value (i.e, validation accuracy), timestamp_submit the time when the evaluator submitted the configuration to be evaluated and timestamp_gather the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the Evaluator instance).

results

	p:classifier	p:criterion	p:max_depth	p:min_samples_leaf	p:min_samples_split	p:n_estimators	p:learning_rate	p:loss	p:subsample	objective	job_id	job_status	m:timestamp_submit	m:timestamp_gather
0	GradientBoosting	friedman_mse	22	9	9	3	0.644010	log_loss	0.808648	0.593486	3	DONE	1.467354	3.604347
1	RandomForest	entropy	17	5	6	2	0.010000	log_loss	0.010000	0.589680	5	DONE	1.469836	4.116830
2	GradientBoosting	squared_error	41	1	6	5	0.602586	log_loss	0.677687	0.579005	0	DONE	1.463099	4.121034
3	RandomForest	gini	5	8	8	8	0.010000	log_loss	0.010000	0.631245	10	DONE	4.669049	5.093934
4	GradientBoosting	friedman_mse	12	3	5	80	0.451887	log_loss	0.335074	0.567551	1	DONE	1.464540	5.524779
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
100	GradientBoosting	squared_error	26	2	6	555	0.184134	log_loss	0.548685	0.601922	104	DONE	181.565745	205.574851
101	GradientBoosting	squared_error	30	1	4	297	0.119764	log_loss	0.567435	0.604387	103	DONE	181.564635	205.577222
102	GradientBoosting	friedman_mse	22	5	3	52	0.024548	log_loss	0.951610	0.612722	100	DONE	181.560469	205.579115
103	GradientBoosting	squared_error	11	1	2	405	0.039474	log_loss	0.031607	0.587667	102	DONE	181.563470	205.580877
104	GradientBoosting	squared_error	45	5	2	460	0.038718	log_loss	0.091673	0.605200	101	DONE	181.561892	205.582460

105 rows × 14 columns

Then, we can now look at the Top-3 configuration of hyperparameters.

results.nlargest(n=3, columns="objective")

	p:classifier	p:criterion	p:max_depth	p:min_samples_leaf	p:min_samples_split	p:n_estimators	p:learning_rate	p:loss	p:subsample	objective	job_id	job_status	m:timestamp_submit	m:timestamp_gather
96	RandomForest	gini	39	10	4	888	0.01	log_loss	0.01	0.642422	93	DONE	162.587231	173.139025
48	RandomForest	entropy	14	7	4	303	0.01	log_loss	0.01	0.641743	44	DONE	60.722285	67.346927
72	RandomForest	entropy	46	6	2	627	0.01	log_loss	0.01	0.641575	69	DONE	104.765836	114.384126

Let us define a test to evaluate the best configuration on the training, validation and test data sets.

def evaluate_config(config):
    config["random_state"] = check_random_state(42)

    rs_data = check_random_state(42)

    ratio_test = 0.33
    ratio_valid = (1 - ratio_test) * 0.33

    train, valid, test = load_data(
        random_state=rs_data,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    print("Scoring model:", config["p:classifier"])
    clf_class = CLASSIFIERS[config["p:classifier"]]
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    clf = clf_class(**clf_params)

    clf.fit(*train)

    acc_train = clf.score(*train)
    acc_valid = clf.score(*valid)
    acc_test = clf.score(*test)

    print(f"\tAccuracy on Training: {acc_train:.3f}")
    print(f"\tAccuracy on Validation: {acc_valid:.3f}")
    print(f"\tAccuracy on Testing: {acc_test:.3f}")


config = results.iloc[results.objective.argmax()][:-2].to_dict()
print(f"Best config is:\n {config}")
evaluate_config(config)

Best config is:
 {'p:classifier': 'RandomForest', 'p:criterion': 'gini', 'p:max_depth': 39, 'p:min_samples_leaf': 10, 'p:min_samples_split': 4, 'p:n_estimators': 888, 'p:learning_rate': 0.01, 'p:loss': 'log_loss', 'p:subsample': 0.01, 'objective': 0.6424223112914856, 'job_id': 93, 'job_status': 'DONE'}
Scoring model: RandomForest
        Accuracy on Training: 0.751
        Accuracy on Validation: 0.666
        Accuracy on Testing: 0.666

In conclusion, compared to the default configuration, we can see the accuracy improvement from 0.619 to 0.666 on test data and we can also see the reduction of overfitting between the training and the validation/test data sets. It was 0.879 training accuracy to 0.619 test accuracy for baseline RandomForest). It is now 0.750 training accuracy to 0.666 test accuracy with the best hyperparameters that selected the RandomForest classifier.

Total running time of the script: (5 minutes 27.192 seconds)

Gallery generated by Sphinx-Gallery

Hyperparameter optimization and overfitting

Contents

Hyperparameter optimization and overfitting#

Installation and imports#