4. Hyperparameter Search to reduce overfitting in Machine Learning (Scikit-Learn)#

In this tutorial, we will show how to treat a learning method as a hyperparameter in the hyperparameter search. We will consider Random Forest (RF) classifier and Gradient Boosting (GB) classifier methods in Scikit-Learn for the Airlines data set. Each of these methods have its own set of hyperparameters and some common parameters. We model them using ConfigSpace a python package to express conditional hyperparameters and more.

Let us start by installing DeepHyper.

[1]:

!pip install deephyper
!pip install ray
!pip install openml

Requirement already satisfied: deephyper in /usr/local/lib/python3.7/dist-packages (0.3.3)
Requirement already satisfied: pydot in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.3.0)
Requirement already satisfied: openml==0.10.2 in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.10.2)
Requirement already satisfied: tensorflow-probability in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.14.1)
Requirement already satisfied: Jinja2 in /usr/local/lib/python3.7/dist-packages (from deephyper) (2.11.3)
Requirement already satisfied: xgboost in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.90)
Requirement already satisfied: matplotlib>=3.0.3 in /usr/local/lib/python3.7/dist-packages (from deephyper) (3.2.2)
Requirement already satisfied: scikit-learn>=0.23.1 in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.0.1)
Requirement already satisfied: dh-scikit-optimize==0.9.4 in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.9.4)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.1.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from deephyper) (4.62.3)
Requirement already satisfied: tensorflow>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from deephyper) (2.7.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.19.5)
Requirement already satisfied: networkx in /usr/local/lib/python3.7/dist-packages (from deephyper) (2.6.3)
Requirement already satisfied: ray[default]>=1.3.0 in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.8.0)
Requirement already satisfied: ConfigSpace>=0.4.18 in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.4.20)
Requirement already satisfied: typeguard in /usr/local/lib/python3.7/dist-packages (from deephyper) (2.7.1)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from deephyper) (0.10.2)
Requirement already satisfied: joblib>=0.10.3 in /usr/local/lib/python3.7/dist-packages (from deephyper) (1.1.0)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from dh-scikit-optimize==0.9.4->deephyper) (1.4.1)
Requirement already satisfied: pyaml>=16.9 in /usr/local/lib/python3.7/dist-packages (from dh-scikit-optimize==0.9.4->deephyper) (21.10.1)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from openml==0.10.2->deephyper) (2.23.0)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-packages (from openml==0.10.2->deephyper) (2.8.2)
Requirement already satisfied: xmltodict in /usr/local/lib/python3.7/dist-packages (from openml==0.10.2->deephyper) (0.12.0)
Requirement already satisfied: liac-arff>=2.4.0 in /usr/local/lib/python3.7/dist-packages (from openml==0.10.2->deephyper) (2.5.0)
Requirement already satisfied: pyparsing in /usr/local/lib/python3.7/dist-packages (from ConfigSpace>=0.4.18->deephyper) (2.4.7)
Requirement already satisfied: cython in /usr/local/lib/python3.7/dist-packages (from ConfigSpace>=0.4.18->deephyper) (0.29.24)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->deephyper) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.3->deephyper) (0.11.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.2->deephyper) (2018.9)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from pyaml>=16.9->dh-scikit-optimize==0.9.4->deephyper) (3.13)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil->openml==0.10.2->deephyper) (1.15.0)
Requirement already satisfied: grpcio>=1.28.1 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (1.41.1)
Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (7.1.2)
Requirement already satisfied: jsonschema in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (2.6.0)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (1.0.2)
Requirement already satisfied: attrs in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (21.2.0)
Requirement already satisfied: redis>=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (4.0.1)
Requirement already satisfied: protobuf>=3.15.3 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (3.17.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (3.3.2)
Requirement already satisfied: prometheus-client>=0.7.1 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (0.12.0)
Requirement already satisfied: py-spy>=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (0.3.11)
Requirement already satisfied: aiohttp-cors in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (0.7.0)
Requirement already satisfied: aioredis<2 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (1.3.1)
Requirement already satisfied: colorful in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (0.5.4)
Requirement already satisfied: gpustat>=1.0.0b1 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (1.0.0b1)
Requirement already satisfied: aiohttp>=3.7 in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (3.8.1)
Requirement already satisfied: opencensus in /usr/local/lib/python3.7/dist-packages (from ray[default]>=1.3.0->deephyper) (0.8.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (2.0.7)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (1.2.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (3.10.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (1.7.2)
Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (0.13.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (4.0.1)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (1.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp>=3.7->ray[default]>=1.3.0->deephyper) (5.2.0)
Requirement already satisfied: hiredis in /usr/local/lib/python3.7/dist-packages (from aioredis<2->ray[default]>=1.3.0->deephyper) (2.0.0)
Requirement already satisfied: blessed>=1.17.1 in /usr/local/lib/python3.7/dist-packages (from gpustat>=1.0.0b1->ray[default]>=1.3.0->deephyper) (1.19.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.7/dist-packages (from gpustat>=1.0.0b1->ray[default]>=1.3.0->deephyper) (5.4.8)
Requirement already satisfied: nvidia-ml-py3>=7.352.0 in /usr/local/lib/python3.7/dist-packages (from gpustat>=1.0.0b1->ray[default]>=1.3.0->deephyper) (7.352.0)
Requirement already satisfied: wcwidth>=0.1.4 in /usr/local/lib/python3.7/dist-packages (from blessed>=1.17.1->gpustat>=1.0.0b1->ray[default]>=1.3.0->deephyper) (0.2.5)
Requirement already satisfied: deprecated in /usr/local/lib/python3.7/dist-packages (from redis>=3.5.0->ray[default]>=1.3.0->deephyper) (1.2.13)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.23.1->deephyper) (3.0.0)
Requirement already satisfied: absl-py>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (0.12.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (3.1.0)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (0.22.0)
Requirement already satisfied: flatbuffers<3.0,>=1.12 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (2.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (0.2.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (1.6.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (3.3.0)
Requirement already satisfied: tensorflow-estimator<2.8,~=2.7.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (2.7.0)
Requirement already satisfied: libclang>=9.0.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (12.0.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (1.1.0)
Requirement already satisfied: wheel<1.0,>=0.32.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (0.37.0)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (1.13.3)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (1.1.2)
Requirement already satisfied: gast<0.5.0,>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (0.4.0)
Requirement already satisfied: tensorboard~=2.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (2.7.0)
Requirement already satisfied: keras<2.8,>=2.7.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow>=2.0.0->deephyper) (2.7.0)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py>=2.9.0->tensorflow>=2.0.0->deephyper) (1.5.2)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (0.4.6)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (1.8.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (57.4.0)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (1.35.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (0.6.1)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (3.3.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (4.2.4)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (4.7.2)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (1.3.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (4.8.2)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (0.4.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->openml==0.10.2->deephyper) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->openml==0.10.2->deephyper) (2021.10.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->openml==0.10.2->deephyper) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->openml==0.10.2->deephyper) (2.10)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (3.1.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->markdown>=2.6.8->tensorboard~=2.6->tensorflow>=2.0.0->deephyper) (3.6.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from Jinja2->deephyper) (2.0.1)
Requirement already satisfied: opencensus-context==0.1.2 in /usr/local/lib/python3.7/dist-packages (from opencensus->ray[default]>=1.3.0->deephyper) (0.1.2)
Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from opencensus->ray[default]>=1.3.0->deephyper) (1.26.3)
Requirement already satisfied: packaging>=14.3 in /usr/local/lib/python3.7/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=1.3.0->deephyper) (21.2)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default]>=1.3.0->deephyper) (1.53.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->deephyper) (0.5.2)
Requirement already satisfied: dm-tree in /usr/local/lib/python3.7/dist-packages (from tensorflow-probability->deephyper) (0.1.6)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from tensorflow-probability->deephyper) (4.4.2)
Requirement already satisfied: cloudpickle>=1.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow-probability->deephyper) (1.3.0)

We start by creating a function which loads the data of interest. Here we use the “Airlines” dataset from OpenML where the task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

[3]:

import numpy as np
import openml
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


def load_data(
    random_state=42,
    verbose=False,
    test_size=0.33,
    valid_size=0.33,
    categoricals_to_integers=False,
):
    """Load the "Airlines" dataset from OpenML.

    Args:
        random_state (int, optional): A numpy `RandomState`. Defaults to 42.
        verbose (bool, optional): Print informations about the dataset. Defaults to False.
        test_size (float, optional): The proportion of the test dataset out of the whole data. Defaults to 0.33.
        valid_size (float, optional): The proportion of the train dataset out of the whole data without the test data. Defaults to 0.33.
        categoricals_to_integers (bool, optional): Convert categoricals features to integer values. Defaults to False.

    Returns:
        tuple: Numpy arrays as, `(X_train, y_train), (X_valid, y_valid), (X_test, y_test)`.
    """
    random_state = (
        np.random.RandomState(random_state) if type(random_state) is int else random_state
    )

    dataset = openml.datasets.get_dataset(1169)

    if verbose:
        print(
            f"This is dataset '{dataset.name}', the target feature is "
            f"'{dataset.default_target_attribute}'"
        )
        print(f"URL: {dataset.url}")
        print(dataset.description[:500])

    X, y, categorical_indicator, ft_names = dataset.get_data(
        target=dataset.default_target_attribute
    )

    # encode categoricals as integers
    if categoricals_to_integers:
        for (ft_ind, ft_name) in enumerate(ft_names):
            if categorical_indicator[ft_ind]:
                labenc = LabelEncoder().fit(X[ft_name])
                X[ft_name] = labenc.transform(X[ft_name])
                n_classes = len(labenc.classes_)
            else:
                n_classes = -1
            categorical_indicator[ft_ind] = (
                categorical_indicator[ft_ind],
                n_classes,
            )

    X, y = X.to_numpy(), y.to_numpy()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, shuffle=True, random_state=random_state
    )

    # relative valid_size on Train set
    r_valid_size = valid_size / (1.0 - test_size)
    X_train, X_valid, y_train, y_valid = train_test_split(
        X_train, y_train, test_size=r_valid_size, shuffle=True, random_state=random_state
    )

    return (X_train, y_train), (X_valid, y_valid), (X_test, y_test)

Then, we create a mapping to record the classification algorithms of interest:

[4]:

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


CLASSIFIERS = {
    "RandomForest": RandomForestClassifier,
    "GradientBoosting": GradientBoostingClassifier,
}

Create a baseline code to test the accuracy of the default configuration for both models:

[5]:

from sklearn.utils import check_random_state

rs_clf = check_random_state(42)
rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test = load_data(
    random_state=rs_data,
    test_size=ratio_test,
    valid_size=ratio_valid,
    categoricals_to_integers=True,
)

for clf_name, clf_class in CLASSIFIERS.items():
    print(clf_name)

    clf = clf_class(random_state=rs_clf)

    clf.fit(*train)

    acc_train = clf.score(*train)
    acc_valid = clf.score(*valid)
    acc_test = clf.score(*test)

    print(f"Accuracy on Training: {acc_train:.3f}")
    print(f"Accuracy on Validation: {acc_valid:.3f}")
    print(f"Accuracy on Testing: {acc_test:.3f}\n")

RandomForest
Accuracy on Training: 0.879
Accuracy on Validation: 0.620
Accuracy on Testing: 0.620

GradientBoosting
Accuracy on Training: 0.649
Accuracy on Validation: 0.648
Accuracy on Testing: 0.649

The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest.

Next, we optimize the hyperparameters, where we seek to find the right classifier and its corresponding hyperparameters to improve the accuracy on the vaidation and test data. Create a load_data function to load and return training and validation data:

[6]:

import numpy as np
from sklearn.utils import resample

def load_subsampled_data(verbose=0, subsample=True):

    # In this case passing a random state is critical to make sure
    # that the same data are loaded all the time and that the test set
    # is not mixed with either the training or validation set.
    # It is important to not avoid setting a global seed for safety reasons.
    random_state = np.random.RandomState(seed=42)

    # Proportion of the test set on the full dataset
    ratio_test = 0.33

    # Proportion of the valid set on "dataset \ test set"
    # here we want the test and validation set to have same number of elements
    ratio_valid = (1 - ratio_test) * 0.33

    # The 3rd result is ignored with "_" because it corresponds to the test set
    # which is not interesting for us now.
    (X_train, y_train), (X_valid, y_valid), _ = load_data(
        random_state=random_state,
        test_size=ratio_test,
        valid_size=ratio_valid,
        categoricals_to_integers=True,
    )

    # Uncomment the next line if you want to sub-sample the training data to speed-up
    # the search, "n_samples" controls the size of the new training data
    if subsample:
        X_train, y_train = resample(X_train, y_train, n_samples=int(1e4))

    if verbose:
        print(f"X_train shape: {np.shape(X_train)}")
        print(f"y_train shape: {np.shape(y_train)}")
        print(f"X_valid shape: {np.shape(X_valid)}")
        print(f"y_valid shape: {np.shape(y_valid)}")
    return (X_train, y_train), (X_valid, y_valid)

print("With subsampling")
_ = load_subsampled_data(verbose=1)
print("\nWithout subsampling")
_ = load_subsampled_data(verbose=1, subsample=False)

With subsampling
X_train shape: (10000, 7)
y_train shape: (10000,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

Without subsampling
X_train shape: (242128, 7)
y_train shape: (242128,)
X_valid shape: (119258, 7)
y_valid shape: (119258,)

Tip

Subsampling with X_train, y_train = resample(X_train, y_train, n_samples=int(1e4)) can be useful if you want to speed-up your search. By subsampling the training time will reduce.

Create a run function to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm.

[17]:

from inspect import signature

def filter_parameters(obj, config: dict) -> dict:
    """Filter the incoming configuration dict based on the signature of obj.
    Args:
        obj (Callable): the object for which the signature is used.
        config (dict): the configuration to filter.
    Returns:
        dict: the filtered configuration dict.
    """
    sig = signature(obj)
    clf_allowed_params = list(sig.parameters.keys())
    clf_params = {
        k[2:]: v
        for k, v in config.items()
        if k[2:] in clf_allowed_params and not (v in ["nan", "NA"])
    }
    return clf_params

[8]:

from sklearn.metrics import accuracy_score
from sklearn.utils import check_random_state


def run(config: dict) -> float:

    config["random_state"] = check_random_state(42)

    (X_train, y_train), (X_valid, y_valid) = load_subsampled_data()

    clf_class = CLASSIFIERS[config["classifier"]]

    # keep parameters possible for the current classifier
    config["n_jobs"] = 4
    clf_params = filter_parameters(clf_class, config)

    try:  # good practice to manage the fail value yourself...
        clf = clf_class(**clf_params)

        clf.fit(X_train, y_train)

        fit_is_complete = True
    except:
        fit_is_complete = False

    if fit_is_complete:
        y_pred = clf.predict(X_valid)
        acc = accuracy_score(y_valid, y_pred)
    else:
        acc = -1.0

    return acc

Create the HpProblem to define the search space of hyperparameters for each model:

[9]:

import ConfigSpace as cs
from deephyper.problem import HpProblem


problem = HpProblem()

#! Default value are very important when adding conditional and forbidden clauses
#! Otherwise the creation of the problem can fail if the default configuration is not
#! Acceptable
classifier = problem.add_hyperparameter(
    ["RandomForest", "GradientBoosting"],
    "classifier",
    default_value="RandomForest"
)

# For both
problem.add_hyperparameter((1, 1000, "log-uniform"), "n_estimators")
problem.add_hyperparameter((1, 50), "max_depth")
problem.add_hyperparameter((2, 10), "min_samples_split")
problem.add_hyperparameter((1, 10), "min_samples_leaf")
criterion = problem.add_hyperparameter(
    ["friedman_mse", "squared_error", "gini", "entropy"],
    "criterion",
    default_value="gini",
)

# GradientBoosting
loss = problem.add_hyperparameter(["log_loss", "exponential"], "loss")
learning_rate = problem.add_hyperparameter((0.01, 1.0), "learning_rate")
subsample = problem.add_hyperparameter((0.01, 1.0), "subsample")

gradient_boosting_hp = [loss, learning_rate, subsample]
for hp_i in gradient_boosting_hp:
    problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting"))

forbidden_criterion_rf = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "RandomForest"),
    cs.ForbiddenInClause(criterion, ["friedman_mse", "squared_error"]),
)
problem.add_forbidden_clause(forbidden_criterion_rf)

forbidden_criterion_gb = cs.ForbiddenAndConjunction(
    cs.ForbiddenEqualsClause(classifier, "GradientBoosting"),
    cs.ForbiddenInClause(criterion, ["gini", "entropy"]),
)
problem.add_forbidden_clause(forbidden_criterion_gb)

problem

[9]:

Configuration space object:
  Hyperparameters:
    classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
    criterion, Type: Categorical, Choices: {friedman_mse, squared_error, gini, entropy}, Default: gini
    learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
    loss, Type: Categorical, Choices: {log_loss, exponential}, Default: log_loss
    max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
    min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
    min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
    n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
    subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
  Conditions:
    learning_rate | classifier == 'GradientBoosting'
    loss | classifier == 'GradientBoosting'
    subsample | classifier == 'GradientBoosting'
  Forbidden Clauses:
    (Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'squared_error'})
    (Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'entropy', 'gini'})

Create an Evaluator object using the ray backend to distribute the evaluation of the run-function defined previously.

[10]:

from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run,
                 method="ray",
                 method_kwargs={
                     "address": None,
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]

                 })

print("Number of workers: ", evaluator.num_workers)

/Users/romainegele/Documents/Argonne/deephyper/deephyper/evaluator/_evaluator.py:126: UserWarning: Applying nest-asyncio patch for IPython Shell!
  warnings.warn(
2023-01-30 16:20:45,873 INFO worker.py:1518 -- Started a local Ray instance.

Number of workers:  1

Finally, you can define a Bayesian optimization search called CBO (for Centralized Bayesian Optimization) and link to it the defined problem and evaluator.

[12]:

from deephyper.search.hps import CBO

search = CBO(problem, evaluator)

[13]:

results = search.search(30)

Once the search is over, a file named results.csv is saved in the current directory. The same dataframe is returned by the search.search(...) call. It contains the hyperparameters configurations evaluated during the search and their corresponding objective value (i.e, validation accuracy), timestamp_submit the time when the evaluator submitted the configuration to be evaluated and timestamp_gather the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the Evaluator instance).

[14]:

results

[14]:

	p:classifier	p:criterion	p:max_depth	p:min_samples_leaf	p:min_samples_split	p:n_estimators	p:learning_rate	p:loss	p:subsample	objective	job_id	m:timestamp_submit	m:timestamp_gather
0	GradientBoosting	friedman_mse	48	8	6	1	0.034711	exponential	0.557088	0.556206	0	20.238832	21.725218
1	GradientBoosting	squared_error	25	6	7	83	0.648618	log_loss	0.845242	-1.000000	1	22.025568	22.304188
2	RandomForest	entropy	10	2	2	80	NaN	NaN	NaN	0.639538	2	22.677139	23.422508
3	RandomForest	entropy	3	5	5	3	NaN	NaN	NaN	0.624428	3	23.701458	24.147202
4	GradientBoosting	friedman_mse	40	8	7	159	0.396828	exponential	0.185283	0.577957	4	24.432544	27.467635
5	GradientBoosting	squared_error	16	9	8	201	0.675658	exponential	0.093068	0.546345	5	27.832241	30.280387
6	RandomForest	gini	31	7	3	2	NaN	NaN	NaN	0.588355	6	30.560062	31.008324
7	GradientBoosting	squared_error	27	7	7	120	0.035546	log_loss	0.959819	-1.000000	7	31.376341	31.655923
8	GradientBoosting	friedman_mse	7	7	4	5	0.738049	exponential	0.428511	0.616135	8	31.930076	32.409772
9	GradientBoosting	squared_error	23	10	5	5	0.716434	log_loss	0.982813	-1.000000	9	32.682877	32.958425
10	RandomForest	entropy	18	2	2	197	NaN	NaN	NaN	0.633844	10	33.440294	34.996482
11	RandomForest	entropy	8	1	2	40	NaN	NaN	NaN	0.638322	11	35.406363	35.961062
12	RandomForest	entropy	10	6	2	73	NaN	NaN	NaN	0.640854	12	36.374761	37.096683
13	RandomForest	entropy	11	7	2	54	NaN	NaN	NaN	0.635546	13	37.589636	38.244692
14	RandomForest	entropy	9	6	2	353	NaN	NaN	NaN	0.641450	14	38.662550	40.431650
15	RandomForest	entropy	7	9	2	845	NaN	NaN	NaN	0.638028	15	40.852464	43.896614
16	RandomForest	entropy	9	5	2	719	NaN	NaN	NaN	0.639781	16	44.397726	47.393177
17	RandomForest	entropy	9	6	3	284	NaN	NaN	NaN	0.639244	17	47.811217	49.270515
18	RandomForest	entropy	47	6	2	398	NaN	NaN	NaN	0.637727	18	49.685115	52.192047
19	RandomForest	entropy	8	5	2	339	NaN	NaN	NaN	0.640737	19	52.602165	54.149691
20	RandomForest	entropy	9	8	2	446	NaN	NaN	NaN	0.639924	20	54.642916	56.724973
21	RandomForest	entropy	4	6	2	325	NaN	NaN	NaN	0.630599	21	57.133088	58.310939
22	RandomForest	entropy	14	6	8	351	NaN	NaN	NaN	0.641165	22	58.728417	60.834408
23	RandomForest	entropy	4	6	7	362	NaN	NaN	NaN	0.629803	23	61.345774	62.587422
24	RandomForest	entropy	9	8	2	399	NaN	NaN	NaN	0.641726	24	62.999582	64.910603
25	RandomForest	entropy	22	8	2	389	NaN	NaN	NaN	0.641148	25	65.415612	67.825691
26	RandomForest	entropy	9	9	2	373	NaN	NaN	NaN	0.640376	26	68.238418	70.019041
27	RandomForest	entropy	9	8	2	292	NaN	NaN	NaN	0.641399	27	70.435441	71.970712
28	RandomForest	entropy	12	8	6	378	NaN	NaN	NaN	0.639286	28	72.483597	74.873468
29	RandomForest	entropy	6	9	2	394	NaN	NaN	NaN	0.633861	29	75.531141	77.583019

The deephyper-analytics command line is a way of analyzing this type of file. For example, we want to output the best configuration we can use the topk functionnality.

[15]:

results.nlargest(n=3, columns="objective")

[15]:

	p:classifier	p:criterion	p:max_depth	p:min_samples_leaf	p:min_samples_split	p:n_estimators	p:learning_rate	p:loss	p:subsample	objective	job_id	m:timestamp_submit	m:timestamp_gather
24	RandomForest	entropy	9	8	2	399	NaN	NaN	NaN	0.641726	24	62.999582	64.910603
14	RandomForest	entropy	9	6	2	353	NaN	NaN	NaN	0.641450	14	38.662550	40.431650
27	RandomForest	entropy	9	8	2	292	NaN	NaN	NaN	0.641399	27	70.435441	71.970712

Let us define a test to evaluate the best configuration on the training, validation and test data sets.

[18]:

from pprint import pprint
import pandas as pd


config = results.iloc[results.objective.argmax()][:-2].to_dict()
print("Best config is:")
pprint(config)

config["random_state"] = check_random_state(42)

rs_data = check_random_state(42)

ratio_test = 0.33
ratio_valid = (1 - ratio_test) * 0.33

train, valid, test = load_data(
    random_state=rs_data,
    test_size=ratio_test,
    valid_size=ratio_valid,
    categoricals_to_integers=True,
)

clf_class = CLASSIFIERS[config["p:classifier"]]
config["n_jobs"] = 4
clf_params = filter_parameters(clf_class, config)

clf = clf_class(**clf_params)

clf.fit(*train)

acc_train = clf.score(*train)
acc_valid = clf.score(*valid)
acc_test = clf.score(*test)

print(f"Accuracy on Training: {acc_train:.3f}")
print(f"Accuracy on Validation: {acc_valid:.3f}")
print(f"Accuracy on Testing: {acc_test:.3f}")

Best config is:
{'job_id': 24,
 'objective': 0.641726341209814,
 'p:classifier': 'RandomForest',
 'p:criterion': 'entropy',
 'p:learning_rate': nan,
 'p:loss': nan,
 'p:max_depth': 9,
 'p:min_samples_leaf': 8,
 'p:min_samples_split': 2,
 'p:n_estimators': 399,
 'p:subsample': nan}
Accuracy on Training: 0.654
Accuracy on Validation: 0.649
Accuracy on Testing: 0.649

Compared to the default configuration, we can see the accuracy improvement and the reduction of overfitting between the training and the validation/test data sets.