Hyperparameter Search for Machine Learning (Advanced)¶

In this tutorial, we will show how to treat a learning method as a hyperparameter in the hyperparameter search. We will consider Random Forest (RF) classifier and Gradient Boosting (GB) classifier methods in scikit-learn for the Airlines data set. Each of these methods have its own set of hyperparameters and some common parameters. We model them using ConfigSpace a python package to express conditional hyperparameters and more.

Let us start by creating a DeepHyper project and a problem for our application:

Note

If you already have a DeepHyper project you do not need to create a new one for each problem.

bash
$deephyper start-project dhproj$ cd dhproj/dhproj/
$deephyper new-problem hps advanced_hpo$ cd rf_tuning/


Create a mapping.py script where you will record the classification algorithms of interest ($touch mapping.py in the terminal then edit the file): advanced_hpo/mapping.py   1 2 3 4 5 6 7 8 9 10 11 12 13 """Mapping of available classifiers for automl. """ from sklearn.ensemble import ( RandomForestClassifier, GradientBoostingClassifier, ) CLASSIFIERS = { "RandomForest": RandomForestClassifier, "GradientBoosting": GradientBoostingClassifier, }  Create a script to test the accuracy of the default configuration for both the models: advanced_hpo/test_default_configs.py   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 from dhproj.advanced_hpo.mapping import CLASSIFIERS from deephyper.benchmark.datasets import airlines as dataset from sklearn.utils import check_random_state rs_clf = check_random_state(42) rs_data = check_random_state(42) ratio_test = 0.33 ratio_valid = (1 - ratio_test) * 0.33 train, valid, test = dataset.load_data( random_state=rs_data, test_size=ratio_test, valid_size=ratio_valid, ) for clf_name, clf_class in CLASSIFIERS.items(): print(clf_name) clf = clf_class(random_state=rs_clf) clf.fit(*train) acc_train = clf.score(*train) acc_valid = clf.score(*valid) acc_test = clf.score(*test) print(f"Accuracy on Training: {acc_train:.3f}") print(f"Accuracy on Validation: {acc_valid:.3f}") print(f"Accuracy on Testing: {acc_test:.3f}\n")  Run the script and record the training, validation, and test accuracy as follows: bash $ python test_default_configs.py


Running the script will give the the following outputs:

[Out]
RandomForest
Accuracy on Training: 0.879
Accuracy on Validation: 0.621
Accuracy on Testing: 0.620

Accuracy on Training: 0.649
Accuracy on Validation: 0.648
Accuracy on Testing: 0.649


The accuracy values show that the RandomForest classifier with default hyperparameters results in overfitting and thus poor generalization (high accuracy on training data but not on the validation and test data). On the contrary GradientBoosting does not show any sign of overfitting and has a better accuracy on the validation and testing set, which shows a better generalization than RandomForest.

Next, we optimize the hyperparameters, where we seek to find the right classifier and its corresponding hyperparameters, and improve the accuracy on the vaidation and test data. Create load_data.py file to load and return training and validation data:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import numpy as np from sklearn.utils import resample from deephyper.benchmark.datasets import airlines as dataset def load_data(): # In this case passing a random state is critical to make sure # that the same data are loaded all the time and that the test set # is not mixed with either the training or validation set. # It is important to not avoid setting a global seed for safety reasons. random_state = np.random.RandomState(seed=42) # Proportion of the test set on the full dataset ratio_test = 0.33 # Proportion of the valid set on "dataset \ test set" # here we want the test and validation set to have same number of elements ratio_valid = (1 - ratio_test) * 0.33 # The 3rd result is ignored with "_" because it corresponds to the test set # which is not interesting for us now. (X_train, y_train), (X_valid, y_valid), _ = dataset.load_data( random_state=random_state, test_size=ratio_test, valid_size=ratio_valid ) X_train, y_train = resample(X_train, y_train, n_samples=int(1e4)) X_valid, y_valid = resample(X_valid, y_valid, n_samples=int(1e4)) print(f"X_train shape: {np.shape(X_train)}") print(f"y_train shape: {np.shape(y_train)}") print(f"X_valid shape: {np.shape(X_valid)}") print(f"y_valid shape: {np.shape(y_valid)}") return (X_train, y_train), (X_valid, y_valid) if __name__ == "__main__": load_data() 

Note

Subsampling with X_train, y_train = resample(X_train, y_train, n_samples=int(1e4)) can be useful if you want to speed-up your search. By subsampling the training time will reduce.

To test this code:

bash
$python load_data.py  The expected output is: [Out] X_train shape: (10000, 7) y_train shape: (10000,) X_valid shape: (10000, 7) y_valid shape: (10000,)  Create model_run.py file to train and evaluate a given hyperparameter configuration. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. advanced_hpo/model_run.py   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from deephyper.problem import filter_parameters from dhproj.advanced_hpo.mapping import CLASSIFIERS from dhproj.rf_tuning.load_data import load_data from sklearn.metrics import accuracy_score from sklearn.utils import check_random_state def run(config: dict) -> float: """Run function which can be used for AutoML classification. Args: config (dict): [description] load_data (callable): [description] Returns: float: [description] """ seed = 42 config["random_state"] = check_random_state(42) (X_train, y_train), (X_valid, y_valid) = load_data() clf_class = CLASSIFIERS[config["classifier"]] # keep parameters possible for the current classifier config["n_jobs"] = 4 clf_params = filter_parameters(clf_class, config) try: # good practice to manage the fail value yourself... clf = clf_class(**clf_params) clf.fit(X_train, y_train) fit_is_complete = True except: fit_is_complete = False if fit_is_complete: y_pred = clf.predict(X_valid) acc = accuracy_score(y_valid, y_pred) else: acc = -1.0 return acc  Create problem.py to define the search space of hyperparameters for each model: advanced_hpo/problem.py   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 import ConfigSpace as cs from deephyper.problem import HpProblem Problem = HpProblem(seed=45) #! Default value are very important when adding conditional and forbidden clauses #! Otherwise the creation of the problem can fail if the default configuration is not #! Acceptable classifier = Problem.add_hyperparameter( name="classifier", value=["RandomForest", "GradientBoosting"], default_value="RandomForest", ) # For both Problem.add_hyperparameter(name="n_estimators", value=(1, 1000, "log-uniform")) Problem.add_hyperparameter(name="max_depth", value=(1, 50)) Problem.add_hyperparameter( name="min_samples_split", value=(2, 10), ) Problem.add_hyperparameter(name="min_samples_leaf", value=(1, 10)) criterion = Problem.add_hyperparameter( name="criterion", value=["friedman_mse", "mse", "mae", "gini", "entropy"], default_value="gini", ) # GradientBoosting loss = Problem.add_hyperparameter(name="loss", value=["deviance", "exponential"]) learning_rate = Problem.add_hyperparameter(name="learning_rate", value=(0.01, 1.0)) subsample = Problem.add_hyperparameter(name="subsample", value=(0.01, 1.0)) gradient_boosting_hp = [loss, learning_rate, subsample] for hp_i in gradient_boosting_hp: Problem.add_condition(cs.EqualsCondition(hp_i, classifier, "GradientBoosting")) forbidden_criterion_rf = cs.ForbiddenAndConjunction( cs.ForbiddenEqualsClause(classifier, "RandomForest"), cs.ForbiddenInClause(criterion, ["friedman_mse", "mse", "mae"]), ) Problem.add_forbidden_clause(forbidden_criterion_rf) forbidden_criterion_gb = cs.ForbiddenAndConjunction( cs.ForbiddenEqualsClause(classifier, "GradientBoosting"), cs.ForbiddenInClause(criterion, ["gini", "entropy"]), ) Problem.add_forbidden_clause(forbidden_criterion_gb) if __name__ == "__main__": print(Problem)  Run the problem.py with $ python problem.py in your shell. The output will be:

[Out]
Configuration space object:
Hyperparameters:
classifier, Type: Categorical, Choices: {RandomForest, GradientBoosting}, Default: RandomForest
criterion, Type: Categorical, Choices: {friedman_mse, mse, mae, gini, entropy}, Default: gini
learning_rate, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
loss, Type: Categorical, Choices: {deviance, exponential}, Default: deviance
max_depth, Type: UniformInteger, Range: [1, 50], Default: 26
min_samples_leaf, Type: UniformInteger, Range: [1, 10], Default: 6
min_samples_split, Type: UniformInteger, Range: [2, 10], Default: 6
n_estimators, Type: UniformInteger, Range: [1, 1000], Default: 32, on log-scale
subsample, Type: UniformFloat, Range: [0.01, 1.0], Default: 0.505
Conditions:
Forbidden Clauses:
(Forbidden: classifier == 'RandomForest' && Forbidden: criterion in {'friedman_mse', 'mae', 'mse'})
(Forbidden: classifier == 'GradientBoosting' && Forbidden: criterion in {'entropy', 'gini'})


Run the search for 20 model evaluations using the following command line:

bash
\$ deephyper hps ambs --problem dhproj.advanced_hpo.problem.Problem --run dhproj.advanced_hpo.model_run.run --max-evals 20 --evaluator ray --n-jobs 4


Once the search is over, the results.csv file contains the hyperparameters configurations evaluated during the search and their corresponding objective value (validation accuracy). Create test_best_config.py as given below. It will extract the best configuration from the results.csv and run it for the training, validation and test set.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 from pprint import pprint import pandas as pd from deephyper.benchmark.datasets import airlines as dataset from deephyper.problem import filter_parameters from dhproj.advanced_hpo.mapping import CLASSIFIERS from sklearn.utils import check_random_state df = pd.read_csv("results.csv") config = df.iloc[df.objective.argmax()][:-2].to_dict() print("Best config is:") pprint(config) config["random_state"] = check_random_state(42) rs_data = check_random_state(42) ratio_test = 0.33 ratio_valid = (1 - ratio_test) * 0.33 train, valid, test = dataset.load_data( random_state=rs_data, test_size=ratio_test, valid_size=ratio_valid, ) clf_class = CLASSIFIERS[config["classifier"]] config["n_jobs"] = 4 clf_params = filter_parameters(clf_class, config) clf = clf_class(**clf_params) clf.fit(*train) acc_train = clf.score(*train) acc_valid = clf.score(*valid) acc_test = clf.score(*test) print(f"Accuracy on Training: {acc_train:.3f}") print(f"Accuracy on Validation: {acc_valid:.3f}") print(f"Accuracy on Testing: {acc_test:.3f}") 
Accuracy on Training: 0.754