3. Execution on the ThetaGPU supercomputer (with MPI)#
In this tutorial we are going to learn how to use DeepHyper on the ThetaGPU supercomputer at the ALCF using MPI. ThetaGPU is a 3.9 petaflops system based on NVIDIA DGX A100.
3.1. Submission Script#
This section of the tutorial will show you how to submit script to the COBALT scheduler of ThetaGPU. To execute DeepHyper on ThetaGPU with a submission script it is required to:
Define a Bash script to initialize the environment (e.g., load a module, activate a conda environment).
Define an execution script, which will call the bash script defined in 1. and launch the python script using
mpirun
.
Start by creating a script named activate-dhenv.sh
to initialize your environment. Replace the $CONDA_ENV_PATH
by your personnal conda installation (e.g., it can be replaced by base
if no virtual environment is used):
#!/bin/bash
# Necessary for Bash shells
. /etc/profile
# Tensorflow optimized for A100 with CUDA 11
module load conda/2022-07-01
# Activate conda env
conda activate $CONDA_ENV_PATH
Tip
This activate-dhenv
script can be very useful to tailor the execution’s environment to your needs. Here are a few tips that can be useful:
To activate XLA optimized compilation add
export TF_XLA_FLAGS=--tf_xla_enable_xla_devices
To change the log level of Tensorflow add
export TF_CPP_MIN_LOG_LEVEL=3
Then create a new file named job-deephyper.sh
and make it executable. It will correspond to your submission script.
$ touch job-deephyper.sh && chmod +x job-deephyper.sh
#!/bin/bash
#COBALT -q full-node
#COBALT -n 2
#COBALT -t 20
#COBALT -A $PROJECT_NAME
#COBALT --attrs filesystems=home,grand,eagle,theta-fs0
# User Configuration
INIT_SCRIPT=$PWD/activate-dhenv.sh
COBALT_JOBSIZE=2
RANKS_PER_NODE=8
# Initialization of environment
source $INIT_SCRIPT
mpirun -x LD_LIBRARY_PATH -x PYTHONPATH -x PATH -n $(( $COBALT_JOBSIZE * $RANKS_PER_NODE )) -N $RANKS_PER_NODE --hostfile $COBALT_NODEFILE python myscript.py
Note
About the COBALT directives :
#COBALT -q full-node
The queue your job will be submitted to. For ThetaGPU it can either be single-gpu
, full-node
, or bigmem
; you can find here the specificities of these queues.
#COBALT -n 2
The number of nodes your job will be submitted to.
#COBALT -t 20
The duration of the job submission, in minutes.
#COBALT -A $PROJECT_NAME
Your current project, e-g #COBALT -A datascience
:
#COBALT --attrs filesystems=home,grand,eagle,theta-fs0
The filesystems your application should have access to, DeepHyper only requires home
and theta-fs0
, and it is unnecessary to let in this list a filesystem your application (and DeepHyper) doesn’t need.
Note
COBALT_JOBSIZE=2
RANKS_PER_NODE=8
COBALT_JOBSIZE
and RANKS_PER_NODE
correspond respectively to the number of nodes allocated and number of process per node. Unlike Theta
on which COBALT_JOBSIZE
is automatically instanciated to the correct value, on ThetaGPU
it has to be done by hand : COBALT_JOBSIZE
should always correspond to the number of nodes you submitted your application to (the number after #COBALT -n
). e-g if you were to happen to have a #COBALT -n 4
you should have COBALT_JOBSIZE=4
.
Adapt the executed Python application depending on your needs. Here is an application example of CBO
using the mpi_comm
evaluator:
import pathlib
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import mpi4py
mpi4py.rc.initialize = False
mpi4py.rc.threads = True
mpi4py.rc.thread_level = "multiple"
mpi4py.rc.recv_mprobe = False
from deephyper.evaluator import Evaluator
from deephyper.hpo import CBO
from mpi4py import MPI
if not MPI.Is_initialized():
MPI.Init_thread()
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
from deephyper.hpo import HpProblem
hp_problem = HpProblem()
hp_problem.add_hyperparameter((-10.0, 10.0), "x")
def run(config):
return - config["x"]**2
timeout = 10
search_log_dir = "search_log/"
pathlib.Path(search_log_dir).mkdir(parents=False, exist_ok=True)
if rank == 0:
# Evaluator creation
print("Creation of the Evaluator...")
with Evaluator.create(
run,
method="mpicomm",
) as evaluator:
if evaluator is not None:
print(f"Creation of the Evaluator done with {evaluator.num_workers} worker(s)")
# Search creation
print("Creation of the search instance...")
search = CBO(
hp_problem,
evaluator,
)
print("Creation of the search done")
# Search execution
print("Starting the search...")
results = search.search(timeout=timeout)
print("Search is done")
results.to_csv(os.path.join(search_log_dir, f"results.csv"))
Note
If you are using tensorflow, you might want to add this snippet to your script in order to ensure that each worker is restricted to its own gpu (and doesn’t access other workers memory):
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
gpu_local_idx = rank % gpu_per_node
node = int(rank / gpu_per_node)
import tensorflow as tf
gpus = tf.config.list_physical_devices("GPU")
if gpus:
# Restrict TensorFlow to only use the first GPU
try:
tf.config.set_visible_devices(gpus[gpu_local_idx], "GPU")
tf.config.experimental.set_memory_growth(gpus[gpu_local_idx], True)
logical_gpus = tf.config.list_logical_devices("GPU")
logging.info(f"[r={rank}]: {len(gpus)} Physical GPUs, {len(logical_gpus)} Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
logging.info(f"{e}")
With gpu_per_node
being equal to the RANKS_PER_NODE
value specified in job-deephyper.sh
.
Note
If you are using PyTorch, most of the code from myscript.py starts off similar:
import mpi4py
from mpi4py import MPI
mpi4py.rc.initialize = False
mpi4py.rc.threads = True
mpi4py.rc.thread_level = "multiple"
if not MPI.Is_initialized():
MPI.Init_thread()
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
In order to properly use PyTorch while maintaining flexibility, defining the device that the evaluations will be sent to is very important. As all GPUs initially run the same script, they will need their rank defined under the run function. We define the three possible scenarios when running on any given machine:
if is_gpu_available and n_gpus > 1:
device = torch.device("cuda", rank)
print("Running on GPU ", rank)
elif is_gpu_available and n_gpus == 1:
device = torch.device("cuda", 0)
print("Running on the GPU")
else:
device = torch.device("cpu")
print("Running on the CPU")
After defining the black box function, we compute the result on the assigned GPU. Note that while in this instance the run function returns a objective.item()
, all that is needed is a valid float or integer. A portion of myscript.py should resemble something of the following:
def run(config: dict):
### Change as needed
if is_gpu_available and n_gpus > 1:
device = torch.device("cuda", rank)
print("Running on GPU ", rank)
elif is_gpu_available and n_gpus == 1:
device = torch.device("cuda", 0)
print("Running on the GPU")
else:
device = torch.device("cpu")
print("Running on the CPU")
objective = torch.tensor([-config["x"]**2]).to(device)
return objective.item()
if __name__ == "__main__":
is_gpu_available = torch.cuda.is_available()
n_gpus = torch.cuda.device_count() if is_gpu_available else 0
hp_problem = HpProblem()
hp_problem.add_hyperparameter((-10.0, 10.0), "x")
if rank == 0:
# Evaluator creation
print("Creation of the Evaluator...")
with Evaluator.create(
run,
method="mpicomm",
) as evaluator:
if evaluator is not None:
print(f"Creation of the Evaluator done with {evaluator.num_workers} worker(s)")
# Search creation
print("Creation of the search instance...")
search = CBO(
hp_problem,
evaluator,
)
print("Creation of the search done")
# Search execution
print("Starting the search...")
results = search.search(timeout=timeout)
print("Search is done")
results.to_csv(os.path.join(search_log_dir, f"results.csv"))
Finally, submit the script using :
qsub-gpu job-deephyper.sh