2. Execution on the ThetaGPU supercomputer (with Ray)#
In this tutorial we are going to learn how to use DeepHyper on the ThetaGPU supercomputer at the ALCF using Ray. ThetaGPU is a 3.9 petaflops system based on NVIDIA DGX A100.
2.1. Submission Script#
This section of the tutorial will show you how to submit script to the COBALT scheduler of ThetaGPU. To execute DeepHyper on ThetaGPU with a submission script it is required to:
Define a Bash script to initialize the environment (e.g., load a module, activate a conda environment).
Define a script composed of 3 steps: (1) launch a Ray cluster on available ressources, (2) execute a Python application which connects to the Ray cluster, and (3) stop the Ray cluster.
Start by creating a script named activate-dhenv.sh
to initialize your environment. It will be used to initialize each used compute node. Replace the $CONDA_ENV_PATH
by your personnal conda installation (e.g., it can be replaced by base
if no virtual environment is used):
activate-dhenv.sh
##!/bin/bash
# Necessary for Bash shells
. /etc/profile
# Tensorflow optimized for A100 with CUDA 11
module load conda/2022-07-01
# Activate conda env
conda activate $CONDA_ENV_PATH
Tip
This activate-dhenv
script can be very useful to tailor the execution’s environment to your needs. Here are a few tips that can be useful:
To activate XLA optimized compilation add
export TF_XLA_FLAGS=--tf_xla_enable_xla_devices
To change the log level of Tensorflow add
export TF_CPP_MIN_LOG_LEVEL=3
Then create a new file named job-deephyper.sh
and make it executable. It will correspond to your submission script.
$ touch job-deephyper.sh && chmod +x job-deephyper.sh
Add the following content:
job-deephyper.sh
##!/bin/bash
#COBALT -q full-node
#COBALT -n 2
#COBALT -t 20
#COBALT -A $PROJECT_NAME
#COBALT --attrs filesystems=home,grand,eagle,theta-fs0
# User Configuration
EXP_DIR=$PWD
INIT_SCRIPT=$PWD/activate-dhenv.sh
CPUS_PER_NODE=8
GPUS_PER_NODE=8
# Initialization of environment
source $INIT_SCRIPT
# Getting the node names
mapfile -t nodes_array -d '\n' < $COBALT_NODEFILE
head_node=${nodes_array[0]}
head_node_ip=$(dig $head_node a +short | awk 'FNR==2')
# Starting the Ray Head Node
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
ssh -tt $head_node_ip "source $INIT_SCRIPT; cd $EXP_DIR; \
ray start --head --node-ip-address=$head_node_ip --port=$port \
--num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE --block" &
# Optional, though may be useful in certain versions of Ray < 1.0.
sleep 10
# Number of nodes other than the head node
nodes_num=$((${#nodes_array[*]} - 1))
echo "$nodes_num nodes"
for ((i = 1; i <= nodes_num; i++)); do
node_i=${nodes_array[$i]}
node_i_ip=$(dig $node_i a +short | awk 'FNR==1')
echo "Starting WORKER $i at $node_i with ip=$node_i_ip"
ssh -tt $node_i_ip "source $INIT_SCRIPT; cd $EXP_DIR; \
ray start --address $ip_head \
--num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE --block" &
sleep 5
done
# Check the status of the Ray cluster
ssh -tt $head_node_ip "source $INIT_SCRIPT && ray status"
# Run the search
ssh -tt $head_node_ip "source $INIT_SCRIPT && cd $EXP_DIR && python myscript.py"
# Stop de Ray cluster
ssh -tt $head_node_ip "source $INIT_SCRIPT && ray stop"
Note
About the COBALT directives :
#COBALT -q full-node
The queue your job will be submitted to. For ThetaGPU it can either be single-gpu
, full-node
, or bigmem
; you can find here the specificities of these queues.
#COBALT -n 2
The number of nodes your job will be submitted to.
#COBALT -t 20
The duration of the job submission, in minutes.
#COBALT -A $PROJECT_NAME
Your current project, e-g #COBALT -A datascience
:
#COBALT --attrs filesystems=home,grand,eagle,theta-fs0
The filesystems your application should have access to, DeepHyper only requires home
and theta-fs0
, and it is unnecessary to let in this list a filesystem your application (and DeepHyper) doesn’t need.
Adapt the executed Python application depending on your needs. Here is an application example of CBO
using the ray
evaluator:
myscript.py
#import pathlib
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import numpy as np
from deephyper.evaluator import Evaluator
from deephyper.search.hps import CBO
from deephyper.evaluator.callback import ProfilingCallback
from deephyper.problem import HpProblem
hp_problem = HpProblem()
hp_problem.add_hyperparameter((-10.0, 10.0), "x")
def run(config):
return - config["x"]**2
timeout = 10
search_log_dir = "search_log/"
pathlib.Path(search_log_dir).mkdir(parents=False, exist_ok=True)
# Evaluator creation
print("Creation of the Evaluator...")
evaluator = Evaluator.create(
run,
method="ray",
method_kwargs={
"adress": "auto",
"num_gpus_per_task": 1,
}
)
print(f"Creation of the Evaluator done with {evaluator.num_workers} worker(s)")
# Search creation
print("Creation of the search instance...")
search = CBO(
hp_problem,
evaluator,
)
print("Creation of the search done")
# Search execution
print("Starting the search...")
results = search.search(timeout=timeout)
print("Search is done")
results.to_csv(os.path.join(search_log_dir, f"results.csv"))
Finally, submit the script using:
qsub-gpu job-deephyper.sh
Note
The ssh -tt $head_node_ip "source $INIT_SCRIPT && ray status"
command is used to check the good initialization of the Ray cluster. Once the job starts running, check the *.output
file and verify that the number of detected GPUs is correct.