Execution on the ThetaGPU supercomputer
Contents
2. Execution on the ThetaGPU supercomputer#
In this tutorial we are going to learn how to use DeepHyper on the ThetaGPU supercomputer at the ALCF. ThetaGPU is a 3.9 petaflops system based on NVIDIA DGX A100.
2.1. Submission Script#
This section of the tutorial will show you how to submit script to the COBALT scheduler of ThetaGPU. To execute DeepHyper on ThetaGPU with a submission script it is required to:
Define a Bash script to initialize the environment (e.g., load a module, activate a conda environment).
Define a script composed of 3 steps: (1) launch a Ray cluster on available ressources, (2) execute a Python application which connects to the Ray cluster, and (3) stop the Ray cluster.
Start by creating a script named init-dh-environment.sh
to initialize your environment. It will be used to initialize each used compute node. Replace the $CONDA_ENV_PATH
by your personnal conda installation (e.g., it can be replaced by base
if no virtual environment is used):
init-dh-environment.sh
##!/bin/bash
# Necessary for Bash shells
. /etc/profile
# Tensorflow optimized for A100 with CUDA 11
module load conda/2021-09-22
# Activate conda env
conda activate $CONDA_ENV_PATH
Tip
This init-dh-environment
script can be very useful to tailor the execution’s environment to your needs. Here are a few tips that can be useful:
To activate XLA optimized compilation add
export TF_XLA_FLAGS=--tf_xla_enable_xla_devices
To change the log level of Tensorflow add
export TF_CPP_MIN_LOG_LEVEL=3
Then create a new file named deephyper-job.qsub
and make it executable. It will correspond to your submission script.
$ touch deephyper-job.qsub && chmod +x deephyper-job.qsub
Add the following content:
deephyper-job.qsub
##!/bin/bash
#COBALT -A $PROJECT_NAME
#COBALT -n 2
#COBALT -q full-node
#COBALT -t 20
# User Configuration
EXP_DIR=$PWD
INIT_SCRIPT=$PWD/init-dh-environment.sh
CPUS_PER_NODE=8
GPUS_PER_NODE=8
# Initialization of environment
source $INIT_SCRIPT
# Getting the node names
mapfile -t nodes_array -d '\n' < $COBALT_NODEFILE
head_node=${nodes_array[0]}
head_node_ip=$(dig $head_node a +short | awk 'FNR==2')
# Starting the Ray Head Node
port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"
echo "Starting HEAD at $head_node"
ssh -tt $head_node_ip "source $INIT_SCRIPT; cd $EXP_DIR; \
ray start --head --node-ip-address=$head_node_ip --port=$port \
--num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE --block" &
# Optional, though may be useful in certain versions of Ray < 1.0.
sleep 10
# Number of nodes other than the head node
worker_num=$((${#nodes_array[*]} - 1))
echo "$worker_num workers"
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
node_i_ip=$(dig $node_i a +short | awk 'FNR==1')
echo "Starting WORKER $i at $node_i with ip=$node_i_ip"
ssh -tt $node_i_ip "source $INIT_SCRIPT; cd $EXP_DIR; \
ray start --address $ip_head \
--num-cpus $CPUS_PER_NODE --num-gpus $GPUS_PER_NODE --block" &
sleep 5
done
# Check the status of the Ray cluster
ssh -tt $head_node_ip "source $INIT_SCRIPT && ray status"
# Run the search
ssh -tt $head_node_ip "source $INIT_SCRIPT && cd $EXP_DIR && \
deephyper nas random \
--problem deephyper.benchmark.nas.linearRegHybrid.Problem \
--evaluator ray \
--run-function deephyper.nas.run.run_base_trainer \
--num-workers -1 \
--ray-address auto \
--ray-num-cpus-per-task 1 \
--ray-num-gpus-per-task 1 \
--verbose 1"
# Stop de Ray cluster
ssh -tt $head_node_ip "source $INIT_SCRIPT && ray stop"
Edit the #COBALT ...
directives:
#COBALT -A $PROJECT_NAME
#COBALT -n 2
#COBALT -q full-node
#COBALT -t 20
and adapt the executed Python application depending on your needs:
ssh -tt $head_node_ip "source $INIT_SCRIPT && cd $EXP_DIR && \
deephyper nas random \
--problem deephyper.benchmark.nas.linearRegHybrid.Problem \
--evaluator ray \
--run-function deephyper.nas.run.run_base_trainer \
--num-workers -1 \
--ray-address auto \
--ray-num-cpus-per-task 1 \
--ray-num-gpus-per-task 1 \
--verbose 1"
Finally, submit the script from a ThetaGPU login node (e.g., thetagpusn1
):
qsub deephyper-job.qsub
Note
The ssh -tt $head_node_ip "source $INIT_SCRIPT && ray status"
command is used to check the good initialization of the Ray cluster. Once the job starts running, check the *.output
file and verify that the number of detected GPUs is correct.
2.2. Jupyter Notebook#
This section of the tutorial will show you how to run an interactive Jupyter notebook on ThetaGPU. After logging in Theta:
From a
thetalogin
node:ssh thetagpusn1
to login to a ThetaGPU service node.From
thetagpusn1
, start an interactive job (note whichthetagpuXX
node you get placed onto will vary) by replacing your$PROJECT_NAME
and$QUEUE_NAME
(e.g. of available queues arefull-node
andsingle-gpu
):
(thetagpusn1) $ qsub -I -A $PROJECT_NAME -n 1 -q $QUEUE_NAME -t 60
Job routed to queue "full-node".
Wait for job 10003623 to start...
Opening interactive session to thetagpu21
Wait for the interactive session to start. Then, from the ThetaGPU compute node (thetagpuXX), execute the following commands to initialize your DeepHyper environment (adapt to your needs):
$ . /etc/profile
$ module load conda/2021-09-22
$ conda activate $CONDA_ENV_PATH
Then, start the Jupyter notebook server:
$ jupyter notebook &
Note
In the case of a multi-GPUs node, it is possible that the Jupyter notebook process will lock one of the available GPUs. Therefore, launch the notebook with the following command instead:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 jupyter notebook &
Take note of the hostname of the current compute node (e.g.
thetagpuXX
):
echo $HOSTNAME
Leave the interactive session running and open a new terminal window on your local machine.
In the new terminal window, execute the SSH command to link the local port to the ThetaGPU compute node after replacing with you
$USERNAME
and correspondingthetagpuXX
:
$ ssh -tt -L 8888:localhost:8888 $USERNAME@theta.alcf.anl.gov "ssh -L 8888:localhost:8888 thetagpuXX"
Open the Jupyter URL (http:localhost:8888/?token=….) in a local browser. This URL was printed out at step 4.