Tuning
Hyperparameter tuning is powered by Ray Tune. We utilize a wrapper library, lightray, that simplifies the use of Ray Tune with PyTorch Lightning LightningCLI’s.
Initialize a Tune Experiment
A new tuning experiment can be initialized using the amplfi-init command.
For example, to initialize a directory to train a flow, run
amplfi-init --mode flow --pipeline tune --directory ~/amplfi/my-first-tune/
This will create a directory at ~/amplfi/my-first-tune/, and populate it with
configuration files for the run. The train.yaml contains the main configuration for the training.
datagen.cfg controls the configuration for querying training and testing strain data.
tune.yaml configures parameters that control how Ray will perform the hyperparameter tuning.
Configuring an Experiment
A key ingredient in the tuning job is the parameter space that is searched over. This can be configured
via the param_space parameter in the tune.yaml configuration file.
# tune.yaml
param_space:
model.learning_rate: tune.loguniform(1e-3, 4)
data.kernel_length: tune.choice([1, 2])
the parameter names should be python “dot paths” to attributes in the train.yaml. Any
parameters set in the search space will be sampled from the distribution
when each trial is launched, and override the value set in train.yaml.
Most of the parameters from the ray.tune.Tuner are also configurable, including the tuning scheduler and search algorithm. Please see the ray tune documentation for more information.
You can see a full list of configuration by running
lightray --help
Launching a Run
The entrypoint to the tuning pipeline is the run.sh file generated in the experiment directory.
# run.sh
#!/bin/bash
# Export environment variables
export AMPLFI_DATADIR=/home/albert.einstein/amplfi/my-first-tune
export AMPLFI_OUTDIR=/home/albert.einstein/amplfi/my-first-tune/runs/
export AMPLFI_CONDORDIR=/home/albert.einstein/amplfi/my-first-tune/condor
CUDA_VISIBLE_DEVICES=0
# launch the data generation pipeline
LAW_CONFIG_FILE=/home/albert.einstein/amplfi/my-first-tune/datagen.cfg law run amplfi.law.DataGeneration --workers 5
# launch training or tuning pipeline
lightray --config tune.yaml -- --config cbc.yaml
If you’ve run the training pipeline this should look familiar: environment variables control the location where data is stored and where the tuning runs will be stored. There’s a command to launch the data generation pipeline, followed by a command to launch the tuning job.
Local Tuning
If the address parameter in the tune.yaml is set to null (the default), then a local Ray cluster will be initialized.
The tuning will then use local resources. The amount of resources to be alloated per trial can be controlled by the
gpus_per_worker, and cpus_per_gpu arguments. The CUDA_VISIBLE_DEVICES environment variable will control the available GPU resources
exposed to the job.
Remote Tuning
Tuning can also be performed via a remote Ray cluster. Assuming you have properly set up your cluster worker nodes with access
to a remote data directory on s3, and weights and biases (more on this below), then launching a remote tuning job is as simple as passing the
ip address of your Ray clusters head node to the address variable.
Running tuning remotely will require that your data directory live on an s3 storage system. To generate data
that is autmoatically moved to an s3 bucket, you can simply set the AMPLFI_DATADIR environment variable to an s3 path
in the run.sh! You’ll also need to set the AMPLFI_OUTDIR to an s3 location.
# run.sh
export AMPLFI_DATADIR=s3://my-bucket/my-first-tune/data
export AMPLFI_OUTDIR=s3://my-bucket/my-first-tune/runs
...
Kubernetes Ray Cluster
Note
Please see the ml4gw quickstart for help installing the necessary tools (
helm,kubernetes,s3cmd) and configuration (weights and biases, s3 credentials) to run remote tuning. This quickstart includes a comprehensive Makefile to install this tooling in a fresh conda environment, and instructions on settting up necessary credentials.
lightray ships a helm chart that can be used to launch a ray head and worker nodes on a remote kubernetes cluster.
First, add the helm repository
helm repo add lightray https://ethanmarx.github.io/lightray/
The helm chart comes with some configuration you’ll need to set. To pull the “values” configuration template, run
helm show values lightray/ray-cluster >> values.yaml
Specifically, you’ll need to set the container to the remote amplfi image
image: ghcr.io/ml4gw/amplfi/amplfi:main
And you’ll also need to set your WANDB_API_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY
to the corresponding variable so that the remote cluster can access your data on s3, and upload to weights and biases.
Then, you can install the cluster. You can name the installation anything. Here we name it my-ray-cluster
helm install my-ray-cluster lightray/ray-cluster -f values.yaml
To monitor the status of your pods, run
kubectl get pods
You should see something like
NAME READY STATUS RESTARTS AGE
my-ray-cluster-head-7b9597fdd8-brrlm 0/1 ContainerCreating 0 2s
my-ray-cluster-worker-bd6698d67-49p6x 0/1 ContainerCreating 0 2s
Once the head and at least one worker pod are in the RUNNING state, you can query the
kubernetes Service corresponding to the head node for it’s ip address:
$ kubectl get service my-ray-cluster-head-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
pass this ip address, to the address parameter in tune.yaml with the format ray://{ip}:10001. For example,
if the ip address was 11.22.10.27 you would set
address = ray://11.22.10.27:10001
Now, launch the run!
lightray --config tune.yaml -- --config cbc.yaml
Note
Remember to clean up your kubernetes jobs! You can uninstall all resources
created by the helm chart with helm uninstall {chart-name}
Syncing Remote Code
In some cases, it is necessary to launch a tuning job with code changes that haven’t been integrated into the AMPLFI main branch,
and thus have not been pushed to the remote container.
To allow this, the lightray/ray-cluster chart supports an optional git-sync initContainer
that will clone and mount remote code inside the kubernetes pods.
To use this with AMPLFI, you will need to configure the following in the charts values.yaml file
# set dev to true
dev: true
gitRepo:
# name must be set to amplfi
name: amplfi
# set to repo you want to mount
url: git@github.com:albert.einstein/amplfi.git
# set ref to branch name or commit hash
ref: my-branch
# mountPath must be set to /opt
mountPath: /opt
SLURM Ray cluster
In order to use compute resources that are managed via SLURM, the
steps to start the Ray cluster is different. Once started, the rest of the
steps using lightray is similar to that mentioned above. Note that the steps below closely
resembles the deploy on SLURM
in the Ray documentation.
The following example has been created using NCSA Delta.
Also, ensure that the apptainer image already built, referred to as ${AMPLFI_CONTAINER_ROOT}/amplfi.sif.
SBATCH directives
#!/bin/bash
#SBATCH --nodes=10
#SBATCH --tasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=3
#SBATCH --job-name=my-tuner
#SBATCH --account=<account-name>
#SBATCH --output=%x.out
#SBATCH --error=%x.err
#SBATCH --time=6:00:00
#SBATCH --partition=gpuA100x4
#SBATCH --mem-per-cpu=10GB
The first four lines are the relevant ones. In this case, the resources reserved by SLURM will be used for 10 Ray workers each with one GPU and three CPUs.
Note
For the CPU resources, providing one more than that used by a single worker is recommended.
In this case this implies tune.yaml should have: cpus_per_trial: 2,
gpus_per_trial: 1. Adjust based on the workers that the dataloaders use.
Head worker
Out of the 10 nodes, the first landing machine will be used for the head worker.
head_node=$(hostname | cut -d. -f1)
# the cut step is specific to Delta and may not be needed in general
head_node_ip=$(hostname --ip-address)
port=6379
echo "#### STARTING HEAD at $head_node ####"
echo "#### HEAD NODE IP: $head_node_ip ####"
srun --nodes=1 --ntasks=1 -w "$head_node" \
apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \
${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
ray start --head --node-ip-address="$head_node_ip" --port=$port \
--num-cpus "${SLURM_CPUS_PER_TASK}" \
--num-gpus 1 --block &
sleep 10
echo "#### HEAD NODE ASSUMED TO HAVE STARTED ####"
Adjust the sleep 10 statement in case you find the next step start before the head is up.
Remaining workers
Use the address of the head node to start the worker nodes.
worker_num=$(($SLURM_JOB_NUM_NODES - 1))
srun --ntasks=$worker_num --nodes=$worker_num --ntasks-per-node=1 --exclude=$head_node \
apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \
${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
ray start --address $head_node_ip:$port \
--num-cpus "${SLURM_CPUS_PER_TASK}" \
--num-gpus 1 --block &
echo "#### SLEEPING FOR 60s BEFORE CALLING SCRIPT ####"
sleep 60
The --ntasks=$worker_num and --ntasks-per-node=1 will ensure only one instance of Ray is started on
the remaining nodes, and they find the head through --address $head_node_ip:$port. Adjust
the sleep duration based on whether the full Ray cluster becomes active.
Note
Because the individual training runs will be sent to the nodes above, ensure that all the necessary mounts are bound
using the --bind for every apptainer run entrypoint. If you are getting a permission issue, check if you missed binding a mount.
Launch HPO using lightray
Finally, launch the hyperparameter tuning using lightray as above.
echo "#### ASSUMING RAY CLUSTER IS UP, CALLING SCRIPT ####"
apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} \
--nv ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
lightray --config tune.yaml --ray_init.configure_logging false -- \
--config cbc.yaml
We have to set configure_logging=False in ray.init to since by default the logging
is done under /tmp which may point to different filesystems on different nodes. This is fine
since the logs will be directed to the stdout and stderr files in the SBATCH directives.
Put all the steps above in a single file called tune.slurm and submit it.
$ sbatch tune.slurm