Tuning

Hyperparameter tuning is powered by Ray Tune. We utilize a wrapper library, lightray, that simplifies the use of Ray Tune with PyTorch Lightning LightningCLI’s.

Initialize a Tune Experiment

A new tuning experiment can be initialized using the amplfi-init command. For example, to initialize a directory to train a flow, run

amplfi-init --mode flow --pipeline tune --directory ~/amplfi/my-first-tune/

This will create a directory at ~/amplfi/my-first-tune/, and populate it with configuration files for the run. The train.yaml contains the main configuration for the training. datagen.cfg controls the configuration for querying training and testing strain data. tune.yaml configures parameters that control how Ray will perform the hyperparameter tuning.

Configuring an Experiment

A key ingredient in the tuning job is the parameter space that is searched over. This can be configured via the param_space parameter in the tune.yaml configuration file.

# tune.yaml
param_space:
  model.learning_rate: tune.loguniform(1e-3, 4)
  data.kernel_length: tune.choice([1, 2])

the parameter names should be python “dot paths” to attributes in the train.yaml. Any parameters set in the search space will be sampled from the distribution when each trial is launched, and override the value set in train.yaml.

Most of the parameters from the ray.tune.Tuner are also configurable, including the tuning scheduler and search algorithm. Please see the ray tune documentation for more information.

You can see a full list of configuration by running

lightray --help

Launching a Run

The entrypoint to the tuning pipeline is the run.sh file generated in the experiment directory.

# run.sh

#!/bin/bash
# Export environment variables
export AMPLFI_DATADIR=/home/albert.einstein/amplfi/my-first-tune
export AMPLFI_OUTDIR=/home/albert.einstein/amplfi/my-first-tune/runs/
export AMPLFI_CONDORDIR=/home/albert.einstein/amplfi/my-first-tune/condor

CUDA_VISIBLE_DEVICES=0

# launch the data generation pipeline
LAW_CONFIG_FILE=/home/albert.einstein/amplfi/my-first-tune/datagen.cfg law run amplfi.law.DataGeneration --workers 5

# launch training or tuning pipeline
lightray --config tune.yaml -- --config cbc.yaml

If you’ve run the training pipeline this should look familiar: environment variables control the location where data is stored and where the tuning runs will be stored. There’s a command to launch the data generation pipeline, followed by a command to launch the tuning job.

Local Tuning

If the address parameter in the tune.yaml is set to null (the default), then a local Ray cluster will be initialized. The tuning will then use local resources. The amount of resources to be alloated per trial can be controlled by the gpus_per_worker, and cpus_per_gpu arguments. The CUDA_VISIBLE_DEVICES environment variable will control the available GPU resources exposed to the job.

Remote Tuning

Tuning can also be performed via a remote Ray cluster. Assuming you have properly set up your cluster worker nodes with access to a remote data directory on s3, and weights and biases (more on this below), then launching a remote tuning job is as simple as passing the ip address of your Ray clusters head node to the address variable.

Running tuning remotely will require that your data directory live on an s3 storage system. To generate data that is autmoatically moved to an s3 bucket, you can simply set the AMPLFI_DATADIR environment variable to an s3 path in the run.sh! You’ll also need to set the AMPLFI_OUTDIR to an s3 location.

# run.sh
export AMPLFI_DATADIR=s3://my-bucket/my-first-tune/data
export AMPLFI_OUTDIR=s3://my-bucket/my-first-tune/runs
...

Kubernetes Ray Cluster

Note

Please see the ml4gw quickstart for help installing the necessary tools ( helm, kubernetes, s3cmd) and configuration (weights and biases, s3 credentials) to run remote tuning. This quickstart includes a comprehensive Makefile to install this tooling in a fresh conda environment, and instructions on settting up necessary credentials.

lightray ships a helm chart that can be used to launch a ray head and worker nodes on a remote kubernetes cluster.

First, add the helm repository

helm repo add lightray https://ethanmarx.github.io/lightray/

The helm chart comes with some configuration you’ll need to set. To pull the “values” configuration template, run

helm show values lightray/ray-cluster >> values.yaml

Specifically, you’ll need to set the container to the remote amplfi image

image: ghcr.io/ml4gw/amplfi/amplfi:main

And you’ll also need to set your WANDB_API_KEY, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY to the corresponding variable so that the remote cluster can access your data on s3, and upload to weights and biases.

Then, you can install the cluster. You can name the installation anything. Here we name it my-ray-cluster

helm install my-ray-cluster lightray/ray-cluster -f values.yaml

To monitor the status of your pods, run

kubectl get pods

You should see something like

NAME                                   READY   STATUS              RESTARTS   AGE
my-ray-cluster-head-7b9597fdd8-brrlm    0/1     ContainerCreating   0          2s
my-ray-cluster-worker-bd6698d67-49p6x   0/1     ContainerCreating   0          2s

Once the head and at least one worker pod are in the RUNNING state, you can query the kubernetes Service corresponding to the head node for it’s ip address:

$ kubectl get service my-ray-cluster-head-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

pass this ip address, to the address parameter in tune.yaml with the format ray://{ip}:10001. For example, if the ip address was 11.22.10.27 you would set

address = ray://11.22.10.27:10001

Now, launch the run!

lightray --config tune.yaml -- --config cbc.yaml

Note

Remember to clean up your kubernetes jobs! You can uninstall all resources created by the helm chart with helm uninstall {chart-name}

Syncing Remote Code

In some cases, it is necessary to launch a tuning job with code changes that haven’t been integrated into the AMPLFI main branch, and thus have not been pushed to the remote container.

To allow this, the lightray/ray-cluster chart supports an optional git-sync initContainer that will clone and mount remote code inside the kubernetes pods.

To use this with AMPLFI, you will need to configure the following in the charts values.yaml file

# set dev to true
dev: true

gitRepo:
    # name must be set to amplfi
    name: amplfi
    # set to repo you want to mount
    url: git@github.com:albert.einstein/amplfi.git
    # set ref to branch name or commit hash
    ref: my-branch
    # mountPath must be set to /opt
    mountPath: /opt

SLURM Ray cluster

In order to use compute resources that are managed via SLURM, the steps to start the Ray cluster is different. Once started, the rest of the steps using lightray is similar to that mentioned above. Note that the steps below closely resembles the deploy on SLURM in the Ray documentation.

The following example has been created using NCSA Delta. Also, ensure that the apptainer image already built, referred to as ${AMPLFI_CONTAINER_ROOT}/amplfi.sif.

SBATCH directives

#!/bin/bash
#SBATCH --nodes=10
#SBATCH --tasks-per-node=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=3
#SBATCH --job-name=my-tuner
#SBATCH --account=<account-name>
#SBATCH --output=%x.out
#SBATCH --error=%x.err
#SBATCH --time=6:00:00
#SBATCH --partition=gpuA100x4
#SBATCH --mem-per-cpu=10GB

The first four lines are the relevant ones. In this case, the resources reserved by SLURM will be used for 10 Ray workers each with one GPU and three CPUs.

Note

For the CPU resources, providing one more than that used by a single worker is recommended. In this case this implies tune.yaml should have: cpus_per_trial: 2, gpus_per_trial: 1. Adjust based on the workers that the dataloaders use.

Head worker

Out of the 10 nodes, the first landing machine will be used for the head worker.

head_node=$(hostname | cut -d. -f1)
# the cut step is specific to Delta and may not be needed in general
head_node_ip=$(hostname --ip-address)
port=6379

echo "#### STARTING HEAD at $head_node ####"
echo "#### HEAD NODE IP: $head_node_ip ####"
srun --nodes=1 --ntasks=1 -w "$head_node" \
    apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \
    ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
      ray start --head --node-ip-address="$head_node_ip" --port=$port \
      --num-cpus "${SLURM_CPUS_PER_TASK}" \
      --num-gpus 1 --block &
sleep 10
echo "#### HEAD NODE ASSUMED TO HAVE STARTED ####"

Adjust the sleep 10 statement in case you find the next step start before the head is up.

Remaining workers

Use the address of the head node to start the worker nodes.

worker_num=$(($SLURM_JOB_NUM_NODES - 1))
srun --ntasks=$worker_num --nodes=$worker_num --ntasks-per-node=1 --exclude=$head_node \
  apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \
  ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
    ray start --address $head_node_ip:$port \
    --num-cpus "${SLURM_CPUS_PER_TASK}" \
    --num-gpus 1 --block &

echo "#### SLEEPING FOR 60s BEFORE CALLING SCRIPT ####"
sleep 60

The --ntasks=$worker_num and --ntasks-per-node=1 will ensure only one instance of Ray is started on the remaining nodes, and they find the head through --address $head_node_ip:$port. Adjust the sleep duration based on whether the full Ray cluster becomes active.

Note

Because the individual training runs will be sent to the nodes above, ensure that all the necessary mounts are bound using the --bind for every apptainer run entrypoint. If you are getting a permission issue, check if you missed binding a mount.

Launch HPO using `lightray`

Finally, launch the hyperparameter tuning using lightray as above.

echo "#### ASSUMING RAY CLUSTER IS UP, CALLING SCRIPT ####"
apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} \
  --nv ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \
  lightray --config tune.yaml --ray_init.configure_logging false -- \
  --config cbc.yaml

We have to set configure_logging=False in ray.init to since by default the logging is done under /tmp which may point to different filesystems on different nodes. This is fine since the logs will be directed to the stdout and stderr files in the SBATCH directives.

Put all the steps above in a single file called tune.slurm and submit it.

$ sbatch tune.slurm

Tuning

Initialize a Tune Experiment

Configuring an Experiment

Launching a Run

Local Tuning

Remote Tuning

Kubernetes Ray Cluster

Syncing Remote Code

SLURM Ray cluster

SBATCH directives

Head worker

Remaining workers

Launch HPO using lightray

Launch HPO using `lightray`