Tuning ====== Hyperparameter tuning is powered by [Ray Tune](https://docs.ray.io/en/latest/tune/index.html). We utilize a wrapper library, [lightray](https://github.com/ethanmarx/lightray), that simplifies the use of Ray Tune with PyTorch Lightning `LightningCLI`'s. ## Initialize a Tune Experiment A new tuning experiment can be initialized using the `amplfi-init` command. For example, to initialize a directory to train a flow, run ```console amplfi-init --mode flow --pipeline tune --directory ~/amplfi/my-first-tune/ ``` This will create a directory at `~/amplfi/my-first-tune/`, and populate it with configuration files for the run. The `train.yaml` contains the main configuration for the training. `datagen.cfg` controls the configuration for querying training and testing strain data. `tune.yaml` configures parameters that control how `Ray` will perform the hyperparameter tuning. ## Configuring an Experiment A key ingredient in the tuning job is the parameter space that is searched over. This can be configured via the `param_space` parameter in the `tune.yaml` configuration file. ```yaml # tune.yaml param_space: model.learning_rate: tune.loguniform(1e-3, 4) data.kernel_length: tune.choice([1, 2]) ``` the parameter names should be python "dot paths" to attributes in the `train.yaml`. Any parameters set in the search space will be sampled from the distribution when each trial is launched, and override the value set in `train.yaml`. Most of the parameters from the [`ray.tune.Tuner`](https://docs.ray.io/en/latest/tune/api/doc/ray.tune.Tuner.html) are also configurable, including the tuning scheduler and search algorithm. Please see the ray tune [documentation](https://docs.ray.io/en/latest/tune/index.html) for more information. You can see a full list of configuration by running ``` lightray --help ``` ## Launching a Run The entrypoint to the tuning pipeline is the `run.sh` file generated in the experiment directory. ```bash # run.sh #!/bin/bash # Export environment variables export AMPLFI_DATADIR=/home/albert.einstein/amplfi/my-first-tune export AMPLFI_OUTDIR=/home/albert.einstein/amplfi/my-first-tune/runs/ export AMPLFI_CONDORDIR=/home/albert.einstein/amplfi/my-first-tune/condor CUDA_VISIBLE_DEVICES=0 # launch the data generation pipeline LAW_CONFIG_FILE=/home/albert.einstein/amplfi/my-first-tune/datagen.cfg law run amplfi.law.DataGeneration --workers 5 # launch training or tuning pipeline lightray --config tune.yaml -- --config cbc.yaml ``` If you've run the [training pipeline](first_pipeline.md) this should look familiar: environment variables control the location where data is stored and where the tuning runs will be stored. There's a command to launch the data generation pipeline, followed by a command to launch the tuning job. ## Local Tuning If the `address` parameter in the `tune.yaml` is set to `null` (the default), then a local Ray cluster will be initialized. The tuning will then use local resources. The amount of resources to be alloated per trial can be controlled by the `gpus_per_worker`, and `cpus_per_gpu` arguments. The `CUDA_VISIBLE_DEVICES` environment variable will control the available GPU resources exposed to the job. ## Remote Tuning Tuning can also be performed via a remote Ray cluster. Assuming you have properly set up your cluster worker nodes with access to a remote data directory on `s3`, and weights and biases (more on this below), then launching a remote tuning job is as simple as passing the ip address of your Ray clusters head node to the `address` variable. Running tuning remotely will require that your data directory live on an `s3` storage system. To generate data that is autmoatically moved to an `s3` bucket, you can simply set the `AMPLFI_DATADIR` environment variable to an `s3` path in the `run.sh`! You'll also need to set the `AMPLFI_OUTDIR` to an `s3` location. ```bash # run.sh export AMPLFI_DATADIR=s3://my-bucket/my-first-tune/data export AMPLFI_OUTDIR=s3://my-bucket/my-first-tune/runs ... ``` ### Kubernetes Ray Cluster ```{eval-rst} .. note:: Please see the `ml4gw quickstart `_ for help installing the necessary tools ( :code:`helm`, :code:`kubernetes`, :code:`s3cmd`) and configuration (weights and biases, s3 credentials) to run remote tuning. This quickstart includes a comprehensive Makefile to install this tooling in a fresh conda environment, and instructions on settting up necessary credentials. ``` `lightray` ships a `helm` chart that can be used to launch a ray head and worker nodes on a remote kubernetes cluster. First, add the helm repository ```console helm repo add lightray https://ethanmarx.github.io/lightray/ ``` The helm chart comes with some configuration you'll need to set. To pull the "values" configuration template, run ```console helm show values lightray/ray-cluster >> values.yaml ``` Specifically, you'll need to set the container to the remote `amplfi` image ```yaml image: ghcr.io/ml4gw/amplfi/amplfi:main ``` And you'll also need to set your `WANDB_API_KEY`, `AWS_ACCESS_KEY_ID`, and `AWS_SECRET_ACCESS_KEY` to the corresponding variable so that the remote cluster can access your data on s3, and upload to weights and biases. Then, you can install the cluster. You can name the installation anything. Here we name it `my-ray-cluster` ```console helm install my-ray-cluster lightray/ray-cluster -f values.yaml ``` To monitor the status of your pods, run ```console kubectl get pods ``` You should see something like ```console NAME READY STATUS RESTARTS AGE my-ray-cluster-head-7b9597fdd8-brrlm 0/1 ContainerCreating 0 2s my-ray-cluster-worker-bd6698d67-49p6x 0/1 ContainerCreating 0 2s ``` Once the head and at least one worker pod are in the `RUNNING` state, you can query the kubernetes Service corresponding to the head node for it's ip address: ```console $ kubectl get service my-ray-cluster-head-loadbalancer -o jsonpath='{.status.loadBalancer.ingress[0].ip}' ``` pass this ip address, to the `address` parameter in `tune.yaml` with the format `ray://{ip}:10001`. For example, if the ip address was `11.22.10.27` you would set ```yaml address = ray://11.22.10.27:10001 ``` Now, launch the run! ```console lightray --config tune.yaml -- --config cbc.yaml ``` ```{eval-rst} .. note:: Remember to clean up your kubernetes jobs! You can uninstall all resources created by the helm chart with :code:`helm uninstall {chart-name}` ``` #### Syncing Remote Code In some cases, it is necessary to launch a tuning job with code changes that haven't been integrated into the `AMPLFI` `main` branch, and thus have not been pushed to the remote container. To allow this, the `lightray/ray-cluster` chart supports an optional [git-sync](https://github.com/kubernetes/git-sync) `initContainer` that will clone and mount remote code inside the kubernetes pods. To use this with `AMPLFI`, you will need to configure the following in the charts `values.yaml` file ```yaml # set dev to true dev: true gitRepo: # name must be set to amplfi name: amplfi # set to repo you want to mount url: git@github.com:albert.einstein/amplfi.git # set ref to branch name or commit hash ref: my-branch # mountPath must be set to /opt mountPath: /opt ``` ### SLURM Ray cluster In order to use compute resources that are managed via [SLURM](https://slurm.schedmd.com/), the steps to start the `Ray` cluster is different. Once started, the rest of the steps using `lightray` is similar to that mentioned above. Note that the steps below closely resembles the [deploy on SLURM](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html) in the Ray documentation. The following example has been created using [NCSA Delta](https://docs.ncsa.illinois.edu/systems/delta/). Also, ensure that the apptainer image already built, referred to as `${AMPLFI_CONTAINER_ROOT}/amplfi.sif`. #### SBATCH directives ```bash #!/bin/bash #SBATCH --nodes=10 #SBATCH --tasks-per-node=1 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-task=3 #SBATCH --job-name=my-tuner #SBATCH --account= #SBATCH --output=%x.out #SBATCH --error=%x.err #SBATCH --time=6:00:00 #SBATCH --partition=gpuA100x4 #SBATCH --mem-per-cpu=10GB ``` The first four lines are the relevant ones. In this case, the resources reserved by SLURM will be used for 10 Ray workers each with one GPU and three CPUs. ```{eval-rst} .. note:: For the CPU resources, providing one more than that used by a single worker is recommended. In this case this implies ``tune.yaml`` should have: ``cpus_per_trial: 2``, ``gpus_per_trial: 1``. Adjust based on the workers that the dataloaders use. ``` #### Head worker Out of the 10 nodes, the first landing machine will be used for the head worker. ```bash head_node=$(hostname | cut -d. -f1) # the cut step is specific to Delta and may not be needed in general head_node_ip=$(hostname --ip-address) port=6379 echo "#### STARTING HEAD at $head_node ####" echo "#### HEAD NODE IP: $head_node_ip ####" srun --nodes=1 --ntasks=1 -w "$head_node" \ apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \ ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \ ray start --head --node-ip-address="$head_node_ip" --port=$port \ --num-cpus "${SLURM_CPUS_PER_TASK}" \ --num-gpus 1 --block & sleep 10 echo "#### HEAD NODE ASSUMED TO HAVE STARTED ####" ``` Adjust the `sleep 10` statement in case you find the next step start before the head is up. #### Remaining workers Use the address of the head node to start the worker nodes. ```bash worker_num=$(($SLURM_JOB_NUM_NODES - 1)) srun --ntasks=$worker_num --nodes=$worker_num --ntasks-per-node=1 --exclude=$head_node \ apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} --nv \ ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \ ray start --address $head_node_ip:$port \ --num-cpus "${SLURM_CPUS_PER_TASK}" \ --num-gpus 1 --block & echo "#### SLEEPING FOR 60s BEFORE CALLING SCRIPT ####" sleep 60 ``` The `--ntasks=$worker_num` and `--ntasks-per-node=1` will ensure only one instance of `Ray` is started on the remaining nodes, and they find the head through `--address $head_node_ip:$port`. Adjust the `sleep` duration based on whether the full Ray cluster becomes active. ```{eval-rst} .. note:: Because the individual training runs will be sent to the nodes above, ensure that all the necessary mounts are bound using the ``--bind`` for every ``apptainer run`` entrypoint. If you are getting a permission issue, check if you missed binding a mount. ``` #### Launch HPO using `lightray` Finally, launch the hyperparameter tuning using `lightray` as above. ```bash echo "#### ASSUMING RAY CLUSTER IS UP, CALLING SCRIPT ####" apptainer run --bind ${AMPLFI_DATADIR},${AMPLFI_OUTDIR} \ --nv ${AMPLFI_CONTAINER_ROOT}/amplfi.sif \ lightray --config tune.yaml --ray_init.configure_logging false -- \ --config cbc.yaml ``` We have to set `configure_logging=False` in `ray.init` to since by default the logging is done under `/tmp` which may point to different filesystems on different nodes. This is fine since the logs will be directed to the `stdout` and `stderr` files in the SBATCH directives. Put all the steps above in a single file called `tune.slurm` and submit it. ```bash $ sbatch tune.slurm ```