Remote Training
Note
Please see the ml4gw quickstart for help installing the necessary tools (
helm,kubernetes,s3cmd) and configuration (weights and biases, s3 credentials) to run remote tuning. This quickstart includes a comprehensive Makefile to install this tooling in a fresh conda environment, and instructions on settting up necessary credentials.
Initialize a Remote Training Experiment
A remote training experiment can be initialized with the amplfi-init command
by supplying the optional --remote-train and --s3-bucket flags.
For example, to initialize a directory to train a flow, run
> amplfi-init --mode flow --pipeline train --directory ~/amplfi/ -n my-first-remote-train --remote-train true --s3-bucket s3://my_bucket/my-first-remote-train/
INFO - Initialized a flow train pipeline at /home/albert.einstein/amplfi/my-first-remote-train
The directory contents will look similar to those created for local training jobs.
For example you will see a train.yaml training configuration file, and a run.sh file
for launching the job.
You will also now see a kubernetes.yaml file that contains the kuberenetes configuration
for launching the kubernetes pod on nautilus. This file will be filled out with configuration based on
the s3_bucket specified. For example, AWS_ENDPOINT_URL and WANDB_API_KEY will be set inside the remote
container based on your local environment variables.
In addition, AMPLFI_OUTDIR and AMPLFI_DATADIR environment variables will be set inside the container
based on the specified s3_bucket:
# snip
env:
# snip
- name: AMPLFI_OUTDIR
value: s3://my_bucket/my-first-remote-train
- name: AMPLFI_DATADIR
value: s3://my_bucket/my-first-remote-train/data
Note
If you already have a remote data directory you wish to train with, you can specify the
AMPLFI_DATADIRenvironment variable in therun.shandkubernetes.yamlto point to your data directory.
The run.sh file will look slightly different than the local training job:
# run.sh
#!/bin/bash
export AMPLFI_DATADIR=s3://my_bucket/my-first-remote-train/data
# launch data generation pipeline
LAW_CONFIG_FILE=/home/ethan.marx/amplfi/my-first-remote-train/datagen.cfg law run amplfi.data.DataGeneration --workers 5
# move config file to remote s3 location
s3cmd put /home/ethan.marx/amplfi/my-first-remote-train/cbc.yaml s3://my_bucket/my-first-remote-train/cbc.yaml
# launch job
kubectl apply -f /home/ethan.marx/amplfi/kubernetes.yaml
The first step is generating strain data for training and testing. As usual, if data at the specified AMPLFI_DATADIR already exists,
this step will be automatically skipped. Next, the training configuration file will be moved to remote storage
so that it can be accessed by the kubernetes job. Finally, the kubernetes job will be launched.
To monitor the job, you can run
kubectl get pods
to get the pod name and inspect the status of the pod, and
kubectl logs <pod-name>
to inspect any logs from the pod once its running.
Configuring Kubernetes Job
By default, the job will utilize the remote AMPLFI image at ghcr.io/ml4gw/amplfi/amplfi:main.
If for some reason you wish to utilize another image with AMPLFI installed, you can change
the image parameter.
The amount of GPUs and CPUs available in the pod can also be configured by editing the kubernetes.yaml file”
# kubernetes.yaml
# snip
...
image: ghcr.io/ml4gw/amplfi/amplfi:main
imagePullPolicy: Always
name: train
resources:
limits:
cpu: "96"
memory: 416G
nvidia.com/gpu: "8"
requests:
cpu: "96"
memory: 416G
nvidia.com/gpu: "8"
By default, 8 gpus are requested. Sometimes it can take a little while for jobs with 8 gpus to be scheduled. Decreasing the number of requested GPUs will speed up scheduling time.