Remote Training

Note

Please see the ml4gw quickstart for help installing the necessary tools ( helm, kubernetes, s3cmd) and configuration (weights and biases, s3 credentials) to run remote tuning. This quickstart includes a comprehensive Makefile to install this tooling in a fresh conda environment, and instructions on settting up necessary credentials.

Initialize a Remote Training Experiment

A remote training experiment can be initialized with the amplfi-init command by supplying the optional --remote-train and --s3-bucket flags.

For example, to initialize a directory to train a flow, run

> amplfi-init --mode flow --pipeline train --directory ~/amplfi/ -n my-first-remote-train --remote-train true --s3-bucket s3://my_bucket/my-first-remote-train/
INFO - Initialized a flow train pipeline at /home/albert.einstein/amplfi/my-first-remote-train

The directory contents will look similar to those created for local training jobs. For example you will see a train.yaml training configuration file, and a run.sh file for launching the job.

You will also now see a kubernetes.yaml file that contains the kuberenetes configuration for launching the kubernetes pod on nautilus. This file will be filled out with configuration based on the s3_bucket specified. For example, AWS_ENDPOINT_URL and WANDB_API_KEY will be set inside the remote container based on your local environment variables.

In addition, AMPLFI_OUTDIR and AMPLFI_DATADIR environment variables will be set inside the container based on the specified s3_bucket:

# snip
  env:
  # snip
  - name: AMPLFI_OUTDIR
      value: s3://my_bucket/my-first-remote-train
  - name: AMPLFI_DATADIR
      value: s3://my_bucket/my-first-remote-train/data

Note

If you already have a remote data directory you wish to train with, you can specify the AMPLFI_DATADIR environment variable in the run.sh and kubernetes.yaml to point to your data directory.

The run.sh file will look slightly different than the local training job:

# run.sh

#!/bin/bash
export AMPLFI_DATADIR=s3://my_bucket/my-first-remote-train/data

# launch data generation pipeline
LAW_CONFIG_FILE=/home/ethan.marx/amplfi/my-first-remote-train/datagen.cfg law run amplfi.data.DataGeneration --workers 5

# move config file to remote s3 location
s3cmd put /home/ethan.marx/amplfi/my-first-remote-train/cbc.yaml s3://my_bucket/my-first-remote-train/cbc.yaml

# launch job
kubectl apply -f /home/ethan.marx/amplfi/kubernetes.yaml

The first step is generating strain data for training and testing. As usual, if data at the specified AMPLFI_DATADIR already exists, this step will be automatically skipped. Next, the training configuration file will be moved to remote storage so that it can be accessed by the kubernetes job. Finally, the kubernetes job will be launched.

To monitor the job, you can run

kubectl get pods

to get the pod name and inspect the status of the pod, and

kubectl logs <pod-name>

to inspect any logs from the pod once its running.

Configuring Kubernetes Job

By default, the job will utilize the remote AMPLFI image at ghcr.io/ml4gw/amplfi/amplfi:main. If for some reason you wish to utilize another image with AMPLFI installed, you can change the image parameter.

The amount of GPUs and CPUs available in the pod can also be configured by editing the kubernetes.yaml file”

# kubernetes.yaml

# snip
...
    image: ghcr.io/ml4gw/amplfi/amplfi:main
    imagePullPolicy: Always
    name: train
    resources:
        limits:
        cpu: "96"
        memory: 416G
        nvidia.com/gpu: "8"
        requests:
        cpu: "96"
        memory: 416G
        nvidia.com/gpu: "8"

By default, 8 gpus are requested. Sometimes it can take a little while for jobs with 8 gpus to be scheduled. Decreasing the number of requested GPUs will speed up scheduling time.