First Pipeline ============== :::{note} Running AMPLFI out-of-the-box requires access to an enterprise-grade GPU(s) (e.g. P100, V100, T4, A[30,40,100], H[100,200] etc.). There are several nodes on the LIGO Data Grid which meet these requirements. ::: After [installing](./installation.md) `AMPLFI`, you will have access to the `amplfi-init` command for initializing experiment directories: ```console > amplfi-init --help usage: amplfi-init [-h] [--mode {flow,similarity}] [--pipeline {tune,train}] [-n NAME] [-d DIRECTORY] [--s3-bucket S3_BUCKET] Initialize a directory with configuration files for running end-to-end amplfi training or tuning pipelines options: -h, --help Show this help message and exit. --mode {flow,similarity} Either 'flow' or 'similarity'. Whether to setup a flow or similarity training (default: flow) --pipeline {tune,train} Either 'train' or 'tune'. Whether to setup a tune or train pipeline (default: train) -n NAME, --name NAME The name of the run. This will be used to create the run subdirectory. (required, type: str) -d DIRECTORY, --directory DIRECTORY The parent directory where the data and subdirectories for runs will be stored. If not provided, the environment variable AMPLFI_RUNDIR will be used. (type: , default: null) --s3-bucket S3_BUCKET (default: null) ``` For example, let's initialize a run directory at `~/amplfi/my-runs` for training a normalizing flow, and name it `first-flow-run`: ```console amplfi-init --mode flow --pipeline train --directory ~/amplfi/my-runs --name first-flow-run ``` Alternatively the `--directory` argument can be skipped by defining the `AMPLFI_RUNDIR` environment variable. This will be used as the parent directory for all runs. ```console export AMPLFI_RUNDIR=~/amplfi/my-runs amplfi-init --mode flow --pipeline train --name first-flow-run ``` A `run.sh` will be created in the run directory that will look like: ```bash #!/bin/bash # Export environment variables export AMPLFI_DATADIR=/home/albert.einstein/amplfi/my-runs/data/ export AMPLFI_OUTDIR=/home/albert.einstein/amplfi/my-runs/first-flow-run/ export AMPLFI_CONDORDIR=/home/albert.einstein/amplfi/my-runs/data/condor # launch the data generation pipeline LAW_CONFIG_FILE=/home/albert.einstein/amplfi/my-first-run/datagen.cfg law run amplfi.data.DataGeneration --workers 5 # launch training pipeline amplfi-flow-cli fit --config cbc.yaml ``` This bash script consists of two steps: 1. Querying gravitational wave strain data using a [law](https://github.com/riga/law) workflow 2. Training a normalizing flow using [Pytorch Lightning](https://lightning.ai/docs/pytorch/stable/) The data querying step is controlled by the `datagen.cfg` file configuration. This will query segments of science-mode strain data, and save them in the directory specified by the `AMPLFI_DATADIR` environment variable. This step uses htcondor for parallelization, and will save any condor log files to `AMPLFI_CONDORDIR`. :::{note} If you already have a data directory consistent with the settings in :code:`datagen.cfg`, you can point :code:`AMPLFI_DATADIR` to it and the data generation step will automatically be skipped. ::: Once data querying is complete, training will begin. Training configuration is controlled by the `train.yaml` file. It's imporant to get familiar with the training parameters, but the defaults should suffice for your first run. The training job will look in `AMPLFI_DATADIR` for strain data, and will save checkpoints and other training artifacts in `AMPLFI_OUTDIR`. Once training has complete, sample corner plots, skymaps, probability-probability and searched area plots can be generated by running the `test` subcommand. Remember to pass your trained model weights, which are saved in the `AMPLFI_OUTDIR` directory. In this case, we pass the weights corresponding to the best validation score, which are automatically saved at `$AMPLFI_OUTDIR/train_logs/best.ckpt` ```console amplfi-flow-cli test --config /path/to/config.yaml --model.checkpoint=$AMPLFI_OUTDIR/train_logs/best.ckpt ``` Plots will be available in the `$AMPLFI_OUTDIR`