Parallel Bilby Analysis

The Basic Idea

Once everything is set up, parallel_bilby is ready to go and you might simply hit parallel_bilby_analysis to finish the job. However, in a realistic setting of gravitational wave inference, the run will hardly converge in an acceptable amount of time on your local machine. Instead, you will need to work on a computer cluster that handles expensive computations. To allocate its ressources efficiently, most modern clusters use a slurm-workload manager. Think of this as visiting a not-so-nice restaurant. There is no chance for you to cook stuff yourself and you will get exactly what you asked for, but only at a time that suits the kitchen's workflow. And changing your can result in further delays, no matter how hungry you are.

The batch script

The form of specifying your demands is to submit a batch-script to slurm Let's call it analysis.sh. Instead of typing all your commands in the command line, the computer will consecutively execute each line of your batch-file. See below for an example script:

#!/bin/bash
#SBATCH --qos=regular
#SBATCH --time=24:00:00
#SBATCH --nodes=30
#SBATCH --ntasks-per-node=68
#SBATCH --constraint=knl
#SBATCH -o  [PATH_TO_YOUR_OUTDIR]/log_data_analysis/log
#SBATCH -e  [PATH_TO_YOUR_OUTDIR]/log_data_analysis/err
#SBATCH --no-requeue
#SBATCH --account=[YOUR_COST_CENTRE]
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=mail@me.com
#SBATCH --job-name=my_fancy_jobname
module load openmpi/4.0.2 
module load python
conda activate [YOUR_PBILBY_ENVIRONMENT]
  
export MKL_NUM_THREADS="1"
export MKL_DYNAMIC="FALSE"
export OMP_NUM_THREADS=1
export MPI_PER_NODE=68
export PMI_MMAP_SYNC_WAIT_TIME=600
srun -n $SLURM_NTASKS parallel_bilby_analysis [PATH_TO_YOUR_OUTDIR]/data/inj_data_dump.pickle --nlive 2048 --nact 30 --maxmcmc 10000 --sampling-seed 10130134 --no-plot --check-point-deltaT 36000 --outdir  [PATH_TO_YOUR_OUTDIR]/result

The core command is given in the last line: We ask slurm to run (srun), using an amount of $SLURM_NTASKS nodes, a parallel_bilby_analysis on all the information stored in the specified .pickle-file with all the following arguments being passed to parallel_bilby_analysis. But instead of executing the command directly, we provide additional commands that are specified in the above lines in our batch-file.

Required and Useful slurm-options

''bash'' is a very common shell on linux systems. Essentially, it allows for communication with the operating system's kernel. The initial #!/bin/bash is a shebang, telling the computer it is supposed to run all the subsequent commands using bash. A different way would be to simply execute the script as bash analysis.sh. But slurm has a slightly different approach to handling batch-scripts: To allow for efficient management, it comes with its own sbatch-command. All lines immediately after the shebang that start with #SBATCH will be taken as additional arguments until a bash-executable line is found.

Necessary options are:

  • #SBATCH –account=[YOUR_COST_CENTRE] – specifies the cluster account to be charged
  • #SBATCH –qos=regular – the quality of service is specific to options offered by your cluster and will affect your account's balance
  • #SBATCH –time=24:00:00 – the time your job should run at most
  • #SBATCH –nodes=30 – the amount of nodes
  • #SBATCH –ntasks-per-node=68 – the amount of tasks allocated to each node

Note that the latter are subject to cluster rules and a poor choice could result in a job pending indefinitely. Generally, it is prefered by slurm to have more tasks running for a shorter duration. This needs to be balanced against parallel_bilby discouraging the amount of tasks (the product of nodes and ntasks-per-node to significantly exceed the amount of live-points.

Useful options include:

  • #SBATCH –mail-type=BEGIN,END,FAIL – automatic emails are sent when your run begins, ends or fails.
  • #SBATCH –mail-user=mail@me.com – the adress to which these are sent
  • #SBATCH –job-name=my_fancy_jobname – a unique jobname that assists you in monitoring your jobs
  • #SBATCH –no-requeue – keep control over your balance by disallowing administrators to restart jobs, e.g. after node failure
  • #SBATCH -D [YOUR_DIR] – specify the directory relative to which other calls are made. Will otherwise assume your current directory.

Many more options are available.

Batch commands

Your batch commands should include these:

module load openmpi
module load python
conda activate [YOUR_PBILBY_ENVIRONMENT]

These commands make sure that your environment is prepared to handle parallel_bilby.

  
export MKL_NUM_THREADS="1"
export MKL_DYNAMIC="FALSE"
export OMP_NUM_THREADS=1
export MPI_PER_NODE=[YOUR_ntasks-per-node]
export PMI_MMAP_SYNC_WAIT_TIME=600

These are environment variables that affect computation efficiency. These should be editey only carefully.

Options to Analyis

There are many options available to the parallel_bilby_analysis. Some import that are used in the above example include:

  • –nlive 2000 The number of live points governs the overall precision of your inference. It should significantly exceed the expected number of modes in the posterior distribution, being on the order of 1000.
  • –nact 30 –maxmcmc 10000 These quantities affect the algorithm to generate new live-points. Increasing them will result in better convergence at the cost of higher runtime. Regard these values as convenient defaults.
  • –sampling-seed 12345 A sampling seed guarantees the reproducability of stochastic data.
  • –check-point-deltaT 36000 Clusters can be subject to failures, so you will prefer to have your data saved at convenient checkpoints. Since bilby is very computationally expensive, even saving can cost significant runtime. Therefore, check-points should be taken on the order of hours (but below your time-limit).
  • –outdir [PATH_TO_YOUR_OUTDIR]/result

Fixing your Batch

You are likely to find yourself every once in a while in a situation that you need to make changes to your submission. While sometimes there is no way to avoid cancelling it in total

Last modified: le 2022/08/30 18:45