Basics of Inference

The General Idea

Statistical inference refers to any method that tries to infer knowledge on an underlying distribution, e.g. neutron star masses, from the analysis of a limited dataset, e.g. observed neutron stars. In the specific case of Bayesian Inference this is done by applying Bayes' theorem. The sought-afterposterior distribution results from reweighting a certain prior distribution by how well it satisfies certain data constraints that are expressed by a likelihood function. For compact binary coalescences, the likelihood could evaluate how well a waveform generated from certain binary parameters matches an observed waveform.

Bilby

In the context of astrophysical parameter estimation, the prior and posterior may refer to the full parameter space of compact systems, comprising masses, kinetic and angular quantities, and many more. These distributions are usually continuous quantities and thus need to be discretized for numerical treatment. As the number of parameters increases, the numerical cost to cover it quickly becomes prohibitive, even for a very coarse discretization of the parameter space. Bilby, the Bayesian inference library provides convenient routines to evade this problem by implementing the ideas of Nested Sampling, tailored to the needs of compact binary coalescence research.

The fundamental idea is that for any meaningful inference, the posterior should peak in a much narrower region of parameter space than the prior. Instead of drawing samples that cover prior space uniformly, nested sampling algorithms draw samples from nested shells of increasing likelihood that naturally contract around the posterior distrubition's modes.

Live Points

To that purpose a set of live points is generated, each representing a point in n-dimensional parameter space. They are then ranked by the associated likelihood, e.g. a measure of how well a waveform generated from this parameter set matches the observed waveform. The point of least likelihood is then set on deathrow, to be replaced by a new live point that is chosen by random walks starting at the soon-to-be dead point. This forms a sequence of dead points with increasing likelihood. In a sufficiently well-behaved parameter space (and only there it makes sense to do inference, after all) one is justified to assume that they represent exponentially shrinking regions in prior space. One says that the dead points trace nested isocontours of equal likelihood.

The product of likelihood and associated prior volume is called a dead point's weight. The sum of the weights is a numerical approximation to the evidence, the integral over prior space weighted by the likelihood. The algorithm will continue until it assumes to have accounted for most of the evidence. More specifically, it computes a pragmatic estimate of the remaining evidence as the product of the highest known likelihood and the volume of prior space that has not yet been accounted for (assuming exponential shrinkage of the isocontours). The posterior is then delivered (almost) free of charge as the dead points' weights divided by the evidence.

Parallel Bilby

Although computationally much more efficient than naive uniform sampling over prior space, the routines provided by bilby still do not converge in a reasonable amount of time if a high number of dimensions is considered. For realistic parameter estimation of a compact binary, the runtime of a single inference run is on the order of a million CPU-hours, predominantly spent on likelihood evaluations. A tenable runtime requires some parallelisation, instead. Parallel bilby serves exactly this purpose. Its main idea is that since the live points are independent of each other, one does not need to wait for the replacement of an old point on deathrow to introduce a new live point candidate.

Instead, 'worker' or 'slave' cores try to simultaneously replace the currently worst live point, i.e. the one with lowest likelihood, by doing a usual bilby-step. A 'head' or 'master' core uses the suitable candidates to then update as many live points as possible. This procedure results in a speedup by about $n_{live}* ln(1+ n_{cores}/n_{live})$. In other words, the scaling is almost linear if the number of cores does not significantly exceed the number of live points. This gain comes at the cost of clustering effects, though, so strongly multi-modal distributions tend to be washed out quickly. In most astrophysical applications, though, one should expect $n_{modes} << n_{live}$ and parallel bilby provides acceptable approximations.

The following pages will show how to install the Bilby Family and prepare, run and post process parameter estimation runs in parallel bilby.

Last modified: le 2022/08/19 08:55