Calculating Stability Metrics

High Level Overview

By default, stability compares the similarity of a model trained on an extra week of data to the previous version of the model, however any two models can be compared. Stability is measured as the proportion of overlap between the confidence/credible regions for each estimate. The more stable a model’s estimates are, the more overlap there will tend to be, even for narrow confidence/credible regions.

Technical Details

Overlap Function

For a single timestep for a single parameter, overlap is measured as:

Where

  • a_ub is the upper bound of Model A’s confidence/credible region
  • b_ub is the upper bound of Model B’s confidence/credible region
  • a_lb is the lower bound of Model A’s confidence/credible region
  • b_lb is the lower bound of Model B’s confidence/credible region

The default confidence/credible region spans the 25th to the 75th percentiles (middle 50% of our estimates). Each time the model samples a draw from the posterior we record the parameter value. These percentiles come from looking at the distribution of parameter values across all draws.

This overlap value is then turned into a proportion by dividing by the sum of the widths of the two confidence/credible regions.


Weighted Overlap

Next, we calculate the overall stability across all parameters and all timesteps, as well as the stability for intercept, spend, lower funnel (if include_lower_funnel is True), and spikes (if include_spikes is True) individually. When the parameters we’re estimating are large in magnitude (far from 0) they may have a bigger impact on our depvar. We want to account for that so we weight each estimate’s contribution to overall weighted overlap higher if it has a larger magnitude. When weighted overlap is calculated, each overlap is weighted by the average value of the 4 bounds (b_ub, b_lb, a_ub, a_lb). Thus for all parameters across all chains:

Note: While spend and intercept are always included in overall stability, spikes are only included if include_spikes is True. Lower Funnel channels are never included in the overall stability, but stability for lower funnel channels is calculated separately if include_lower_funnel is true.