Recast Knowledge Base
Breadcrumbs

Mean-Zero Gaussian Processes: An Overview

One available option for the MMM’s time-varying parameters are our so-called mean-zero Gaussian processes.

Straight to the definition

We’ll explain the definition of a mean-zero Gaussian process (GP) by contrasting it with a typical GP.

Typical Gaussian processes

A typical (non-mean-zero) GP can be defined as

Screenshot 2025-12-30 at 1.18.25 PM.png

where ‘y' is a parameter vector with one entry for each time step and 'K’ is a covariance matrix built from the GP’s kernel function.[1] We’ve used a prior mean of zero here for simplicity.

Instead of sampling from that multivariate normal directly, we can first calculate the Cholesky decomposition K=LL^T , where L is a lower triangular matrix with positive diagonal entries and L^T is its transpose. Multiplying that matrix L against a vector of parameters sampled from independent standard normal distributions is equivalent to sampling from the above multivariate normal. In other words, we have this alternative implementation of the typical GP:

Screenshot 2025-12-30 at 1.20.03 PM.png

where x now the parameter vector. This is equivalent to the Screenshot 2025-12-30 at 1.22.35 PM.png definition above.

Note that, even though the expected prior mean of 'y' is zero, its posterior mean (after fitting it to data) is not necessarily zero.

Mean-zero Gaussian processes

We define a mean-zero GP starting from the Cholesky decomposition implementation

Screenshot 2025-12-30 at 1.23.10 PM.png


First we place a sum-to-zero constraint on 'x':

Screenshot 2025-12-30 at 1.27.51 PM.png


We then define

Screenshot 2025-12-30 at 1.23.46 PM.png

This is the mean-zero GP.

Applications in the MMM

We use these mean-zero GPs to build the MMM’s time-varying parameters.

Suppose we want to build a channel’s beta parameter, Screenshot 2025-12-30 at 1.25.21 PM.png

First we build a vector Screenshot 2025-12-30 at 1.25.39 PM.png on unconstrained space by mixing the mean-zero GP with a scalar parameter 'w' :

Screenshot 2025-12-30 at 1.26.14 PM.png

where

  • Screenshot 2025-12-30 at 1.29.55 PM.png is a scalar between 0 and 1,

  • Screenshot 2025-12-30 at 1.30.31 PM.png is a scalar with Screenshot 2025-12-30 at 1.30.57 PM.png

  • and Screenshot 2025-12-30 at 1.31.18 PM.png is a mean-zero GP. We scale it so that each entry of Screenshot 2025-12-30 at 1.31.18 PM.png has a prior variance of 1.

We construct the final beta parameter using the inverse logit transform:

Screenshot 2025-12-30 at 1.32.16 PM.png

where Screenshot 2025-12-30 at 1.33.22 PM.png are constants we can use to set appropriate priors on Screenshot 2025-12-30 at 1.32.52 PM.png

We use this construction for essentially every time-varying parameter in the MMM.

Why?

Screenshot 2025-12-30 at 1.51.49 PM.png

Appendix: Motivation for the construction

The original motivation for mean-zero GPs came from wanting a single parameter 'w' that acted as the mean value of the GP.

Naively we tried taking a typical GP ‘y' and a scalar 'w' and defining a time-varying parameter 'z’ like

Screenshot 2025-12-30 at 1.36.06 PM.png

but ran into a major issue. It turns out that when you use the Cholesky decomposition method to build the 'y' for this, 'x' becomes nonidentifiable.

If Screenshot 2025-12-30 at 1.36.33 PM.png then define alternate versions of ‘x' and 'y’ -- we’ll call them Screenshot 2025-12-30 at 1.37.33 PM.png -- by

Screenshot 2025-12-30 at 1.38.16 PM.png

where 'c' is any scalar. Then

Screenshot 2025-12-30 at 1.38.47 PM.png


and so

Screenshot 2025-12-30 at 1.39.11 PM.png

Since Screenshot 2025-12-30 at 1.39.39 PM.png we’ve found infinitely many different values of ‘x' (since ‘c’ was arbitrary) that produce the same 'z’.

Screenshot 2025-12-30 at 1.39.39 PM.png

You might be interested to see what that vector Screenshot 2025-12-30 at 1.40.26 PM.png looks like. Here’s a plot using one of our usual kernels:

y.png

What can we do to constrain ‘x' so that it can’t move along this problem vector? Well, since this problem vector changes the sum of ‘x’, we could try restricting 'x' so that its sum is zero.

If we force ‘x' to sum to zero, we effectively restrict our GP ‘y’ to a subset of all typical GPs. How much does this constrain 'y’ ? Not much, it turns out. We can easily generate some data to try fitting it with these constrained GPs, and in doing so we found that the constrained GPs behave quite similarly to unconstrained GPs.

The final step, going from

Screenshot 2025-12-30 at 1.41.33 PM.png

to

Screenshot 2025-12-30 at 1.41.47 PM.png

gives us a new hyperparameter ‘p' that controls how constant-like ‘z’ is. We could achieve a similar result by changing the variance hyperparameter in the GP kernel, but by doing it this way we don’t need to worry about balancing the prior variances of ‘y’ and 'w’.


Footnotes

[1] A great intro to Gaussian processes is Yuge Shi’s “Gaussian Processes, not quite for dummies”, The Gradient, 2019.