This vignette describes mathematically the statistical models oflgpr. We study the different arguments of thelgp() orcreate_model() modeling functions andwhat parts of the probabilistic model they customize. This is a concisedescription, and the original publication (Timonen et al. (2021)) has more informationabout the actual motivation for the used modeling approaches, and thetutorials have codeexamples.
The models inlgpr are models for the conditionaldistribution\[p(y \mid f(\textbf{x}), \theta_{\text{obs}}),\] of response variable\(y\)given covariates\(\textbf{x}\), where\(\theta_{\text{obs}}\) is a possibleparameter of the observation model (like the magnitude of observationnoise). The function\(f\) has aGaussian Process (GP) prior\[f \sim \mathcal{GP}(0, k\left(\textbf{x}, \textbf{x}' \mid\theta_{\text{GP}})\right),\]
with covariance (kernel) function\(k(\textbf{x}, \textbf{x}' \mid\theta_{\text{GP}})\) that has hyperparameters\(\theta_{\text{GP}}\). In addition to the GPprior for\(f\), there is a parameterprior distribution\(p(\theta)\) for\(\theta = \left\{ \theta_{\text{GP}},\theta_{\text{obs}} \right\}\). Given\(N\) observations\(\mathcal{D} = \{y_n,\textbf{x}_n\}_{n=1}^N\) the probabilistic models inlgpr have the form\[\begin{align}p\left(\theta, \textbf{f}\right) &= p\left(\textbf{f} \mid\theta\right) \cdot p(\theta) & \text{(prior)} \\p(\textbf{y} \mid \textbf{f}, \theta) &= \prod_{n=1}^N p(y_n \midf(\textbf{x}_n), \theta_{\text{obs}}) & \text{(likelihood)},\end{align}\] where\(\textbf{f} =\left[ f(\textbf{x}_1), \ldots, f(\textbf{x}_N) \right]^{\top}\),\(\textbf{y} = \left[y_1, \ldots,y_N\right]^{\top}\). The parameter prior density\(p(\theta)\) is the product of the priordensities of each parameter, and the GP prior means that the prior for\(\textbf{f}\) is the multivariatenormal\[\begin{equation}p\left(\textbf{f} \mid \theta\right) = \mathcal{N}\left(\textbf{f} \mid\textbf{0}, \textbf{K} \right),\end{equation}\] where the\(N \timesN\) matrix\(\textbf{K}\) hasentries\(\{ \textbf{K} \}_{in} =k(\textbf{x}_i, \textbf{x}_n \mid \theta_{\text{GP}})\).
The below table shows which parts of the above mathematicaldescription are affected by which arguments tolgp() orcreate_model(). You can read more about them in thedocumentation of said functions.
| Argument | Affected model part |
|---|---|
formula | \(k(\textbf{x},\textbf{x}')\) |
data | \(\mathcal{D}\) |
likelihood | \(p(y_n \mid f(\textbf{x}_n),\theta_{\text{obs}})\) |
prior | \(p(\theta)\) |
c_hat | \(p(y_n \mid f(\textbf{x}_n),\theta_{\text{obs}})\) |
num_trials | \(\mathcal{D}\) |
options | \(k(\textbf{x},\textbf{x}')\) |
likelihood argument and observation modelsThe termsobservation model andlikelihood are used to refer to the same formula,i.e. \(p(y_n \mid f(\textbf{x}_n),\theta_{\text{obs}})\), though the former means it as a functionof\(\textbf{y}\) and the latter as afunction of\(\theta\). There arecurrently five observation models available and they all involve aninverse link function transformation\[h_n = g^{-1}\left( f(\textbf{x}_n)+ \hat{c}_n \right)\] where\(g\) is determined bythelikelihood argument and\(\hat{c}_n\) by thec_hatargument. The below table shows what the link function is in differentcases, and what parameter the corresponding observation model has.
likelihood | Link function\(g\) | Parameter\(\theta_{\text{obs}}\) |
|---|---|---|
gaussian | identity | \(\sigma\) |
poisson | logarithm | - |
nb | logarithm | \(\phi\) |
binomial | logit | - |
bb | logit | \(\gamma\) |
In theGaussian observation model(likelihood="gaussian"),\[p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}) = \mathcal{N}(y_n \midh_n, \sigma^2)\]\(\theta_{\text{obs}}=\sigma\) is a noisemagnitude parameter.
ThePoisson observation model(likelihood="poisson") for count data is\[y_n \sim \text{Poisson}\left(\lambda_n \right),\] where the rate is\(\lambda_n =h_n\).
In thenegative binomial(likelihood="nb") model,\(\lambda_n\) is gamma-distributed withparameters\[\begin{cases}\text{shape} &= \phi \\\text{scale} &= \frac{\phi}{h_n}\end{cases},\] and\(\phi > 0\) controlsoverdispersion so that\(\phi \rightarrow\infty\) corresponds to the Poisson model.
When selecting the binomial or beta-binomial observation modelfor count data, the number of trials\(\eta_n\), for each\(n=1, \ldots, N\) has to be supplied usingthenum_trials argument. Thebinomialmodel (likelihood="binomial") is\[y_n \sim \text{Binomial}(h_n, \eta_n),\] where the success probability\(\rho_n = h_n\).
In thebeta-binomial model(likelihood="bb"),\(\rho_i\) is random so that\[\rho_n \sim \text{Beta}\left(h_n \cdot \frac{1 - \gamma}{\gamma},\ (1-h_n) \cdot \frac{1 - \gamma}{\gamma}\right),\] and the parameter\(\gamma \in [0,1]\) controls overdispersion so that\(\gamma \rightarrow 0\) corresponds to thebinomial model.
When using the Gaussian observation model withsample_f=TRUE the continuous response\(y\) is normalized to unit variance and zeromean, and\(\hat{c}_n = 0\) for all\(n\) is set. In this case thec_hat argument has no effect. Withsample_f = TRUE, sensible defaults are used. See thedocumentation of thec_hat argument oflgp()for exact details and the5. Modelinference section for information about thesample_fargument.
formula argument and kernel functionsThe GP models oflgpr are additive, so that\[\begin{equation}k(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) = \sum_{j=1}^J\alpha_j^2 k_j(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}).\end{equation}\] This is equivalent to saying that we have\(f = f^{(1)} + \ldots + f^{(J)}\) modeled sothat each component\(j = 1, \ldots,J\) has a GP prior\[\begin{equation}f^{(j)} \sim \mathcal{GP}\left(0, \alpha_j^2 k_j(\textbf{x},\textbf{x}' \mid \theta_{\text{GP}}) \right),\end{equation}\] independently from other components.
The number of components\(J\) isequal to the number of terms in yourformula. Terms areseparated by a plus sign. An example formula with three terms couldbe
y ~ gp(age) + gp(age)*zs(id) + categ(group)wherey,age,id andgroup have to be columns ofdata. Each formulaterm defines what the corresponding kernel\(k_j\) will be like, and what covariates andparameters it depends on. Each term adds one\(\alpha\) parameter in the GP parametervector\(\theta_{\text{GP}}\), andpossible additional parameters depending on the term.
Each term is a product (separated by*) of what we callexpressions. At this point we are not using standard R formulaterminology because terms inlgpr are parsed in a customway. Each expression corresponds to one kernel, and the kernel\(k_j\) is the product of all the kernels interm\(j\). Inside parentheses, eachexpression must contain the name of onedata variable, asingp(age). This determines what variable the kerneldepends on. Most of the allowed expressions, their correspondingkernels, and allowed variable types are listed below.
| Expression | Corresponding kernel | Allowed variable type |
|---|---|---|
gp() | Exponentiated quadratic (EQ) | Continuous |
zs() | Zero-sum (ZS) | Categorical |
categ() | Categorical (CAT) | Categorical |
gp_ns() | Nonstationary (NS) | Continuous |
gp_vm() | Variance-mask (VM) | Continuous |
Continuous covariates should be represented indata asnumeric and categorical covariates asfactors.Equations for different kernels are listed here briefly. SeeTimonen et al. (2021) for more motivation anddetails about what kind of effects they can model alone and incombinations.
The EQ kernel is\[k_{\text{EQ}}(x,x' \mid \theta_{\text{GP}}) = \exp \left(-\frac{(x-x')^2}{2 \ell^2}\right)\] and it has the lengthscale parameter\(\ell\). Each EQ kernel adds one lengthscaleparameter to\(\theta_{\text{GP}}\).
The ZS kernel is\[\begin{equation} k_{\text{ZS}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ \frac{1}{1 - M} \ \ \ \text{ if } z \neq z' \end{cases}\end{equation}\] where\(M\) isthe number of different categories for covariate\(z\).
The CAT kernel is\[\begin{equation} k_{\text{CAT}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ 0 \ \ \ \text{ if } z \neq z' \end{cases}\end{equation}\]
The NS kernel is\[\begin{equation} k_{\text{NS}}(x, x' \mid a, \ell) = k_{\text{EQ}}(\omega_a(x),\omega_a(x') \mid \ell),\end{equation}\] where\(\omega_a:\mathbb{R} \rightarrow ]-1,1[\) is an input warping function\[\begin{equation} \omega_a(x) = 2 \cdot \left(\frac{1}{1 + e^{-a x}} - \frac{1}{2}\right),\end{equation}\] Each NS kernel adds one lengthscale parameter\(\ell\) and one warping steepnessparameter\(a\) to\(\theta_{\text{GP}}\).
The VM kernel is\[\begin{equation} k_{\text{VM}}(x, x' \mid a, \ell) = f^a_{\text{VM}}(x)\cdot f^a_{\text{VM}}(x') \cdot k_{\text{NS}}(x, x' \mid a,\ell),\end{equation}\] where\(f^a_{\text{VM}}(x) = \frac{1}{1 + e^{-a h_2(x-r)}}\) and\(r = \frac{1}{a}\text{logit}(h_1)\). The parameters\(h_1\) and\(h_2\) are determined byopt$vm_params[1] andopt$vm_params[2],respectively, whereopt is theoptionsargument given tolgp(). Each VM kernel adds onelengthscale parameter\(\ell\) and onewarping steepness parameter\(a\) to\(\theta_{\text{GP}}\).
All kernels that work with continuous covariates are actually alsomultiplied by a binary mask (BIN) kernel\(k_{\text{BIN}}(x,x')\) which returns\(0\) if either\(x\) or\(x'\) is missing and\(1\) if they are both available. Missingdata should be encoded asNA orNaN indata.
There are also thehet() andunc()expressions. They cannot be alone in a term but have to be multiplied byEQ, NS or VM. They are not actually kernels alone but edit the covariateor kernel of their term and add additional parameters. See the tutorialsfor example use cases andTimonen et al.(2021) for their mathematical definition.
After the model is defined,lgpr uses the MCMC methodsof Stan to obtain draws from the joint posterior\(p\left(\theta, \textbf{f} \mid\mathcal{D}\right)\) or the marginal posterior of parameters,i.e. \(p\left(\theta \mid\mathcal{D}\right)\). Which one of these is done is determined bythesample_f argument oflgp() orcreate_model().
This option is always possible but not recommended withlikelihood = "gaussian". The joint posterior that issampled from is\[\begin{equation}p\left(\theta, \textbf{f} \mid \mathcal{D}\right) \propto p\left(\theta,\textbf{f}\right) \cdot p(\textbf{y} \mid \textbf{f}, \theta) \\\end{equation}\] and sampling requires evaluating the right-handside and its gradient thousands of times.
This option is only possible (and is automatically selected bydefault) iflikelihood = "gaussian". This is because\[\begin{equation}p\left(\textbf{y} \mid \theta\right) = \mathcal{N}\left(\textbf{y} \mid\textbf{0}, \textbf{K} + \sigma^2 \textbf{I} \right)\end{equation}\] is analytically available only in this case. Thedistribution that is sampled from is\[\begin{equation}p\left(\theta \mid \mathcal{D}\right) \propto p\left(\theta\right) \cdotp(\textbf{y} \mid \theta) \\\end{equation}\] and now sampling requires repeatedly evaluatingthe right-hand side of this equation and its gradient. This analyticalmarginalization reduces the MCMC dimension by\(N\) and likely improves samplingefficiency. The conditional posterior\(p\left(\textbf{f} \mid \theta,\mathcal{D}\right)\) also has an analytical form for a fixed\(\theta\), so draws from the marginalposterior\(p\left(\textbf{f} \mid\mathcal{D}\right)\) could be obtained by first drawing\(\theta\) and then\(\textbf{f}\), according to the process\[\begin{align}\theta &\sim p\left(\theta \mid \mathcal{D}\right) \\\textbf{f} & \sim p\left(\textbf{f} \mid \theta, \mathcal{D}\right).\end{align}\] By combining these, we again have draws from thejoint posterior\(p\left(\theta, \textbf{f}\mid \mathcal{D}\right)\), but likely with less computationaleffort.