| Title: | Clustering and Prediction using Multi-Task Gaussian Processeswith Common Mean |
| Version: | 1.2.1 |
| Description: | An implementation for the multi-task Gaussian processes with common mean framework. Two main algorithms, called 'Magma' and 'MagmaClust', are available to perform predictions for supervised learning problems, in particular for time series or any functional/continuous data applications. The corresponding articles has been respectively proposed by Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2022) <doi:10.1007/s10994-022-06172-1>, and Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2023)https://jmlr.org/papers/v24/20-1321.html. Theses approaches leverage the learning of cluster-specific mean processes, which are common across similar tasks, to provide enhanced prediction performances (even far from data) at a linear computational cost (in the number of tasks). 'MagmaClust' is a generalisation of 'Magma' where the tasks are simultaneously clustered into groups, each being associated to a specific mean process. User-oriented functions in the package are decomposed into training, prediction and plotting functions. Some basic features (classic kernels, training, prediction) of standard Gaussian processes are also implemented. |
| License: | MIT + file LICENSE |
| URL: | https://github.com/ArthurLeroy/MagmaClustR,https://arthurleroy.github.io/MagmaClustR/ |
| BugReports: | https://github.com/ArthurLeroy/MagmaClustR/issues |
| Imports: | broom, dplyr, ggplot2, magrittr, methods, mvtnorm, plyr,purrr, Rcpp, rlang, stats, tibble, tidyr, tidyselect |
| Suggests: | gganimate, gifski, gridExtra, knitr, plotly, png, rmarkdown,testthat (≥ 3.0.0), transformr |
| LinkingTo: | Rcpp |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.2.3 |
| Depends: | R (≥ 2.10) |
| NeedsCompilation: | yes |
| Packaged: | 2024-06-28 20:01:23 UTC; Arthur Leroy |
| Author: | Arthur Leroy |
| Maintainer: | Arthur Leroy <arthur.leroy.pro@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2024-06-28 20:20:02 UTC |
MagmaClustR : Clustering and Prediction using Multi-Task Gaussian Processes
Description
TheMagmaClustR package implements two main algorithms, calledMagma andMagmaClust, using a multi-task GPs model to performpredictions for supervised learning problems. Theses approaches leveragethe learning of cluster-specific mean processes, which are common acrosssimilar tasks, to provide enhanced prediction performances (even far fromdata) at a linear computational cost (in the number of tasks).MagmaClust is a generalisation ofMagma where the tasks aresimultaneously clustered into groups, each being associated to a specificmean process. User-oriented functions in the package are decomposed intotraining, prediction and plotting functions. Some basic features ofstandard GPs are also implemented.
Details
For a quick introduction toMagmaClustR, please refer to the README athttps://github.com/ArthurLeroy/MagmaClustR
Author(s)
Arthur Leroy, Pierre Pathe and Pierre Latouche
Maintainer: Arthur Leroy -arthur.leroy.pro@gmail.com
References
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
MAGMA: Inference and Prediction with Multi-Task Gaussian Processes.Machine Learning, 2022,https://link.springer.com/article/10.1007/s10994-022-06172-1
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
Cluster-Specific Predictions with Multi-Task Gaussian Processes.Journal of Machine Learning Research, 2023,https://jmlr.org/papers/v24/20-1321.html
Examples
Simulate a dataset, train and predict with Magma
set.seed(4242)
data_magma <- simu_db(M = 11, N = 10, K = 1)
magma_train <- data_magma %>% subset(ID %in% 1:10)
magma_test <- data_magma %>% subset(ID == 11) %>% head(7)
magma_model <- train_magma(data = magma_train)
magma_pred <- pred_magma(data = magma_test, trained_model = magma_model,grid_inputs = seq(0, 10, 0.01))
Simulate a dataset, train and predict with MagmaClust
set.seed(4242)
data_magmaclust <- simu_db(M = 4, N = 10, K = 3)
list_ID = unique(data_magmaclust$ID)
magmaclust_train <- data_magmaclust %>% subset(ID %in% list_ID[1:11])
magmaclust_test <- data_magmaclust %>% subset(ID == list_ID[12]) %>%head(5)
magmaclust_model <- train_magmaclust(data = magmaclust_train)
magmaclust_pred <- pred_magmaclust(data = magmaclust_test,
trained_model = magmaclust_model, grid_inputs = seq(0, 10, 0.01))
Author(s)
Maintainer: Arthur Leroyarthur.leroy.pro@gmail.com (ORCID)
Authors:
Pierre Latouchepierre.latouche@gmail.com
Other contributors:
Pierre Pathépathepierre@gmail.com [contributor]
Alexia Grenouillatgrenouil@insa-toulouse.fr [contributor]
Hugo Lelievrelelievre@insa-toulouse.fr [contributor]
See Also
Useful links:
Report bugs athttps://github.com/ArthurLeroy/MagmaClustR/issues
Pipe operator
Description
Seemagrittr::%>% for details.
Usage
lhs %>% rhsArguments
lhs | A value or the magrittr placeholder. |
rhs | A function call using the magrittr semantics. |
Value
The result of callingrhs(lhs).
Round a matrix to make if symmetric
Description
If a matrix is non-symmetric due to numerical errors, round with a decreasingnumber of digits until the matrix becomes symmetric.
Usage
check_symmetric(mat, digits = 10)Arguments
mat | A matrix, possibly non-symmetric. |
digits | A number, the starting number of digits to round from if |
Value
A matrix, rounded approximation ofmat that is symmetric.
Examples
TRUEInverse a matrix using an adaptive jitter term
Description
Inverse a matrix from its Choleski decomposition. If (nearly-)singular,increase the order of magnitude of the jitter term added to the diagonaluntil the matrix becomes non-singular.
Usage
chol_inv_jitter(mat, pen_diag)Arguments
mat | A matrix, possibly singular. |
pen_diag | A number, a jitter term to add on the diagonal. |
Value
A matrix, inverse ofmat plus an adaptive jitter termadded on the diagonal.
Examples
TRUEAllocate training data into the most probable cluster
Description
Allocate training data into the most probable cluster
Usage
data_allocate_cluster(trained_model)Arguments
trained_model | A list, containing the information coming from aMagmaClust model, previously trained using the |
Value
The original dataset used to train the MagmaClust model, withadditional 'Cluster' and associated 'Proba' columns, indicating the mostprobable cluster for each individual/task at the end of the trainingprocedure.
Examples
TRUECompute the Multivariate Gaussian likelihood
Description
Modification of the functiondmvnorm() from the packagemvtnorm, providing an implementation of the Multivariate Gaussianlikelihood. This version uses inverse of the covariance function as argumentinstead of the traditional covariance.
Usage
dmnorm(x, mu, inv_Sigma, log = FALSE)Arguments
x | A vector, containing values the likelihood is evaluated on. |
mu | A vector or matrix, specifying the mean parameter. |
inv_Sigma | A matrix, specifying the inverse of covariance parameter. |
log | A logical value, indicating whether we return the log-likelihood. |
Value
A number, corresponding to the Multivariate Gaussian log-likelihood.
Examples
TRUEDraw a number
Description
Draw uniformly a number within a specified interval
Usage
draw(int)Arguments
int | An interval of values we want to draw uniformly in. |
Value
A 2-decimals-rounded random number
Examples
TRUEE-Step of the EM algorithm
Description
Expectation step of the EM algorithm to compute the parameters of thehyper-posterior Gaussian distribution of the mean process in Magma.
Usage
e_step(db, m_0, kern_0, kern_i, hp_0, hp_i, pen_diag)Arguments
db | A tibble or data frame. Columns required: ID, Input, Output.Additional columns for covariates can be specified. |
m_0 | A vector, corresponding to the prior mean of the mean GP. |
kern_0 | A kernel function, associated with the mean GP. |
kern_i | A kernel function, associated with the individual GPs. |
hp_0 | A named vector, tibble or data frame of hyper-parametersassociated with |
hp_i | A tibble or data frame of hyper-parametersassociated with |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elementsmean, a tibblecontaining the Input and associated Output of the hyper-posterior's meanparameter, andcov, the hyper-posterior's covariance matrix.
Examples
TRUEPenalised elbo for multiple mean GPs with common HPs
Description
Penalised elbo for multiple mean GPs with common HPs
Usage
elbo_GP_mod_common_hp_k(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing values we want to compute elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A list of the K mean GPs at union of observed timestamps. |
kern | A kernel function used to compute the covariance matrix atcorresponding timestamps. |
post_cov | A List of the K posterior covariance of the mean GP (mu_k).Used to compute correction term (cor_term). |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo forthe sum of the k mean GPs with common HPs.
Examples
TRUEEvidence Lower Bound for a mixture of GPs
Description
Evidence Lower Bound for a mixture of GPs
Usage
elbo_clust_multi_GP(hp, db, hyperpost, kern, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost | List of parameters for the K mean GPs. |
kern | A kernel function used to compute the covariance matrix atcorresponding timestamps. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo for a mixture of GPs
Examples
TRUEPenalised elbo for multiple individual GPs with common HPs
Description
Penalised elbo for multiple individual GPs with common HPs
Usage
elbo_clust_multi_GP_common_hp_i(hp, db, hyperpost, kern, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing values we want to compute elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost | List of parameters for the K mean Gaussian processes. |
kern | A kernel function used to compute the covariance matrix atcorresponding timestamps. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The value of the penalised Gaussian elbo forthe sum of the M individual GPs with common HPs.
Examples
TRUEEvidence Lower Bound maximised in MagmaClust
Description
Evidence Lower Bound maximised in MagmaClust
Usage
elbo_monitoring_VEM(hp_k, hp_i, db, kern_i, kern_k, hyperpost, m_k, pen_diag)Arguments
hp_k | A tibble, data frame or named vector of hyper-parametersfor each clusters. |
hp_i | A tibble, data frame or named vector of hyper-parametersfor each individuals. |
db | A tibble containing values we want to compute elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
kern_i | Kernel used to compute the covariance matrix of individuals GPsat corresponding inputs. |
kern_k | Kernel used to compute the covariance matrix of the mean GPsat corresponding inputs. |
hyperpost | A list of parameters for the variational distributionsof the K mean GPs. |
m_k | Prior value of the mean parameter of the mean GPs (mu_k).Length = 1 or nrow(db). |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
Value of the elbo that is maximised during the VEM algorithm used fortraining in MagmaClust.
Examples
TRUEExpand a grid of inputs
Description
Expand a grid of inputs
Usage
expand_grid_inputs(Input, ...)Arguments
Input | A vector of inputs. |
... | As many vector of covariates as desired. We advise to giveexplicit names when using the function. |
Value
A tibble containing all the combination of values of theparameters.
Examples
TRUEGradient of the logLikelihood of a Gaussian Process
Description
Gradient of the logLikelihood of a Gaussian Process
Usage
gr_GP(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A vector, specifying the mean of the GP at the reference inputs. |
kern | A kernel function. |
post_cov | (optional) A matrix, corresponding to covariance parameter ofthe hyper-posterior. Used to compute the hyper-prior distribution of a newindividual in Magma. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parametersgradients for the Gaussian log-Likelihood (where the covariance can be thesum of the individual and the hyper-posterior's mean process covariances).
Examples
TRUEGradient of the modified logLikelihood for GPs in Magma
Description
Gradient of the modified logLikelihood for GPs in Magma
Usage
gr_GP_mod(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A vector, specifying the mean of the GPs at the reference inputs. |
kern | A kernel function. |
post_cov | A matrix, covariance parameter of the hyper-posterior.Used to compute the correction term. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parametersgradients for the modified Gaussian log-Likelihood involved in Magma.
Examples
TRUEGradient of the modified logLikelihood with common HPs for GPs in Magma
Description
Gradient of the modified logLikelihood with common HPs for GPs in Magma
Usage
gr_GP_mod_common_hp(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble or data frame containing hyper-parameters for allindividuals. |
db | A tibble containing the values we want to compute the logL on.Required columns: ID, Input, Output. Additional covariate columns areallowed. |
mean | A vector, specifying the mean of the GPs at the reference inputs. |
kern | A kernel function. |
post_cov | A matrix, covariance parameter of the hyper-posterior.Used to compute the correction term. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters'gradients for the modified Gaussian log-Likelihood involved in Magma withthe 'common HP' setting.
Examples
TRUEGradient of the penalised elbo for multiple mean GPs with common HPs
Description
Gradient of the penalised elbo for multiple mean GPs with common HPs
Usage
gr_GP_mod_common_hp_k(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A list of the k means of the GPs at union of observed timestamps. |
kern | A kernel function |
post_cov | A list of the k posterior covariance of the mean GP (mu_k).Used to compute correction term (cor_term) |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo forthe sum of the k mean GPs with common HPs.
Examples
TRUEGradient of the elbo for a mixture of GPs
Description
Gradient of the elbo for a mixture of GPs
Usage
gr_clust_multi_GP(hp, db, hyperpost, kern, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost | List of parameters for the K mean Gaussian processes. |
kern | A kernel function. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo for a mixture of GPs
Examples
TRUEGradient of the penalised elbo for multiple individual GPs with common HPs
Description
Gradient of the penalised elbo for multiple individual GPs with common HPs
Usage
gr_clust_multi_GP_common_hp_i(hp, db, hyperpost, kern, pen_diag = NULL)Arguments
hp | A tibble, data frame or name vector of hyper-parameters. |
db | A tibble containing values we want to compute elbo on.Required columns: Input, Output. Additional covariate columns are allowed. |
hyperpost | List of parameters for the K mean Gaussian processes. |
kern | A kernel function used to compute the covariance matrix atcorresponding timestamps. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
The gradient of the penalised Gaussian elbo forthe sum of the M individual GPs with common HPs.
Examples
TRUEGradient of the mixture of Gaussian likelihoods
Description
Compute the gradient of a sum of Gaussian log-likelihoods, weighted by theirmixture probabilities.
Usage
gr_sum_logL_GP_clust(hp, db, mixture, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector of hyper-parameters. |
db | A tibble containing data we want to evaluate the logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mixture | A tibble or data frame, indicating the mixture probabilitiesof each cluster for the new individual/task. |
mean | A list of hyper-posterior mean parameters for all clusters. |
kern | A kernel function. |
post_cov | A list of hyper-posterior covariance parameters for allclusters. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A named vector, corresponding to the value of the hyper-parameters'gradients for the mixture of Gaussian log-likelihoods involved in theprediction step of MagmaClust.
Examples
TRUEGenerate random hyper-parameters
Description
Generate a set of random hyper-parameters, specific to the chosen type ofkernel, under the format that is used in Magma.
Usage
hp( kern = "SE", list_ID = NULL, list_hp = NULL, noise = FALSE, common_hp = FALSE)Arguments
kern | A function, or a character string indicating the chosen type ofkernel among:
In case of a custom kernel function, the argument |
list_ID | A vector, associating an |
list_hp | A vector of characters, providing the name of eachhyper-parameter, in case where |
noise | A logical value, indicating whether a 'noise' hyper-parametershould be included. |
common_hp | A logical value, indicating whether the set ofhyper-parameters is assumed to be common to all individuals. |
Value
A tibble, providing a set of random hyper-parameters associated withthe kernel specified through the argumentkern.
Examples
TRUECompute the hyper-posterior distribution in Magma
Description
Compute the parameters of the hyper-posterior Gaussian distribution of themean process in Magma (similarly to the expectation step of the EMalgorithm used for learning). This hyper-posterior distribution, evaluatedon a grid of inputs provided through thegrid_inputs argument, is akey component for making prediction in Magma, and is required in the functionpred_magma.
Usage
hyperposterior( trained_model = NULL, data = NULL, hp_0 = NULL, hp_i = NULL, kern_0 = NULL, kern_i = NULL, prior_mean = NULL, grid_inputs = NULL, pen_diag = 1e-10)Arguments
trained_model | A list, containing the information coming from aMagma model, previously trained using the |
data | A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified.The 'Input' column should define the variable that is used asreference for the observations (e.g. time for longitudinal data). The'Output' column specifies the observed values (the responsevariable). The data frame can also provide as many covariates as desired,with no constraints on the column names. These covariates are additionalinputs (explanatory variables) of the models that are also observed ateach reference 'Input'. Recovered from |
hp_0 | A named vector, tibble or data frame of hyper-parametersassociated with |
hp_i | A tibble or data frame of hyper-parametersassociated with |
kern_0 | A kernel function, associated with the mean GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
kern_i | A kernel function, associated with the individual GPs. ("SE","PERIO" and "RQ" are aso available here). Recovered from |
prior_mean | Hyper-prior mean parameter of the mean GP. This argument,can be specified under various formats, such as:
|
grid_inputs | A vector or a data frame, indicating the grid ofadditional reference inputs on which the mean process' hyper-posteriorshould be evaluated. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A list gathering the parameters of the mean processes'hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameterevaluated at each training
Input.cov: A matrix, the covariance parameter for thehyper-posterior distribution of the mean process.
pred: A tibble, the predicted mean and variance at
Inputfor the mean process' hyper-posteriordistribution under a format that allows the directvisualisation as a GP prediction.
Examples
TRUECompute the hyper-posterior distribution for each cluster in MagmaClust
Description
Recompute the E-step of the VEM algorithm in MagmaClust for a new set ofreferenceInput. Once training is completed, it can be necessary toevaluate the hyper-posterior distributions of the mean processes at specificlocations, for which we want to make predictions. This process is directlyimplemented in thepred_magmaclust function but the usermight want to usehyperpost_clust for a tailored control ofthe prediction procedure.
Usage
hyperposterior_clust( trained_model = NULL, data = NULL, mixture = NULL, hp_k = NULL, hp_i = NULL, kern_k = NULL, kern_i = NULL, prior_mean_k = NULL, grid_inputs = NULL, pen_diag = 1e-10)Arguments
trained_model | A list, containing the information coming from aMagma model, previously trained using the |
data | A tibble or data frame. Required columns: |
mixture | A tibble or data frame, indicating the mixture probabilitiesof each cluster for each individual. Required column: |
hp_k | A tibble or data frame of hyper-parametersassociated with |
hp_i | A tibble or data frame of hyper-parametersassociated with |
kern_k | A kernel function, associated with the mean GPs.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
kern_i | A kernel function, associated with the individual GPs. ("SE","LIN", PERIO" and "RQ" are also available here). Recovered from |
prior_mean_k | The set of hyper-prior mean parameters (m_k) for the Kmean GPs, one value for each cluster.cluster. This argument can be specified under various formats, such as:
|
grid_inputs | A vector or a data frame, indicating the grid ofadditional reference inputs on which the mean process' hyper-posteriorshould be evaluated. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A list containing the parameters of the mean processes'hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, thehyper-posterior mean parameters evaluated at each
Input.cov: A list of matrices containing, for each cluster, thehyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities ineach cluster for each individual.
Examples
TRUERun a k-means algorithm to initialise clusters' allocation
Description
Run a k-means algorithm to initialise clusters' allocation
Usage
ini_kmeans(data, k, nstart = 50, summary = FALSE)Arguments
data | A tibble containing common Input and associated Output valuesto cluster. |
k | A number of clusters assumed for running the kmeans algorithm. |
nstart | A number, indicating how many re-starts of kmeans are set. |
summary | A boolean, indicating whether we want an outcome summary |
Value
A tibble containing the initial clustering obtained through kmeans.
Examples
TRUEMixture initialisation with kmeans
Description
Provide an initial kmeans allocation of the individuals/tasks in a datasetinto a definite number of clusters, and return the associated mixtureprobabilities.
Usage
ini_mixture(data, k, name_clust = NULL, nstart = 50)Arguments
data | A tibble or data frame. Required columns: |
k | A number, indicating the number of clusters. |
name_clust | A vector of characters. Each element should correspond tothe name of one cluster. |
nstart | A number of restart used in the underlying kmeans algorithm |
Value
A tibble indicating for eachID in which cluster it belongsafter a kmeans initialisation.
Examples
TRUECreate covariance matrix from a kernel
Description
kern_to_cov() creates a covariance matrix between input values (thatcould be either scalars or vectors) evaluated within a kernel function,which is characterised by specified hyper-parameters. This matrix isa finite-dimensional evaluation of the infinite-dimensional covariancestructure of a GP, defined thanks to this kernel.
Usage
kern_to_cov(input, kern = "SE", hp, deriv = NULL, input_2 = NULL)Arguments
input | A vector, matrix, data frame or tibble containing all inputs forone individual. If a vector, the elements are used as reference, otherwise, one column should be named 'Input' to indicate that it represents thereference (e.g. 'Input' would contain the timestamps in time-seriesapplications). The other columns are considered as being covariates. Ifno column is named 'Input', the first one is used by default. |
kern | A kernel function. Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hp | A list, data frame or tibble containing the hyper-parameters usedin the kernel. The name of the elements (or columns) should correspondexactly to those used in the kernel definition. If |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simplyreturns the covariance matrix. |
input_2 | (optional) A vector, matrix, data frame or tibble under thesame format as |
Value
A covariance matrix, where elements are evaluations of the associatedkernel for each pair of reference inputs.
Examples
TRUECreate inverse of a covariance matrix from a kernel
Description
kern_to_inv() creates the inverse of a covariance matrix betweeninput values (that could be either scalars or vectors) evaluated withina kernel function, which is characterised by specified hyper-parameters.This matrix is a finite-dimensional evaluation of theinfinite-dimensional covariance structure of a GP, defined thanks to thiskernel.
Usage
kern_to_inv(input, kern, hp, pen_diag = 1e-10, deriv = NULL)Arguments
input | A vector, matrix, data frame or tibble containing all inputs forone individual. If a vector, the elements are used as reference, otherwise,one column should be named 'Input' to indicate that it represents thereference (e.g. 'Input' would contain the timestamps in time-seriesapplications). The other columns are considered as being covariates. Ifno column is named 'Input', the first one is used by default. |
kern | A kernel function. Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hp | A list, data frame or tibble containing the hyper-parameters usedin the kernel. The name of the elements (or columns) should correspondexactly to those used in the kernel definition. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simply returnsthe inverse covariance matrix. |
Value
The inverse of a covariance matrix, which elements are evaluations ofthe associated kernel for each pair of reference inputs.
Examples
TRUELinear Kernel
Description
Linear Kernel
Usage
lin_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)Arguments
x | A vector (or matrix if vectorized = T) of inputs. |
y | A vector (or matrix if vectorized = T) of inputs. |
hp | A tibble, data frame or named vector, containing the kernel'shyperparameters. Required columns: 'lin_slope' and 'lin_offset'. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simplyreturns the evaluation of the kernel. |
vectorized | A logical value, indicating whether the function providesa vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUECompute a covariance matrix for multiple individuals
Description
Compute the covariance matrices associated with all individuals in thedatabase, taking into account their specific inputs and hyper-parameters.
Usage
list_kern_to_cov(data, kern, hp, deriv = NULL)Arguments
data | A tibble or data frame of input data. Required column: 'ID'.Suggested column: 'Input' (for indicating the reference input). |
kern | A kernel function. |
hp | A tibble or data frame, containing the hyper-parameters associatedwith each individual. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simply returnsthe list of covariance matrices. |
Value
A named list containing all of the inverse covariance matrices.
Examples
TRUECompute an inverse covariance matrix for multiple individuals
Description
Compute the inverse covariance matrices associated with all individualsin the database, taking into account their specific inputs andhyper-parameters.
Usage
list_kern_to_inv(db, kern, hp, pen_diag, deriv = NULL)Arguments
db | A tibble or data frame of input data. Required column: 'ID'.Suggested column: 'Input' (for indicating the reference input). |
kern | A kernel function. |
hp | A tibble or data frame, containing the hyper-parameters associatedwith each individual. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simply returnsthe list of covariance matrices. |
Value
A named list containing all of the inverse covariance matrices.
Examples
TRUELog-Likelihood function of a Gaussian Process
Description
Log-Likelihood function of a Gaussian Process
Usage
logL_GP(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector containing hyper-parameters. |
db | A tibble containing the values we want to compute the logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A vector, specifying the mean of the GP at the reference inputs. |
kern | A kernel function. |
post_cov | (optional) A matrix, corresponding to covariance parameter ofthe hyper-posterior. Used to compute the hyper-prior distribution of a newindividual in Magma. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of Gaussianlog-Likelihood (where the covariance can be the sum of the individual andthe hyper-posterior's mean process covariances).
Examples
TRUEModified log-Likelihood function for GPs
Description
Log-Likelihood function involved in Magma during the maximisation step ofthe training. The log-Likelihood is defined as a simple Gaussian likelihoodadded with correction trace term.
Usage
logL_GP_mod(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame or named vector of hyper-parameters. |
db | A tibble containing values we want to compute logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mean | A vector, specifying the mean of the GP at the reference inputs. |
kern | A kernel function. |
post_cov | A matrix, covariance parameter of the hyper-posterior.Used to compute the correction term. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of the modified Gaussianlog-Likelihood defined in Magma.
Examples
TRUEModified log-Likelihood function with common HPs for GPs
Description
Log-Likelihood function involved in Magma during the maximisation step ofthe training, in the particular case where the hyper-parameters are shared byall individuals. The log-Likelihood is defined as a sum over all individualsof Gaussian likelihoods added with correction trace terms.
Usage
logL_GP_mod_common_hp(hp, db, mean, kern, post_cov, pen_diag)Arguments
hp | A tibble, data frame of hyper-parameters. |
db | A tibble containing the values we want to compute the logL on.Required columns: ID, Input, Output. Additional covariate columns areallowed. |
mean | A vector, specifying the mean of the GP at the reference inputs. |
kern | A kernel function. |
post_cov | A matrix, covariance parameter of the hyper-posterior.Used to compute the correction term. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, corresponding to the value of the modified Gaussianlog-Likelihood with common hyper-parameters defined in Magma.
Examples
TRUELog-Likelihood for monitoring the EM algorithm in Magma
Description
Log-Likelihood for monitoring the EM algorithm in Magma
Usage
logL_monitoring( hp_0, hp_i, db, m_0, kern_0, kern_i, post_mean, post_cov, pen_diag)Arguments
hp_0 | A named vector, tibble or data frame, containing thehyper-parameters associated with the mean GP. |
hp_i | A tibble or data frame, containing the hyper-parameters with theindividual GPs. |
db | A tibble or data frame. Columns required: ID, Input, Output.Additional columns for covariates can be specified. |
m_0 | A vector, corresponding to the prior mean of the mean GP. |
kern_0 | A kernel function, associated with the mean GP. |
kern_i | A kernel function, associated with the individual GPs. |
post_mean | A tibble, coming out of the E step, containing the Input andassociated Output of the hyper-posterior mean parameter. |
post_cov | A matrix, coming out of the E step, being the hyper-posteriorcovariance parameter. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, expectation of joint log-likelihood of the model. Thisquantity is supposed to increase at each step of the EM algorithm, andthus used for monitoring the procedure.
Examples
TRUEM-Step of the EM algorithm
Description
Maximisation step of the EM algorithm to compute hyper-parameters of all thekernels involved in Magma.
Usage
m_step( db, m_0, kern_0, kern_i, old_hp_0, old_hp_i, post_mean, post_cov, common_hp, pen_diag)Arguments
db | A tibble or data frame. Columns required: ID, Input, Output.Additional columns for covariates can be specified. |
m_0 | A vector, corresponding to the prior mean of the mean GP. |
kern_0 | A kernel function, associated with the mean GP. |
kern_i | A kernel function, associated with the individual GPs. |
old_hp_0 | A named vector, tibble or data frame, containing thehyper-parameters from the previous M-step (or initialisation) associatedwith the mean GP. |
old_hp_i | A tibble or data frame, containing the hyper-parametersfrom the previous M-step (or initialisation) associated with theindividual GPs. |
post_mean | A tibble, coming out of the E step, containing the Input andassociated Output of the hyper-posterior mean parameter. |
post_cov | A matrix, coming out of the E step, being the hyper-posteriorcovariance parameter. |
common_hp | A logical value, indicating whether the set ofhyper-parameters is assumed to be common to all indiviuals. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elementshp_0, a tibblecontaining the hyper-parameters associated with the mean GP,hp_i, a tibble containing the hyper-parametersassociated with the individual GPs.
Examples
TRUEPeriodic Kernel
Description
Periodic Kernel
Usage
perio_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)Arguments
x | A vector (or matrix if vectorized = T) of inputs. |
y | A vector (or matrix if vectorized = T) of inputs. |
hp | A tibble, data frame or named vector, containing the kernel'shyperparameters. Required columns: 'perio_variance', 'perio_lengthscale',and 'period'. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simply returnsthe evaluation of the kernel. |
vectorized | A logical value, indicating whether the function providesa vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUEPlot smoothed curves of raw data
Description
Display raw data under the Magma format as smoothed curves.
Usage
plot_db(data, cluster = FALSE, legend = FALSE)Arguments
data | A data frame or tibble with format : ID, Input, Output. |
cluster | A boolean indicating whether data should be coloured bycluster. Requires a column named 'Cluster'. |
legend | A boolean indicating whether the legend should be displayed. |
Value
Graph of smoothed curves of raw data.
Examples
TRUECreate a GIF of Magma or GP predictions
Description
Create a GIF animation displaying how Magma or classic GPpredictions evolve and improve when the number of data points increase.
Usage
plot_gif( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5, export_gif = FALSE, path = "gif_gp.gif", ...)Arguments
pred_gp | A tibble, typically coming from the |
x_input | A vector of character strings, indicating which input shouldbe displayed. If NULL(default) the 'Input' column is used for the x-axis.If providing a 2-dimensional vector, the corresponding columns are usedfor the x-axis and y-axis. |
data | (Optional) A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified.The 'Input' column should define the variable that is used asreference for the observations (e.g. time for longitudinal data). The'Output' column specifies the observed values (the responsevariable). The data frame can also provide as many covariates as desired,with no constraints on the column names. These covariates are additionalinputs (explanatory variables) of the models that are also observed ateach reference 'Input'. |
data_train | (Optional) A tibble or data frame, containing the trainingdata of the Magma model. The data set should have the same format as the |
prior_mean | (Optional) A tibble or a data frame, containing the 'Input'and associated 'Output' prior mean parameter of the GP prediction. |
y_grid | A vector, indicating the grid of values on the y-axis for whichprobabilities should be computed for heatmaps of 1-dimensionalpredictions. If NULL (default), a vector of length 50 is defined, rangingbetween the min and max 'Output' values contained in |
heatmap | A logical value indicating whether the GP prediction should berepresented as a heatmap of probabilities for 1-dimensional inputs. IfFALSE (default), the mean curve and associated 95% CI are displayed. |
prob_CI | A number between 0 and 1 (default is 0.95), indicating thelevel of the Credible Interval associated with the posterior mean curve. |
size_data | A number, controlling the size of the |
size_data_train | A number, controlling the size of the |
alpha_data_train | A number, between 0 and 1, controlling transparencyof the |
export_gif | A logical value indicating whether the animation shouldbe exported as a .gif file. |
path | A character string defining the path where the GIF file should beexported. |
... | Any additional parameters that can be passed to the function |
Value
Visualisation of a Magma or GP prediction (optional: display datapoints, training data points and the prior mean function), where datapoints are added sequentially for visualising changes in prediction asinformation increases.
Examples
TRUEPlot Magma or GP predictions
Description
Display Magma or classic GP predictions. According to the dimension of theinputs, the graph may be a mean curve + Credible Interval or a heatmap ofprobabilities.
Usage
plot_gp( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5)plot_magma( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5)Arguments
pred_gp | A tibble or data frame, typically coming from |
x_input | A vector of character strings, indicating which input shouldbe displayed. If NULL (default) the 'Input' column is used for the x-axis.If providing a 2-dimensional vector, the corresponding columns are usedfor the x-axis and y-axis. |
data | (Optional) A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified. Thisargument corresponds to the raw data on which the prediction has beenperformed. |
data_train | (Optional) A tibble or data frame, containing the trainingdata of the Magma model. The data set should have the same format as the |
prior_mean | (Optional) A tibble or a data frame, containing the 'Input'and associated 'Output' prior mean parameter of the GP prediction. |
y_grid | A vector, indicating the grid of values on the y-axis for whichprobabilities should be computed for heatmaps of 1-dimensionalpredictions. If NULL (default), a vector of length 50 is defined, rangingbetween the min and max 'Output' values contained in |
heatmap | A logical value indicating whether the GP prediction should berepresented as a heatmap of probabilities for 1-dimensional inputs. IfFALSE (default), the mean curve and associated Credible Interval aredisplayed. |
samples | A logical value indicating whether the GP prediction should berepresented as a collection of samples drawn from the posterior. IfFALSE (default), the mean curve and associated Credible Interval aredisplayed. |
nb_samples | A number, indicating the number of samples to be drawn fromthe predictive posterior distribution. For two-dimensional graphs, onlyone sample can be displayed. |
plot_mean | A logical value, indicating whether the mean predictionshould be displayed on the graph when |
alpha_samples | A number, controlling transparency of the sample curves. |
prob_CI | A number between 0 and 1 (default is 0.95), indicating thelevel of the Credible Interval associated with the posterior mean curve.If this this argument is set to 1, the Credible Interval is not displayed. |
size_data | A number, controlling the size of the |
size_data_train | A number, controlling the size of the |
alpha_data_train | A number, between 0 and 1, controlling transparencyof the |
Value
Visualisation of a Magma or GP prediction (optional: display datapoints, training data points and the prior mean function). For 1-Dinputs, the prediction is represented as a mean curve and its associated95% Credible Interval, as a collection of samples drawn from theposterior ifsamples = TRUE, or as a heatmap of probabilities ifheatmap = TRUE. For 2-D inputs, the prediction is represented as aheatmap, where each couple of inputs on the x-axis and y-axis areassociated with a gradient of colours for the posterior mean values,whereas the uncertainty is indicated by the transparency (the narrower isthe Credible Interval, the more opaque is the associated colour, and viceversa)
Examples
TRUEPlot MagmaClust predictions
Description
Display MagmaClust predictions. According to the dimension of theinputs, the graph may be a mean curve (dim inputs = 1) or a heatmap(dim inputs = 2) of probabilities. Moreover, MagmaClust can provide credibleintervals only by visualising cluster-specific predictions (e.g. for the mostprobable cluster). When visualising the full mixture-of-GPs prediction,which can be multimodal, the user should choose between the simple meanfunction or the full heatmap of probabilities (more informative but slower).
Usage
plot_magmaclust( pred_clust, cluster = "all", x_input = NULL, data = NULL, data_train = NULL, col_clust = FALSE, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5)Arguments
pred_clust | A list of predictions, typically coming from |
cluster | A character string, indicating which cluster to plot from.If 'all' (default) the mixture of GPs prediction is displayed as a meancurve (1-D inputs) or a mean heatmap (2-D inputs). Alternatively, if thename of one cluster is provided, the classic mean curve + credibleinterval is displayed (1-D inputs), or a heatmap with colour gradient forthe mean and transparency gradient for the Credible Interval (2-D inputs). |
x_input | A vector of character strings, indicating which input shouldbe displayed. If NULL (default) the 'Input' column is used for the x-axis.If providing a 2-dimensional vector, the corresponding columns are usedfor the x-axis and y-axis. |
data | (Optional) A tibble or data frame. Required columns: |
data_train | (Optional) A tibble or data frame, containing the trainingdata of the MagmaClust model. The data set should have the same format asthe |
col_clust | A boolean indicating whether backward points are colouredaccording to the individuals or to their most probable cluster. If onewants to colour by clusters, a column |
prior_mean | (Optional) A list providing, for each cluster, atibble containing prior mean parameters of the prediction. This argumenttypically comes as an outcome |
y_grid | A vector, indicating the grid of values on the y-axis for whichprobabilities should be computed for heatmaps of 1-dimensionalpredictions. If NULL (default), a vector of length 50 is defined, rangingbetween the min and max 'Output' values contained in |
heatmap | A logical value indicating whether the GP mixture should berepresented as a heatmap of probabilities for 1-dimensional inputs. IfFALSE (default), the mean curve (and associated Credible Interval ifavailable) are displayed. |
samples | A logical value indicating whether the GP mixture should berepresented as a collection of samples drawn from the posterior. IfFALSE (default), the mean curve (and associated Credible Interval ifavailable) are displayed. |
nb_samples | A number, indicating the number of samples to be drawn fromthe predictive posterior distribution. For two-dimensional graphs, onlyone sample can be displayed. |
plot_mean | A logical value, indicating whether the mean predictionshould be displayed on the graph when |
alpha_samples | A number, controlling transparency of the sample curves. |
prob_CI | A number between 0 and 1 (default is 0.95), indicating thelevel of the Credible Interval associated with the posterior mean curve.If this this argument is set to 1, the Credible Interval is not displayed. |
size_data | A number, controlling the size of the |
size_data_train | A number, controlling the size of the |
alpha_data_train | A number, between 0 and 1, controlling transparencyof the |
Value
Visualisation of a MagmaClust prediction (optional: display datapoints, training data points and the prior mean functions). For 1-Dinputs, the prediction is represented as a mean curve (and its associated95% Credible Interval for cluster-specific predictions), or as a heatmapof probabilities ifheatmap = TRUE. In the case of MagmaClust,the heatmap representation should be preferred for clarity, although thedefault display remains mean curve for quicker execution. For 2-D inputs,the prediction is represented as a heatmap, where each couple of inputs onthe x-axis and y-axis are associated with a gradient of colours for theposterior mean values, whereas the uncertainty is indicated by thetransparency (the narrower is the Credible Interval, the more opaque isthe associated colour, and vice versa). As for 1-D inputs, CredibleInterval information is only available for cluster-specific predictions.
Examples
TRUEDisplay realisations from a (mixture of) GP prediction
Description
Display samples drawn from the posterior of a GP, Magma orMagmaClust prediction. According to the dimension of the inputs, the graphmay represent curves or a heatmap.
Usage
plot_samples( pred = NULL, samples = NULL, nb_samples = 50, x_input = NULL, plot_mean = TRUE, alpha_samples = 0.3)Arguments
pred | A list, typically coming from |
samples | A tibble or data frame, containing the samples generated froma GP, Magma, or MagmaClust prediction. Required columns: |
nb_samples | A number, indicating the number of samples to be drawn fromthe predictive posterior distribution. For two-dimensional graphs, onlyone sample can be displayed. |
x_input | A vector of character strings, indicating which 'column'should be displayed in the case of multidimensional inputs. IfNULL(default) the Input' column is used for the x-axis. If providing a2-dimensional vector, the corresponding columns are used for the x-axisand the y-axis. |
plot_mean | A logical value, indicating whether the mean predictionshould be displayed on the graph. |
alpha_samples | A number, controlling transparency of the sample curves. |
Value
Graph of samples drawn from a posterior distribution of a GP,Magma, or MagmaClust prediction.
Examples
TRUEMagma prediction for ploting GIFs
Description
Generate a Magma or classic GP prediction under a format that is compatiblewith a further GIF visualisation of the results. For a Magma prediction,either thetrained_model orhyperpost argument is required.Otherwise, a classic GP prediction is applied and the prior mean can bespecified through themean argument.
Usage
pred_gif( data, trained_model = NULL, grid_inputs = NULL, hyperpost = NULL, mean = NULL, hp = NULL, kern = "SE", pen_diag = 1e-10)Arguments
data | A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified.The 'Input' column should define the variable that is used asreference for the observations (e.g. time for longitudinal data). The'Output' column specifies the observed values (the responsevariable). The data frame can also provide as many covariates as desired,with no constraints on the column names. These covariates are additionalinputs (explanatory variables) of the models that are also observed ateach reference 'Input'. |
trained_model | A list, containing the information coming from aMagma model, previously trained using the |
grid_inputs | The grid of inputs (reference Input and covariates) valueson which the GP should be evaluated. Ideally, this argument should be atibble or a data frame, providing the same columns as |
hyperpost | A list, containing the elements 'mean' and 'cov', theparameters of the hyper-posterior distribution of the mean process.Typically, this argument should from a previous learning using |
mean | Mean parameter of the GP. This argument can be specified undervarious formats, such as:
|
hp | A named vector, tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A tibble, representing Magma or GP predictions as two column 'Mean'and 'Var', evaluated on thegrid_inputs. The column 'Input' andadditional covariates columns are associated to each predicted values. Anadditional 'Index' column is created for the sake of GIF creation usingthe functionplot_gif
Examples
TRUEGaussian Process prediction
Description
Compute the posterior distribution of a standard GP, using the formalism ofMagma. By providing observed data, the prior mean and covariancematrix (by defining a kernel and its associated hyper-parameters), the meanand covariance parameters of the posterior distribution are computed on thegrid of inputs that has been specified. This predictive distribution can beevaluated on any arbitrary inputs since a GP is an infinite-dimensionalobject.
Usage
pred_gp( data = NULL, grid_inputs = NULL, mean = NULL, hp = NULL, kern = "SE", get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10)Arguments
data | A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified.The 'Input' column should define the variable that is used asreference for the observations (e.g. time for longitudinal data). The'Output' column specifies the observed values (the responsevariable). The data frame can also provide as many covariates as desired,with no constraints on the column names. These covariates are additionalinputs (explanatory variables) of the models that are also observed ateach reference 'Input'. If NULL, the prior GP is returned. |
grid_inputs | The grid of inputs (reference Input and covariates) valueson which the GP should be evaluated. Ideally, this argument should be atibble or a data frame, providing the same columns as |
mean | Mean parameter of the GP. This argument can be specified undervarious formats, such as:
|
hp | A named vector, tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
get_full_cov | A logical value, indicating whether the full posteriorcovariance matrix should be returned. |
plot | A logical value, indicating whether a plot of the results isautomatically displayed. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A tibble, representing the GP predictions as two column 'Mean' and'Var', evaluated on thegrid_inputs. The column 'Input' andadditional covariates columns are associated to each predicted values.If theget_full_cov argument is TRUE, the function returns a list,in which the tibble described above is defined as 'pred' and the fullposterior covariance matrix is defined as 'cov'.
Examples
TRUEMagma prediction
Description
Compute the posterior predictive distribution in Magma. Providing data of anynew individual/task, its trained hyper-parameters and a previously trainedMagma model, the predictive distribution is evaluated on any arbitrary inputsthat are specified through the 'grid_inputs' argument.
Usage
pred_magma( data = NULL, trained_model = NULL, grid_inputs = NULL, hp = NULL, kern = "SE", hyperpost = NULL, get_hyperpost = FALSE, get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10)Arguments
data | A tibble or data frame. Required columns: 'Input','Output'. Additional columns for covariates can be specified.The 'Input' column should define the variable that is used asreference for the observations (e.g. time for longitudinal data). The'Output' column specifies the observed values (the responsevariable). The data frame can also provide as many covariates as desired,with no constraints on the column names. These covariates are additionalinputs (explanatory variables) of the models that are also observed ateach reference 'Input'. If NULL, the mean process from |
trained_model | A list, containing the information coming from aMagma model, previously trained using the |
grid_inputs | The grid of inputs (reference Input and covariates) valueson which the GP should be evaluated. Ideally, this argument should be atibble or a data frame, providing the same columns as |
hp | A named vector, tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hyperpost | A list, containing the elements 'mean' and 'cov', theparameters of the hyper-posterior distribution of the mean process.Typically, this argument should come from a previous learning using |
get_hyperpost | A logical value, indicating whether the hyper-posteriordistribution of the mean process should be returned. This can be usefulwhen planning to perform several predictions on the same grid of inputs,since recomputation of the hyper-posterior can be prohibitive for highdimensional grids. |
get_full_cov | A logical value, indicating whether the full posteriorcovariance matrix should be returned. |
plot | A logical value, indicating whether a plot of the results isautomatically displayed. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A tibble, representing Magma predictions as two column 'Mean' and'Var', evaluated on thegrid_inputs. The column 'Input' andadditional covariates columns are associated to each predicted values.If theget_full_cov orget_hyperpost arguments are TRUE,the function returns a list, in which the tibble described above isdefined as 'pred_gp' and the full posterior covariance matrix isdefined as 'cov', and the hyper-posterior distribution of the mean processis defined as 'hyperpost'.
Examples
TRUEMagmaClust prediction
Description
Compute the posterior predictive distribution in MagmaClust.Providing data from any new individual/task, its trained hyper-parametersand a previously trained MagmaClust model, the multi-task posteriordistribution is evaluated on any arbitrary inputs that are specified throughthe 'grid_inputs' argument. Due to the nature of the model, the prediction isdefined as a mixture of Gaussian distributions. Therefore the presentfunction computes the parameters of the predictive distributionassociated with each cluster, as well as the posterior mixture probabilitiesfor this new individual/task.
Usage
pred_magmaclust( data = NULL, trained_model = NULL, grid_inputs = NULL, mixture = NULL, hp = NULL, kern = "SE", hyperpost = NULL, prop_mixture = NULL, get_hyperpost = FALSE, get_full_cov = TRUE, plot = TRUE, pen_diag = 1e-10)Arguments
data | A tibble or data frame. Required columns: |
trained_model | A list, containing the information coming from aMagmaClust model, previously trained using the |
grid_inputs | The grid of inputs (reference Input and covariates) valueson which the GP should be evaluated. Ideally, this argument should be atibble or a data frame, providing the same columns as |
mixture | A tibble or data frame, indicating the mixture probabilitiesof each cluster for the new individual/task.If NULL, the |
hp | A named vector, tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hyperpost | A list, containing the elements |
prop_mixture | A tibble or a named vector of the mixture proportions.Each name of column or element should refer to a cluster. The valueassociated with each cluster is a number between 0 and 1. If both |
get_hyperpost | A logical value, indicating whether the hyper-posteriordistributions of the mean processes should be returned. This can be usefulwhen planning to perform several predictions on the same grid of inputs,since recomputation of the hyper-posterior can be prohibitive for highdimensional grids. |
get_full_cov | A logical value, indicating whether the full posteriorcovariance matrices should be returned. |
plot | A logical value, indicating whether a plot of the results isautomatically displayed. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A list of GP prediction results composed of:
pred: As sub-list containing, for each cluster:
pred_gp: A tibble, representing the GP predictions as twocolumn
MeanandVar, evaluated on thegrid_inputs. The columnInputand additionalcovariates columns are associated with each predicted values.proba: A number, the posterior probability associated withthis cluster.
cov (if
get_full_cov= TRUE): A matrix, the fullposterior covariance matrix associated with this cluster.
mixture: A tibble, indicating the mixture probabilitiesof each cluster for the predicted individual/task.
hyperpost (if
get_hyperpost= TRUE): A list,containing the hyper-posterior distributions information usefulfor visualisation purposes.
Examples
TRUEIndicates the most probable cluster
Description
Indicates the most probable cluster
Usage
proba_max_cluster(mixture)Arguments
mixture | A tibble or data frame containing mixture probabilities. |
Value
A tibble, retaining only the most probable cluster. The columnCluster indicates the the cluster's name whereasProbarefers to its associated probability. IfID is initiallya column ofmixture (optional), the function returns the mostprobable cluster for all the differentID values.
Examples
TRUERegularise a grid of inputs in a dataset
Description
Modify the original grid of inputs to make it more 'regular' (in the sensethat the interval between each observation is constant, or corresponds to aspecific pattern defined by the user). In particular, this function can alsobe used to summarise several data points into one, at a specific location. Inthis case, the output values are averaged according to the 'summarise_fct'argument.
Usage
regularize_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean)regularise_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean)Arguments
data | A tibble or data frame. Required columns: |
size_grid | An integer, which indicates the number of equispaced pointseach column must contain. Each original input value will be collapsed tothe closest point of the new regular grid, and the associated outputs areaveraged using the 'summarise_fct' function. This argument is used when'grid_inputs' is left to 'NULL'. Default value is 30. |
grid_inputs | A data frame, corresponding to a pre-defined grid ofinputs according to which we want to regularise a dataset. Column namesmust be similar to those appearing in |
summarise_fct | A character string or a function. If several similarinputs are associated with different outputs, the user can choose thesummarising function for the output among the following: min, max, mean,median. A custom function can be defined if necessary. Default is "mean". |
Value
A data frame, where input columns have been regularised as desired.
Examples
data = tibble::tibble(ID = 1, Input = 0:100, Output = -50:50)## Define a 1D input grid of 10 pointsregularize_data(data, size_grid = 10)## Define a 1D custom gridmy_grid = tibble::tibble(Input = c(5, 10, 25, 50, 100))regularize_data(data, grid_inputs = my_grid)## Define a 2D input grid of 5x5 pointsdata_2D = cbind(ID = 1, expand.grid(Input=1:10, Input2=1:10), Output = 1:100)regularize_data(data_2D, size_grid = 5)## Define a 2D custom input gridmy_grid_2D = MagmaClustR::expand_grid_inputs(c(2, 4, 8), 'Input2' = c(3, 5))regularize_data(data_2D, grid_inputs = my_grid_2D)Rational Quadratic Kernel
Description
Rational Quadratic Kernel
Usage
rq_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)Arguments
x | A vector (or matrix if vectorized = T) of inputs. |
y | A vector (or matrix if vectorized = T) of inputs. |
hp | A tibble, data frame or named vector, containing the kernel'shyperparameters. Required columns: 'rq_variance', 'rq_lengthscale', and'rq_scale'. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simply returnsthe evaluation of the kernel. |
vectorized | A logical value, indicating whether the function providesa vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUEDraw samples from a posterior GP/Magma distribution
Description
Draw samples from a posterior GP/Magma distribution
Usage
sample_gp(pred_gp, nb_samples = 50)sample_magma(pred_gp, nb_samples = 50)Arguments
pred_gp | A list, typically coming from |
nb_samples | A number, indicating the number of samples to be drawn fromthe predictive posterior distribution. For two-dimensional graphs, onlyone sample can be displayed. |
Value
A tibble or data frame, containing the samples generated froma GP prediction. Format:Input,Sample,Output.
Examples
TRUEDraw samples from a MagmaClust posterior distribution
Description
Draw samples from a MagmaClust posterior distribution
Usage
sample_magmaclust(pred_clust, nb_samples = 50)Arguments
pred_clust | A list, typically coming from |
nb_samples | A number, indicating the number of samples to be drawn fromthe predictive posterior distribution. For two-dimensional graphs, onlyone sample can be displayed. |
Value
A tibble or data frame, containing the samples generated froma GP prediction. Format:Cluster,Proba,Input,Sample,Output.
Examples
TRUESquared Exponential Kernel
Description
Squared Exponential Kernel
Usage
se_kernel(x, y, hp, deriv = NULL, vectorized = FALSE)Arguments
x | A vector (or matrix if vectorized = T) of inputs. |
y | A vector (or matrix if vectorized = T) of inputs. |
hp | A tibble, data frame or named vector, containing the kernel'shyperparameters. Required columns: 'se_variance', 'se_lengthscale'. |
deriv | A character, indicating according to which hyper-parameter thederivative should be computed. If NULL (default), the function simplyreturns the evaluation of the kernel. |
vectorized | A logical value, indicating whether the function providesa vectorized version for speeded-up calculations. If TRUE, the |
Value
A scalar, corresponding to the evaluation of the kernel.
Examples
TRUESelect the optimal number of clusters
Description
In MagmaClust, as for any clustering method, the number K of clusters has tobe provided as an hypothesis of the model. This function implements a modelselection procedure, by maximising a variational BIC criterion, computedfor different values of K. A heuristic for a fast approximation of theprocedure is proposed as well, although the corresponding models would notbe properly trained.
Usage
select_nb_cluster( data, fast_approx = TRUE, grid_nb_cluster = 1:10, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", plot = TRUE, ...)Arguments
data | A tibble or data frame. Columns required: |
fast_approx | A boolean, indicating whether a fast approximation shouldbe used for selecting the number of clusters. If TRUE, each Magma orMagmaClust model will perform only one E-step of the training, usingthe same fixed values for the hyper-parameters ( |
grid_nb_cluster | A vector of integer, corresponding to grid of valuesthat will be tested for the number of clusters. |
ini_hp_k | A tibble or data frame of hyper-parameters associated with |
ini_hp_i | A tibble or data frame of hyper-parameters associated with |
kern_k | A kernel function associated to the mean processes. |
kern_i | A kernel function associated to the individuals/tasks. |
plot | A boolean indicating whether the plot of V-BIC values for allnumbers of clusters should displayed. |
... | Any additional argument that could be passed to |
Value
A list, containing the results of model selection procedure forselecting the optimal number of clusters thanks to a V-BIC criterionmaximisation. The elements of the list are:
best_k: An integer, indicating the resulting optimal number of clusters
seq_vbic: A vector, corresponding to the sequence of the V-BIC valuesassociated with the models trained for each provided cluster's number in
grid_nb_cluster.trained_models: A list, named by associated number of clusters, ofMagma or MagmaClust models that have been trained (or approximated if
fast_approx= T) during the model selection procedure.
Examples
TRUESimulate a dataset tailored for MagmaClustR
Description
Simulate a complete training dataset, which may be representative of variousapplications. Several flexible arguments allow adjustment of the number ofindividuals, of observed inputs, and the values of many parameterscontrolling the data generation.
Usage
simu_db( M = 10, N = 10, K = 1, covariate = FALSE, grid = seq(0, 10, 0.05), grid_cov = seq(0, 10, 0.5), common_input = TRUE, common_hp = TRUE, add_hp = FALSE, add_clust = FALSE, int_mu_v = c(4, 5), int_mu_l = c(0, 1), int_i_v = c(1, 2), int_i_l = c(0, 1), int_i_sigma = c(0, 0.2), lambda_int = c(30, 40), m_int = c(0, 10), lengthscale_int = c(30, 40), m0_slope = c(-5, 5), m0_intercept = c(-50, 50))Arguments
M | An integer. The number of individual per cluster. |
N | An integer. The number of observations per individual. |
K | An integer. The number of underlying clusters. |
covariate | A logical value indicating whether the dataset shouldinclude an additional input covariate named 'Covariate'. |
grid | A vector of numbers defining a grid of observations(i.e. the reference inputs). |
grid_cov | A vector of numbers defining a grid of observations(i.e. the covariate reference inputs). |
common_input | A logical value indicating whether the reference inputsare common to all individual. |
common_hp | A logical value indicating whether the hyper-parameters arecommon to all individual. If TRUE and K>1, the hyper-parameters remaindifferent between the clusters. |
add_hp | A logical value indicating whether the values ofhyper-parameters should be added as columns in the dataset. |
add_clust | A logical value indicating whether the name of theclusters should be added as a column in the dataset. |
int_mu_v | A vector of 2 numbers, defining an interval of admissiblevalues for the variance hyper-parameter of the mean process' kernel. |
int_mu_l | A vector of 2 numbers, defining an interval of admissiblevalues for the lengthscale hyper-parameter of the mean process' kernel. |
int_i_v | A vector of 2 numbers, defining an interval of admissiblevalues for the variance hyper-parameter of the individual process' kernel. |
int_i_l | A vector of 2 numbers, defining an interval of admissiblevalues for the lengthscale hyper-parameter of the individual process'kernel. |
int_i_sigma | A vector of 2 numbers, defining an interval of admissiblevalues for the noise hyper-parameter. |
lambda_int | A vector of 2 numbers, defining an interval of admissiblevalues for the lambda parameter of the 2D exponential. |
m_int | A vector of 2 numbers, defining an interval of admissiblevalues for the mean of the 2D exponential. |
lengthscale_int | A vector of 2 numbers, defining an interval ofadmissible values for the lengthscale parameter of the 2D exponential. |
m0_slope | A vector of 2 numbers, defining an interval of admissiblevalues for the slope of m0. |
m0_intercept | A vector of 2 numbers, defining an interval of admissiblevalues for the intercept of m0. |
Value
A full dataset of simulated training data.
Examples
## Generate a dataset with 3 clusters of 4 individuals, observed at 10 inputsdata = simu_db(M = 4, N = 10, K = 3)## Generate a 2-D dataset with an additional input 'Covariate'data = simu_db(covariate = TRUE)## Generate a dataset where input locations are different among individualsdata = simu_db(common_input = FALSE)## Generate a dataset with an additional column indicating the true clustersdata = simu_db(K = 3, add_clust = TRUE)Simulate a batch of data
Description
Simulate a batch of output data, corresponding to one individual, coming froma GP with a the Squared Exponential kernel as covariance structure, andspecified hyper-parameters and input.
Usage
simu_indiv_se(ID, input, mean, v, l, sigma)Arguments
ID | An identification code, whether numeric or character. |
input | A vector of numbers. The input variable that is used as'reference' for input and outputs. |
mean | A vector of numbers. Prior mean values of the GP. |
v | A number. The variance hyper-parameter of the SE kernel. |
l | A number. The lengthscale hyper-parameter of the SE kernel. |
sigma | A number. The noise hyper-parameter. |
Value
A tibble containing a batch of output data along with input andadditional information for a simulated individual.
Examples
TRUECompute a mixture of Gaussian log-likelihoods
Description
During the prediction step of MagmaClust, an EM algorithm is used to computethe maximum likelihood estimator of the hyper-parameters along withmixture probabilities for the new individual/task. This function implementsthe quantity that is maximised (i.e. a sum of Gaussian log-likelihoods,weighted by their mixture probabilities). It can also be used to monitor theEM algorithm when providing the 'prop_mixture' argument, for properpenalisation of the full log-likelihood.
Usage
sum_logL_GP_clust( hp, db, mixture, mean, kern, post_cov, prop_mixture = NULL, pen_diag)Arguments
hp | A tibble, data frame or named vector of hyper-parameters. |
db | A tibble containing data we want to evaluate the logL on.Required columns: Input, Output. Additional covariate columns are allowed. |
mixture | A tibble or data frame, indicating the mixture probabilitiesof each cluster for the new individual/task. |
mean | A list of hyper-posterior mean parameters for all clusters. |
kern | A kernel function. |
post_cov | A list of hyper-posterior covariance parameters for allclusters. |
prop_mixture | A tibble or a named vector. Each name of column orelement should refer to a cluster. The value associated with each clusteris a number between 0 and 1, corresponding to the mixtureproportions. |
pen_diag | A jitter term that is added to the covariance matrix to avoidnumerical issues when inverting, in cases of nearly singular matrices. |
Value
A number, expectation of mixture of Gaussian log-likelihoods inthe prediction step of MagmaClust. This quantity is supposed to increaseat each step of the EM algorithm, and can be used for monitoring theprocedure.
Examples
TRUEFrench swimmers performances data on 100m freestyle events
Description
A subset of data from reported performances of French swimmers during100m freestyle competitions between 2002 and 2016. Seehttps://link.springer.com/article/10.1007/s10994-022-06172-1 andhttps://www.mdpi.com/2076-3417/8/10/1766 for dedicated description andanalysis.
Usage
swimmersFormat
swimmers
A data frame with 76,832 rows and 4 columns:
- ID
Indentifying number associated to each swimmer
- Input
Age in years
- Output
Performance in seconds on a 100m freestyle event
- Gender
Competition gender
Source
https://ffn.extranat.fr/webffn/competitions.php?idact=nat
Learning hyper-parameters of a Gaussian Process
Description
Learning hyper-parameters of any new individual/task inMagma isrequired in the prediction procedure. This function can also be used to learnhyper-parameters of a simple GP (just let thehyperpost argument setto NULL, and useprior_mean instead). When using withinMagma,by providing data for the new individual/task, the hyper-posterior mean andcovariance parameters, and initialisation values for the hyper-parameters,the function computes maximum likelihood estimates of the hyper-parameters.
Usage
train_gp( data, prior_mean = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10)Arguments
data | A tibble or data frame. Required columns: |
prior_mean | Mean parameter of the GP. This argument can bespecified under various formats, such as:
|
ini_hp | A named vector, tibble or data frame of hyper-parametersassociated with the |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hyperpost | A list, containing the elements 'mean' and 'cov',the parameters of the hyper-posterior distribution of the mean process.Typically, this argument should come from a previous learning using |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A tibble, containing the trained hyper-parameters for the kernel ofthe new individual/task.
Examples
TRUEPrediction in MagmaClust: learning new HPs and mixture probabilities
Description
Learning hyper-parameters and mixture probabilities of any newindividual/task is required inMagmaClust in the prediction procedure.By providing data for the new individual/task, the hyper-posterior mean andcovariance parameters, the mixture proportions, and initialisation values forthe hyper-parameters,train_gp_clust uses an EM algorithm to computemaximum likelihood estimates of the hyper-parameters and hyper-posteriormixture probabilities of the new individual/task.
Usage
train_gp_clust( data, prop_mixture = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001)Arguments
data | A tibble or data frame. Required columns: |
prop_mixture | A tibble or a named vector. Each name of column orelement should refer to a cluster. The value associated with each clusteris a number between 0 and 1, corresponding to the mixtureproportions. |
ini_hp | A tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure of the GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
hyperpost | A list, containing the elements |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
n_iter_max | A number, indicating the maximum number of iterations ofthe EM algorithm to proceed while not reaching convergence. |
cv_threshold | A number, indicating the threshold of the likelihood gainunder which the EM algorithm will stop. |
Value
A list, containing the results of the EM algorithm used during theprediction step of MagmaClust. The elements of the list are:
hp: A tibble of optimal hyper-parameters for the new individual's GP.
mixture: A tibble of mixture probabilities for the new individual.
Examples
TRUETraining Magma with an EM algorithm
Description
The hyper-parameters and the hyper-posterior distribution involved in Magmacan be learned thanks to an EM algorithm implemented intrain_magma.By providing a dataset, the model hypotheses (hyper-prior mean parameter andcovariance kernels) and initialisation values for the hyper-parameters, thefunction computes maximum likelihood estimates of the HPs as well as themean and covariance parameters of the Gaussian hyper-posterior distributionof the mean process.
Usage
train_magma( data, prior_mean = NULL, ini_hp_0 = NULL, ini_hp_i = NULL, kern_0 = "SE", kern_i = "SE", common_hp = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE)Arguments
data | A tibble or data frame. Required columns: |
prior_mean | Hyper-prior mean parameter (m_0) of the mean GP. Thisargument can be specified under various formats, such as:
|
ini_hp_0 | A named vector, tibble or data frame of hyper-parametersassociated with |
ini_hp_i | A tibble or data frame of hyper-parametersassociated with |
kern_0 | A kernel function, associated with the mean GP.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
kern_i | A kernel function, associated with the individual GPs. ("SE","PERIO" and "RQ" are also available here). |
common_hp | A logical value, indicating whether the set ofhyper-parameters is assumed to be common to all individuals. |
grid_inputs | A vector, indicating the grid of additional referenceinputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
n_iter_max | A number, indicating the maximum number of iterations ofthe EM algorithm to proceed while not reaching convergence. |
cv_threshold | A number, indicating the threshold of the likelihood gainunder which the EM algorithm will stop. The convergence condition isdefined as the difference of likelihoods between two consecutive steps,divided by the absolute value of the last one( |
fast_approx | A boolean, indicating whether the EM algorithm shouldstop after only one iteration of the E-step. This advanced feature ismainly used to provide a faster approximation of the model selectionprocedure, by preventing any optimisation over the hyper-parameters. |
Details
The user can specify custom kernel functions for the argumentkern_0 andkern_i. The hyper-parameters used in the kernelshould have explicit names, and be contained within thehpargument.hp should typically be defined as a named vector or adata frame. Although it is not mandatory for thetrain_magmafunction to run, gradients can be provided within kernel functiondefinition. See for examplese_kernel to create a customkernel function displaying an adequate format to be used in Magma.
Value
A list, gathering the results of the EM algorithm used for trainingin Magma. The elements of the list are:
hp_0: A tibble of the trained hyper-parameters for the meanprocess' kernel.
hp_i: A tibble of all the trained hyper-parameters for theindividual processes' kernels.
hyperpost: A sub-list gathering the parameters of the mean processes'hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameter(
Output) evaluated at each training referenceInput.cov: A matrix, the covariance parameter for the hyper-posteriordistribution of the mean process.
pred: A tibble, the predicted mean and variance at
Inputfor the mean process' hyper-posterior distribution under a formatthat allows the direct visualisation as a GP prediction.
ini_args: A list containing the initial function arguments and valuesfor the hyper-prior mean, the hyper-parameters. In particular, ifthose arguments were set to NULL,
ini_argsallows us to retrievethe (randomly chosen) initialisations used during training.seq_loglikelihood: A vector, containing the sequence of log-likelihoodvalues associated with each iteration.
converged: A logical value indicated whether the EM algorithm convergedor not.
training_time: Total running time of the complete training.
Examples
TRUETraining MagmaClust with a Variational EM algorithm
Description
The hyper-parameters and the hyper-posterior distributions involved inMagmaClust can be learned thanks to a VEM algorithm implemented intrain_magmaclust. By providing a dataset, the model hypotheses(hyper-prior mean parameters, covariance kernels and number of clusters) andinitialisation values for the hyper-parameters, the function computesmaximum likelihood estimates of the HPs as well as the mean and covarianceparameters of the Gaussian hyper-posterior distributions of the meanprocesses.
Usage
train_magmaclust( data, nb_cluster = NULL, prior_mean_k = NULL, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", ini_mixture = NULL, common_hp_k = TRUE, common_hp_i = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE)Arguments
data | A tibble or data frame. Columns required: |
nb_cluster | A number, indicating the number of clusters ofindividuals/tasks that are assumed to exist among the dataset. |
prior_mean_k | The set of hyper-prior mean parameters (m_k) for the Kmean GPs, one value for each cluster.cluster. This argument can be specified under various formats, such as:
|
ini_hp_k | A tibble or data frame of hyper-parametersassociated with |
ini_hp_i | A tibble or data frame of hyper-parametersassociated with |
kern_k | A kernel function, associated with the mean GPs.Several popular kernels(seeThe KernelCookbook) are already implemented and can be selected within thefollowing list:
|
kern_i | A kernel function, associated with the individual GPs. (Seedetails above in |
ini_mixture | Initial values of the probability to belong to eachcluster for each individual ( |
common_hp_k | A boolean indicating whether hyper-parameters are commonamong the mean GPs. |
common_hp_i | A boolean indicating whether hyper-parameters are commonamong the individual GPs. |
grid_inputs | A vector, indicating the grid of additional referenceinputs on which the mean processes' hyper-posteriors should be evaluated. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
n_iter_max | A number, indicating the maximum number of iterations ofthe VEM algorithm to proceed while not reaching convergence. |
cv_threshold | A number, indicating the threshold of the likelihood gainunder which the VEM algorithm will stop. The convergence condition isdefined as the difference of elbo between two consecutive steps,divided by the absolute value of the last one( |
fast_approx | A boolean, indicating whether the VEM algorithm shouldstop after only one iteration of the VE-step. This advanced feature ismainly used to provide a faster approximation of the model selectionprocedure, by preventing any optimisation over the hyper-parameters. |
Details
The user can specify custom kernel functions for the argumentkern_k andkern_i. The hyper-parameters used in the kernelshould have explicit names, and be contained within thehpargument.hp should typically be defined as a named vector or adata frame. Although it is not mandatory for thetrain_magmaclustfunction to run, gradients be can provided within kernel functiondefinition. See for examplese_kernel to create a customkernel function displaying an adequate format to be used inMagmaClust.
Value
A list, containing the results of the VEM algorithm used in thetraining step of MagmaClust. The elements of the list are:
hp_k: A tibble containing the trained hyper-parameters for the meanprocess' kernel and the mixture proportions for each cluster.
hp_i: A tibble containing the trained hyper-parameters for theindividual processes' kernels.
hyperpost: A sub-list containing the parameters of the mean processes'hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, thehyper-posterior mean parameters evaluated at each
Input.cov: A list of matrices containing, for each cluster, thehyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities in eachcluster for each individual.
ini_args: A list containing the initial function arguments and valuesfor the hyper-prior means, the hyper-parameters. In particular, ifthose arguments were set to NULL,
ini_argsallows us to retrievethe (randomly chosen) initialisations used during training.seq_elbo: A vector, containing the sequence of ELBO values associatedwith each iteration.
converged: A logical value indicated whether the algorithm converged.
training_time: Total running time of the complete training.
Examples
TRUEUpdate the mixture probabilities for each individual and each cluster
Description
Update the mixture probabilities for each individual and each cluster
Usage
update_mixture(db, mean_k, cov_k, hp, kern, prop_mixture, pen_diag)Arguments
db | A tibble or data frame. Columns required: |
mean_k | A list of the K hyper-posterior mean parameters. |
cov_k | A list of the K hyper-posterior covariance matrices. |
hp | A named vector, tibble or data frame of hyper-parametersassociated with |
kern | A kernel function, defining the covariance structure ofthe individual GPs. |
prop_mixture | A tibble containing the hyper-parameters associatedwith each individual, indicating in which cluster it belongs. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
Compute the hyper-posterior multinomial distributions by updatingmixture probabilities.
Examples
TRUEE-Step of the VEM algorithm
Description
Expectation step of the Variational EM algorithm used to computethe parameters of the hyper-posteriors distributionsfor the mean processes and mixture variables involved in MagmaClust.
Usage
ve_step(db, m_k, kern_k, kern_i, hp_k, hp_i, old_mixture, iter, pen_diag)Arguments
db | A tibble or data frame. Columns required: ID, Input, Output.Additional columns for covariates can be specified. |
m_k | A named list of vectors, corresponding to the prior meanparameters of the K mean GPs. |
kern_k | A kernel function, associated with the K mean GPs. |
kern_i | A kernel function, associated with the M individual GPs. |
hp_k | A named vector, tibble or data frame of hyper-parametersassociated with |
hp_i | A named vector, tibble or data frame of hyper-parametersassociated with |
old_mixture | A list of mixture values from the previous iteration. |
iter | A number, indicating the current iteration of the VEM algorithm. |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elementsmean, a tibblecontaining the Input and associated Output of the hyper-posterior meanparameters,cov, the hyper-posterior covariance matrices,andmixture, the probabilities to belong to each cluster for eachindividual.
Examples
TRUEV-Step of the VEM algorithm
Description
Maximization step of the Variational EM algorithm used to computehyper-parameters of all the kernels involved in MagmaClust.
Usage
vm_step( db, old_hp_k, old_hp_i, list_mu_param, kern_k, kern_i, m_k, common_hp_k, common_hp_i, pen_diag)Arguments
db | A tibble or data frame. Columns required: ID, Input, Output.Additional columns for covariates can be specified. |
old_hp_k | A named vector, tibble or data frame, containing thehyper-parameters from the previous M-step (or initialisation) associatedwith the mean GPs. |
old_hp_i | A named vector, tibble or data frame, containing thehyper-parameters from the previous M-step (or initialisation) associatedwith the individual GPs. |
list_mu_param | List of parameters of the K mean GPs. |
kern_k | A kernel used to compute the covariance matrix of the mean GPat corresponding timestamps. |
kern_i | A kernel used to compute the covariance matrix of individualsGP at corresponding timestamps. |
m_k | A named list of prior mean parameters for the K mean GPs.Length = 1 or nrow(unique(db$Input)) |
common_hp_k | A boolean indicating whether hp are common amongmean GPs (for each mu_k) |
common_hp_i | A boolean indicating whether hp are common amongindividual GPs (for each y_i) |
pen_diag | A number. A jitter term, added on the diagonal to preventnumerical issues when inverting nearly singular matrices. |
Value
A named list, containing the elementshp_k, a tibblecontaining the hyper-parameters associated with each cluster,hp_i, a tibble containing the hyper-parametersassociated with the individual GPs, andprop_mixture_k,a tibble containing the hyper-parameters associated with each individual,indicating the probabilities to belong to each cluster.
Examples
TRUEWeight follow-up data of children in Singapore
Description
A subset of data from the GUSTO project (https://www.gusto.sg/) collectingthe weight over time of several children in Singapore.See https://arxiv.org/abs/2011.07866 for dedicated description andanalysis.
Usage
weightFormat
weight
A data frame with 3,629 rows and 4 columns:
- ID
Indentifying number associated to each child
- sex
Biological gender
- Input
Age in months
- Output
Weight in kilograms