Movatterモバイル変換

Type:

Package

Title:

Memory-Based Learning in Spectral Chemometrics

Version:

2.2.5

Date:

2025-10-16

Maintainer:

Leonardo Ramirez-Lopez <ramirez.lopez.leo@gmail.com>

BugReports:

https://github.com/l-ramirez-lopez/resemble/issues

Description:

Functions for dissimilarity analysis and memory-based learning (MBL, a.k.a local modeling) in complex spectral data sets. Most of these functions are based on the methods presented in Ramirez-Lopez et al. (2013) <doi:10.1016/j.geoderma.2012.12.014>.

License:

MIT + file LICENSE

URL:

http://l-ramirez-lopez.github.io/resemble/

Depends:

R (≥ 3.5.0)

Imports:

foreach, iterators, Rcpp (≥ 1.0.3), mathjaxr (≥ 1.0),magrittr (≥ 1.5.0), lifecycle (≥ 0.2.0), data.table (≥1.9.8)

Suggests:

prospectr, parallel, doParallel, testthat, formatR,rmarkdown, bookdown, knitr

LinkingTo:

Rcpp, RcppArmadillo

RdMacros:

mathjaxr

VignetteBuilder:

knitr

NeedsCompilation:

yes

Repository:

CRAN

RoxygenNote:

7.3.2

Encoding:

UTF-8

Config/VersionName:

dstatements

Packaged:

2025-10-17 18:47:43 UTC; leo

Author:

Leonardo Ramirez-Lopez

[aut, cre], Antoine Stevens

[aut, ctb], Claudio Orellano [ctb], Raphael Viscarra Rossel

[ctb], Alex Wadoux

[ctb]

Date/Publication:

2025-10-17 19:20:02 UTC

Overview of the functions in the resemble package

Description

Functions for memory-based learning

logo

Details

This is the version2.2.5 – dstatementsof the package. It implements a number of functions useful formodeling complex spectral spectra (e.g. NIR, IR).The package includes functions for dimensionality reduction,computing spectral dissimilarity matrices, nearest neighbor search,and modeling spectral data using memory-based learning. This package buildsupon the methods presented in Ramirez-Lopez et al. (2013)doi:10.1016/j.geoderma.2012.12.014.

Development versions can be found in the github repository of the packageathttps://github.com/l-ramirez-lopez/resemble.

The functions available for dimensionality reduction are:

The functions available for computing dissimilarity matrices are:

The functions available for evaluating dissimilarity matrices are:

sim_eval

The functions available for nearest neighbor search:

search_neighbors

The functions available for modeling spectral data:

mbl
mbl_control

Other supplementary functions:

Author(s)

Maintainer / Creator: Leonardo Ramirez-Lopezramirez.lopez.leo@gmail.com

Authors:

Leonardo Ramirez-Lopez (ORCID)
Antoine Stevens (ORCID)
Claudio Orellano
Raphael Viscarra Rossel (ORCID)
Zefang Shen
Craig Lobsey (ORCID)
Alex Wadoux (ORCID)

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196,268-279.

Print method for an object of class`local_ortho_diss`

Description

prints the subsets of local_ortho_diss objects

Usage

## S3 method for class 'local_ortho_diss'x[rows, columns, drop = FALSE, ...]

Arguments

x

local_ortho_diss matrix

rows

the indices of the rows

columns

the indices of the columns

drop

drop argument

...

not used

checks the pc_selection argument

Description

internal

Usage

check_pc_arguments(  n_rows_x,  n_cols_x,  pc_selection,  default_max_comp = 40,  default_max_cumvar = 0.99,  default_max_var = 0.01)

Correlation and moving correlation dissimilarity measurements (cor_diss)

Description

Computes correlation and moving correlation dissimilarity matrices.

Usage

cor_diss(Xr, Xu = NULL, ws = NULL,         center = TRUE, scale = FALSE)

Arguments

Xr

a matrix.

Xu

an optional matrix containing data of a second set of observations.

ws

for moving correlation dissimilarity, an odd integer value whichspecifies the window size. Ifws = NULL, then the window size will beequal to the number of variables (columns), i.e. instead moving correlation,the normal correlation will be used. See details.

center

a logical indicating if the spectral dataXr (andXu if specified) must be centered. IfXu is provided, the datais scaled on the basis of $Xr \cup Xu$.

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is provided the data is scaled on the basisof $Xr \cup Xu$.

Details

The correlation dissimilarity $d$ between two observations$x_i$ and $x_j$ is based on the Perason'scorrelation coefficient ($\rho$) and it can be computed asfollows:

\[d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))\]

The above formula is used whenws = NULL.On the other hand (whenws != NULL) the moving correlationdissimilarity between two observations $x_i$ and $x_j$is computed as follows:

\[d(x_i, x_j; ws) = \frac{1}{2 ws}\sum_{k=1}^{p-ws}1 - \rho(x_{i,(k:k+ws)}, x_{j,(k:k+ws)})\]

where $ws$ represents a given window size which rolls sequentiallyfrom 1 up to $p - ws$ and $p$ is the number ofvariables of the observations.

The function does not accept input data containing missing values.

Value

a matrix of the computed dissimilarities.

Author(s)

Antoine Stevens andLeonardo Ramirez-Lopez

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]cor_diss(Xr = Xr)cor_diss(Xr = Xr, Xu = Xu)cor_diss(Xr = Xr, ws = 41)cor_diss(Xr = Xr, Xu = Xu, ws = 41)

From dissimilarity matrix to neighbors

Description

internal

Usage

diss_to_neighbors(  diss_matrix,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  return_dissimilarity = FALSE,  skip_first = FALSE)

Arguments

diss_matrix

a matrix representing the dissimilarities betweenobservations in a matrixXu and observations in another matrixXr.Xr in rowsXu in columns.

k

an integer value indicating the k-nearest neighbors of eachobservation inXu that must be selected fromXr.

k_diss

an integer value indicating a dissimilarity treshold.For each observation inXu, its nearest neighbors inXrare selected as those for which their dissimilarity toXu is belowthisk_diss threshold. This treshold depends on the correspondingdissimilarity metric specified indiss_method. Eitherk ork_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

a vector of integers indicating what observations inXr(andYr) must be 'forced' to always be part of all the neighborhoods.

return_dissimilarity

logical indicating if the input dissimilaritymust be mirroed in the output.

skip_first

a logical indicating whether to skip the first neighbor ornot. Default isFALSE. This is used when the search is being conductedin symmetric matrix of distances (i.e. to avoid that the nearest neighbor ofeach observation is itself).

Dissimilarity computation between matrices

Description

This is a wrapper to integrate the different dissimilarity functions of theoffered by package.It computes the dissimilarities between observations innumerical matrices by using an specifed dissmilarity measure.

Usage

dissimilarity(Xr, Xu = NULL,              diss_method = c("pca", "pca.nipals", "pls", "mpls",                              "cor", "euclid", "cosine", "sid"),              Yr = NULL, gh = FALSE, pc_selection = list("var", 0.01),              return_projection = FALSE, ws = NULL,              center = TRUE, scale = FALSE, documentation = character(),              ...)

Arguments

Xr

a matrix of containingn observations/rows andpvariables/columns.

Xu

an optional matrix containing data of a second set of observationswithp variables/columns.

diss_method

a character string indicating the method to be used tocompute the dissimilarities between observations. Options are:

"pca": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if provided). PC projection isdone using the singular value decomposition (SVD) algorithm.Seeortho_diss function.
"pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if provided). PC projection isdone using the non-linear iterative partial least squares (nipals)algorithm. Seeortho_diss function.
"pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.
"mpls": Mahalanobis distancecomputed on the matrix of scores of a modified partial least squaresprojection (Shenk and Westerhaus, 1991; Westerhaus, 2014)ofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.
"cor": based on the correlation coefficientbetween observations. Seecor_diss function.
"euclid": Euclidean distancebetween observations. Seef_diss function.
"cosine": Cosine distancebetween observations. Seef_diss function.
"sid": spectral information divergence betweenobservations. Seesid function.

Yr

a numeric matrix ofn observations used as side information ofXr for theortho_diss methods (i.e.pca,pca.nipals orpls). It is required when:

diss_method = "pls"
diss_method = "pca" with"opc" used as the methodin thepc_selection argument. Seeortho_diss.
gh = TRUE

gh

a logical indicating if the Mahalanobis distance (in the pls scorespace) between each observation and the pls centre/mean must becomputed.

pc_selection

a list of length 2 to be passed onto theortho_diss methods. It is required if the method selected indiss_method is any of"pca","pca.nipals" or"pls" or ifgh = TRUE. This argument is used foroptimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

"opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of set of observations) is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value ((larger than 0 andbelow the minimum dimension ofXr orXr andXucombined) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.
"cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.
"var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.
"manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The default islist(method = "var", value = 0.01).

Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must bereturned. Projections are used if theortho_diss methods arecalled (i.e.diss_method = "pca",diss_method = "pca.nipals" ordiss_method = "pls") or whengh = TRUE.In casegh = TRUE and aortho_diss method is used (in thediss_method argument), both projections are returned.

ws

an odd integer value which specifies the window size, whendiss_method = "cor" (cor_diss method) for movingcorrelation dissimilarity. Ifws = NULL (default), then the windowsize will be equal to the number of variables (columns), i.e. instead movingcorrelation, the normal correlation will be used. Seecor_dissfunction.

center

a logical indicating ifXr (andXu if provided)must be centered. IfXu is provided the data is centered around themean of the pooledXr andXu matrices ($Xr \cup Xu$). Fordissimilarity computations based ondiss_method = pls, the data isalways centered.

scale

a logical indicating ifXr (andXu ifprovided) must be scaled. IfXu is provided the data is scaled basedon the standard deviation of the the pooledXr andXu matrices($Xr \cup Xu$). Ifcenter = TRUE, scaling is applied aftercentering.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

...

other arguments passed to the dissimilarity functions(ortho_diss,cor_diss,f_diss orsid).

Details

This function is a wrapper forortho_diss,cor_diss,f_diss,sid. Check the documentation of thesefunctions for further details.

Value

A list with the following components:

dissimilarity: the resulting dissimilarity matrix.
projection: anortho_projection object. Only outputifreturn_projection = TRUE and ifdiss_method = "pca",diss_method = "pca.nipals",diss_method = "pls" ordiss_method = "mpls".This object contains the projection used to computethe dissimilarity matrix. In case of local dissimilarity matrices,the projection corresponds to the global projection used to select theneighborhoods (seeortho_diss function for furtherdetails).
gh: a list containing the GH distances as well as thepls projection used to compute the GH.

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

Examples

library(prospectr)data(NIRsoil)# Filter the data using the first derivative with Savitzky and Golay# smoothing filter and a window size of 11 spectral variables and a# polynomial order of 4sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15)# Replace the original spectra with the filtered onesNIRsoil$spc <- sgXu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Xr <- Xr[!is.na(Yr), ]Yu <- Yu[!is.na(Yu)]Yr <- Yr[!is.na(Yr)]dsm_pca <- dissimilarity(  Xr = Xr, Xu = Xu,  diss_method = c("pca"),  Yr = Yr, gh = TRUE,  pc_selection = list("opc", 30),  return_projection = TRUE)

A function for transforming a matrix from its Euclidean space toits Mahalanobis space

Description

For internal use only

Usage

euclid_to_mahal(X, sm_method = c("svd", "eigen"))

evaluation of multiple distances obtained with multiple PCs

Description

internal

Usage

eval_multi_pc_diss(  scores,  side_info,  from = 1,  to = ncol(scores),  steps = 1,  method = c("pc", "pls"),  check_dims = TRUE)

Euclidean, Mahalanobis and cosine dissimilarity measurements

Description

This function is used to compute the dissimilarity between observationsbased on Euclidean or Mahalanobis distance measures or on cosinedissimilarity measures (a.k.a spectral angle mapper).

Usage

f_diss(Xr, Xu = NULL, diss_method = "euclid",       center = TRUE, scale = FALSE)

Arguments

Xr

a matrix containing the (reference) data.

Xu

an optional matrix containing data of a second set of observations(samples).

diss_method

the method for computing the dissimilarity betweenobservations.Options are"euclid" (Euclidean distance),"mahalanobis"(Mahalanobis distance) and"cosine" (cosine distance, a.k.a spectralangle mapper). See details.

center

a logical indicating if the spectral dataXr (andXu if specified) must be centered. IfXu is provided, the datais scaled on the basis of $Xr \cup Xu$.

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is provided the data is scaled on the basisof $Xr \cup Xu$.

Details

The results obtained for Euclidean dissimilarity are equivalent to thosereturned by thestats::dist() function, but are scaleddifferently. However,f_diss is considerably faster (which can beadvantageous when computing dissimilarities for very large matrices). Thefinal scaling of the dissimilarity scores inf_diss wherethe number of variables is used to scale the squared dissimilarity scores. Seethe examples section for a comparison betweenstats::dist() andf_diss.

In the case of both the Euclidean and Mahalanobis distances, the scaleddissimilarity matrix $D$ between between observations in a givenmatrix $X$ is computed as follows:

\[d(x_i, x_j)^{2} = \sum (x_i - x_j)M^{-1}(x_i - x_j)^{\mathrm{T}}\]\[d_{scaled}(x_i, x_j) = \sqrt{\frac{1}{p}d(x_i, x_j)^{2}}\]

where $p$ is the number of variables in $X$, $M$ is the identitymatrix in the case of the Euclidean distance and the variance-covariancematrix of $X$ in the case of the Mahalanobis distance. The Mahalanobisdistance can also be viewed as the Euclidean distance after applying alinear transformation of the original variables. Such a linear transformationis done by using a factorization of the inverse covariance matrix as$M^{-1} = W^{T}W$, where $M$ is merely the square root of$M^{-1}$ which can be found by using a singular value decomposition.

Note that when attempting to compute the Mahalanobis distance on a datasetwith highly correlated variables (i.e. spectral variables) thevariance-covariance matrix may result in a singular matrix which cannot beinverted and therefore the distance cannot be computed.This is also the case when the number of observations in the dataset issmaller than the number of variables.

For the computation of the Mahalanobis distance, the mentioned method isused.

The cosine dissimilarity $c$ between two observations$x_i$ and $x_j$ is computed as follows:

\[c(x_i, x_j) = cos^{-1}{\frac{\sum_{k=1}^{p}x_{i,k} x_{j,k}}{\sqrt{\sum_{k=1}^{p} x_{i,k}^{2}} \sqrt{\sum_{k=1}^{p} x_{j,k}^{2}}}}\]

where $p$ is the number of variables of the observations.The function does not accept input data containing missing values.NOTE: The computed distances are divided by the number of variables/columnsinXr.

Value

a matrix of the computed dissimilarities.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]# Euclidean distances between all the observations in Xred <- f_diss(Xr = Xr, diss_method = "euclid")# Equivalence with the dist() fucntion of R baseed_dist <- (as.matrix(dist(Xr))^2 / ncol(Xr))^0.5round(ed_dist - ed, 5)# Comparing the computational timeiter <- 20tm <- proc.time()for (i in 1:iter) {  f_diss(Xr)}f_diss_time <- proc.time() - tmtm_2 <- proc.time()for (i in 1:iter) {  dist(Xr)}dist_time <- proc.time() - tm_2f_diss_timedist_time# Euclidean distances between observations in Xr and observations in Xued_xr_xu <- f_diss(Xr, Xu)# Mahalanobis distance computed on the first 20 spectral variablesmd_xr_xu <- f_diss(Xr[, 1:20], Xu[, 1:20], "mahalanobis")# Cosine dissimilarity matrixcdiss_xr_xu <- f_diss(Xr, Xu, "cosine")

A fast distance algorithm for two matrices written in C++

Description

Computes distances between two data matrices using"euclid", "cor", "cosine"

Usage

fast_diss(X, Y, method)

Arguments

X

a matrix

Y

a matrix

method

astring with possible values "euclid", "cor", "cosine"

Value

a distance matrix

Author(s)

Antoine Stevens and Leonardo Ramirez-Lopez

A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++

Description

A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++

Usage

fast_diss_vector(X)

Arguments

X

a vector.

Details

used internally in ortho_projection

Value

a vector of distance (lower triangle of the distance matrix, stored by column)

Author(s)

Antoine Stevens

Local multivariate regression

Description

internal

Usage

fit_and_predict(  x,  y,  pred_method,  scale = FALSE,  weights = NULL,  newdata,  pls_c = NULL,  CV = FALSE,  tune = FALSE,  number = 10,  p = 0.75,  group = NULL,  noise_variance = 0.001,  range_prediction_limits = TRUE,  pls_max_iter = 1,  pls_tol = 1e-06,  modified = FALSE,  seed = NULL)

format internal messages

Description

internal

Usage

format_xr_xu_indices(xr_xu_names)

Arguments

xr_xu_names

the names of Xr and Xu

Cross validation for Gaussian process regression

Description

internal

Usage

gaussian_pr_cv(  x,  y,  scale,  weights = NULL,  p = 0.75,  number = 10,  group = NULL,  noise_variance = 0.001,  retrieve = c("final_model", "none"),  seed = NULL)

Gaussian process regression with linear kernel (gaussian_process)

Description

Carries out a gaussian process regression with a linear kernel (dot product). For internal use only!

Usage

gaussian_process(X, Y, noisev, scale)

Arguments

X

a matrix of predictor variables

Y

a matrix with a single response variable

noisev

a value indicating the variance of the noise for Gaussian process regression. Default is 0.001. a matrix with a single response variable

scale

a logical indicating whether both the predictorsand the response variable must be scaled to zero mean and unit variance.

Value

a list containing the following elements:

b: the regression coefficients.
Xz: the (final transformed) matrix of predictor variables.
alpha: the alpha matrix.
is.scaled: logical indicating whether both the predictors and response variable were scaled to zero mean and unit variance.
Xcenter: if matrix of predictors was scaled, the centering vector used forX.
Xscale: if matrix of predictors was scaled, the scaling vector used forX.
Ycenter: if matrix of predictors was scaled, the centering vector used forY.
Yscale: if matrix of predictors was scaled, the scaling vector used forY.

Author(s)

Leonardo Ramirez-Lopez

Internal Cpp function for performing leave-group-out crossvalidations for gaussian process

Description

For internal use only!.

Usage

gaussian_process_cv(X, Y, mindices, pindices, noisev = 0.001,  scale = TRUE, statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

mindices

a matrix withn rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices of the observations to be used for modeling at eachiteration.

pindices

a matrix withk rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices of the observations to be used for predicting at eachiteration.

scale

a logical indicating whether both the predictorsand the response variable must be scaled to zero mean and unit variance.

statistics

a logical value indicating whether the precision andaccuracy statistics are to be returned, otherwise the predictions for eachvalidation segment are retrieved.

Value

a list containing the following one-row matrices:

rmse.seg: the RMSEs.
st.rmse.seg: the standardized RMSEs.
rsq.seg: the coefficients of determination.

Author(s)

Leonardo Ramirez-Lopez

Function for identifiying the column in a matrix with the largest standard deviation

Description

Identifies the column with the largest standard deviation. For internal use only!

Usage

get_col_largest_sd(X)

Arguments

X

a matrix.

Value

a value indicating the index of the column with the largest standard deviation.

Author(s)

Leonardo Ramirez-Lopez

Standard deviation of columns

Description

For internal use only!

Usage

get_col_sds(x)

Function for computing the mean of each column in a matrix

Description

Computes the mean of each column in a matrix. For internal use only!

Usage

get_column_means(X)

Arguments

X

a a matrix.

Value

a vector of mean values.

Author(s)

Leonardo Ramirez-Lopez

Function for computing the standard deviation of each column in a matrix

Description

Computes the standard deviation of each column in a matrix. For internal use only!

Usage

get_column_sds(X)

Arguments

X

a a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez

Function for computing sum of each column in a matrix

Description

Computes the sum of each column in a matrix. For internal use only!

Usage

get_column_sums(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez

get the evaluation results for categorical data

Description

internal

Usage

get_eval_categorical(y, indices_closest)

get the evaluation results for continuous data

Description

internal

Usage

get_eval_continuous(y, indices_closest)

A function to obtain the local neighbors based on dissimilaritymatrices from orthogonal projections.

Description

internal function. This function is used to obtain the localneighbors based on dissimilarity matrices from orthogonal projections. Theseneighbors are obatin from an orthogonal projection on a set of precomputedneighbors. This function is used internally by the mbl fucntion.ortho_diss(, .local = TRUE) operates in the same way, however for mbl, it ismore efficient to do the re-search of the neighbors inside its main for loop

Usage

get_ith_local_neighbors(  ith_xr,  ith_xu,  ith_yr,  ith_yu = NULL,  diss_usage = "none",  ith_neig_indices,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  diss_method,  pc_selection,  ith_group = NULL,  center,  scale,  ...)

Arguments

ith_xr

the set of neighbors of a Xu observation found in Xr

ith_xu

the Xu observation

ith_yr

the response values of the set of neighbors of the Xuobservation found in Xr

ith_yu

the response value of the xu observation

diss_usage

a character string indicating if the dissimilarity datawill be used as predictors ("predictors") or not ("none").

ith_neig_indices

a vector of the original indices of the Xr neighbors.

k

the number of nearest neighbors to select from the alreadyidentified neighbors

k_diss

the distance threshold to select the neighbors from the alreadyidentified neighbors

k_range

a min and max number of allowed neighbors whenk_dissis used

spike

a vector with the indices of the observations forced to beretained as neighbors. They have to be present in all the neighborhoods andat the top ofneighbor_indices.

diss_method

the ortho_diss() method

pc_selection

the pc_selection argument as in ortho_diss()

ith_group

the vector containing the group labes ofith_xr.

center

center the data in the local diss computation?

scale

scale the data in the local diss computation?

Value

a list:

ith_xr: the new Xr data of the neighbors for the ith observation (ifdiss_usage = "predictors", this data is combined with the localdissmilarity scores of the neighbors of Xu)
ith_yr: the new Yr data of the neighbors for the ith observation
ith_xu: the ith Xu observation (ifdiss_usage = "predictors",this data is combined with the local dissmilarity scores to its Xr neighbors
ith_yu: the ith Yu observation
ith_neigh_diss: the new dissimilarity scores of the neighbors for the ithobservation
ith_group: the group labels for the new ith_xr
n_k: the number of neighbors
ith_components: the number of components used

Author(s)

Leonardo Ramirez-Lopez

Internal Cpp function for computing the weights of the PLS componentsnecessary for weighted average PLS

Description

For internal use only!.

Usage

get_local_pls_weights(projection_mat,           xloadings,           coefficients,           new_x,           min_component,           max_component,           scale,           Xcenter,           Xscale)

Arguments

projection_mat

the projection matrix generated either by theopls function.

xloadings

coefficients

the matrix of regression coefficients.

new_x

a matrix of one new spectra to be predicted.

min_component

an integer indicating the minimum number of pls components.

max_component

an integer indicating the maximum number of pls components.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix of one row with the weights for each component between the max. and min. specified.

Author(s)

Leonardo Ramirez-Lopez

A function to get the neighbor information

Description

This fucntion gathers information of all neighborhoods of theXu observations found inXr. This information is equired duringlocal regressions.

Usage

get_neighbor_info(  Xr,  Xu,  diss_method,  Yr = NULL,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  pc_selection,  return_dissimilarity,  center,  scale,  gh,  diss_usage,  allow_parallel = FALSE,  ...)

Details

For local pca and pls distances, the local dissimilarity matrices are notcomputed as it is cheaer to compute them during the local regressions.Instead the global distances (required for later local dissimilarity matrixcomputation are output)

Extract predictions from an object of class`mbl`

Description

Extract predictions from an object of classmbl

Usage

get_predictions(object)

Arguments

object

an object of classmbl as returned bymbl

Value

a data.table of predicted values according to eitherk ork_dist

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

A function to assign values to sample distribution strata

Description

for internal use only! This function takes a continuous variable,creates n strata based on its distribution and assigns the corresponding startato every value.

Usage

get_sample_strata(y, n = NULL, probs = NULL)

Arguments

y

a matrix of one column with the response variable.

n

the number of strata.

Value

a data table with the inputy and the corresponding strata toevery value.

A function for stratified calibration/validation sampling

Description

for internal use only! This function selects samplesbased on provided strata.

Usage

get_samples_from_strata(  y,  original_order,  strata,  samples_per_strata,  sampling_for = c("calibration", "validation"),  replacement = FALSE)

Arguments

original_order

a matrix of one column with the response variable.

strata

the number of strata.

sampling_for

sampling to select the calibration samples ("calibration")or sampling to select the validation samples ("validation").

replacement

logical indicating if sampling with replacement must bedone.

Value

a list with the indices of the calibration and validation samples.

Internal function for computing the weights of the PLS componentsnecessary for weighted average PLS

Description

internal

Usage

get_wapls_weights(pls_model, original_x, type = "w1", new_x = NULL, pls_c)

Arguments

pls_model

either an object returned by thepls_cv function or anobject as returned by theopls_get_basics function which contains a pls model.

original_x

the original spectral matrix which was used for calibrating thepls model.

type

type of weight to be computed. The only available option (forthe moment) is"w1". See details on thembl function where itis explained how"w1" is computed whitin the"wapls"regression.

new_x

a vector of a new spectral observation. When "w1" is selected, new_xmust be specified.

pls_c

a vector of length 2 which contains both the minimum and maximumnumber of PLS components for which the weights must be computed.

Value

get_wapls_weights returns a vector of weights for each PLScomponent specified

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Computes the weights for pls regressions

Description

This is an internal function that computes the wights required for obtainingeach vector of pls scores. Implementation is done in C++ for improved performance.

Usage

get_weights(X, Y, algorithm = "pls", xls_min_w = 3L, xls_max_w = 15L)

Arguments

X

a numeric matrix of spectral data.

Y

a matrix of one column with the response variable.

algorithm

a character string indicating what method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y as inShenk and Westerhaus, 1991; Westerhaus 2014) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

amatrix of one column containing the weights.

Author(s)

Leonardo Ramirez-Lopez and Claudio Orellano

References

Shenk, J. S., & Westerhaus, M. O. (1991). Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Westerhaus, M. (2014). Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

An iterator for local prediction data in mbl

Description

internal function. It collects only the data necessary toexecute a local prediction for the mbl function based on a list of neighbors.Not valid for local dissmilitary (e.g. for ortho_diss(...., .local = TRUE))

Usage

ith_mbl_neighbor(  Xr,  Xu = NULL,  Yr,  Yu = NULL,  diss_usage = "none",  neighbor_indices,  neighbor_diss = NULL,  diss_xr_xr = NULL,  group = NULL)

Arguments

Xr

the Xr matrix in mbl.

Xu

the Xu matrix in mbl. DefaultNULL. If not provided, thefunction will iterate for each{Yr, Xr} to get the respective neighbors.

Yr

the Yr matrix in mbl.

Yu

the Yu matrix in mbl. DefaultNULL.

diss_usage

a character string indicating if the dissimilarity datawill be used as predictors ("predictors") or not ("none").

neighbor_indices

a matrix with the indices of neighbors of every Xufound in Xr.

neighbor_diss

a matrix with the dissimilarity socres for the neighborsof every Xu found in Xr. This matrix is organized in the same way asneighbor_indices.

diss_xr_xr

a dissimilarity matrix between sampes in Xr.

group

a factor representing the group labels of Xr.

Details

isubset will look at the order of knn in each col of D andre-organize the rows of x accordingly

Value

an object ofclass iterator giving the following list:

ith_xr: the Xr data of the neighbors for the ith observation (ifdiss_usage = "predictors", this data is combined with the localdissmilarity scores of the neighbors of Xu (or Xr if Xu was not provided))
ith_yr: the Yr data of the neighbors for the ith observation
ith_xu: the ith Xu observation (or Xr if Xu was not provided).Ifdiss_usage = "predictors", this data is combined with the localdissmilarity scores to its Xr neighbors.
ith_yu: the ith Yu observation (or Yr observation if Xu was not provided).
ith_neigh_diss: the dissimilarity scores of the neighbors for the ithobservation.
ith_group: the group labels for ith_xr.
n_k: the number of neighbors.

Author(s)

Leonardo Ramirez-Lopez

iterator for nearest neighbor subsets

Description

internal

Usage

ith_subsets_ortho_diss(x, xu = NULL, y, kindx, na_rm = FALSE)

Arguments

x

a reference matrix

xu

a second matrix

y

a matrix of side information

kindx

a matrix of nearest neighbor indices

na_rm

logical indicating whether NAs must be removed.

Local fit functions

Description

These functions define the way in which each local fit/prediction is donewithin each iteration in thembl function.

Usage

local_fit_pls(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6)local_fit_wapls(min_pls_c, max_pls_c, modified = FALSE,                max_iter = 100, tol = 1e-6)local_fit_gpr(noise_variance = 0.001)

Arguments

pls_c

an integer indicating the number of pls components to be used inthe local regressions when the partial least squares (local_fit_pls)method is used.

modified

a logical indicating whether the modified version of the plsalgorithm (Shenk and Westerhaus, 1991 and Westerhaus, 2014). Default isFALSE.

max_iter

an integer indicating the maximum number of iterations incasetol is not reached. Defaul is 100.

tol

a numeric value indicating the convergence for calculating thescores. Default is 1-e6.

min_pls_c

an integer indicating the minimum number of pls componentsto be used in the local regressions when the weighted average partial leastsquares (local_fit_wapls) method is used. See details.

max_pls_c

integer indicating the maximum number of pls componentsto be used in the local regressions when the weighted average partial leastsquares (local_fit_wapls) method is used. See details.

noise_variance

a numeric value indicating the variance of the noisefor Gaussian process local regressions (local_fit_gpr). Default is0.001.

Details

These functions are used to indicate how to fitthe regression models within thembl function.

There are three possible options for performing these regressions:

Partial least squares (pls,local_fit_pls): It uses theorthogonal scores (non-linear iterative partial least squares, nipals)algorithm. The only parameter which needs to be optimized is the number ofpls components.
Weighted average pls (local_fit_wapls): This method wasdeveloped by Shenk et al. (1997) and it used as the regression method in thewidely known LOCAL algorithm. It uses multiple models generated by multiplepls components (i.e. between a minimum and a maximum number of plscomponents). At each local partition the final predicted value is a ensemble(weighted average) of all the predicted values generated by the multiple plsmodels. The weight for each component is calculated as follows:
\[w_{j} = \frac{1}{s_{1:j}\times g_{j}}\]
where $s_{1:j}$ is the root mean square of thespectral reconstruction error of the unknown (or target) observation(s)when a total of $j$ pls components are used and$g_{j}$ is the root mean square of the squared regressioncoefficients corresponding to the $j$th pls component (seeShenk et al., 1997 for more details).
Gaussian process with dot product covariance (local_fit_gpr):Gaussian process regression is a probabilistic and non-parametric Bayesianmethod. It is commonly described as a collection of random variables whichhave a joint Gaussian distribution and it is characterized by both a meanand a covariance function (Rasmussen and Williams, 2006). The covariancefunction used in the implemented method is the dot product. The onlyparameter to be taken into account in this method is the noise. In thismethod, the process for predicting the response variable of a new sample($y_u$) from its predictor variables($x_u$) is carried out first by computing a predictionvector ($A$). It is derived from a reference/training observationscongaing both a response vector ($Y$) and predictors ($X$) as follows:
\[A = (X X^{T} + \sigma^2 I)^{-1} Y\]
where $\sigma^2$ denotes the variance of the noise and $I$ theidentity matrix (with dimensions equal to the number of observations in$X$). The prediction of $y_{u}$ is then done as follows:
\[\hat{y}_{u} = (x_{u}x_{u}^{T}) A\]

Themodified argument in the pls methods (local_fit_pls()andlocal_fit_wapls()) is used to indicate ifa modified version of the pls algorithm (modified pls or mpls) is to be used.The modified pls was proposed Shenk and Westerhaus(1991, see also Westerhaus, 2014) and it differs from the standard pls methodin the way the weights of the predictors (used to compute the matrix ofscores) are obtained. While pls uses the covariance between response(s)and predictors (and later their deflated versions corresponding at each plscomponent iteration) to obtain these weights, the modified pls uses thecorrelation as weights. The authors indicate that by using correlation,a larger potion of the response variable(s) can be explained.

Value

An object of classlocal_fit mirroring the input arguments.

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning.Massachusetts Institute of Technology: MIT-Press, 2006.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

Examples

local_fit_wapls(min_pls_c = 3, max_pls_c = 12)

local ortho dissimilarity matrices initialized by a globaldissimilarity matrix

Description

internal

Usage

local_ortho_diss(  k_index_matrix,  Xr,  Yr,  Xu,  diss_method,  pc_selection,  center,  scale,  allow_parallel,  ...)

Arguments

k_index_matrix

a matrix of nearest neighnbor indices

Xr

argument passed to ortho_projection

Yr

argument passed to ortho_projection

Xu

argument passed to ortho_projection

diss_method

argument passed to ortho_projection

pc_selection

argument passed to ortho_projection

center

argument passed to ortho_projection

scale

argument passed to ortho_projection

A function for memory-based learning (mbl)

Description

This function is implemented for memory-based learning (a.k.a.instance-based learning or local regression) which is a non-linear lazylearning approach for predicting a given response variable from a set ofpredictor variables. For each observation in a prediction set, a specificlocal regression is carried out based on a subset of similar observations(nearest neighbors) selected from a reference set. The local model isthen used to predict the response value of the target (prediction)observation. Therefore this function does not yield a globalregression model.

Usage

mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,    method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),    diss_method = "pca", diss_usage = "predictors", gh = TRUE,    pc_selection = list(method = "opc", value = min(dim(Xr), 40)),    control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,    verbose = TRUE, documentation = character(), seed = NULL, ...)

Arguments

Xr

a matrix of predictor variables of the reference data(observations in rows and variables in columns).

Yr

a numeric matrix of one column containing the values of theresponse variable corresponding to the reference data.

Xu

a matrix of predictor variables of the data to be predicted(observations in rows and variables in columns).

Yu

an optional matrix of one column containing the values of theresponse variable corresponding to the data to be predicted. Default isNULL.

k

a vector of integers specifying the sequence of k-nearestneighbors to be tested. Eitherk ork_diss must be specified.This vector will be automatically sorted into ascending order. Ifnon-integer numbers are passed, they will be coerced to the next upperintegers.

k_diss

a numeric vector specifying the sequence of dissimilaritythresholds to be tested for the selection of the nearest neighbors found inXr around each observation inXu. These thresholds depend onthe corresponding dissimilarity measure specified in the object passed tocontrol. Eitherk ork_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

an integer vector (with positive and/or negative values) indicatingthe indices of observations inXr that must be either be forced intoor avoided in the neighborhoods of everyXu observation. Default isNULL (i.e. no observations are forced or avoided). Notethat this argument is not intended for increasing or reducing the neighborhoodsize which is only controlled byk ork_diss andk_range.By forcing observations into the neighborhood, some of the farthestobservations may be forced out of the neighborhood. In contrast, by avoidingobservations in the neighborhood, some of farthestobservations may be included into the neighborhood. See details.

method

an object of classlocal_fit which indicates thetype of regression to conduct at each local segment as well as additionalparameters affecting this regression. Seelocal_fit function.

diss_method

a character string indicating the spectral dissimilaritymetric to be used in the selection of the nearest neighbors of eachobservation. Options are:

"pca" (Default): Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr andXu. PC projection is done using thesingular value decomposition (SVD) algorithm.Seeortho_diss function.
"pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr andXu. PC projection is done using thenon-linear iterative partial least squares (nipals) algorithm.Seeortho_diss function.
"pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr andXu. In this case,Yr is alwaysrequired. Seeortho_diss function.
"cor": correlation coefficientbetween observations. Seecor_diss function.
"euclid": Euclidean distancebetween observations. Seef_diss function.
"cosine": Cosine distancebetween observations. Seef_diss function.
"sid": spectral information divergence betweenobservations. Seesid function.

Alternatively, a matrix of dissimilarities can also be passed to thisargument. This matrix is supposed to be a user-defined matrixrepresenting the dissimilarities between observations inXr andXu. Whendiss_usage = "predictors", this matrix must be squared(derived from a matrix of the formrbind(Xr, Xu)) for which thediagonal values are zeros (since the dissimilarity between an object anditself must be 0). On the other hand, ifdiss_usage is set to either"weights" or"none", it must be a matrix representing thedissimilarity of each observation inXu to each observation inXr. The number of columns of the input matrix must be equal to thenumber of rows inXu and the number of rows equal to the number ofrows inXr.

diss_usage

a character string specifying how the dissimilarityinformation shall be used. The possible options are:"predictors","weights" and"none" (see details below).Default is"predictors".

gh

a logical indicating if the global Mahalanobis distance (in the plsscore space) between each observation and the pls mean (centre) must becomputed. This metric is known as the GH distance in the literature. Notethat this computation is based on the number of pls components determined byusing thepc_selection argument. See details.

pc_selection

a list of length 2 used for the computation of GH (ifgh = TRUE) as well as in the computation of the dissimilarity methodsbased onortho_diss (i.e. whendiss_method is one of:"pca","pca.nipals" or"pls") or whengh = TRUE.This argument is used for optimizing the number of components (principalcomponents or pls factors) to be retained for dissimilarity/distancecomputation purposes only (i.e not for regression).This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

"opc": optimized principal component selection basedon Ramirez-Lopez et al. (2013a, 2013b). The optimal number ofcomponents (of set of observations) is the one for which its distancematrix minimizes the differences between theYr value of eachobservation and theYr value of its closest observation. Inthis casevalue must be a value (larger than 0 andbelow the minimum dimension ofXr orXr andXucombined) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.
"cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.
"var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.
"manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 and below the minimum dimension ofXr orXr andXu combined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The listlist(method = "opc", value = min(dim(Xr), 40)) is the default.Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

control

a list created with thembl_control functionwhich contains additional parameters that control some few aspects of thembl function (cross-validation, parameter tuning, etc).The default list is as returned bymbl_control().See thembl_control function for more details.

group

an optional factor (or character vector vectorthat can be coerced tofactor byas.factor) thatassigns a group/class label to each observation inXr(e.g. groups can be given by spectra collected from the same batch ofmeasurements, from the same observation, from observations with very similarorigin, etc). This is taken into account for internal leave-group-out crossvalidation for pls tuning (factor optimization) to avoid pseudo-replication.When one observation is selected for cross-validation, all observations ofthe same group are removed together and assigned to validation. The lengthof the vector must be equal to the number of observations in thereference/training set (i.e.nrow(Xr)). See details.

center

a logical if the predictor variables must be centred at eachlocal segment (before regression). In addition, ifTRUE,XrandXu will be centred for dissimilarity computations.

scale

a logical indicating if the predictor variables must be scaledto unit variance at each local segment (before regression). In addition, ifTRUE,Xr andXu will be scaled for dissimilaritycomputations.

verbose

a logical indicating whether or not to print a progress barfor each observation to be predicted. Default isTRUE. Note: In caseparallel processing is used, these progress bars will not be printed.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

seed

an integer value containing the random number generator (RNG)state for random number generation. This argument can be used forreproducibility purposes (for random sampling) in the cross-validationresults. Default isNULL, i.e. no RNG is applied.

...

further arguments to be passed to thedissimilarityfunction. See details.

Details

The argumentspike can be used to indicate what reference observationsinXr must be kept in the neighborhood of every singleXuobservation. If a vector of length $m$ is passed to this argument,this means that the $m$ original neighbors with the largestdissimilarities to the target observations will be forced out of theneighborhood. Spiking might be useful in cases wheresome reference observations are known to be somehow related to the ones inXu and therefore might be relevant for fitting the local models. SeeGuerrero et al. (2010) for an example on the benefits of spiking.

Thembl function uses thedissimilarity function tocompute the dissimilarities betweenXr andXu. The dissimilaritymethod to be used is specified in thediss_method argument.Arguments todissimilarity as well as further arguments to thefunctions used insidedissimilarity(i.e.ortho_disscor_dissf_disssid) can be passed to those functions by using....

Thediss_usage argument is used to specify whether the dissimilarityinformation must be used within the local regressions and, if so, how.Whendiss_usage = "predictors" the local (square symmetric)dissimilarity matrix corresponding the selected neighborhood is used assource of additional predictors (i.e the columns of this local matrix aretreated as predictor variables). In some cases this results in an improvementof the prediction performance (Ramirez-Lopez et al., 2013a).Ifdiss_usage = "weights", the neighbors of the query point($xu_{j}$) are weighted according to their dissimilarity to$xu_{j}$ before carrying out each local regression. The followingtricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used forcomputing the final weights based on the measured dissimilarities:

\[W_{j} = (1 - v^{3})^{3}\]

where if ${xr_{i} \in }$ neighbors of $xu_{j}$:

\[v_{j}(xu_{j}) = d(xr_{i}, xu_{j})\]

otherwise:

\[v_{j}(xu_{j}) = 0\]

In the above formulas $d(xr_{i}, xu_{j})$ represents thedissimilarity between the query point and each object in $Xr$.Whendiss_usage = "none" is chosen the dissimilarity information isnot used.

The global Mahalanobis distance (a.k.a GH) is computed based on the scoresof a pls projection. A pls projection model is built with for{Yr}, {Xr}and this model is used to obtain the pls scores of theXuobservations. The Mahalanobis distance between eachXu observation in(the pls space) and the centre ofXr is then computed. The number ofpls components is optimized based on the parameters passed to thepc_selection argument. In addition, thembl function alsoreports the GH distance for the observations inXr.

Some aspects of the mbl process, such as the type of internal validation,parameter tuning, what extra objects to return, permission for parallelexecution, prediction limits, etc, can be specified by using thembl_control function.

By using thegroup argument one can specify groups of observationsthat have something in common (e.g. observations with very similar origin).The purpose ofgroup is to avoid biased cross-validation results dueto pseudo-replication. This argument allows to select calibration pointsthat are independent from the validation ones. In this regard, whenvalidation_type = "local_cv" (used inmbl_controlfunction), then thep argument refers to the percentage of groups ofobservations (rather than single observations) to be retained in eachsampling iteration at each local segment.

Value

alist of classmbl with the following components(sorted either byk ork_diss):

call: the call to mbl.
cntrl_param: the list with the control parameters passed tocontrol.
Xu_neighbors: a list containing two elements: a matrix ofXr indices corresponding to the neighbors ofXu and a matrixof dissimilarities between eachXu observation and its correspondingneighbor inXr.
dissimilarities: a list with the method used to obtain thedissimilarity matrices and the dissimilarity matrix corresponding to$D(Xr, Xu)$. This object is returned only if thereturn_dissimilarity argument in thecontrol list was settoTRUE.
n_predictions: the total number of observations predicted.
gh: ifgh = TRUE, a list containing the globalMahalanobis distance values for the observations inXr andXuas well as the results of the global pls projection object used to obtainthe GH values.
validation_results: a list of validation results for"local cross validation" (returned if thevalidation_type incontrol list was set to"local_cv"),"nearest neighbor validation" (returned if thevalidation_typeincontrol list was set to"NNv") and"Yu prediction statistics" (returned ifYu was supplied).“
results: a list of data tables containing the results of thepredictions for each eitherk ork_diss. Each data tablecontains the following columns:
- o_index: The index of the predicted observation.
- k_diss: This column is only output if thek_dissargument is used. It indicates the corresponding dissimilarity thresholdfor selecting the neighbors.
- k_original: This column is only output if thek_dissargument is used. It indicates the number of neighbors that were originallyfound when the given dissimilarity threshold is used.
- k: This column indicates the final number of neighborsused.
- npls: This column is only output if theplsregression method was used. It indicates the final number of plscomponents used.
- min_pls: This column is only output ifwaplsregression method was used. It indicates the final number of minimum plscomponents used. If no optimization was set, it retrieves the originalminimum pls components passed to themethod argument.
- max_pls: This column is only output if thewaplsregression method was used. It indicates the final number of maximum plscomponents used. If no optimization was set, it retrieves the originalmaximum pls components passed to themethod argument.
- yu_obs: The input values given inYu (the responsevariable corresponding to the data to be predicted). IfYu = NULL,thenNAs are retrieved.
- pred: The predicted Yu values.
- yr_min_obs: The minimum reference value (of the responsevariable) in the neighborhood.
- yr_max_obs: The maximum reference value (of the responsevariable) in the neighborhood.
- index_nearest_in_Xr: The index of the nearest neighbor foundinXr.
- index_farthest_in_Xr: The index of the farthest neighborfound inXr.
- y_nearest: The reference value (Yr) corresponding tothe nearest neighbor found inXr.
- y_nearest_pred: This column is only output if thevalidation method in the object passed tocontrol was set to"NNv". It represents the predicted value of the nearest neighborobservation found inXr. This prediction come from model fittedwith the remaining observations in the neighborhood of the targetobservation inXu.
- loc_rmse_cv: This column is only output if the validationmethod in the object passed tocontrol was set to'local_cv'. It represents the RMSE of the cross-validationcomputed for the neighborhood of the target observation inXu.
- loc_st_rmse_cv: This column is only output if thevalidation method in the object passed tocontrol was set to'local_cv'. It represents the standardized RMSE of thecross-validation computed for the neighborhood of the target observationinXu.
- dist_nearest: The distance to the nearest neighbor.
- dist_farthest: The distance to the farthest neighbor.
- loc_n_components: This column is only output if thedissimilarity method used is one of"pca","pca.nipals" or"pls" and in addition the dissimilarities are requested to becomputed locally by passing.local = TRUE to themblfunction.See.local argument in theortho_diss function.
seed: a value mirroring the one passed to seed.
documentation: a character string mirroring the one providedin thedocumentation argument.

When thek_diss argument is used, the printed results show a tablewith a column named 'p_bounded. It represents the percentage ofobservations for which the neighbors selected by the given dissimilaritythreshold were outside the boundaries specified in thek_rangeargument.

Author(s)

Leonardo Ramirez-Lopezand Antoine Stevens

References

Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: anapproach to regression analysis by local fitting. Journal of the AmericanStatistical Association, 83, 596-610.

Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking ofNIR regional models using observations from target sites: Effect of modelsize on prediction accuracy. Geoderma, 158(1-2), 66-77.

Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression andscatter correction for near-infrared reflectance data. Analytical Chemistry62, 664-673.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics foruse with soil vis-NIR spectra. Geoderma 199, 43-53.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning.Massachusetts Institute of Technology: MIT-Press, 2006.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of NearInfrared Spectroscopy, 5, 223-232.

Examples

library(prospectr)data(NIRsoil)# Proprocess the data using detrend plus first derivative with Savitzky and# Golay smoothing filtersg_det <- savitzkyGolay(  detrend(NIRsoil$spc,    wav = as.numeric(colnames(NIRsoil$spc))  ),  m = 1,  p = 1,  w = 7)NIRsoil$spc_pr <- sg_det# split into training and testing setstest_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]# Example 1# A mbl implemented in Ramirez-Lopez et al. (2013,# the spectrum-based learner)# Example 1.1# An exmaple where Yu is supposed to be unknown, but the Xu# (spectral variables) are knownmy_control <- mbl_control(validation_type = "NNv")## The neighborhood sizes to testks <- seq(40, 140, by = 20)sbl <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  k = ks,  method = local_fit_gpr(),  control = my_control,  scale = TRUE)sblplot(sbl)get_predictions(sbl)# Example 1.2# If Yu is actually known...sbl_2 <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_gpr(),  control = my_control)sbl_2plot(sbl_2)# Example 2# the LOCAL algorithm (Shenk et al., 1997)local_algorithm <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "cor",  diss_usage = "none",  control = my_control)local_algorithmplot(local_algorithm)# Example 3# A variation of the LOCAL algorithm (using the optimized pc# dissmilarity matrix) and dissimilarity matrix as source of# additional preditorslocal_algorithm_2 <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "pca",  diss_usage = "predictors",  control = my_control)local_algorithm_2plot(local_algorithm_2)# Example 4# Running the mbl function in parallel with example 2n_cores <- 2if (parallel::detectCores() < 2) {  n_cores <- 1}# Alternatively:# n_cores <- parallel::detectCores() - 1# if (n_cores == 0) {#  n_cores <- 1# }library(doParallel)clust <- makeCluster(n_cores)registerDoParallel(clust)# Alernatively:# library(doSNOW)# clust <- makeCluster(n_cores, type = "SOCK")# registerDoSNOW(clust)# getDoParWorkers()local_algorithm_par <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "cor",  diss_usage = "none",  control = my_control)local_algorithm_parregisterDoSEQ()try(stopCluster(clust))# Example 5# Using local pls distanceswith_local_diss <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "pls",  diss_usage = "predictors",  control = my_control,  .local = TRUE,  pre_k = 150,)with_local_dissplot(with_local_diss)

A function that controls some few aspects of the memory-based learningprocess in the`mbl` function

Description

This function is used to further control some aspects of the memory-basedlearning process in thembl function.

Usage

mbl_control(  return_dissimilarity = FALSE,  validation_type = c("NNv", "local_cv"),  tune_locally = TRUE,  number = 10,  p = 0.75,  range_prediction_limits = TRUE,  progress = TRUE,  allow_parallel = TRUE)

Arguments

return_dissimilarity

a logical indicating if the dissimilarity matrixbetweenXr andXu must be returned.

validation_type

a character vector which indicates the (internal) validationmethod(s) to be used for assessing the global performance of the local models.Possible options are:"NNv" and"local_cv". Alternatively"none" can be used when cross-validation is not required (see detailsbelow).

tune_locally

a logical. It only applies whenvalidation_type = "local_cv" and "pls" or "wapls" fitting algorithms areused. IfTRUE, the parameters of the local pls-based models(i.e. pls factors for the "pls" method and minimum and maximum pls factorsfor the "wapls" method). Default isTRUE.

number

an integer indicating the number of sampling iterations ateach local segment when"local_cv" is selected in thevalidation_type argument. Default is 10.

p

a numeric value indicating the percentage of observations to be retainedat each sampling iteration at each local segment when"local_cv"is selected in thevalidation_type argument. Default is 0.75 (75 %).

range_prediction_limits

a logical. It indicates whether the predictionlimits at each local regression are determined by the range of the responsevariable within each neighborhood. When the predicted value is outsidethis range, it will be automatically replaced with the value of the nearestrange value. IfFALSE, no prediction limits are imposed.Default isTRUE.

progress

a logical indicating whether or not to print a progress barfor each observation to be predicted. Default isTRUE. Note: In caseparallel processing is used, these progress bars will not be printed.

allow_parallel

a logical indicating if parallel execution is allowed.IfTRUE, this parallelism is applied to the loop inmblin which each iteration takes care of a single observation inXu. Theparallelization of this for loop is implemented using theforeach function of theforeach package.Default isTRUE.

Details

The validation methods available for assessing the predictive performance ofthe memory-based learning method used are described as follows:

Leave-nearest-neighbor-out cross-validation ("NNv"): Fromthe group of neighbors of each observation to be predicted, the nearest observation(i.e. the most similar observation) is excluded and then a local model is fittedusing the remaining neighbors. This model is then used to predict the valueof the target response variable of the nearest observation. These predictedvalues are finally cross validated with the actual values (See Ramirez-Lopezet al. (2013a) for additional details). This method is faster than"local_cv".
Local leave-group-out cross-validation ("local_cv"): Thegroup of neighbors of each observation to be predicted is partitioned intodifferent equal size subsets. Each partition is selected based on astratified random sampling which takes into account the values of theresponse variable of the corresponding set of neighbors. The selectedlocal subset is used as local validation subset and the remaining observationsare used for fitting a model. This model is used to predict the targetresponse variable values of the local validation subset and the local rootmean square error is computed. This process is repeated $m$ times andthe final local error is computed as the average of the local root meansquare error of all the $m$ iterations. In thembl function$m$ is controlled by thenumber argument and the size of thesubsets is controlled by thep argument which indicates thepercentage of observations to be selected from the subset of nearest neighbours.The global error of the predictions is computed as the average of the localroot mean square errors.
No validation ("none"): No validation is carried out.If"none" is selected along with"NNv" and/or"local_cv", then it will be ignored and the respectivevalidation(s) will be carried out.

Value

alist mirroring the specified parameters

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics foruse with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

# A control list with the default parametersmbl_control()

Moving/rolling correlation distance of two matrices

Description

Computes a moving window correlation distance between two data matrices

Usage

moving_cor_diss(X,Y,w)

Arguments

X

a matrix

Y

a matrix

w

window size (must be odd)

Value

a matrix of correlation distance

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.It does not return the variance information of the components. NOTE: Forinternal use only!

Usage

opls(X,      Y,      ncomp,      scale,      maxiter,      tol,      algorithm = "pls",      xls_min_w = 3,      xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

coefficients: the matrix of regression coefficients.
bo: a matrix of one row containing the intercepts for each component.
scores: the matrix of scores.
X_loadings: the matrix of X loadings.
Y_loadings: the matrix of Y loadings.
projection_mat: the projection matrix.
Y: theY input.
transf: alist conating two objects:Xcenter andXscale.
weights: the matrix of wheights.

Author(s)

Leonardo Ramirez-Lopez

Internal Cpp function for performing leave-group-out cross-validations for pls regression

Description

For internal use only!.

Usage

opls_cv_cpp(X, Y, scale, method,                   mindices, pindices,                   min_component, ncomp,                   new_x,                   maxiter, tol,                   wapls_grid,                   algorithm,                   statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

scale

a logical indicating whether the matrix of predictors(X) must be scaled.

method

the method used for regression. One of the following options:'pls' or'wapls' or'completewapls1p'.

mindices

a matrix withn rows andm columns wherem is equivalent to the number of resampling iterations. The elementsof each column indicate the indices of the observations to be used formodeling at each iteration.

pindices

a matrix withk rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices ofthe observations to be used for predicting at each iteration.

min_component

an integer indicating the number of minimum plscomponents (if themethod = 'pls').

ncomp

an integer indicating the number of pls components.

new_x

a matrix of one row corresponding to the observation to bepredicted (if themethod = 'wapls').

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

wapls_grid

the grid on which the search for the best combination ofminimum and maximum pls factors of'wapls' is based on in casemethod = 'completewapls1p'.

algorithm

either pls ('pls') or modified pls ('mpls').Seeget_weigths function.

statistics

a logical value indicating whether the precision andaccuracy statistics are to be returned, otherwise the predictions for eachvalidation segment are retrieved.

Value

ifstatistics = true a list containing the following one-row matrices:

rmse_seg: the RMSEs.
st_rmse_seg: the standardized RMSEs.
rsq_seg: the coefficients of determination.

ifstatistics = false a list containing the following one-row matrices:

predictions: the predictions of each of the validationsegments inpindices. Each column inpindices contains thevalidation indices of a segment.
st_rmse_seg: the standardized RMSEs.
rsq_seg: the coefficients of determination.

Ifmethod = "wapls", data of the pls weights are output in thislist(compweights).

Ifmethod = "completewapls1", data of all the combination ofcomponents passed inwapls_grid areoutput in this list(complete_compweights).

Author(s)

Leonardo Ramirez-Lopez

orthogonal scores algorithn of partial leat squares (opls) projection

Description

Computes orthogonal socres partial least squares (opls)projection with the NIPALS algorithm. It allows multiple response variables.Although the main use of the function is for projection, it also retrievesregression coefficients. NOTE: For internal use only!

Usage

opls_for_projection(X, Y, ncomp, scale,                    maxiter, tol,                    pcSelmethod = "var",                    pcSelvalue = 0.01,                     algorithm = "pls",                     xls_min_w = 3,                     xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

ifregression = TRUE, the method for selecting thenumber of components.Options are:'manual','cumvar' (for selecting the number ofprincipal components based on a given cumulative amount of explainedvariance) and'var' (for selecting the number of principal componentsbased on a given amount of explained variance). Default is'cumvar'.

pcSelvalue

a numerical value that complements the selected method(pcSelmethod).If'cumvar' is chosen (default),pcSelvalue must be a value(larger than 0 and below 1) indicating the maximum amount of cumulativevariance that the retained components should explain. Default is 0.99.If'var' is chosen,pcSelvalue must be a value (larger than 0and below 1) indicating that components that explain (individually)a variance lower than this threshold must be excluded. If'manual'is chosen,pcSelvalue has no effect and the number of componentsretrieved are the one specified inncomp.

algorithm

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

coefficients: the matrix of regression coefficients.
bo: a matrix of one row containing the intercepts foreach component.
scores: the matrix of scores.
X_loadings: the matrix of X loadings.
Y_loadings: the matrix of Y loadings.
projection_mat: the projection matrix.
Y: theY input.
variance: alist conating two objects:x_varandy_var.These objects contain information on the explained variance for theXandY matrices respectively.
transf: alist conating two objects:XcenterandXscale.
weights: the matrix of wheights.

Author(s)

Leonardo Ramirez-Lopez

orthogonal scores algorithn of partial leat squares (opls_get_all)

Description

Computes orthogonal socres partial least squares (opls_get_all)regressions with the NIPALS algorithm. It retrives a comprehensive set ofpls outputs (e.g. vip and sensivity radius). It allows multiple responsevariables. NOTE: For internal use only!

Usage

opls_get_all(X,              Y,              ncomp,              scale,              maxiter,              tol,              algorithm = "pls",              xls_min_w = 3,              xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

ncomp: the number of components used.
coefficients: the matrix of regression coefficients.
bo: a matrix of one row containing the intercepts for each component.
scores: the matrix of scores.
X_loadings: the matrix of X loadings.
Y_loadings: the matrix of Y loadings.
vip: the projection matrix.
selectivity_ratio: the matrix of selectivity ratio (see Rajalahti, Tarja, et al. 2009).
Y: theY input.
variance: alist conating two objects:x_var andy_var.These objects contain information on the explained variance for theX andY matrices respectively.
transf: alist conating two objects:Xcenter andXscale.
weights: the matrix of wheights.

Author(s)

Leonardo Ramirez-Lopez

fast orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.In contrast toopls function, this one does not compute unnecessarydata for (local) regression.For internal use only!

Usage

opls_get_basics(X, Y, ncomp, scale,                 maxiter, tol,                 algorithm = "pls",                 xls_min_w = 3,                 xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

coefficients: the matrix of regression coefficients.
bo: a matrix of one row containing the intercepts for each component.
Y_loadings: the matrix of Y loadings.
projection_mat: the projection matrix.
transf: alist conating two objects:Xcenter andXscale.

Author(s)

Leonardo Ramirez-Lopez

orthogonal scores algorithm of partial leat squares (opls)

Description

Computes orthogonal scores partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.It does not return the variance information of the components. NOTE: Forinternal use only!

Usage

opls_gs(Xr,         Yr,        Xu,         ncomp,        scale,             response = FALSE,         reconstruction = TRUE,        similarity = TRUE,        fresponse = TRUE,        algorithm = "pls")

Arguments

Xr

a matrix of predictor variables for the training set.

Yr

a matrix of a single response variable for the training set.

Xu

a matrix of predictor variables for the test set.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

response

logical indicating whether to compute the prediction ofYu.

reconstruction

logical indicating whether to compute the reconstruction error ofXu.

similarity

logical indicating whether to compute the the distance score betweenXr andXu (in the pls space).

fresponse

logical indicating whether to compute the score of the variance not explained forYu.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y) or'mpls' for modified pls (using correlation between X and Y).

Value

a list containing the following elements:

ncomp: the number of components.
pred_response: the response predictions forXu.
rmse_reconstruction: the rmse of the reconstruction forXu.
score_dissimilarity: the distance score betweenXr andXu.

Author(s)

Leonardo Ramirez-Lopez

A function to construct an optimal strata for the samples, based onthe distribution of the given y.

Description

for internal use only! This function computes the optimal stratafrom the distribution of the given y

Usage

optim_sample_strata(y, n)

Arguments

y

a matrix of one column with the response variable.

n

number of samples that must be sampled.

Value

a list with twodata.table objects:sample_strata containsthe optimal strata, whereassamples_to_get contains information on howmany samples per stratum are supposed to be drawn.

A function for computing dissimilarity matrices from orthogonalprojections (ortho_diss)

Description

This function computes dissimilarities (in an orthogonal space) betweeneither observations in a given set or between observations in two differentsets.The dissimilarities are computed based on either principal componentprojection or partial least squares projection of the data. After projectingthe data, the Mahalanobis distance is applied.

Usage

ortho_diss(Xr, Xu = NULL,           Yr = NULL,           pc_selection = list(method = "var", value = 0.01),           diss_method = "pca",           .local = FALSE,           pre_k,           center = TRUE,           scale = FALSE,           compute_all = FALSE,           return_projection = FALSE,           allow_parallel = TRUE, ...)

Arguments

Xr

a matrix containingn reference observations rows andp variablescolumns.

Xu

an optional matrix containing data of a second set of observationswithp variables/columns.

Yr

a matrix ofn rows and one or more columns (variables) withside information corresponding to the observations inXr (e.g. responsevariables). It can be numeric with multiple variables/columns, or characterwith one single column. This argument isrequired if:

diss_method == 'pls':Yr is required to project the variablesto orthogonal directions such that the covariance between the extracted plscomponents andYr is maximized.
pc_selection$method == 'opc':Yr is required to optimizethe number of components. The optimal number of projected components is the onefor which its distance matrix minimizes the differences between theYrvalue of each observation and theYr value of its closest observation.Seesim_eval.

pc_selection

a list of length 2 which specifies the method to be usedfor optimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements (in the following order):method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

"opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of a given set of observations) is the one for which its distancematrix minimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscase,value must be a value (larger than 0 andbelowmin(nrow(Xr)+ nrow(Xu),ncol(Xr)) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.
"cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.
"var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.
"manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

Default islist(method = "var", value = 0.01).

Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such case, the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

diss_method

a character value indicating the type of projection on whichthe dissimilarities must be computed. This argument is equivalent tomethod argument in theortho_projection function.Options are:

"pca": principal component analysis using the singular valuedecomposition algorithm)
"pca.nipals": principal component analysis usingthe non-linear iterative partial least squares algorithm.
"pls": partial least squares.
"mpls": modified partial least squares (Shenk and Westerhaus,1991 and Westerhaus, 2014).

See theortho_projection function for further details on theprojection methods.

.local

a logical indicating whether or not to compute the dissimilaritieslocally (i.e. projecting locally the data) by using thepre_k nearestneighbor observations of each target observation. Default isFALSE. See details.

pre_k

if.local = TRUE a numeric integer value which indicates thenumber of nearest neighbors to (pre-)retain for each observation tocompute the (local) orthogonal dissimilarities to each observation in itsneighborhhod.

center

a logical indicating if theXr andXu must becentered. IfXu is provided the data is centered around the mean ofthe pooledXr andXu matrices ($Xr \cup Xu$). Fordissimilarity computations based on pls, the data is always centered forthe projections.

scale

a logical indicating if theXr andXu must bescaled. IfXu is provided the data is scaled based on the standarddeviation of the the pooledXr andXu matrices ($Xr \cup Xu$).ifcenter = TRUE, scaling is applied after centering.

compute_all

a logical. In caseXu is specified it indicateswhether or not the distances between all the elements resulting from thepooledXr andXu matrices ($Xr \cup Xu$ must be computed).

return_projection

a logical. IfTRUE theortho_projection objecton which the dissimilarities are computed will be returned. Default isFALSE. Note thatfor.local = TRUE only the initial projection is returned (i.e. localprojections are not).

allow_parallel

a logical (default TRUE). It allows parallel computingof the local distance matrices (i.e. when.local = TRUE). This is doneviaforeach function of the 'foreach' package.

...

additional arguments to be passed to theortho_projection function.

Details

When.local = TRUE, first a global dissimilarity matrix is computed based onthe parameters specified. Then, by using this matrix for each targetobservation, a given set of nearest neighbors (pre_k) are identified.These neighbors (together with the target observation) are projected(from the original data space) onto a (local) orthogonal space (using thesame parameters specified in the function). In this projected space theMahalanobis distance between the target observation and its neighbors isrecomputed. A missing value is assigned to the observations that do not belong tothis set of neighbors (non-neighbor observations).In this case the dissimilarity matrix cannot be considered as a distancemetric since it does not necessarily satisfies the symmetry condition fordistance matrices (i.e. given two observations $x_i$ and $x_j$, the localdissimilarity ($d$) between them is relative since generally$d(x_i, x_j) \neq d(x_j, x_i)$). On the other hand, when.local = FALSE, the dissimilarity matrix obtained can be considered asa distance matrix.

In the cases where"Yr" is required to compute the dissimilarities andif.local = TRUE, care must be taken as some neighborhoods mightnot have enough observations with non-missing"Yr" values, which might retrieveunreliable dissimilarity computations.

If"opc" or"manual" are used inpc_selection$methodand.local = TRUE, the minimum number of observations with non-missing"Yr" values at each neighborhood is determined bypc_selection$value (i.e. the maximum number of components to compute).

Value

alist of classortho_diss with the following elements:

n_components: the number of components (either principalcomponents or partial least squares components) used for computing theglobal dissimilarities.
global_variance_info: the information about the expalinedvariance(s) of the projection. When.local = TRUE, the informationcorresponds to the global projection done prior computing the localprojections.
local_n_components: if.local = TRUE, a data.tablewhich specifies the number of local components (either principal componentsor partial least squares components) used for computing the dissimilaritybetween each target observation and its neighbor observations.
dissimilarity: the computed dissimilarity matrix. If.local = FALSE a distance matrix. If.local = TRUE a matrix ofclasslocal_ortho_diss. In this case, each column represent the dissimilaritybetween a target observation and its neighbor observations.
projection: ifreturn_projection = TRUE,anortho_projection object.

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for usewith soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil[!as.logical(NIRsoil$train), "CEC", drop = FALSE]Yr <- NIRsoil[as.logical(NIRsoil$train), "CEC", drop = FALSE]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu), , drop = FALSE]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr), , drop = FALSE]# Computation of the orthogonal dissimilarity matrix using the# default parameterspca_diss <- ortho_diss(Xr, Xu)# Computation of a principal component dissimilarity matrix using# the "opc" method for the selection of the principal componentspca_diss_optim <- ortho_diss(  Xr, Xu, Yr,  pc_selection = list("opc", 40),  compute_all = TRUE)# Computation of a partial least squares (PLS) dissimilarity# matrix using the "opc" method for the selection of the PLS# componentspls_diss_optim <- ortho_diss(  Xr = Xr, Xu = Xu,  Yr = Yr,  pc_selection = list("opc", 40),  diss_method = "pls")

Orthogonal projections using principal component analysis and partialleast squares

Description

Functions to perform orthogonal projections of high dimensional data matricesusing principal component analysis (pca) and partial least squares (pls).

Usage

ortho_projection(Xr, Xu = NULL,                 Yr = NULL,                 method = "pca",                 pc_selection = list(method = "var", value = 0.01),                 center = TRUE, scale = FALSE, ...)pc_projection(Xr, Xu = NULL, Yr = NULL,              pc_selection = list(method = "var", value = 0.01),              center = TRUE, scale = FALSE,              method = "pca",              tol = 1e-6, max_iter = 1000, ...)pls_projection(Xr, Xu = NULL, Yr,               pc_selection = list(method = "opc", value = min(dim(Xr), 40)),               scale = FALSE, method = "pls",               tol = 1e-6, max_iter = 1000, ...)## S3 method for class 'ortho_projection'predict(object, newdata, ...)

Arguments

Xr

a matrix of observations.

Xu

an optional matrix containing data of a second set of observations.

Yr

if the method used in thepc_selection argument is"opc"or ifmethod = "pls", then it must be a matrixcontaining the side information corresponding to the spectra inXr.It is equivalent to theside_info parameter of thesim_evalfunction. In casemethod = "pca", a matrix (with one or morecontinuous variables) can also be used as input. The root mean square ofdifferences (rmsd) is used for assessing the similarity between the observationsand their corresponding most similar observations in terms of the side informationprovided. A single discrete variable of class factor can also be passed. Inthat case, the kappa index is used. Seesim_eval function for more details.

method

the method for projecting the data. Options are:

"pca": principal component analysis using the singular valuedecomposition algorithm.
"pca.nipals": principal component analysis using thenon-linear iterative partial least squares algorithm.
"pls": partial least squares.
"mpls": modified partial least squares. See details.

pc_selection

"opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of componentsof a given set of observations is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value (larger than 0 andbelowmin(nrow(Xr)+ nrow(Xu),ncol(Xr)) indicatingthe maximum number of principal components to be tested. See details.
"cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.
"var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single component shouldexplain in order to be retained.
"manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The listlist(method = "var", value = 0.01) is the default.Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

center

a logical indicating if the dataXr (andXu ifspecified) must be centered. IfXu is specified the data is centeredon the basis of $Xr \cup Xu$. NOTE: This argument only applies to theprincipal components projection. For pls projections the data is alwayscentered.

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is specified the data is scaled on the basis of$Xr \cup Xu$.

...

additional arguments to be passedtopc_projection orpls_projection.

tol

tolerance limit for convergence of the algorithm in the nipalsalgorithm (default is 1e-06). In the case of PLS this applies only to Yr withmore than one variable.

max_iter

maximum number of iterations (default is 1000). In the case ofmethod = "pls" this applies only toYr matrices with more thanone variable.

object

object of class"ortho_projection".

newdata

an optional data frame or matrix in which to look for variableswith which to predict. If omitted, the scores are used. It must contain thesame number of columns, to be used in the same order.

Details

In the case ofmethod = "pca", the algorithm used is the singular valuedecomposition in which a given data matrix ($X$) is factorized as follows:

\[X = UDV^{T}\]

where $U$ and $V$ are orthogonal matrices, being the left and rightsingular vectors of $X$ respectively, $D$ is a diagonal matrixcontaining the singular values of $X$ and $V$ is the is a matrix ofthe right singular vectors of $X$.The matrix of principal component scores is obtained by a matrixmultiplication of $U$ and $D$, and the matrix of principal componentloadings is equivalent to the matrix $V$.

Whenmethod = "pca.nipals", the algorithm used for principal componentanalysis is the non-linear iterative partial least squares (nipals).

In the case of the of the partial least squares projection (a.k.a projectionto latent structures) the nipals regression algorithm is used by default.Details on the "nipals" algorithm are presented in Martens (1991). Anothermethod called modified pls ('mpls') can also be used. The modifiedpls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and itdiffers from the standard pls method in the way the weights of theXr(used to compute the matrix of scores) are obtained. While pls uses the covariancebetweenYr andXr (and later their deflated versionscorresponding at each pls component iteration) to obtain these weights, the modified plsuses the correlation as weights. The authors indicate that by using correlation,a larger potion of the response variable(s) can be explained.

Whenmethod = "opc", the selection of the components is carried out byusing an iterative method based on the side information concept(Ramirez-Lopez et al. 2013a, 2013b). First let be $P$ a sequence ofretained components (so that $P = 1, 2, ...,k $).At each iteration, the function computes a dissimilarity matrix retaining$p_i$ components. The values in this side information variable arecompared against the side information values of their most spectrally similarobservations (closestXr observation).The optimal number of components retrieved by the function is the one thatminimizes the root mean squared differences (RMSD) in the case of continuousvariables, or maximizes the kappa index in the case of categorical variables.In this process, thesim_eval function is used.Note that for the"opc" methodYr is required (i.e. theside information of the observations).

Value

alist of classortho_projection with the followingcomponents:

scores: a matrix of scores corresponding to the observations inXr (andXu if it was provided). The components retrievedcorrespond to the ones optimized or specified.
X_loadings: a matrix of loadings corresponding to theexplanatory variables. The components retrieved correspond to the onesoptimized or specified.
Y_loadings: a matrix of partial least squares loadingscorresponding toYr. The components retrieved correspond to theones optimized or specified.This object is only returned if the partial least squares algorithm was used.
weigths: a matrix of partial least squares ("pls") weights.This object is only returned if the "pls" algorithm was used.
projection_mat: a matrix that can be used to project new dataonto a "pls" space. This object is only returned if the "pls" algorithm wasused.
variance: a list with information on the original variance andthe explained variances. This list contains a matrix indicating the amount ofvariance explained by each component (var), the ratio between explainedvariance by each single component and the original variance (explained_var) andthe cumulative ratio of explained variance (cumulative_explained_var).The amount of variance explained by each component is computed by multiplyingits score vector by its corresponding loading vector and calculating thevariance of the result. These values are computed based on the observationsused to create the projection matrices. For example if the "pls" method wasused, then these values are computed based only on the data that containsinformation onYr (i.e. theXr data). If the principalcomponent method is used, the this data is computed on the basis ofXr andXu (if it applies) since both matrices are employed inthe computation of the projection matrix (loadings in this case).
sdv: the standard deviation of the retrieved scores. This vectorcan be different from the "sd" invariance.
n_components: the number of components (either principalcomponents or partial least squares components) used for computing theglobal dissimilarity scores.
opc_evaluation: a matrix containing the statistics computedfor optimizing the number of principal components based on the variable(s)specified in theYr argument. IfYr was a continuous was acontinuous vector or matrix then this object indicates the root mean squareof differences (rmse) for each number of components. IfYr was acategorical variable this object indicates the kappa values for each numberof components. This object is returned only if"opc" was used withinthepc_selection argument. See thesim_eval function formore details.
method: theortho_projection method used.

predict.ortho_projection, returns a matrix of scores proprojected fornewdtata.

Author(s)

Leonardo Ramirez-Lopez

References

Martens, H. (1991). Multivariate calibration. John Wiley & Sons.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for usewith soil vis-NIR spectra. Geoderma 199, 43-53.

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

Examples

library(prospectr)data(NIRsoil)# Proprocess the data using detrend plus first derivative with Savitzky and# Golay smoothing filtersg_det <- savitzkyGolay(  detrend(NIRsoil$spc,    wav = as.numeric(colnames(NIRsoil$spc))  ),  m = 1,  p = 1,  w = 7)NIRsoil$spc_pr <- sg_det# split into training and testing setstest_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]# A principal component analysis using 5 componentspca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5))pca_projected# A principal components projection using the "opc" method# for the selection of the optimal number of componentspca_projected_2 <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pca",  pc_selection = list("opc", 40))pca_projected_2plot(pca_projected_2)# A partial least squares projection using the "opc" method# for the selection of the optimal number of componentspls_projected <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pls",  pc_selection = list("opc", 40))pls_projectedplot(pls_projected)# A partial least squares projection using the "cumvar" method# for the selection of the optimal number of componentspls_projected_2 <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pls",  pc_selection = list("cumvar", 0.99))

Function for computing the overall variance of a matrix

Description

Computes the variance of a matrix. For internal use only!

Usage

overall_var(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez

Principal components based on the non-linear iterative partial least squares (nipals) algorithm

Description

Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables.For internal use only!

Usage

pca_nipals(X, ncomp, center, scale,           maxiter, tol,           pcSelmethod = "var",           pcSelvalue = 0.01)

Arguments

X

a matrix of predictor variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

the method for selecting the number of components.Options are:'cumvar' (for selecting the number of principal components based on a givencumulative amount of explained variance) and"var" (for selecting the number of principalcomponents based on a given amount of explained variance). Default is'var'

pcSelvalue

a numerical value that complements the selected method (pcSelmethod).If"cumvar" is chosen, it must be a value (larger than 0 and below 1) indicating the maximumamount of cumulative variance that the retained components should explain. If"var" is chosen,it must be a value (larger than 0 and below 1) indicating that components that explain (individually)a variance lower than this threshold must be excluded. If"manual" is chosen, it must be a valuespecifying the desired number of principal components to retain. Default is 0.01.

Value

a list containing the following elements:

pc_scores: a matrix of principal component scores.
pc_loadings: a matrix of of principal component loadings.
variance: a matrix of the variance of the principal components.
scale: alist conating two objects:center andscale, which correspond to the vectors used to center and scale the input matrix.

Author(s)

Leonardo Ramirez-Lopez

Get the package version info

Description

returns package info.

Usage

pkg_info(pkg = "resemble")

Arguments

pkg

the package name i.e "resemble"

Plot method for an object of class`mbl`

Description

Plots the content of an object of classmbl

Usage

## S3 method for class 'mbl'plot(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1,2), ...)

Arguments

x

an object of classmbl (as returned bymbl).

g

a character vector indicating what results shall be plotted.Options are:"validation" (for plotting the validation results) and/or"gh" (for plotting the pls scores used to compute the GH distance.See details).

param

a character string indicating what validation statistics shall beplotted. The following options are available:"rmse","st_rmse"or"r2". These options only available if thembl object containsvalidation results.

pls_c

a numeric vector of length one or two indicating the pls factors to beplotted. Default isc(1, 2). It is only available if"gh" isspecified in theg argument.

...

some arguments to be passed to the plot methods.

Details

For plotting the pls scores from the pls score matrix (of more than one column),this matrix is first transformed from the Euclidean space to the Mahalanobisspace. This is done by multiplying the score matrix by the root square ofits covariance matrix. The root square of this matrix is estimated using asingular value decomposition.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu)]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr)]ctrl <- mbl_control(validation_type = "NNv")ex_1 <- mbl(  Yr = Yr, Xr = Xr, Xu = Xu,  diss_method = "cor",  diss_usage = "none",  gh = TRUE,  mblCtrl = ctrl,  k = seq(50, 250, 30))plot(ex_1)plot(ex_1, g = "gh", pls_c = c(2, 3))

Plot method for an object of class`ortho_projection`

Description

Plots objects of classortho_projection

Usage

## S3 method for class 'ortho_projection'plot(x, col = "dodgerblue", ...)

Arguments

x

an object of classortho_projection (as returned byortho_projection).

col

the color of the plots (default is "dodgerblue")

...

arguments to be passed to methods.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Cross validation for PLS regression

Description

for internal use only!

Usage

pls_cv(  x,  y,  ncomp,  method = c("pls", "wapls"),  center = TRUE,  scale,  min_component = 1,  new_x = matrix(0, 1, 1),  weights = NULL,  p = 0.75,  number = 10,  group = NULL,  retrieve = TRUE,  tune = TRUE,  max_iter = 1,  tol = 1e-06,  seed = NULL,  modified = FALSE)

Prediction function for the`gaussian_process` function (Gaussian process regression with dot product covariance)

Description

Predicts response values based on a model generated by thegaussian_process function (Gaussian process regression with dot product covariance). For internal use only!.

Usage

predict_gaussian_process(Xz, alpha, newdata, scale, Xcenter, Xscale, Ycenter, Yscale)

Arguments

newdata

a matrix containing the predictor variables

scale

a logical indicating whether the matrix of predictors used to create the regression model(in thegaussian_process function) was scaled

Xcenter

ifcenter = TRUE a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Ycenter

ifcenter = TRUE a matrix of one row with the values that must be used for accounting for the centering of the response variable.

Yscale

ifscale = TRUE a matrix of one row with the values that must be used for accounting for the scaling of the response variable.

Value

a matrix of predicted values

Author(s)

Leonardo Ramirez-Lopez

Prediction function for the`opls` and`fopls` functions

Description

Predicts response values based on a model generated by either byopls or thefopls functions.For internal use only!.

Usage

predict_opls(bo, b, ncomp, newdata, scale, Xscale)

Arguments

bo

a numeric value indicating the intercept.

b

the matrix of regression coefficients.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix of predicted values.

Author(s)

Leonardo Ramirez-Lopez

Print method for an object of class`local_fit`

Description

Prints the contents of an object of classlocal_fit

Usage

## S3 method for class 'local_fit'print(x, ...)

Arguments

x

an object of classlocal_fit

...

not yet functional.

Author(s)

Leonardo Ramirez-Lopez

Print method for an object of class`ortho_diss`

Description

Prints the content of an object of classortho_diss

Usage

## S3 method for class 'local_ortho_diss'print(x, ...)

Arguments

x

an object of classlocal_ortho_diss (returned byortho_diss when it uses.local = TRUE).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Print method for an object of class`mbl`

Description

Prints the content of an object of classmbl

Usage

## S3 method for class 'mbl'print(x, ...)

Arguments

x

an object of classmbl (as returned by thembl function).

...

arguments to be passed to methods (not functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Print method for an object of class`ortho_projection`

Description

Prints the contents of an object of classortho_projection

Usage

## S3 method for class 'ortho_projection'print(x, ...)

Arguments

x

an object of classortho_projection (as returned by theortho_projection function).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez

Projection function for the`opls` function

Description

Projects new spectra onto a PLS space based on a model generated by either byopls or theopls2 functions.For internal use only!.

Usage

project_opls(projection_mat, ncomp, newdata, scale, Xcenter, Xscale)

Arguments

projection_mat

the projection matrix generated by theopls function.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix corresponding to the new spectra projected onto the PLS space

Author(s)

Leonardo Ramirez-Lopez

Projection to pls and then re-construction

Description

Projects spectra onto a PLS space and then reconstructs it back.

Usage

reconstruction_error(x,                             projection_mat,                             xloadings,                             scale,                             Xcenter,                             Xscale,                             scale_back = FALSE)

Arguments

x

a matrix to project.

projection_mat

the projection matrix generated by theopls_get_basics function.

xloadings

the loadings matrix generated by theopls_get_basics function.

scale

logical indicating if scaling is required

Xcenter

a matrix of one row with the centering values

Xscale

a matrix of one row with the scaling values

scale_back

compute the reconstruction error after de-centering thedata and de-scaling it.

Value

a matrix of 1 row and 1 column.

Author(s)

Leonardo Ramirez-Lopez

A function to create calibration and validation sample sets forleave-group-out cross-validation

Description

for internal use only! This is stratified sampling based on thevalues of a continuous response variable (y). If group is provided, thesampling is done based on the groups and the average of y per group. Thisfunction is used to create calibration and validation groups forleave-group-out cross-validations (orleave-group-of-groups-out cross-validation if group argument is provided).

Usage

sample_stratified(y, p, number, group = NULL, replacement = FALSE, seed = NULL)

Arguments

y

a matrix of one column with the response variable.

p

the percentage of samples (or groups if group argument is used) toretain in the validation_indices set

number

the number of sample groups to be crated

group

the labels for each sample iny indicating the group eachobservation belongs to.

replacement

A logical indicating sample replacements for thecalibration set are required.

seed

an integer for random number generator (defaultNULL).

Value

a list with two matrices (hold_in andhold_out) giving the indices of the observations in eachcolumn. The number of columns represents the number of sampling repetitions.

A function for searching in a given reference set the neighbors ofanother given set of observations (search_neighbors)

Description

This function searches in a reference set the neighbors of the observationsprovided in another set.

Usage

search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls",                                         "cor", "euclid", "cosine", "sid"),                 Yr = NULL, k, k_diss, k_range, spike = NULL,                 pc_selection = list("var", 0.01),                 return_projection = FALSE, return_dissimilarity = FALSE,                 ws = NULL,                 center = TRUE, scale = FALSE,                 documentation = character(), ...)

Arguments

Xr

a matrix of reference (spectral) observations where the neighborsearch is to be conducted. See details.

Xu

an optional matrix of (spectral) observations for which itsneighbors are to be searched inXr. Default isNULL. See details.

diss_method

a character string indicating the spectral dissimilarity metricto be used in the selection of the nearest neighbors of each observation.

"pca": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if supplied).PC projection is done using the singular value decomposition (SVD)algorithm. Seeortho_diss function.
"pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if supplied).PC projection is done using thenon-linear iterative partial least squares (niapls) algorithm.Seeortho_diss function.
"pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr (andXu if supplied). In this case,Yris always required. Seeortho_diss function.
"mpls": Mahalanobis distancecomputed on the matrix of scores of a modified partial least squaresprojection (Shenk and Westerhaus, 1991; Westerhaus, 2014)ofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.
"cor": correlation coefficientbetween observations. Seecor_diss function.
"euclid": Euclidean distancebetween observations. Seef_diss function.
"cosine": Cosine distancebetween observations. Seef_diss function.
"sid": spectral information divergence between observations.Seesid function.

Yr

a numeric matrix ofn observations used as side information ofXr for theortho_diss methods (i.e.pca,pca.nipals orpls). It is required when:

diss_method = "pls"
diss_method = "pca" with"opc" used as the methodin thepc_selection argument. Seeortho_diss().

k

an integer value indicating the k-nearest neighbors of eachobservation inXu that must be selected fromXr.

k_diss

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

a vector of integers (with positive and/or negative values)indicating what observations inXr(andYr) must be forced into or avoided in the neighborhoods.

pc_selection

a list of length 2 to be passed onto theortho_diss methods. It is required if the method selected indiss_method is any of"pca","pca.nipals" or"pls". This argument is used foroptimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

"opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of set of observations) is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value (larger than 0 and below theminimum dimension ofXr orXr andXu combined)indicating the maximum number of principal components to be tested.See theortho_projection function for more details.
"cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.
"var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.
"manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 and below theminimum dimension ofXr orXr andXu combined)indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The default islist(method = "var", value = 0.01).

return_projection

a logical indicating if the projection(s) must bereturned. Projections are used if theortho_diss methods arecalled (i.e.method = "pca",method = "pca.nipals" ormethod = "pls").

return_dissimilarity

a logical indicating if the dissimilarity matrixused for neighbor search must be returned.

ws

an odd integer value which specifies the window size, whendiss_method = cor (cor_diss method) for moving correlationdissimilarity. Ifws = NULL (default), then the window size will beequal to the number of variables (columns), i.e. instead moving correlation,the normal correlation will be used. Seecor_diss function.

center

a logical indicating if theXr andXu matricesmust be centered. IfXu is provided the data is centered around themean of the pooledXr andXu matrices ($Xr \cup Xu$). Fordissimilarity computations based ondiss_method = pls, the data is alwayscentered.

scale

a logical indicating if theXr andXu matricesmust be scaled. IfXu is provided the data is scaled basedon the standard deviation of the the pooledXr andXu matrices($Xr \cup Xu$). Ifcenter = TRUE, scaling is applied aftercentering.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

...

further arguments to be passed to thedissimilarityfunction. See details.

Details

This function may be specially useful when the reference set (Xr) isvery large. In some cases the number of observations in the reference setcan be reduced by removing irrelevant observations (i.e. observations that are notneighbors of a particular target set). For example, this fucntion can beused to reduce the size of the reference set before before running thembl function.

This function uses thedissimilarity fucntion to compute thedissimilarities betweenXr andXu. Arguments todissimilarity as well as further arguments to the functionsused insidedissimilarity (i.e.ortho_disscor_dissf_disssid) can be passed tothose functions as additional arguments (i.e....).

If no matrix is passed toXu, the neighbor search is conducted for theobservations inXr that are found whiting that matrix. If a matrix ispassed toXu, the neighbors ofXu are searched in theXrmatrix.

Value

alist containing the following elements:

neighbors_diss: a matrix of theXr dissimilarity scorescorresponding to the neighbors of eachXr observation (orXuobservation, in caseXu was supplied).The neighbor dissimilarity scores are organized by columns and are sortedin ascending order.
neighbors: a matrix of theXr indices corresponding tothe neighbors of each observation inXu. The neighbor indices areorganized by columns and are sorted in ascending order by theirdissimilarity score.
unique_neighbors: a vector of the indices inXridentified as neighbors of any observation inXr (or inXu,in case it was supplied). This is obtained byconverting theneighbors matrix into a vector and applying theunique function.
k_diss_info: adata.table that is returned only if thek_diss argument was used. It comprises three columns, the first one(Xr_index orXu_index) indicates the index of the observationsinXr (or inXu, in case it was suppplied),the second column (n_k) indicates the number of neighbors found inXr and the third column (final_n_k) indicates the final numberof neighbors selected bounded byk_range.argument.
dissimilarity: Ifreturn_dissimilarity = TRUE thedissimilarity object used (as computed by thedissimilarityfunction.
projection: anortho_projection object. Only output ifreturn_projection = TRUE and ifdiss_method = "pca",diss_method = "pca.nipals" ordiss_method = "pls".
This object contains the projection used to computethe dissimilarity matrix. In case of local dissimilarity matrices,the projection corresponds to the global projection used to select theneighborhoods. (seeortho_diss function for furtherdetails).

Author(s)

Leonardo Ramirez-Lopez.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R.,Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-searchmetrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu)]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr)]# Identify the neighbor observations using the correlation dissimilarity and# default parameters# (In this example all the observations in Xr belong at least to the# first 100 neighbors of one observation in Xu)ex1 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "cor",  k = 40)# Identify the neighbor observations using principal component (PC)# and partial least squares (PLS) dissimilarities, and using the "opc"# approach for selecting the number of componentsex2 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pca",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE)# Observations that do not belong to any neighborhoodseq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]ex3 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pls",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE)# Observations that do not belong to any neighborhoodseq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]# Identify the neighbor observations using local PC dissimialrities# Here, 150 neighbors are used to compute a local dissimilarity matrix# and then this matrix is used to select 50 neighborsex4 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pls",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE,  .local = TRUE,  pre_k = 150)

A function for computing the spectral information divergence betweenspectra (sid)

Description

This function computes the spectral information divergence/dissimilarity betweenspectra based on the kullback-leibler divergence algorithm (see details).

Usage

sid(Xr, Xu = NULL,    mode = "density",    center = FALSE, scale = FALSE,    kernel = "gaussian",    n = if(mode == "density") round(0.5 * ncol(Xr)),    bw = "nrd0",    reg = 1e-04,    ...)

Arguments

Xr

a matrix containing the spectral (reference) data.

Xu

an optional matrix containing the spectral data of a second set ofobservations.

mode

the method to be used for computing the spectral informationdivergence. Options are"density" (default) for computing the divergencevalues on the density distributions of the spectral observations, and"feature" for computing the divergence vales on the spectral variables.See details.

center

a logical indicating if the computations must be carried out onthe centredX andXu (if specified) matrices. Ifmode = "feature" centring is not carried out since this option doesnot accept negative values which are generated after centring the matrices.Default is FALSE. See details.

scale

a logical indicating if the computations must be carried out onthe variance scaledX andXu (if specified) matrices. Defaultis TRUE.

kernel

ifmode = "density" a character string indicating thesmoothing kernel to be used. It must be one of"gaussian" (default),"rectangular","triangular","epanechnikov","biweight","cosine" or"optcosine". See thedensity function of thestats package.

n

ifmode = "density" a numerical value indicating the numberof equally spaced points at which the density is to be estimated. See thedensity function of thestats package for furtherdetails. Default isround(0.5 * ncol(X)).

bw

ifmode = "density" a numerical value indicating thesmoothing kernel bandwidth to be used. Optionally the character string"nrd0" can be used, it computes the bandwidth using thebw.nrd0function of thestats package (seebw.nrd0). See thedensity and thebw.nrd0 functions for moredetails. By default"nrd0" is used, in this case the bandwidth iscomputed asbw.nrd0(as.vector(X)), ifXu is specified thebandwidth is computed asbw.nrd0(as.vector(rbind(X, Xu))).

reg

a numerical value larger than 0 which indicates a regularizationparameter. Values (probabilities) below this threshold are replaced by thisvalue for numerical stability. Default is 1e-4.

...

additional arguments to be passed to thedensity function of the base package.

Details

This function computes the spectral information divergence (distance)between spectra.Whenmode = "density", the function first computes the probabilitydistribution of each spectrum which result in a matrix of densitydistribution estimates. The density distributions of all the observations inthe datasets are compared based on the kullback-leibler divergence algorithm.Whenmode = "feature", the kullback-leibler divergence between allthe observations is computed directly on the spectral variables.The spectral information divergence (SID) algorithm (Chang, 2000) uses theKullback-Leibler divergence ($KL$) or relative entropy(Kullback and Leibler, 1951) to account for the vis-NIR information providedby each spectrum. The SID between two spectra ($x_{i}$ and$x_{j}$) is computed as follows:

\[sid(x_{i},x_{j}) = KL(x_{i} \left |\right | x_{j}) + KL(x_{j} \left |\right | x_{i})\]\[sid(x_{i},x_{j}) = \sum_{l=1}^{k} p_l \ log(\frac{p_l}{q_l}) + \sum_{l=1}^{k} q_l \ log(\frac{q_l}{p_l})\]

where $k$ represents the number of variables or spectral features,$p$ and $q$ are the probability vectors of $x_{i}$ and$x_{i}$ respectively which are calculated as:

\[p = \frac{x_i}{\sum_{l=1}^{k} x_{i,l}}\]\[q = \frac{x_j}{\sum_{l=1}^{k} x_{j,l}}\]

From the above equations it can be seen that the original SID algorithmassumes that all the components in the data matrices are nonnegative.Therefore centering cannot be applied whenmode = "feature". If adata matrix with negative values is provided andmode = "feature",thesid function automatically scales the matrix as follows:

\[X_s = \frac{X-min(X)}{max(X)-min(X)}\]

\[X_{s} = \frac{X-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]\[Xu_{s} = \frac{Xu-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]

ifXu is specified. The 0 values are replaced by a regularizationparameter (reg argument) for numerical stability.The default of thesid function is to compute the SID based on thedensity distributions of the spectra (mode = "density"). For eachspectrum inX the density distribution is computed using thedensity function of thestats package.The 0 values of the estimated density distributions of the spectra arereplaced by a regularization parameter ("reg" argument) for numericalstability. Finally the divergence between the computed spectral histogramasis computed using the SID algorithm. Note that ifmode = "density",thesid function will accept negative values and matrix centeringwill be possible.

Value

alist with the following components:

sid: if only"X" is specified (i.e.Xu = NULL),a square symmetric matrix of SID distances between all the components in"X". If both"X" and"Xu" are specified, a matrixof SID distances between the components in"X" and the componentsin"Xu") where the rows represent the objects in"X" and thecolumns represent the objects in"Xu"
Xr: the (centered and/or scaled if specified) spectralX matrix
Xu: the (centered and/or scaled if specified) spectralXu matrix
densityDisXr: ifmode = "density", the computeddensity distributions ofXr
densityDisXu: ifmode = "density", the computeddensity distributions ofXu

Author(s)

Leonardo Ramirez-Lopez

References

Chang, C.I. 2000. An information theoretic-based approach tospectral variability, similarity and discriminability for hyperspectralimage analysis. IEEE Transactions on Information Theory 46, 1927-1932.

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Xr <- Xr[!is.na(Yr), ]# Example 1# Compute the SID distance between all the observations in Xrxr_sid <- sid(Xr)xr_sid# Example 2# Compute the SID distance between the observations in Xr and the observations# in Xuxr_xu_sid <- sid(Xr, Xu)xr_xu_sid

A function for evaluating dissimilarity matrices (sim_eval)

Description

This function searches for the most similar observation (closest neighbor) ofeach observation in a given dataset based on a dissimilarity (e.g. distancematrix). The observations are compared against their corresponding closestobservations in terms of their side information provided. The root meansquare of differences and the correlation coefficient are used for continuousvariables and for discrete variables the kappa index is used.

Usage

sim_eval(d, side_info)

Arguments

d

a symmetric matrix of dissimilarity scores between observations ofa given dataset. Alternatively, a vector of with the dissimilarityscores of the lower triangle (without the diagonal values) can be used(see details).

side_info

a matrix containing the side information corresponding tothe observations in the dataset from which the dissimilarity matrix wascomputed. It can be either a numeric matrix with one or multiplecolumns/variables or a matrix with one character variable (discrete variable).If it is numeric, the root mean square of differences is used for assessingthe similarity between the observations and their corresponding most similarobservations in terms of the side information provided. If it is a charactervariable, then the kappa index is used. See details.

Details

For the evaluation of dissimilarity matrices this function uses sideinformation (information about one variable which is available for agroup of observations, Ramirez-Lopez et al., 2013). It is assumed that thereis a (direct or indirect) correlation between this side informative variableand the variables from which the dissimilarity was computed.Ifside_info is numeric, the root mean square of differences (RMSD)is used for assessing the similarity between the observations and theircorresponding most similar observations in terms of the side informationprovided. It is computed as follows:

\[j(i) = NN(xr_i, Xr^{{-i}})\]\[RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^n {(y_i - y_{j(i)})^2}}\]

where $NN(xr_i, Xr^{-i})$ represents a function toobtain the index of the nearest neighbor observation found in $Xr$(excluding the $i$th observation) for $xr_i$,$y_{i}$ is the value of the side variable of the $i$thobservation, $y_{j(i)}$ is the value of the side variable ofthe nearest neighbor of the $i$th observation and $m$ isthe total number of observations.

Ifside_info is a factor the kappa index ($\kappa$) isused instead the RMSD. It is computed as follows:

\[\kappa = \frac{p_{o}-p_{e}}{1-p_{e}}\]

where both $p_o$ and $p_e$ are two different agreementindices between the the side information of the observations and the sideinformation of their corresponding nearest observations (i.e. most similarobservations). While $p_o$ is the relative agreement$p_e$ is the the agreement expected by chance.

This functions accepts vectors to be passed to argumentd, in thiscase, the vector must represent the lower triangle of a dissimilarity matrix(e.g. as returned by thestats::dist() function ofstats).

Value

sim_eval returns a list with the following components:

"eval: either the RMSD (and the correlation coefficient) orthe kappa index
first_nn: a matrix containing the original sideinformative variable in the first half of the columns, and the sideinformative values of the corresponding nearest neighbors in the second halfof the columns.

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R.,Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-searchmetrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

library(prospectr)data(NIRsoil)sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)# Replace the original spectra with the filtered onesNIRsoil$spc <- sgYr <- NIRsoil$Nt[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]# Example 1# Compute a principal components distancepca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity# Example 1.1# Evaluate the distance matrix on the baisis of the# side information (Yr) associated with Xrse <- sim_eval(pca_d, side_info = as.matrix(Yr))# The final evaluation resultsse$eval# The final values of the side information (Yr) and the values of# the side information corresponding to the first nearest neighbors# found by using the distance matrixse$first_nn# Example 1.2# Evaluate the distance matrix on the basis of two side# information (Yr and Yr2)# variables associated with XrYr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))# The final evaluation resultsse_2$eval# The final values of the side information variables and the values# of the side information variables corresponding to the first# nearest neighbors found by using the distance matrixse_2$first_nn# Example 2# Evaluate the distances produced by retaining different number of# principal components (this is the same principle used in the# optimized principal components approach ("opc"))# first project the datapca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))results <- matrix(NA, pca_2$n_components, 3)colnames(results) <- c("pcs", "rmsd", "r")results[, 1] <- 1:pca_2$n_componentsfor (i in 1:pca_2$n_components) {  ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)  ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))  results[i, 2:3] <- as.vector(ith_eval$eval)}plot(results)# Example 3# Example 3.1# Evaluate a dissimilarity matrix computed using the correlation# methodcd <- cor_diss(Xr)eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))eval_corr_diss$eval

Square root of (square) symmetric matrices

Description

For internal use only

Usage

sqrt_sm(X, method = c("svd", "eigen"))

A function to compute row-wise index of minimum values of a square distance matrix

Description

For internal use only

Usage

which_min(X)

Arguments

X

a square matrix of distances

Details

Used internally to find the nearest neighbors

Value

a vector of the indices of the minimum value in each row of the input matrix

Author(s)

Antoine Stevens

A function to compute indices of minimum values of a distance vector

Description

For internal use only

Usage

which_min_vector(X)

Arguments

X

a vector of distances

Details

Used internally to find the nearest neighbors.It searches in lower (or upper) triangular matrix. Therefore this must be the format of theinput data. The piece of code intlen = (sqrt(X.size()*8+1)+1)/2 generated an error in CRANsincesqrt cannot be applied to integers.

Value

a vector of the indices of the nearest neighbors

Author(s)

Antoine Stevens

Movatterモバイル変換

Overview of the functions in the resemble package

Description

Details

Author(s)

References

See Also

Print method for an object of classlocal_ortho_diss

Description

Usage

Arguments

checks the pc_selection argument

Description

Usage

Correlation and moving correlation dissimilarity measurements (cor_diss)

Description

Usage

Arguments

Details

Value

Author(s)

Examples

From dissimilarity matrix to neighbors

Description

Usage

Arguments

Dissimilarity computation between matrices

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

A function for transforming a matrix from its Euclidean space toits Mahalanobis space

Description

Usage

evaluation of multiple distances obtained with multiple PCs

Description

Usage

Euclidean, Mahalanobis and cosine dissimilarity measurements

Description

Usage

Arguments

Details

Value

Author(s)

Examples

A fast distance algorithm for two matrices written in C++

Description

Usage

Arguments

Value

Author(s)

A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++

Description

Usage

Arguments

Details

Value

Author(s)

Local multivariate regression

Description

Usage

format internal messages

Description

Usage

Arguments

Cross validation for Gaussian process regression

Description

Usage

Gaussian process regression with linear kernel (gaussian_process)

Description

Usage

Arguments

Value

Author(s)

Internal Cpp function for performing leave-group-out crossvalidations for gaussian process

Print method for an object of class`local_ortho_diss`

Extract predictions from an object of class`mbl`