Movatterモバイル変換


[0]ホーム

URL:


Type:Package
Title:Memory-Based Learning in Spectral Chemometrics
Version:2.2.5
Date:2025-10-16
Maintainer:Leonardo Ramirez-Lopez <ramirez.lopez.leo@gmail.com>
BugReports:https://github.com/l-ramirez-lopez/resemble/issues
Description: Functions for dissimilarity analysis and memory-based learning (MBL, a.k.a local modeling) in complex spectral data sets. Most of these functions are based on the methods presented in Ramirez-Lopez et al. (2013) <doi:10.1016/j.geoderma.2012.12.014>.
License:MIT + file LICENSE
URL:http://l-ramirez-lopez.github.io/resemble/
Depends:R (≥ 3.5.0)
Imports:foreach, iterators, Rcpp (≥ 1.0.3), mathjaxr (≥ 1.0),magrittr (≥ 1.5.0), lifecycle (≥ 0.2.0), data.table (≥1.9.8)
Suggests:prospectr, parallel, doParallel, testthat, formatR,rmarkdown, bookdown, knitr
LinkingTo:Rcpp, RcppArmadillo
RdMacros:mathjaxr
VignetteBuilder:knitr
NeedsCompilation:yes
Repository:CRAN
RoxygenNote:7.3.2
Encoding:UTF-8
Config/VersionName:dstatements
Packaged:2025-10-17 18:47:43 UTC; leo
Author:Leonardo Ramirez-LopezORCID iD [aut, cre], Antoine StevensORCID iD [aut, ctb], Claudio Orellano [ctb], Raphael Viscarra RosselORCID iD [ctb], Alex WadouxORCID iD [ctb]
Date/Publication:2025-10-17 19:20:02 UTC

Overview of the functions in the resemble package

Description

Maturing lifecycle

Functions for memory-based learning

logo

Details

This is the version2.2.5 – dstatementsof the package. It implements a number of functions useful formodeling complex spectral spectra (e.g. NIR, IR).The package includes functions for dimensionality reduction,computing spectral dissimilarity matrices, nearest neighbor search,and modeling spectral data using memory-based learning. This package buildsupon the methods presented in Ramirez-Lopez et al. (2013)doi:10.1016/j.geoderma.2012.12.014.

Development versions can be found in the github repository of the packageathttps://github.com/l-ramirez-lopez/resemble.

The functions available for dimensionality reduction are:

The functions available for computing dissimilarity matrices are:

The functions available for evaluating dissimilarity matrices are:

The functions available for nearest neighbor search:

The functions available for modeling spectral data:

Other supplementary functions:

Author(s)

Maintainer / Creator: Leonardo Ramirez-Lopezramirez.lopez.leo@gmail.com

Authors:

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196,268-279.

See Also

Useful links:


Print method for an object of classlocal_ortho_diss

Description

prints the subsets of local_ortho_diss objects

Usage

## S3 method for class 'local_ortho_diss'x[rows, columns, drop = FALSE, ...]

Arguments

x

local_ortho_diss matrix

rows

the indices of the rows

columns

the indices of the columns

drop

drop argument

...

not used


checks the pc_selection argument

Description

internal

Usage

check_pc_arguments(  n_rows_x,  n_cols_x,  pc_selection,  default_max_comp = 40,  default_max_cumvar = 0.99,  default_max_var = 0.01)

Correlation and moving correlation dissimilarity measurements (cor_diss)

Description

Stable lifecycle

Computes correlation and moving correlation dissimilarity matrices.

Usage

cor_diss(Xr, Xu = NULL, ws = NULL,         center = TRUE, scale = FALSE)

Arguments

Xr

a matrix.

Xu

an optional matrix containing data of a second set of observations.

ws

for moving correlation dissimilarity, an odd integer value whichspecifies the window size. Ifws = NULL, then the window size will beequal to the number of variables (columns), i.e. instead moving correlation,the normal correlation will be used. See details.

center

a logical indicating if the spectral dataXr (andXu if specified) must be centered. IfXu is provided, the datais scaled on the basis of \(Xr \cup Xu\).

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is provided the data is scaled on the basisof \(Xr \cup Xu\).

Details

The correlation dissimilarity \(d\) between two observations\(x_i\) and \(x_j\) is based on the Perason'scorrelation coefficient (\(\rho\)) and it can be computed asfollows:

\[d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))\]

The above formula is used whenws = NULL.On the other hand (whenws != NULL) the moving correlationdissimilarity between two observations \(x_i\) and \(x_j\)is computed as follows:

\[d(x_i, x_j; ws) = \frac{1}{2 ws}\sum_{k=1}^{p-ws}1 - \rho(x_{i,(k:k+ws)}, x_{j,(k:k+ws)})\]

where \(ws\) represents a given window size which rolls sequentiallyfrom 1 up to \(p - ws\) and \(p\) is the number ofvariables of the observations.

The function does not accept input data containing missing values.

Value

a matrix of the computed dissimilarities.

Author(s)

Antoine Stevens andLeonardo Ramirez-Lopez

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]cor_diss(Xr = Xr)cor_diss(Xr = Xr, Xu = Xu)cor_diss(Xr = Xr, ws = 41)cor_diss(Xr = Xr, Xu = Xu, ws = 41)

From dissimilarity matrix to neighbors

Description

internal

Usage

diss_to_neighbors(  diss_matrix,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  return_dissimilarity = FALSE,  skip_first = FALSE)

Arguments

diss_matrix

a matrix representing the dissimilarities betweenobservations in a matrixXu and observations in another matrixXr.Xr in rowsXu in columns.

k

an integer value indicating the k-nearest neighbors of eachobservation inXu that must be selected fromXr.

k_diss

an integer value indicating a dissimilarity treshold.For each observation inXu, its nearest neighbors inXrare selected as those for which their dissimilarity toXu is belowthisk_diss threshold. This treshold depends on the correspondingdissimilarity metric specified indiss_method. Eitherk ork_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

a vector of integers indicating what observations inXr(andYr) must be 'forced' to always be part of all the neighborhoods.

return_dissimilarity

logical indicating if the input dissimilaritymust be mirroed in the output.

skip_first

a logical indicating whether to skip the first neighbor ornot. Default isFALSE. This is used when the search is being conductedin symmetric matrix of distances (i.e. to avoid that the nearest neighbor ofeach observation is itself).


Dissimilarity computation between matrices

Description

This is a wrapper to integrate the different dissimilarity functions of theoffered by package.It computes the dissimilarities between observations innumerical matrices by using an specifed dissmilarity measure.

Usage

dissimilarity(Xr, Xu = NULL,              diss_method = c("pca", "pca.nipals", "pls", "mpls",                              "cor", "euclid", "cosine", "sid"),              Yr = NULL, gh = FALSE, pc_selection = list("var", 0.01),              return_projection = FALSE, ws = NULL,              center = TRUE, scale = FALSE, documentation = character(),              ...)

Arguments

Xr

a matrix of containingn observations/rows andpvariables/columns.

Xu

an optional matrix containing data of a second set of observationswithp variables/columns.

diss_method

a character string indicating the method to be used tocompute the dissimilarities between observations. Options are:

  • "pca": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if provided). PC projection isdone using the singular value decomposition (SVD) algorithm.Seeortho_diss function.

  • "pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if provided). PC projection isdone using the non-linear iterative partial least squares (nipals)algorithm. Seeortho_diss function.

  • "pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.

  • "mpls": Mahalanobis distancecomputed on the matrix of scores of a modified partial least squaresprojection (Shenk and Westerhaus, 1991; Westerhaus, 2014)ofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.

  • "cor": based on the correlation coefficientbetween observations. Seecor_diss function.

  • "euclid": Euclidean distancebetween observations. Seef_diss function.

  • "cosine": Cosine distancebetween observations. Seef_diss function.

  • "sid": spectral information divergence betweenobservations. Seesid function.

Yr

a numeric matrix ofn observations used as side information ofXr for theortho_diss methods (i.e.pca,pca.nipals orpls). It is required when:

  • diss_method = "pls"

  • diss_method = "pca" with"opc" used as the methodin thepc_selection argument. Seeortho_diss.

  • gh = TRUE

gh

a logical indicating if the Mahalanobis distance (in the pls scorespace) between each observation and the pls centre/mean must becomputed.

pc_selection

a list of length 2 to be passed onto theortho_diss methods. It is required if the method selected indiss_method is any of"pca","pca.nipals" or"pls" or ifgh = TRUE. This argument is used foroptimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

  • "opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of set of observations) is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value ((larger than 0 andbelow the minimum dimension ofXr orXr andXucombined) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.

  • "cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.

  • "var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.

  • "manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The default islist(method = "var", value = 0.01).

Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must bereturned. Projections are used if theortho_diss methods arecalled (i.e.diss_method = "pca",diss_method = "pca.nipals" ordiss_method = "pls") or whengh = TRUE.In casegh = TRUE and aortho_diss method is used (in thediss_method argument), both projections are returned.

ws

an odd integer value which specifies the window size, whendiss_method = "cor" (cor_diss method) for movingcorrelation dissimilarity. Ifws = NULL (default), then the windowsize will be equal to the number of variables (columns), i.e. instead movingcorrelation, the normal correlation will be used. Seecor_dissfunction.

center

a logical indicating ifXr (andXu if provided)must be centered. IfXu is provided the data is centered around themean of the pooledXr andXu matrices (\(Xr \cup Xu\)). Fordissimilarity computations based ondiss_method = pls, the data isalways centered.

scale

a logical indicating ifXr (andXu ifprovided) must be scaled. IfXu is provided the data is scaled basedon the standard deviation of the the pooledXr andXu matrices(\(Xr \cup Xu\)). Ifcenter = TRUE, scaling is applied aftercentering.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

...

other arguments passed to the dissimilarity functions(ortho_diss,cor_diss,f_diss orsid).

Details

This function is a wrapper forortho_diss,cor_diss,f_diss,sid. Check the documentation of thesefunctions for further details.

Value

A list with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

ortho_disscor_dissf_disssid.

Examples

library(prospectr)data(NIRsoil)# Filter the data using the first derivative with Savitzky and Golay# smoothing filter and a window size of 11 spectral variables and a# polynomial order of 4sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15)# Replace the original spectra with the filtered onesNIRsoil$spc <- sgXu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Xr <- Xr[!is.na(Yr), ]Yu <- Yu[!is.na(Yu)]Yr <- Yr[!is.na(Yr)]dsm_pca <- dissimilarity(  Xr = Xr, Xu = Xu,  diss_method = c("pca"),  Yr = Yr, gh = TRUE,  pc_selection = list("opc", 30),  return_projection = TRUE)

A function for transforming a matrix from its Euclidean space toits Mahalanobis space

Description

For internal use only

Usage

euclid_to_mahal(X, sm_method = c("svd", "eigen"))

evaluation of multiple distances obtained with multiple PCs

Description

internal

Usage

eval_multi_pc_diss(  scores,  side_info,  from = 1,  to = ncol(scores),  steps = 1,  method = c("pc", "pls"),  check_dims = TRUE)

Euclidean, Mahalanobis and cosine dissimilarity measurements

Description

Stable lifecycle

This function is used to compute the dissimilarity between observationsbased on Euclidean or Mahalanobis distance measures or on cosinedissimilarity measures (a.k.a spectral angle mapper).

Usage

f_diss(Xr, Xu = NULL, diss_method = "euclid",       center = TRUE, scale = FALSE)

Arguments

Xr

a matrix containing the (reference) data.

Xu

an optional matrix containing data of a second set of observations(samples).

diss_method

the method for computing the dissimilarity betweenobservations.Options are"euclid" (Euclidean distance),"mahalanobis"(Mahalanobis distance) and"cosine" (cosine distance, a.k.a spectralangle mapper). See details.

center

a logical indicating if the spectral dataXr (andXu if specified) must be centered. IfXu is provided, the datais scaled on the basis of \(Xr \cup Xu\).

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is provided the data is scaled on the basisof \(Xr \cup Xu\).

Details

The results obtained for Euclidean dissimilarity are equivalent to thosereturned by thestats::dist() function, but are scaleddifferently. However,f_diss is considerably faster (which can beadvantageous when computing dissimilarities for very large matrices). Thefinal scaling of the dissimilarity scores inf_diss wherethe number of variables is used to scale the squared dissimilarity scores. Seethe examples section for a comparison betweenstats::dist() andf_diss.

In the case of both the Euclidean and Mahalanobis distances, the scaleddissimilarity matrix \(D\) between between observations in a givenmatrix \(X\) is computed as follows:

\[d(x_i, x_j)^{2} = \sum (x_i - x_j)M^{-1}(x_i - x_j)^{\mathrm{T}}\]\[d_{scaled}(x_i, x_j) = \sqrt{\frac{1}{p}d(x_i, x_j)^{2}}\]

where \(p\) is the number of variables in \(X\), \(M\) is the identitymatrix in the case of the Euclidean distance and the variance-covariancematrix of \(X\) in the case of the Mahalanobis distance. The Mahalanobisdistance can also be viewed as the Euclidean distance after applying alinear transformation of the original variables. Such a linear transformationis done by using a factorization of the inverse covariance matrix as\(M^{-1} = W^{T}W\), where \(M\) is merely the square root of\(M^{-1}\) which can be found by using a singular value decomposition.

Note that when attempting to compute the Mahalanobis distance on a datasetwith highly correlated variables (i.e. spectral variables) thevariance-covariance matrix may result in a singular matrix which cannot beinverted and therefore the distance cannot be computed.This is also the case when the number of observations in the dataset issmaller than the number of variables.

For the computation of the Mahalanobis distance, the mentioned method isused.

The cosine dissimilarity \(c\) between two observations\(x_i\) and \(x_j\) is computed as follows:

\[c(x_i, x_j) = cos^{-1}{\frac{\sum_{k=1}^{p}x_{i,k} x_{j,k}}{\sqrt{\sum_{k=1}^{p} x_{i,k}^{2}} \sqrt{\sum_{k=1}^{p} x_{j,k}^{2}}}}\]

where \(p\) is the number of variables of the observations.The function does not accept input data containing missing values.NOTE: The computed distances are divided by the number of variables/columnsinXr.

Value

a matrix of the computed dissimilarities.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]# Euclidean distances between all the observations in Xred <- f_diss(Xr = Xr, diss_method = "euclid")# Equivalence with the dist() fucntion of R baseed_dist <- (as.matrix(dist(Xr))^2 / ncol(Xr))^0.5round(ed_dist - ed, 5)# Comparing the computational timeiter <- 20tm <- proc.time()for (i in 1:iter) {  f_diss(Xr)}f_diss_time <- proc.time() - tmtm_2 <- proc.time()for (i in 1:iter) {  dist(Xr)}dist_time <- proc.time() - tm_2f_diss_timedist_time# Euclidean distances between observations in Xr and observations in Xued_xr_xu <- f_diss(Xr, Xu)# Mahalanobis distance computed on the first 20 spectral variablesmd_xr_xu <- f_diss(Xr[, 1:20], Xu[, 1:20], "mahalanobis")# Cosine dissimilarity matrixcdiss_xr_xu <- f_diss(Xr, Xu, "cosine")

A fast distance algorithm for two matrices written in C++

Description

Computes distances between two data matrices using"euclid", "cor", "cosine"

Usage

fast_diss(X, Y, method)

Arguments

X

a matrix

Y

a matrix

method

astring with possible values "euclid", "cor", "cosine"

Value

a distance matrix

Author(s)

Antoine Stevens and Leonardo Ramirez-Lopez


A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++

Description

A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++

Usage

fast_diss_vector(X)

Arguments

X

a vector.

Details

used internally in ortho_projection

Value

a vector of distance (lower triangle of the distance matrix, stored by column)

Author(s)

Antoine Stevens


Local multivariate regression

Description

internal

Usage

fit_and_predict(  x,  y,  pred_method,  scale = FALSE,  weights = NULL,  newdata,  pls_c = NULL,  CV = FALSE,  tune = FALSE,  number = 10,  p = 0.75,  group = NULL,  noise_variance = 0.001,  range_prediction_limits = TRUE,  pls_max_iter = 1,  pls_tol = 1e-06,  modified = FALSE,  seed = NULL)

format internal messages

Description

internal

Usage

format_xr_xu_indices(xr_xu_names)

Arguments

xr_xu_names

the names of Xr and Xu


Cross validation for Gaussian process regression

Description

internal

Usage

gaussian_pr_cv(  x,  y,  scale,  weights = NULL,  p = 0.75,  number = 10,  group = NULL,  noise_variance = 0.001,  retrieve = c("final_model", "none"),  seed = NULL)

Gaussian process regression with linear kernel (gaussian_process)

Description

Carries out a gaussian process regression with a linear kernel (dot product). For internal use only!

Usage

gaussian_process(X, Y, noisev, scale)

Arguments

X

a matrix of predictor variables

Y

a matrix with a single response variable

noisev

a value indicating the variance of the noise for Gaussian process regression. Default is 0.001. a matrix with a single response variable

scale

a logical indicating whether both the predictorsand the response variable must be scaled to zero mean and unit variance.

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for performing leave-group-out crossvalidations for gaussian process

Description

For internal use only!.

Usage

gaussian_process_cv(X, Y, mindices, pindices, noisev = 0.001,  scale = TRUE, statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

mindices

a matrix withn rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices of the observations to be used for modeling at eachiteration.

pindices

a matrix withk rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices of the observations to be used for predicting at eachiteration.

scale

a logical indicating whether both the predictorsand the response variable must be scaled to zero mean and unit variance.

statistics

a logical value indicating whether the precision andaccuracy statistics are to be returned, otherwise the predictions for eachvalidation segment are retrieved.

Value

a list containing the following one-row matrices:

Author(s)

Leonardo Ramirez-Lopez


Function for identifiying the column in a matrix with the largest standard deviation

Description

Identifies the column with the largest standard deviation. For internal use only!

Usage

get_col_largest_sd(X)

Arguments

X

a matrix.

Value

a value indicating the index of the column with the largest standard deviation.

Author(s)

Leonardo Ramirez-Lopez


Standard deviation of columns

Description

For internal use only!

Usage

get_col_sds(x)

Function for computing the mean of each column in a matrix

Description

Computes the mean of each column in a matrix. For internal use only!

Usage

get_column_means(X)

Arguments

X

a a matrix.

Value

a vector of mean values.

Author(s)

Leonardo Ramirez-Lopez


Function for computing the standard deviation of each column in a matrix

Description

Computes the standard deviation of each column in a matrix. For internal use only!

Usage

get_column_sds(X)

Arguments

X

a a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


Function for computing sum of each column in a matrix

Description

Computes the sum of each column in a matrix. For internal use only!

Usage

get_column_sums(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


get the evaluation results for categorical data

Description

internal

Usage

get_eval_categorical(y, indices_closest)

get the evaluation results for continuous data

Description

internal

Usage

get_eval_continuous(y, indices_closest)

A function to obtain the local neighbors based on dissimilaritymatrices from orthogonal projections.

Description

internal function. This function is used to obtain the localneighbors based on dissimilarity matrices from orthogonal projections. Theseneighbors are obatin from an orthogonal projection on a set of precomputedneighbors. This function is used internally by the mbl fucntion.ortho_diss(, .local = TRUE) operates in the same way, however for mbl, it ismore efficient to do the re-search of the neighbors inside its main for loop

Usage

get_ith_local_neighbors(  ith_xr,  ith_xu,  ith_yr,  ith_yu = NULL,  diss_usage = "none",  ith_neig_indices,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  diss_method,  pc_selection,  ith_group = NULL,  center,  scale,  ...)

Arguments

ith_xr

the set of neighbors of a Xu observation found in Xr

ith_xu

the Xu observation

ith_yr

the response values of the set of neighbors of the Xuobservation found in Xr

ith_yu

the response value of the xu observation

diss_usage

a character string indicating if the dissimilarity datawill be used as predictors ("predictors") or not ("none").

ith_neig_indices

a vector of the original indices of the Xr neighbors.

k

the number of nearest neighbors to select from the alreadyidentified neighbors

k_diss

the distance threshold to select the neighbors from the alreadyidentified neighbors

k_range

a min and max number of allowed neighbors whenk_dissis used

spike

a vector with the indices of the observations forced to beretained as neighbors. They have to be present in all the neighborhoods andat the top ofneighbor_indices.

diss_method

the ortho_diss() method

pc_selection

the pc_selection argument as in ortho_diss()

ith_group

the vector containing the group labes ofith_xr.

center

center the data in the local diss computation?

scale

scale the data in the local diss computation?

Value

a list:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for computing the weights of the PLS componentsnecessary for weighted average PLS

Description

For internal use only!.

Usage

get_local_pls_weights(projection_mat,           xloadings,           coefficients,           new_x,           min_component,           max_component,           scale,           Xcenter,           Xscale)

Arguments

projection_mat

the projection matrix generated either by theopls function.

xloadings

.

coefficients

the matrix of regression coefficients.

new_x

a matrix of one new spectra to be predicted.

min_component

an integer indicating the minimum number of pls components.

max_component

an integer indicating the maximum number of pls components.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix of one row with the weights for each component between the max. and min. specified.

Author(s)

Leonardo Ramirez-Lopez


A function to get the neighbor information

Description

This fucntion gathers information of all neighborhoods of theXu observations found inXr. This information is equired duringlocal regressions.

Usage

get_neighbor_info(  Xr,  Xu,  diss_method,  Yr = NULL,  k = NULL,  k_diss = NULL,  k_range = NULL,  spike = NULL,  pc_selection,  return_dissimilarity,  center,  scale,  gh,  diss_usage,  allow_parallel = FALSE,  ...)

Details

For local pca and pls distances, the local dissimilarity matrices are notcomputed as it is cheaer to compute them during the local regressions.Instead the global distances (required for later local dissimilarity matrixcomputation are output)


Extract predictions from an object of classmbl

Description

Stable lifecycle

Extract predictions from an object of classmbl

Usage

get_predictions(object)

Arguments

object

an object of classmbl as returned bymbl

Value

a data.table of predicted values according to eitherk ork_dist

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

mbl


A function to assign values to sample distribution strata

Description

for internal use only! This function takes a continuous variable,creates n strata based on its distribution and assigns the corresponding startato every value.

Usage

get_sample_strata(y, n = NULL, probs = NULL)

Arguments

y

a matrix of one column with the response variable.

n

the number of strata.

Value

a data table with the inputy and the corresponding strata toevery value.


A function for stratified calibration/validation sampling

Description

for internal use only! This function selects samplesbased on provided strata.

Usage

get_samples_from_strata(  y,  original_order,  strata,  samples_per_strata,  sampling_for = c("calibration", "validation"),  replacement = FALSE)

Arguments

original_order

a matrix of one column with the response variable.

strata

the number of strata.

sampling_for

sampling to select the calibration samples ("calibration")or sampling to select the validation samples ("validation").

replacement

logical indicating if sampling with replacement must bedone.

Value

a list with the indices of the calibration and validation samples.


Internal function for computing the weights of the PLS componentsnecessary for weighted average PLS

Description

internal

Usage

get_wapls_weights(pls_model, original_x, type = "w1", new_x = NULL, pls_c)

Arguments

pls_model

either an object returned by thepls_cv function or anobject as returned by theopls_get_basics function which contains a pls model.

original_x

the original spectral matrix which was used for calibrating thepls model.

type

type of weight to be computed. The only available option (forthe moment) is"w1". See details on thembl function where itis explained how"w1" is computed whitin the"wapls"regression.

new_x

a vector of a new spectral observation. When "w1" is selected, new_xmust be specified.

pls_c

a vector of length 2 which contains both the minimum and maximumnumber of PLS components for which the weights must be computed.

Value

get_wapls_weights returns a vector of weights for each PLScomponent specified

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Computes the weights for pls regressions

Description

This is an internal function that computes the wights required for obtainingeach vector of pls scores. Implementation is done in C++ for improved performance.

Usage

get_weights(X, Y, algorithm = "pls", xls_min_w = 3L, xls_max_w = 15L)

Arguments

X

a numeric matrix of spectral data.

Y

a matrix of one column with the response variable.

algorithm

a character string indicating what method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y as inShenk and Westerhaus, 1991; Westerhaus 2014) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

amatrix of one column containing the weights.

Author(s)

Leonardo Ramirez-Lopez and Claudio Orellano

References

Shenk, J. S., & Westerhaus, M. O. (1991). Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Westerhaus, M. (2014). Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.


An iterator for local prediction data in mbl

Description

internal function. It collects only the data necessary toexecute a local prediction for the mbl function based on a list of neighbors.Not valid for local dissmilitary (e.g. for ortho_diss(...., .local = TRUE))

Usage

ith_mbl_neighbor(  Xr,  Xu = NULL,  Yr,  Yu = NULL,  diss_usage = "none",  neighbor_indices,  neighbor_diss = NULL,  diss_xr_xr = NULL,  group = NULL)

Arguments

Xr

the Xr matrix in mbl.

Xu

the Xu matrix in mbl. DefaultNULL. If not provided, thefunction will iterate for each{Yr, Xr} to get the respective neighbors.

Yr

the Yr matrix in mbl.

Yu

the Yu matrix in mbl. DefaultNULL.

diss_usage

a character string indicating if the dissimilarity datawill be used as predictors ("predictors") or not ("none").

neighbor_indices

a matrix with the indices of neighbors of every Xufound in Xr.

neighbor_diss

a matrix with the dissimilarity socres for the neighborsof every Xu found in Xr. This matrix is organized in the same way asneighbor_indices.

diss_xr_xr

a dissimilarity matrix between sampes in Xr.

group

a factor representing the group labels of Xr.

Details

isubset will look at the order of knn in each col of D andre-organize the rows of x accordingly

Value

an object ofclass iterator giving the following list:

Author(s)

Leonardo Ramirez-Lopez


iterator for nearest neighbor subsets

Description

internal

Usage

ith_subsets_ortho_diss(x, xu = NULL, y, kindx, na_rm = FALSE)

Arguments

x

a reference matrix

xu

a second matrix

y

a matrix of side information

kindx

a matrix of nearest neighbor indices

na_rm

logical indicating whether NAs must be removed.


Local fit functions

Description

These functions define the way in which each local fit/prediction is donewithin each iteration in thembl function.

Usage

local_fit_pls(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6)local_fit_wapls(min_pls_c, max_pls_c, modified = FALSE,                max_iter = 100, tol = 1e-6)local_fit_gpr(noise_variance = 0.001)

Arguments

pls_c

an integer indicating the number of pls components to be used inthe local regressions when the partial least squares (local_fit_pls)method is used.

modified

a logical indicating whether the modified version of the plsalgorithm (Shenk and Westerhaus, 1991 and Westerhaus, 2014). Default isFALSE.

max_iter

an integer indicating the maximum number of iterations incasetol is not reached. Defaul is 100.

tol

a numeric value indicating the convergence for calculating thescores. Default is 1-e6.

min_pls_c

an integer indicating the minimum number of pls componentsto be used in the local regressions when the weighted average partial leastsquares (local_fit_wapls) method is used. See details.

max_pls_c

integer indicating the maximum number of pls componentsto be used in the local regressions when the weighted average partial leastsquares (local_fit_wapls) method is used. See details.

noise_variance

a numeric value indicating the variance of the noisefor Gaussian process local regressions (local_fit_gpr). Default is0.001.

Details

These functions are used to indicate how to fitthe regression models within thembl function.

There are three possible options for performing these regressions:

Themodified argument in the pls methods (local_fit_pls()andlocal_fit_wapls()) is used to indicate ifa modified version of the pls algorithm (modified pls or mpls) is to be used.The modified pls was proposed Shenk and Westerhaus(1991, see also Westerhaus, 2014) and it differs from the standard pls methodin the way the weights of the predictors (used to compute the matrix ofscores) are obtained. While pls uses the covariance between response(s)and predictors (and later their deflated versions corresponding at each plscomponent iteration) to obtain these weights, the modified pls uses thecorrelation as weights. The authors indicate that by using correlation,a larger potion of the response variable(s) can be explained.

Value

An object of classlocal_fit mirroring the input arguments.

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning.Massachusetts Institute of Technology: MIT-Press, 2006.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

mbl

Examples

local_fit_wapls(min_pls_c = 3, max_pls_c = 12)

local ortho dissimilarity matrices initialized by a globaldissimilarity matrix

Description

internal

Usage

local_ortho_diss(  k_index_matrix,  Xr,  Yr,  Xu,  diss_method,  pc_selection,  center,  scale,  allow_parallel,  ...)

Arguments

k_index_matrix

a matrix of nearest neighnbor indices

Xr

argument passed to ortho_projection

Yr

argument passed to ortho_projection

Xu

argument passed to ortho_projection

diss_method

argument passed to ortho_projection

pc_selection

argument passed to ortho_projection

center

argument passed to ortho_projection

scale

argument passed to ortho_projection


A function for memory-based learning (mbl)

Description

This function is implemented for memory-based learning (a.k.a.instance-based learning or local regression) which is a non-linear lazylearning approach for predicting a given response variable from a set ofpredictor variables. For each observation in a prediction set, a specificlocal regression is carried out based on a subset of similar observations(nearest neighbors) selected from a reference set. The local model isthen used to predict the response value of the target (prediction)observation. Therefore this function does not yield a globalregression model.

Usage

mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,    method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),    diss_method = "pca", diss_usage = "predictors", gh = TRUE,    pc_selection = list(method = "opc", value = min(dim(Xr), 40)),    control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,    verbose = TRUE, documentation = character(), seed = NULL, ...)

Arguments

Xr

a matrix of predictor variables of the reference data(observations in rows and variables in columns).

Yr

a numeric matrix of one column containing the values of theresponse variable corresponding to the reference data.

Xu

a matrix of predictor variables of the data to be predicted(observations in rows and variables in columns).

Yu

an optional matrix of one column containing the values of theresponse variable corresponding to the data to be predicted. Default isNULL.

k

a vector of integers specifying the sequence of k-nearestneighbors to be tested. Eitherk ork_diss must be specified.This vector will be automatically sorted into ascending order. Ifnon-integer numbers are passed, they will be coerced to the next upperintegers.

k_diss

a numeric vector specifying the sequence of dissimilaritythresholds to be tested for the selection of the nearest neighbors found inXr around each observation inXu. These thresholds depend onthe corresponding dissimilarity measure specified in the object passed tocontrol. Eitherk ork_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

an integer vector (with positive and/or negative values) indicatingthe indices of observations inXr that must be either be forced intoor avoided in the neighborhoods of everyXu observation. Default isNULL (i.e. no observations are forced or avoided). Notethat this argument is not intended for increasing or reducing the neighborhoodsize which is only controlled byk ork_diss andk_range.By forcing observations into the neighborhood, some of the farthestobservations may be forced out of the neighborhood. In contrast, by avoidingobservations in the neighborhood, some of farthestobservations may be included into the neighborhood. See details.

method

an object of classlocal_fit which indicates thetype of regression to conduct at each local segment as well as additionalparameters affecting this regression. Seelocal_fit function.

diss_method

a character string indicating the spectral dissimilaritymetric to be used in the selection of the nearest neighbors of eachobservation. Options are:

  • "pca" (Default): Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr andXu. PC projection is done using thesingular value decomposition (SVD) algorithm.Seeortho_diss function.

  • "pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr andXu. PC projection is done using thenon-linear iterative partial least squares (nipals) algorithm.Seeortho_diss function.

  • "pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr andXu. In this case,Yr is alwaysrequired. Seeortho_diss function.

  • "cor": correlation coefficientbetween observations. Seecor_diss function.

  • "euclid": Euclidean distancebetween observations. Seef_diss function.

  • "cosine": Cosine distancebetween observations. Seef_diss function.

  • "sid": spectral information divergence betweenobservations. Seesid function.

Alternatively, a matrix of dissimilarities can also be passed to thisargument. This matrix is supposed to be a user-defined matrixrepresenting the dissimilarities between observations inXr andXu. Whendiss_usage = "predictors", this matrix must be squared(derived from a matrix of the formrbind(Xr, Xu)) for which thediagonal values are zeros (since the dissimilarity between an object anditself must be 0). On the other hand, ifdiss_usage is set to either"weights" or"none", it must be a matrix representing thedissimilarity of each observation inXu to each observation inXr. The number of columns of the input matrix must be equal to thenumber of rows inXu and the number of rows equal to the number ofrows inXr.

diss_usage

a character string specifying how the dissimilarityinformation shall be used. The possible options are:"predictors","weights" and"none" (see details below).Default is"predictors".

gh

a logical indicating if the global Mahalanobis distance (in the plsscore space) between each observation and the pls mean (centre) must becomputed. This metric is known as the GH distance in the literature. Notethat this computation is based on the number of pls components determined byusing thepc_selection argument. See details.

pc_selection

a list of length 2 used for the computation of GH (ifgh = TRUE) as well as in the computation of the dissimilarity methodsbased onortho_diss (i.e. whendiss_method is one of:"pca","pca.nipals" or"pls") or whengh = TRUE.This argument is used for optimizing the number of components (principalcomponents or pls factors) to be retained for dissimilarity/distancecomputation purposes only (i.e not for regression).This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

  • "opc": optimized principal component selection basedon Ramirez-Lopez et al. (2013a, 2013b). The optimal number ofcomponents (of set of observations) is the one for which its distancematrix minimizes the differences between theYr value of eachobservation and theYr value of its closest observation. Inthis casevalue must be a value (larger than 0 andbelow the minimum dimension ofXr orXr andXucombined) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.

  • "cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.

  • "var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.

  • "manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 and below the minimum dimension ofXr orXr andXu combined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The listlist(method = "opc", value = min(dim(Xr), 40)) is the default.Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

control

a list created with thembl_control functionwhich contains additional parameters that control some few aspects of thembl function (cross-validation, parameter tuning, etc).The default list is as returned bymbl_control().See thembl_control function for more details.

group

an optional factor (or character vector vectorthat can be coerced tofactor byas.factor) thatassigns a group/class label to each observation inXr(e.g. groups can be given by spectra collected from the same batch ofmeasurements, from the same observation, from observations with very similarorigin, etc). This is taken into account for internal leave-group-out crossvalidation for pls tuning (factor optimization) to avoid pseudo-replication.When one observation is selected for cross-validation, all observations ofthe same group are removed together and assigned to validation. The lengthof the vector must be equal to the number of observations in thereference/training set (i.e.nrow(Xr)). See details.

center

a logical if the predictor variables must be centred at eachlocal segment (before regression). In addition, ifTRUE,XrandXu will be centred for dissimilarity computations.

scale

a logical indicating if the predictor variables must be scaledto unit variance at each local segment (before regression). In addition, ifTRUE,Xr andXu will be scaled for dissimilaritycomputations.

verbose

a logical indicating whether or not to print a progress barfor each observation to be predicted. Default isTRUE. Note: In caseparallel processing is used, these progress bars will not be printed.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

seed

an integer value containing the random number generator (RNG)state for random number generation. This argument can be used forreproducibility purposes (for random sampling) in the cross-validationresults. Default isNULL, i.e. no RNG is applied.

...

further arguments to be passed to thedissimilarityfunction. See details.

Details

The argumentspike can be used to indicate what reference observationsinXr must be kept in the neighborhood of every singleXuobservation. If a vector of length \(m\) is passed to this argument,this means that the \(m\) original neighbors with the largestdissimilarities to the target observations will be forced out of theneighborhood. Spiking might be useful in cases wheresome reference observations are known to be somehow related to the ones inXu and therefore might be relevant for fitting the local models. SeeGuerrero et al. (2010) for an example on the benefits of spiking.

Thembl function uses thedissimilarity function tocompute the dissimilarities betweenXr andXu. The dissimilaritymethod to be used is specified in thediss_method argument.Arguments todissimilarity as well as further arguments to thefunctions used insidedissimilarity(i.e.ortho_disscor_dissf_disssid) can be passed to those functions by using....

Thediss_usage argument is used to specify whether the dissimilarityinformation must be used within the local regressions and, if so, how.Whendiss_usage = "predictors" the local (square symmetric)dissimilarity matrix corresponding the selected neighborhood is used assource of additional predictors (i.e the columns of this local matrix aretreated as predictor variables). In some cases this results in an improvementof the prediction performance (Ramirez-Lopez et al., 2013a).Ifdiss_usage = "weights", the neighbors of the query point(\(xu_{j}\)) are weighted according to their dissimilarity to\(xu_{j}\) before carrying out each local regression. The followingtricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used forcomputing the final weights based on the measured dissimilarities:

\[W_{j} = (1 - v^{3})^{3}\]

where if \({xr_{i} \in }\) neighbors of \(xu_{j}\):

\[v_{j}(xu_{j}) = d(xr_{i}, xu_{j})\]

otherwise:

\[v_{j}(xu_{j}) = 0\]

In the above formulas \(d(xr_{i}, xu_{j})\) represents thedissimilarity between the query point and each object in \(Xr\).Whendiss_usage = "none" is chosen the dissimilarity information isnot used.

The global Mahalanobis distance (a.k.a GH) is computed based on the scoresof a pls projection. A pls projection model is built with for{Yr}, {Xr}and this model is used to obtain the pls scores of theXuobservations. The Mahalanobis distance between eachXu observation in(the pls space) and the centre ofXr is then computed. The number ofpls components is optimized based on the parameters passed to thepc_selection argument. In addition, thembl function alsoreports the GH distance for the observations inXr.

Some aspects of the mbl process, such as the type of internal validation,parameter tuning, what extra objects to return, permission for parallelexecution, prediction limits, etc, can be specified by using thembl_control function.

By using thegroup argument one can specify groups of observationsthat have something in common (e.g. observations with very similar origin).The purpose ofgroup is to avoid biased cross-validation results dueto pseudo-replication. This argument allows to select calibration pointsthat are independent from the validation ones. In this regard, whenvalidation_type = "local_cv" (used inmbl_controlfunction), then thep argument refers to the percentage of groups ofobservations (rather than single observations) to be retained in eachsampling iteration at each local segment.

Value

alist of classmbl with the following components(sorted either byk ork_diss):

When thek_diss argument is used, the printed results show a tablewith a column named 'p_bounded. It represents the percentage ofobservations for which the neighbors selected by the given dissimilaritythreshold were outside the boundaries specified in thek_rangeargument.

Author(s)

Leonardo Ramirez-Lopezand Antoine Stevens

References

Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: anapproach to regression analysis by local fitting. Journal of the AmericanStatistical Association, 83, 596-610.

Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking ofNIR regional models using observations from target sites: Effect of modelsize on prediction accuracy. Geoderma, 158(1-2), 66-77.

Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression andscatter correction for near-infrared reflectance data. Analytical Chemistry62, 664-673.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196,268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics foruse with soil vis-NIR spectra. Geoderma 199, 43-53.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning.Massachusetts Institute of Technology: MIT-Press, 2006.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of NearInfrared Spectroscopy, 5, 223-232.

See Also

mbl_control,f_diss,cor_diss,sid,ortho_diss,search_neighbors,local_fit

Examples

library(prospectr)data(NIRsoil)# Proprocess the data using detrend plus first derivative with Savitzky and# Golay smoothing filtersg_det <- savitzkyGolay(  detrend(NIRsoil$spc,    wav = as.numeric(colnames(NIRsoil$spc))  ),  m = 1,  p = 1,  w = 7)NIRsoil$spc_pr <- sg_det# split into training and testing setstest_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]# Example 1# A mbl implemented in Ramirez-Lopez et al. (2013,# the spectrum-based learner)# Example 1.1# An exmaple where Yu is supposed to be unknown, but the Xu# (spectral variables) are knownmy_control <- mbl_control(validation_type = "NNv")## The neighborhood sizes to testks <- seq(40, 140, by = 20)sbl <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  k = ks,  method = local_fit_gpr(),  control = my_control,  scale = TRUE)sblplot(sbl)get_predictions(sbl)# Example 1.2# If Yu is actually known...sbl_2 <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_gpr(),  control = my_control)sbl_2plot(sbl_2)# Example 2# the LOCAL algorithm (Shenk et al., 1997)local_algorithm <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "cor",  diss_usage = "none",  control = my_control)local_algorithmplot(local_algorithm)# Example 3# A variation of the LOCAL algorithm (using the optimized pc# dissmilarity matrix) and dissimilarity matrix as source of# additional preditorslocal_algorithm_2 <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "pca",  diss_usage = "predictors",  control = my_control)local_algorithm_2plot(local_algorithm_2)# Example 4# Running the mbl function in parallel with example 2n_cores <- 2if (parallel::detectCores() < 2) {  n_cores <- 1}# Alternatively:# n_cores <- parallel::detectCores() - 1# if (n_cores == 0) {#  n_cores <- 1# }library(doParallel)clust <- makeCluster(n_cores)registerDoParallel(clust)# Alernatively:# library(doSNOW)# clust <- makeCluster(n_cores, type = "SOCK")# registerDoSNOW(clust)# getDoParWorkers()local_algorithm_par <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "cor",  diss_usage = "none",  control = my_control)local_algorithm_parregisterDoSEQ()try(stopCluster(clust))# Example 5# Using local pls distanceswith_local_diss <- mbl(  Xr = train_x,  Yr = train_y,  Xu = test_x,  Yu = test_y,  k = ks,  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),  diss_method = "pls",  diss_usage = "predictors",  control = my_control,  .local = TRUE,  pre_k = 150,)with_local_dissplot(with_local_diss)

A function that controls some few aspects of the memory-based learningprocess in thembl function

Description

Experimental lifecycle

This function is used to further control some aspects of the memory-basedlearning process in thembl function.

Usage

mbl_control(  return_dissimilarity = FALSE,  validation_type = c("NNv", "local_cv"),  tune_locally = TRUE,  number = 10,  p = 0.75,  range_prediction_limits = TRUE,  progress = TRUE,  allow_parallel = TRUE)

Arguments

return_dissimilarity

a logical indicating if the dissimilarity matrixbetweenXr andXu must be returned.

validation_type

a character vector which indicates the (internal) validationmethod(s) to be used for assessing the global performance of the local models.Possible options are:"NNv" and"local_cv". Alternatively"none" can be used when cross-validation is not required (see detailsbelow).

tune_locally

a logical. It only applies whenvalidation_type = "local_cv" and "pls" or "wapls" fitting algorithms areused. IfTRUE, the parameters of the local pls-based models(i.e. pls factors for the "pls" method and minimum and maximum pls factorsfor the "wapls" method). Default isTRUE.

number

an integer indicating the number of sampling iterations ateach local segment when"local_cv" is selected in thevalidation_type argument. Default is 10.

p

a numeric value indicating the percentage of observations to be retainedat each sampling iteration at each local segment when"local_cv"is selected in thevalidation_type argument. Default is 0.75 (75 %).

range_prediction_limits

a logical. It indicates whether the predictionlimits at each local regression are determined by the range of the responsevariable within each neighborhood. When the predicted value is outsidethis range, it will be automatically replaced with the value of the nearestrange value. IfFALSE, no prediction limits are imposed.Default isTRUE.

progress

a logical indicating whether or not to print a progress barfor each observation to be predicted. Default isTRUE. Note: In caseparallel processing is used, these progress bars will not be printed.

allow_parallel

a logical indicating if parallel execution is allowed.IfTRUE, this parallelism is applied to the loop inmblin which each iteration takes care of a single observation inXu. Theparallelization of this for loop is implemented using theforeach function of theforeach package.Default isTRUE.

Details

The validation methods available for assessing the predictive performance ofthe memory-based learning method used are described as follows:

Value

alist mirroring the specified parameters

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics foruse with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

f_diss,cor_diss,sid,ortho_diss,mbl

Examples

# A control list with the default parametersmbl_control()

Moving/rolling correlation distance of two matrices

Description

Computes a moving window correlation distance between two data matrices

Usage

moving_cor_diss(X,Y,w)

Arguments

X

a matrix

Y

a matrix

w

window size (must be odd)

Value

a matrix of correlation distance

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.It does not return the variance information of the components. NOTE: Forinternal use only!

Usage

opls(X,      Y,      ncomp,      scale,      maxiter,      tol,      algorithm = "pls",      xls_min_w = 3,      xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for performing leave-group-out cross-validations for pls regression

Description

For internal use only!.

Usage

opls_cv_cpp(X, Y, scale, method,                   mindices, pindices,                   min_component, ncomp,                   new_x,                   maxiter, tol,                   wapls_grid,                   algorithm,                   statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

scale

a logical indicating whether the matrix of predictors(X) must be scaled.

method

the method used for regression. One of the following options:'pls' or'wapls' or'completewapls1p'.

mindices

a matrix withn rows andm columns wherem is equivalent to the number of resampling iterations. The elementsof each column indicate the indices of the observations to be used formodeling at each iteration.

pindices

a matrix withk rows andm columns wherem is equivalent to the number ofresampling iterations. The elements of each column indicate the indices ofthe observations to be used for predicting at each iteration.

min_component

an integer indicating the number of minimum plscomponents (if themethod = 'pls').

ncomp

an integer indicating the number of pls components.

new_x

a matrix of one row corresponding to the observation to bepredicted (if themethod = 'wapls').

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

wapls_grid

the grid on which the search for the best combination ofminimum and maximum pls factors of'wapls' is based on in casemethod = 'completewapls1p'.

algorithm

either pls ('pls') or modified pls ('mpls').Seeget_weigths function.

statistics

a logical value indicating whether the precision andaccuracy statistics are to be returned, otherwise the predictions for eachvalidation segment are retrieved.

Value

ifstatistics = true a list containing the following one-row matrices:

ifstatistics = false a list containing the following one-row matrices:

Ifmethod = "wapls", data of the pls weights are output in thislist(compweights).

Ifmethod = "completewapls1", data of all the combination ofcomponents passed inwapls_grid areoutput in this list(complete_compweights).

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithn of partial leat squares (opls) projection

Description

Computes orthogonal socres partial least squares (opls)projection with the NIPALS algorithm. It allows multiple response variables.Although the main use of the function is for projection, it also retrievesregression coefficients. NOTE: For internal use only!

Usage

opls_for_projection(X, Y, ncomp, scale,                    maxiter, tol,                    pcSelmethod = "var",                    pcSelvalue = 0.01,                     algorithm = "pls",                     xls_min_w = 3,                     xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

ifregression = TRUE, the method for selecting thenumber of components.Options are:'manual','cumvar' (for selecting the number ofprincipal components based on a given cumulative amount of explainedvariance) and'var' (for selecting the number of principal componentsbased on a given amount of explained variance). Default is'cumvar'.

pcSelvalue

a numerical value that complements the selected method(pcSelmethod).If'cumvar' is chosen (default),pcSelvalue must be a value(larger than 0 and below 1) indicating the maximum amount of cumulativevariance that the retained components should explain. Default is 0.99.If'var' is chosen,pcSelvalue must be a value (larger than 0and below 1) indicating that components that explain (individually)a variance lower than this threshold must be excluded. If'manual'is chosen,pcSelvalue has no effect and the number of componentsretrieved are the one specified inncomp.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithn of partial leat squares (opls_get_all)

Description

Computes orthogonal socres partial least squares (opls_get_all)regressions with the NIPALS algorithm. It retrives a comprehensive set ofpls outputs (e.g. vip and sensivity radius). It allows multiple responsevariables. NOTE: For internal use only!

Usage

opls_get_all(X,              Y,              ncomp,              scale,              maxiter,              tol,              algorithm = "pls",              xls_min_w = 3,              xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


fast orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.In contrast toopls function, this one does not compute unnecessarydata for (local) regression.For internal use only!

Usage

opls_get_basics(X, Y, ncomp, scale,                 maxiter, tol,                 algorithm = "pls",                 xls_min_w = 3,                 xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y),'mpls' for modified pls (using correlation between X and Y) or'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls"method. Only used ifalgorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithm of partial leat squares (opls)

Description

Computes orthogonal scores partial least squares (opls)regressions with the NIPALS algorithm. It allows multiple response variables.It does not return the variance information of the components. NOTE: Forinternal use only!

Usage

opls_gs(Xr,         Yr,        Xu,         ncomp,        scale,             response = FALSE,         reconstruction = TRUE,        similarity = TRUE,        fresponse = TRUE,        algorithm = "pls")

Arguments

Xr

a matrix of predictor variables for the training set.

Yr

a matrix of a single response variable for the training set.

Xu

a matrix of predictor variables for the test set.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

response

logical indicating whether to compute the prediction ofYu.

reconstruction

logical indicating whether to compute the reconstruction error ofXu.

similarity

logical indicating whether to compute the the distance score betweenXr andXu (in the pls space).

fresponse

logical indicating whether to compute the score of the variance not explained forYu.

algorithm

(for weights computation) a character string indicatingwhat method to use. Options are:'pls' for pls (using covariance between X and Y) or'mpls' for modified pls (using correlation between X and Y).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


A function to construct an optimal strata for the samples, based onthe distribution of the given y.

Description

for internal use only! This function computes the optimal stratafrom the distribution of the given y

Usage

optim_sample_strata(y, n)

Arguments

y

a matrix of one column with the response variable.

n

number of samples that must be sampled.

Value

a list with twodata.table objects:sample_strata containsthe optimal strata, whereassamples_to_get contains information on howmany samples per stratum are supposed to be drawn.


A function for computing dissimilarity matrices from orthogonalprojections (ortho_diss)

Description

This function computes dissimilarities (in an orthogonal space) betweeneither observations in a given set or between observations in two differentsets.The dissimilarities are computed based on either principal componentprojection or partial least squares projection of the data. After projectingthe data, the Mahalanobis distance is applied.

Usage

ortho_diss(Xr, Xu = NULL,           Yr = NULL,           pc_selection = list(method = "var", value = 0.01),           diss_method = "pca",           .local = FALSE,           pre_k,           center = TRUE,           scale = FALSE,           compute_all = FALSE,           return_projection = FALSE,           allow_parallel = TRUE, ...)

Arguments

Xr

a matrix containingn reference observations rows andp variablescolumns.

Xu

an optional matrix containing data of a second set of observationswithp variables/columns.

Yr

a matrix ofn rows and one or more columns (variables) withside information corresponding to the observations inXr (e.g. responsevariables). It can be numeric with multiple variables/columns, or characterwith one single column. This argument isrequired if:

  • diss_method == 'pls':Yr is required to project the variablesto orthogonal directions such that the covariance between the extracted plscomponents andYr is maximized.

  • pc_selection$method == 'opc':Yr is required to optimizethe number of components. The optimal number of projected components is the onefor which its distance matrix minimizes the differences between theYrvalue of each observation and theYr value of its closest observation.Seesim_eval.

pc_selection

a list of length 2 which specifies the method to be usedfor optimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements (in the following order):method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

  • "opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of a given set of observations) is the one for which its distancematrix minimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscase,value must be a value (larger than 0 andbelowmin(nrow(Xr)+ nrow(Xu),ncol(Xr)) indicating the maximumnumber of principal components to be tested. See theortho_projection function for more details.

  • "cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.

  • "var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.

  • "manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

Default islist(method = "var", value = 0.01).

Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such case, the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

diss_method

a character value indicating the type of projection on whichthe dissimilarities must be computed. This argument is equivalent tomethod argument in theortho_projection function.Options are:

  • "pca": principal component analysis using the singular valuedecomposition algorithm)

  • "pca.nipals": principal component analysis usingthe non-linear iterative partial least squares algorithm.

  • "pls": partial least squares.

  • "mpls": modified partial least squares (Shenk and Westerhaus,1991 and Westerhaus, 2014).

See theortho_projection function for further details on theprojection methods.

.local

a logical indicating whether or not to compute the dissimilaritieslocally (i.e. projecting locally the data) by using thepre_k nearestneighbor observations of each target observation. Default isFALSE. See details.

pre_k

if.local = TRUE a numeric integer value which indicates thenumber of nearest neighbors to (pre-)retain for each observation tocompute the (local) orthogonal dissimilarities to each observation in itsneighborhhod.

center

a logical indicating if theXr andXu must becentered. IfXu is provided the data is centered around the mean ofthe pooledXr andXu matrices (\(Xr \cup Xu\)). Fordissimilarity computations based on pls, the data is always centered forthe projections.

scale

a logical indicating if theXr andXu must bescaled. IfXu is provided the data is scaled based on the standarddeviation of the the pooledXr andXu matrices (\(Xr \cup Xu\)).ifcenter = TRUE, scaling is applied after centering.

compute_all

a logical. In caseXu is specified it indicateswhether or not the distances between all the elements resulting from thepooledXr andXu matrices (\(Xr \cup Xu\) must be computed).

return_projection

a logical. IfTRUE theortho_projection objecton which the dissimilarities are computed will be returned. Default isFALSE. Note thatfor.local = TRUE only the initial projection is returned (i.e. localprojections are not).

allow_parallel

a logical (default TRUE). It allows parallel computingof the local distance matrices (i.e. when.local = TRUE). This is doneviaforeach function of the 'foreach' package.

...

additional arguments to be passed to theortho_projection function.

Details

When.local = TRUE, first a global dissimilarity matrix is computed based onthe parameters specified. Then, by using this matrix for each targetobservation, a given set of nearest neighbors (pre_k) are identified.These neighbors (together with the target observation) are projected(from the original data space) onto a (local) orthogonal space (using thesame parameters specified in the function). In this projected space theMahalanobis distance between the target observation and its neighbors isrecomputed. A missing value is assigned to the observations that do not belong tothis set of neighbors (non-neighbor observations).In this case the dissimilarity matrix cannot be considered as a distancemetric since it does not necessarily satisfies the symmetry condition fordistance matrices (i.e. given two observations \(x_i\) and \(x_j\), the localdissimilarity (\(d\)) between them is relative since generally\(d(x_i, x_j) \neq d(x_j, x_i)\)). On the other hand, when.local = FALSE, the dissimilarity matrix obtained can be considered asa distance matrix.

In the cases where"Yr" is required to compute the dissimilarities andif.local = TRUE, care must be taken as some neighborhoods mightnot have enough observations with non-missing"Yr" values, which might retrieveunreliable dissimilarity computations.

If"opc" or"manual" are used inpc_selection$methodand.local = TRUE, the minimum number of observations with non-missing"Yr" values at each neighborhood is determined bypc_selection$value (i.e. the maximum number of components to compute).

Value

alist of classortho_diss with the following elements:

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for usewith soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

ortho_projection,sim_eval

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil[!as.logical(NIRsoil$train), "CEC", drop = FALSE]Yr <- NIRsoil[as.logical(NIRsoil$train), "CEC", drop = FALSE]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu), , drop = FALSE]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr), , drop = FALSE]# Computation of the orthogonal dissimilarity matrix using the# default parameterspca_diss <- ortho_diss(Xr, Xu)# Computation of a principal component dissimilarity matrix using# the "opc" method for the selection of the principal componentspca_diss_optim <- ortho_diss(  Xr, Xu, Yr,  pc_selection = list("opc", 40),  compute_all = TRUE)# Computation of a partial least squares (PLS) dissimilarity# matrix using the "opc" method for the selection of the PLS# componentspls_diss_optim <- ortho_diss(  Xr = Xr, Xu = Xu,  Yr = Yr,  pc_selection = list("opc", 40),  diss_method = "pls")

Orthogonal projections using principal component analysis and partialleast squares

Description

Functions to perform orthogonal projections of high dimensional data matricesusing principal component analysis (pca) and partial least squares (pls).

Usage

ortho_projection(Xr, Xu = NULL,                 Yr = NULL,                 method = "pca",                 pc_selection = list(method = "var", value = 0.01),                 center = TRUE, scale = FALSE, ...)pc_projection(Xr, Xu = NULL, Yr = NULL,              pc_selection = list(method = "var", value = 0.01),              center = TRUE, scale = FALSE,              method = "pca",              tol = 1e-6, max_iter = 1000, ...)pls_projection(Xr, Xu = NULL, Yr,               pc_selection = list(method = "opc", value = min(dim(Xr), 40)),               scale = FALSE, method = "pls",               tol = 1e-6, max_iter = 1000, ...)## S3 method for class 'ortho_projection'predict(object, newdata, ...)

Arguments

Xr

a matrix of observations.

Xu

an optional matrix containing data of a second set of observations.

Yr

if the method used in thepc_selection argument is"opc"or ifmethod = "pls", then it must be a matrixcontaining the side information corresponding to the spectra inXr.It is equivalent to theside_info parameter of thesim_evalfunction. In casemethod = "pca", a matrix (with one or morecontinuous variables) can also be used as input. The root mean square ofdifferences (rmsd) is used for assessing the similarity between the observationsand their corresponding most similar observations in terms of the side informationprovided. A single discrete variable of class factor can also be passed. Inthat case, the kappa index is used. Seesim_eval function for more details.

method

the method for projecting the data. Options are:

  • "pca": principal component analysis using the singular valuedecomposition algorithm.

  • "pca.nipals": principal component analysis using thenon-linear iterative partial least squares algorithm.

  • "pls": partial least squares.

  • "mpls": modified partial least squares. See details.

pc_selection

a list of length 2 which specifies the method to be usedfor optimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements (in the following order):method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

  • "opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of componentsof a given set of observations is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value (larger than 0 andbelowmin(nrow(Xr)+ nrow(Xu),ncol(Xr)) indicatingthe maximum number of principal components to be tested. See details.

  • "cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.

  • "var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single component shouldexplain in order to be retained.

  • "manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 andbelow the minimum dimension ofXr orXr andXucombined).indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The listlist(method = "var", value = 0.01) is the default.Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

center

a logical indicating if the dataXr (andXu ifspecified) must be centered. IfXu is specified the data is centeredon the basis of \(Xr \cup Xu\). NOTE: This argument only applies to theprincipal components projection. For pls projections the data is alwayscentered.

scale

a logical indicating ifXr (andXu if specified)must be scaled. IfXu is specified the data is scaled on the basis of\(Xr \cup Xu\).

...

additional arguments to be passedtopc_projection orpls_projection.

tol

tolerance limit for convergence of the algorithm in the nipalsalgorithm (default is 1e-06). In the case of PLS this applies only to Yr withmore than one variable.

max_iter

maximum number of iterations (default is 1000). In the case ofmethod = "pls" this applies only toYr matrices with more thanone variable.

object

object of class"ortho_projection".

newdata

an optional data frame or matrix in which to look for variableswith which to predict. If omitted, the scores are used. It must contain thesame number of columns, to be used in the same order.

Details

In the case ofmethod = "pca", the algorithm used is the singular valuedecomposition in which a given data matrix (\(X\)) is factorized as follows:

\[X = UDV^{T}\]

where \(U\) and \(V\) are orthogonal matrices, being the left and rightsingular vectors of \(X\) respectively, \(D\) is a diagonal matrixcontaining the singular values of \(X\) and \(V\) is the is a matrix ofthe right singular vectors of \(X\).The matrix of principal component scores is obtained by a matrixmultiplication of \(U\) and \(D\), and the matrix of principal componentloadings is equivalent to the matrix \(V\).

Whenmethod = "pca.nipals", the algorithm used for principal componentanalysis is the non-linear iterative partial least squares (nipals).

In the case of the of the partial least squares projection (a.k.a projectionto latent structures) the nipals regression algorithm is used by default.Details on the "nipals" algorithm are presented in Martens (1991). Anothermethod called modified pls ('mpls') can also be used. The modifiedpls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and itdiffers from the standard pls method in the way the weights of theXr(used to compute the matrix of scores) are obtained. While pls uses the covariancebetweenYr andXr (and later their deflated versionscorresponding at each pls component iteration) to obtain these weights, the modified plsuses the correlation as weights. The authors indicate that by using correlation,a larger potion of the response variable(s) can be explained.

Whenmethod = "opc", the selection of the components is carried out byusing an iterative method based on the side information concept(Ramirez-Lopez et al. 2013a, 2013b). First let be \(P\) a sequence ofretained components (so that \(P = 1, 2, ...,k \)).At each iteration, the function computes a dissimilarity matrix retaining\(p_i\) components. The values in this side information variable arecompared against the side information values of their most spectrally similarobservations (closestXr observation).The optimal number of components retrieved by the function is the one thatminimizes the root mean squared differences (RMSD) in the case of continuousvariables, or maximizes the kappa index in the case of categorical variables.In this process, thesim_eval function is used.Note that for the"opc" methodYr is required (i.e. theside information of the observations).

Value

alist of classortho_projection with the followingcomponents:

predict.ortho_projection, returns a matrix of scores proprojected fornewdtata.

Author(s)

Leonardo Ramirez-Lopez

References

Martens, H. (1991). Multivariate calibration. John Wiley & Sons.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte,J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for usewith soil vis-NIR spectra. Geoderma 199, 43-53.

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring ofnear infrared spectra and modified partial least squares regression.Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCALcalibration procedure for near infrared instruments. Journal of Near InfraredSpectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstandingWachievements in near infrared spectroscopy: my contributions toWnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

ortho_diss,sim_eval,mbl

Examples

library(prospectr)data(NIRsoil)# Proprocess the data using detrend plus first derivative with Savitzky and# Golay smoothing filtersg_det <- savitzkyGolay(  detrend(NIRsoil$spc,    wav = as.numeric(colnames(NIRsoil$spc))  ),  m = 1,  p = 1,  w = 7)NIRsoil$spc_pr <- sg_det# split into training and testing setstest_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]# A principal component analysis using 5 componentspca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5))pca_projected# A principal components projection using the "opc" method# for the selection of the optimal number of componentspca_projected_2 <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pca",  pc_selection = list("opc", 40))pca_projected_2plot(pca_projected_2)# A partial least squares projection using the "opc" method# for the selection of the optimal number of componentspls_projected <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pls",  pc_selection = list("opc", 40))pls_projectedplot(pls_projected)# A partial least squares projection using the "cumvar" method# for the selection of the optimal number of componentspls_projected_2 <- ortho_projection(  Xr = train_x, Xu = test_x, Yr = train_y,  method = "pls",  pc_selection = list("cumvar", 0.99))

Function for computing the overall variance of a matrix

Description

Computes the variance of a matrix. For internal use only!

Usage

overall_var(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


Principal components based on the non-linear iterative partial least squares (nipals) algorithm

Description

Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables.For internal use only!

Usage

pca_nipals(X, ncomp, center, scale,           maxiter, tol,           pcSelmethod = "var",           pcSelvalue = 0.01)

Arguments

X

a matrix of predictor variables.

ncomp

the number of pls components.

scale

logical indicating whetherX must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

the method for selecting the number of components.Options are:'cumvar' (for selecting the number of principal components based on a givencumulative amount of explained variance) and"var" (for selecting the number of principalcomponents based on a given amount of explained variance). Default is'var'

pcSelvalue

a numerical value that complements the selected method (pcSelmethod).If"cumvar" is chosen, it must be a value (larger than 0 and below 1) indicating the maximumamount of cumulative variance that the retained components should explain. If"var" is chosen,it must be a value (larger than 0 and below 1) indicating that components that explain (individually)a variance lower than this threshold must be excluded. If"manual" is chosen, it must be a valuespecifying the desired number of principal components to retain. Default is 0.01.

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Get the package version info

Description

returns package info.

Usage

pkg_info(pkg = "resemble")

Arguments

pkg

the package name i.e "resemble"


Plot method for an object of classmbl

Description

Plots the content of an object of classmbl

Usage

## S3 method for class 'mbl'plot(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1,2), ...)

Arguments

x

an object of classmbl (as returned bymbl).

g

a character vector indicating what results shall be plotted.Options are:"validation" (for plotting the validation results) and/or"gh" (for plotting the pls scores used to compute the GH distance.See details).

param

a character string indicating what validation statistics shall beplotted. The following options are available:"rmse","st_rmse"or"r2". These options only available if thembl object containsvalidation results.

pls_c

a numeric vector of length one or two indicating the pls factors to beplotted. Default isc(1, 2). It is only available if"gh" isspecified in theg argument.

...

some arguments to be passed to the plot methods.

Details

For plotting the pls scores from the pls score matrix (of more than one column),this matrix is first transformed from the Euclidean space to the Mahalanobisspace. This is done by multiplying the score matrix by the root square ofits covariance matrix. The root square of this matrix is estimated using asingular value decomposition.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

mbl

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu)]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr)]ctrl <- mbl_control(validation_type = "NNv")ex_1 <- mbl(  Yr = Yr, Xr = Xr, Xu = Xu,  diss_method = "cor",  diss_usage = "none",  gh = TRUE,  mblCtrl = ctrl,  k = seq(50, 250, 30))plot(ex_1)plot(ex_1, g = "gh", pls_c = c(2, 3))

Plot method for an object of classortho_projection

Description

Plots objects of classortho_projection

Usage

## S3 method for class 'ortho_projection'plot(x, col = "dodgerblue", ...)

Arguments

x

an object of classortho_projection (as returned byortho_projection).

col

the color of the plots (default is "dodgerblue")

...

arguments to be passed to methods.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

ortho_projection


Cross validation for PLS regression

Description

for internal use only!

Usage

pls_cv(  x,  y,  ncomp,  method = c("pls", "wapls"),  center = TRUE,  scale,  min_component = 1,  new_x = matrix(0, 1, 1),  weights = NULL,  p = 0.75,  number = 10,  group = NULL,  retrieve = TRUE,  tune = TRUE,  max_iter = 1,  tol = 1e-06,  seed = NULL,  modified = FALSE)

Prediction function for thegaussian_process function (Gaussian process regression with dot product covariance)

Description

Predicts response values based on a model generated by thegaussian_process function (Gaussian process regression with dot product covariance). For internal use only!.

Usage

predict_gaussian_process(Xz, alpha, newdata, scale, Xcenter, Xscale, Ycenter, Yscale)

Arguments

newdata

a matrix containing the predictor variables

scale

a logical indicating whether the matrix of predictors used to create the regression model(in thegaussian_process function) was scaled

Xcenter

ifcenter = TRUE a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Ycenter

ifcenter = TRUE a matrix of one row with the values that must be used for accounting for the centering of the response variable.

Yscale

ifscale = TRUE a matrix of one row with the values that must be used for accounting for the scaling of the response variable.

Value

a matrix of predicted values

Author(s)

Leonardo Ramirez-Lopez


Prediction function for theopls andfopls functions

Description

Predicts response values based on a model generated by either byopls or thefopls functions.For internal use only!.

Usage

predict_opls(bo, b, ncomp, newdata, scale, Xscale)

Arguments

bo

a numeric value indicating the intercept.

b

the matrix of regression coefficients.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix of predicted values.

Author(s)

Leonardo Ramirez-Lopez


Print method for an object of classlocal_fit

Description

Prints the contents of an object of classlocal_fit

Usage

## S3 method for class 'local_fit'print(x, ...)

Arguments

x

an object of classlocal_fit

...

not yet functional.

Author(s)

Leonardo Ramirez-Lopez


Print method for an object of classortho_diss

Description

Prints the content of an object of classortho_diss

Usage

## S3 method for class 'local_ortho_diss'print(x, ...)

Arguments

x

an object of classlocal_ortho_diss (returned byortho_diss when it uses.local = TRUE).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Print method for an object of classmbl

Description

Prints the content of an object of classmbl

Usage

## S3 method for class 'mbl'print(x, ...)

Arguments

x

an object of classmbl (as returned by thembl function).

...

arguments to be passed to methods (not functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Print method for an object of classortho_projection

Description

Prints the contents of an object of classortho_projection

Usage

## S3 method for class 'ortho_projection'print(x, ...)

Arguments

x

an object of classortho_projection (as returned by theortho_projection function).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez


Projection function for theopls function

Description

Projects new spectra onto a PLS space based on a model generated by either byopls or theopls2 functions.For internal use only!.

Usage

project_opls(projection_mat, ncomp, newdata, scale, Xcenter, Xscale)

Arguments

projection_mat

the projection matrix generated by theopls function.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centeringnewdata.

Xscale

ifscale = TRUE a matrix of one row with the values that must be used for scalingnewdata.

Value

a matrix corresponding to the new spectra projected onto the PLS space

Author(s)

Leonardo Ramirez-Lopez


Projection to pls and then re-construction

Description

Projects spectra onto a PLS space and then reconstructs it back.

Usage

reconstruction_error(x,                             projection_mat,                             xloadings,                             scale,                             Xcenter,                             Xscale,                             scale_back = FALSE)

Arguments

x

a matrix to project.

projection_mat

the projection matrix generated by theopls_get_basics function.

xloadings

the loadings matrix generated by theopls_get_basics function.

scale

logical indicating if scaling is required

Xcenter

a matrix of one row with the centering values

Xscale

a matrix of one row with the scaling values

scale_back

compute the reconstruction error after de-centering thedata and de-scaling it.

Value

a matrix of 1 row and 1 column.

Author(s)

Leonardo Ramirez-Lopez


A function to create calibration and validation sample sets forleave-group-out cross-validation

Description

for internal use only! This is stratified sampling based on thevalues of a continuous response variable (y). If group is provided, thesampling is done based on the groups and the average of y per group. Thisfunction is used to create calibration and validation groups forleave-group-out cross-validations (orleave-group-of-groups-out cross-validation if group argument is provided).

Usage

sample_stratified(y, p, number, group = NULL, replacement = FALSE, seed = NULL)

Arguments

y

a matrix of one column with the response variable.

p

the percentage of samples (or groups if group argument is used) toretain in the validation_indices set

number

the number of sample groups to be crated

group

the labels for each sample iny indicating the group eachobservation belongs to.

replacement

A logical indicating sample replacements for thecalibration set are required.

seed

an integer for random number generator (defaultNULL).

Value

a list with two matrices (hold_in andhold_out) giving the indices of the observations in eachcolumn. The number of columns represents the number of sampling repetitions.


A function for searching in a given reference set the neighbors ofanother given set of observations (search_neighbors)

Description

This function searches in a reference set the neighbors of the observationsprovided in another set.

Usage

search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls",                                         "cor", "euclid", "cosine", "sid"),                 Yr = NULL, k, k_diss, k_range, spike = NULL,                 pc_selection = list("var", 0.01),                 return_projection = FALSE, return_dissimilarity = FALSE,                 ws = NULL,                 center = TRUE, scale = FALSE,                 documentation = character(), ...)

Arguments

Xr

a matrix of reference (spectral) observations where the neighborsearch is to be conducted. See details.

Xu

an optional matrix of (spectral) observations for which itsneighbors are to be searched inXr. Default isNULL. See details.

diss_method

a character string indicating the spectral dissimilarity metricto be used in the selection of the nearest neighbors of each observation.

  • "pca": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if supplied).PC projection is done using the singular value decomposition (SVD)algorithm. Seeortho_diss function.

  • "pca.nipals": Mahalanobis distancecomputed on the matrix of scores of a Principal Component (PC)projection ofXr (andXu if supplied).PC projection is done using thenon-linear iterative partial least squares (niapls) algorithm.Seeortho_diss function.

  • "pls": Mahalanobis distancecomputed on the matrix of scores of a partial least squares projectionofXr (andXu if supplied). In this case,Yris always required. Seeortho_diss function.

  • "mpls": Mahalanobis distancecomputed on the matrix of scores of a modified partial least squaresprojection (Shenk and Westerhaus, 1991; Westerhaus, 2014)ofXr (andXu if provided). In this case,Yr isalways required. Seeortho_diss function.

  • "cor": correlation coefficientbetween observations. Seecor_diss function.

  • "euclid": Euclidean distancebetween observations. Seef_diss function.

  • "cosine": Cosine distancebetween observations. Seef_diss function.

  • "sid": spectral information divergence between observations.Seesid function.

Yr

a numeric matrix ofn observations used as side information ofXr for theortho_diss methods (i.e.pca,pca.nipals orpls). It is required when:

  • diss_method = "pls"

  • diss_method = "pca" with"opc" used as the methodin thepc_selection argument. Seeortho_diss().

k

an integer value indicating the k-nearest neighbors of eachobservation inXu that must be selected fromXr.

k_diss

an integer value indicating a dissimilarity treshold.For each observation inXu, its nearest neighbors inXrare selected as those for which their dissimilarity toXu is belowthisk_diss threshold. This treshold depends on the correspondingdissimilarity metric specified indiss_method. Eitherk ork_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum(first value) and the maximum (second value) number of neighbors to beretained when thek_diss is given.

spike

a vector of integers (with positive and/or negative values)indicating what observations inXr(andYr) must be forced into or avoided in the neighborhoods.

pc_selection

a list of length 2 to be passed onto theortho_diss methods. It is required if the method selected indiss_method is any of"pca","pca.nipals" or"pls". This argument is used foroptimizing the number of components (principal components or pls factors)to be retained. This list must contain two elements in the following order:method (a character indicating the method for selecting the number ofcomponents) andvalue (a numerical value that complements the selectedmethod). The methods available are:

  • "opc": optimized principal component selection based onRamirez-Lopez et al. (2013a, 2013b). The optimal number of components(of set of observations) is the one for which its distance matrixminimizes the differences between theYr value of eachobservation and theYr value of its closest observation. In thiscasevalue must be a value (larger than 0 and below theminimum dimension ofXr orXr andXu combined)indicating the maximum number of principal components to be tested.See theortho_projection function for more details.

  • "cumvar": selection of the principal components basedon a given cumulative amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of cumulative variance that thecombination of retained components should explain.

  • "var": selection of the principal components basedon a given amount of explained variance. In this case,value must be a value (larger than 0 and below or equal to 1)indicating the minimum amount of variance that a single componentshould explain in order to be retained.

  • "manual": for manually specifying a fix number ofprincipal components. In this case,value must be a value(larger than 0 and below theminimum dimension ofXr orXr andXu combined)indicating the minimum amount of variance that a component shouldexplain in order to be retained.

The default islist(method = "var", value = 0.01).

Optionally, thepc_selection argument admits"opc" or"cumvar" or"var" or"manual" as a single characterstring. In such a case the default"value" when either"opc" or"manual" are used is 40. When"cumvar" is used the default"value" is set to 0.99 and when"var" is used, the default"value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must bereturned. Projections are used if theortho_diss methods arecalled (i.e.method = "pca",method = "pca.nipals" ormethod = "pls").

return_dissimilarity

a logical indicating if the dissimilarity matrixused for neighbor search must be returned.

ws

an odd integer value which specifies the window size, whendiss_method = cor (cor_diss method) for moving correlationdissimilarity. Ifws = NULL (default), then the window size will beequal to the number of variables (columns), i.e. instead moving correlation,the normal correlation will be used. Seecor_diss function.

center

a logical indicating if theXr andXu matricesmust be centered. IfXu is provided the data is centered around themean of the pooledXr andXu matrices (\(Xr \cup Xu\)). Fordissimilarity computations based ondiss_method = pls, the data is alwayscentered.

scale

a logical indicating if theXr andXu matricesmust be scaled. IfXu is provided the data is scaled basedon the standard deviation of the the pooledXr andXu matrices(\(Xr \cup Xu\)). Ifcenter = TRUE, scaling is applied aftercentering.

documentation

an optional character string that can be used todescribe anything related to thembl call (e.g. description of theinput data). Default:character(). NOTE: his is an experimentalargument.

...

further arguments to be passed to thedissimilarityfunction. See details.

Details

This function may be specially useful when the reference set (Xr) isvery large. In some cases the number of observations in the reference setcan be reduced by removing irrelevant observations (i.e. observations that are notneighbors of a particular target set). For example, this fucntion can beused to reduce the size of the reference set before before running thembl function.

This function uses thedissimilarity fucntion to compute thedissimilarities betweenXr andXu. Arguments todissimilarity as well as further arguments to the functionsused insidedissimilarity (i.e.ortho_disscor_dissf_disssid) can be passed tothose functions as additional arguments (i.e....).

If no matrix is passed toXu, the neighbor search is conducted for theobservations inXr that are found whiting that matrix. If a matrix ispassed toXu, the neighbors ofXu are searched in theXrmatrix.

Value

alist containing the following elements:

Author(s)

Leonardo Ramirez-Lopez.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R.,Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-searchmetrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

dissimilarityortho_disscor_dissf_disssidmbl

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Yu <- Yu[!is.na(Yu)]Xr <- Xr[!is.na(Yr), ]Yr <- Yr[!is.na(Yr)]# Identify the neighbor observations using the correlation dissimilarity and# default parameters# (In this example all the observations in Xr belong at least to the# first 100 neighbors of one observation in Xu)ex1 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "cor",  k = 40)# Identify the neighbor observations using principal component (PC)# and partial least squares (PLS) dissimilarities, and using the "opc"# approach for selecting the number of componentsex2 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pca",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE)# Observations that do not belong to any neighborhoodseq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]ex3 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pls",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE)# Observations that do not belong to any neighborhoodseq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]# Identify the neighbor observations using local PC dissimialrities# Here, 150 neighbors are used to compute a local dissimilarity matrix# and then this matrix is used to select 50 neighborsex4 <- search_neighbors(  Xr = Xr, Xu = Xu,  diss_method = "pls",  Yr = Yr, k = 50,  pc_selection = list("opc", 40),  scale = TRUE,  .local = TRUE,  pre_k = 150)

A function for computing the spectral information divergence betweenspectra (sid)

Description

Experimental lifecycle

This function computes the spectral information divergence/dissimilarity betweenspectra based on the kullback-leibler divergence algorithm (see details).

Usage

sid(Xr, Xu = NULL,    mode = "density",    center = FALSE, scale = FALSE,    kernel = "gaussian",    n = if(mode == "density") round(0.5 * ncol(Xr)),    bw = "nrd0",    reg = 1e-04,    ...)

Arguments

Xr

a matrix containing the spectral (reference) data.

Xu

an optional matrix containing the spectral data of a second set ofobservations.

mode

the method to be used for computing the spectral informationdivergence. Options are"density" (default) for computing the divergencevalues on the density distributions of the spectral observations, and"feature" for computing the divergence vales on the spectral variables.See details.

center

a logical indicating if the computations must be carried out onthe centredX andXu (if specified) matrices. Ifmode = "feature" centring is not carried out since this option doesnot accept negative values which are generated after centring the matrices.Default is FALSE. See details.

scale

a logical indicating if the computations must be carried out onthe variance scaledX andXu (if specified) matrices. Defaultis TRUE.

kernel

ifmode = "density" a character string indicating thesmoothing kernel to be used. It must be one of"gaussian" (default),"rectangular","triangular","epanechnikov","biweight","cosine" or"optcosine". See thedensity function of thestats package.

n

ifmode = "density" a numerical value indicating the numberof equally spaced points at which the density is to be estimated. See thedensity function of thestats package for furtherdetails. Default isround(0.5 * ncol(X)).

bw

ifmode = "density" a numerical value indicating thesmoothing kernel bandwidth to be used. Optionally the character string"nrd0" can be used, it computes the bandwidth using thebw.nrd0function of thestats package (seebw.nrd0). See thedensity and thebw.nrd0 functions for moredetails. By default"nrd0" is used, in this case the bandwidth iscomputed asbw.nrd0(as.vector(X)), ifXu is specified thebandwidth is computed asbw.nrd0(as.vector(rbind(X, Xu))).

reg

a numerical value larger than 0 which indicates a regularizationparameter. Values (probabilities) below this threshold are replaced by thisvalue for numerical stability. Default is 1e-4.

...

additional arguments to be passed to thedensity function of the base package.

Details

This function computes the spectral information divergence (distance)between spectra.Whenmode = "density", the function first computes the probabilitydistribution of each spectrum which result in a matrix of densitydistribution estimates. The density distributions of all the observations inthe datasets are compared based on the kullback-leibler divergence algorithm.Whenmode = "feature", the kullback-leibler divergence between allthe observations is computed directly on the spectral variables.The spectral information divergence (SID) algorithm (Chang, 2000) uses theKullback-Leibler divergence (\(KL\)) or relative entropy(Kullback and Leibler, 1951) to account for the vis-NIR information providedby each spectrum. The SID between two spectra (\(x_{i}\) and\(x_{j}\)) is computed as follows:

\[sid(x_{i},x_{j}) = KL(x_{i} \left |\right | x_{j}) + KL(x_{j} \left |\right | x_{i})\]\[sid(x_{i},x_{j}) = \sum_{l=1}^{k} p_l \ log(\frac{p_l}{q_l}) + \sum_{l=1}^{k} q_l \ log(\frac{q_l}{p_l})\]

where \(k\) represents the number of variables or spectral features,\(p\) and \(q\) are the probability vectors of \(x_{i}\) and\(x_{i}\) respectively which are calculated as:

\[p = \frac{x_i}{\sum_{l=1}^{k} x_{i,l}}\]\[q = \frac{x_j}{\sum_{l=1}^{k} x_{j,l}}\]

From the above equations it can be seen that the original SID algorithmassumes that all the components in the data matrices are nonnegative.Therefore centering cannot be applied whenmode = "feature". If adata matrix with negative values is provided andmode = "feature",thesid function automatically scales the matrix as follows:

\[X_s = \frac{X-min(X)}{max(X)-min(X)}\]

or

\[X_{s} = \frac{X-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]\[Xu_{s} = \frac{Xu-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]

ifXu is specified. The 0 values are replaced by a regularizationparameter (reg argument) for numerical stability.The default of thesid function is to compute the SID based on thedensity distributions of the spectra (mode = "density"). For eachspectrum inX the density distribution is computed using thedensity function of thestats package.The 0 values of the estimated density distributions of the spectra arereplaced by a regularization parameter ("reg" argument) for numericalstability. Finally the divergence between the computed spectral histogramasis computed using the SID algorithm. Note that ifmode = "density",thesid function will accept negative values and matrix centeringwill be possible.

Value

alist with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Chang, C.I. 2000. An information theoretic-based approach tospectral variability, similarity and discriminability for hyperspectralimage analysis. IEEE Transactions on Information Theory 46, 1927-1932.

See Also

density

Examples

library(prospectr)data(NIRsoil)Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]Xu <- Xu[!is.na(Yu), ]Xr <- Xr[!is.na(Yr), ]# Example 1# Compute the SID distance between all the observations in Xrxr_sid <- sid(Xr)xr_sid# Example 2# Compute the SID distance between the observations in Xr and the observations# in Xuxr_xu_sid <- sid(Xr, Xu)xr_xu_sid

A function for evaluating dissimilarity matrices (sim_eval)

Description

Stable lifecycle

This function searches for the most similar observation (closest neighbor) ofeach observation in a given dataset based on a dissimilarity (e.g. distancematrix). The observations are compared against their corresponding closestobservations in terms of their side information provided. The root meansquare of differences and the correlation coefficient are used for continuousvariables and for discrete variables the kappa index is used.

Usage

sim_eval(d, side_info)

Arguments

d

a symmetric matrix of dissimilarity scores between observations ofa given dataset. Alternatively, a vector of with the dissimilarityscores of the lower triangle (without the diagonal values) can be used(see details).

side_info

a matrix containing the side information corresponding tothe observations in the dataset from which the dissimilarity matrix wascomputed. It can be either a numeric matrix with one or multiplecolumns/variables or a matrix with one character variable (discrete variable).If it is numeric, the root mean square of differences is used for assessingthe similarity between the observations and their corresponding most similarobservations in terms of the side information provided. If it is a charactervariable, then the kappa index is used. See details.

Details

For the evaluation of dissimilarity matrices this function uses sideinformation (information about one variable which is available for agroup of observations, Ramirez-Lopez et al., 2013). It is assumed that thereis a (direct or indirect) correlation between this side informative variableand the variables from which the dissimilarity was computed.Ifside_info is numeric, the root mean square of differences (RMSD)is used for assessing the similarity between the observations and theircorresponding most similar observations in terms of the side informationprovided. It is computed as follows:

\[j(i) = NN(xr_i, Xr^{{-i}})\]\[RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^n {(y_i - y_{j(i)})^2}}\]

where \(NN(xr_i, Xr^{-i})\) represents a function toobtain the index of the nearest neighbor observation found in \(Xr\)(excluding the \(i\)th observation) for \(xr_i\),\(y_{i}\) is the value of the side variable of the \(i\)thobservation, \(y_{j(i)}\) is the value of the side variable ofthe nearest neighbor of the \(i\)th observation and \(m\) isthe total number of observations.

Ifside_info is a factor the kappa index (\(\kappa\)) isused instead the RMSD. It is computed as follows:

\[\kappa = \frac{p_{o}-p_{e}}{1-p_{e}}\]

where both \(p_o\) and \(p_e\) are two different agreementindices between the the side information of the observations and the sideinformation of their corresponding nearest observations (i.e. most similarobservations). While \(p_o\) is the relative agreement\(p_e\) is the the agreement expected by chance.

This functions accepts vectors to be passed to argumentd, in thiscase, the vector must represent the lower triangle of a dissimilarity matrix(e.g. as returned by thestats::dist() function ofstats).

Value

sim_eval returns a list with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M.,Scholten, T. 2013a. The spectrum-based learner: A new local approach formodeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R.,Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-searchmetrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples

library(prospectr)data(NIRsoil)sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)# Replace the original spectra with the filtered onesNIRsoil$spc <- sgYr <- NIRsoil$Nt[as.logical(NIRsoil$train)]Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]# Example 1# Compute a principal components distancepca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity# Example 1.1# Evaluate the distance matrix on the baisis of the# side information (Yr) associated with Xrse <- sim_eval(pca_d, side_info = as.matrix(Yr))# The final evaluation resultsse$eval# The final values of the side information (Yr) and the values of# the side information corresponding to the first nearest neighbors# found by using the distance matrixse$first_nn# Example 1.2# Evaluate the distance matrix on the basis of two side# information (Yr and Yr2)# variables associated with XrYr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))# The final evaluation resultsse_2$eval# The final values of the side information variables and the values# of the side information variables corresponding to the first# nearest neighbors found by using the distance matrixse_2$first_nn# Example 2# Evaluate the distances produced by retaining different number of# principal components (this is the same principle used in the# optimized principal components approach ("opc"))# first project the datapca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))results <- matrix(NA, pca_2$n_components, 3)colnames(results) <- c("pcs", "rmsd", "r")results[, 1] <- 1:pca_2$n_componentsfor (i in 1:pca_2$n_components) {  ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)  ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))  results[i, 2:3] <- as.vector(ith_eval$eval)}plot(results)# Example 3# Example 3.1# Evaluate a dissimilarity matrix computed using the correlation# methodcd <- cor_diss(Xr)eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))eval_corr_diss$eval

Square root of (square) symmetric matrices

Description

For internal use only

Usage

sqrt_sm(X, method = c("svd", "eigen"))

A function to compute row-wise index of minimum values of a square distance matrix

Description

For internal use only

Usage

which_min(X)

Arguments

X

a square matrix of distances

Details

Used internally to find the nearest neighbors

Value

a vector of the indices of the minimum value in each row of the input matrix

Author(s)

Antoine Stevens


A function to compute indices of minimum values of a distance vector

Description

For internal use only

Usage

which_min_vector(X)

Arguments

X

a vector of distances

Details

Used internally to find the nearest neighbors.It searches in lower (or upper) triangular matrix. Therefore this must be the format of theinput data. The piece of code intlen = (sqrt(X.size()*8+1)+1)/2 generated an error in CRANsincesqrt cannot be applied to integers.

Value

a vector of the indices of the nearest neighbors

Author(s)

Antoine Stevens


[8]ページ先頭

©2009-2025 Movatter.jp