Movatterモバイル変換

Title:

Robust Data Analysis Through Monitoring and DynamicVisualization

Version:

0.9-0

VersionNote:

Released 0.8-1 on 2023-03-09 on CRAN

Description:

Provides interface to the 'MATLAB' toolbox 'Flexible Statistical Data Analysis (FSDA)' which is comprehensive and computationally efficient software package for robust statistics in regression, multivariate and categorical data analysis. The current R version implements tools for regression: (forward search, S- and MM-estimation, least trimmed squares (LTS) and least median of squares (LMS)), for multivariate analysis (forward search, S- and MM-estimation), for cluster analysis and cluster-wise regression. The distinctive feature of our package is the possibility of monitoring the statistics of interest as a function of breakdown point, efficiency or subset size, depending on the estimator. This is accompanied by a rich set of graphical features, such as dynamic brushing, linking, particularly useful for exploratory data analysis.

Depends:

R (≥ 3.5.0)

Imports:

rJava, methods, stats4, ggplot2

Suggests:

robustbase, rrcov, MASS

SystemRequirements:

(license-free) MATLAB Runtime (MCR) V 9.12, Java(>=8)

LazyLoad:

yes

LazyData:

yes

License:

GPL (≥ 3)

URL:

https://github.com/UniprJRC/fsdaR

BugReports:

https://github.com/UniprJRC/fsdaR/issues

Packaged:

2023-12-05 21:23:26 UTC; valen

Author:

Valentin Todorov

[aut, cre], Emmanuele Sordini [aut], Aldo Corbellini [ctb], Francesca Torti [ctb], Marco Riani [ctb], Domenico Perrotta [ctb], Andrea Cerioli [ctb]

Maintainer:

Valentin Todorov <valentin.todorov@chello.at>

NeedsCompilation:

Repository:

CRAN

RoxygenNote:

7.2.3

Date/Publication:

2023-12-06 03:40:02 UTC

Creates an`FSR_control` object

Description

Creates an object of classFSR_control to be used with thefsreg() function,containing various control parameters.

Usage

FSR_control(intercept = TRUE, h, nsamp = 1000, lms = 1, init, nocheck = FALSE,     bonflev = "", msg = TRUE, bsbmfullrank = TRUE,     plot = FALSE, bivarfit = FALSE, multivarfit = FALSE,     labeladd = FALSE, nameX, namey, ylim, xlim)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

h

The number of observations that have determined the least trimmed squares estimator, scalar.h is an integer greater or equal thanp but smaller thenn. Generally if the purpose is outlier detectionh=[0.5*(n+p+1)] (default value).h can be smaller than this threshold if the purpose is to find subgroups of homogeneous observations. In this function the LTS/LMS estimator is used just to initialize the search.

nsamp

Number of subsamples which will be extracted to find the robust estimator, scalar. Ifnsamp=0 all subsets will be extracted. They will be(n choose p). If the number of all possible subset is<1000 the default is to extract all subsets otherwise just 1000.

lms

Criterion to use to find the initial subset to initialize the search (LMS, LTS with concentration steps, LTS without concentration steps or subset supplied directly by the user). The default value is 1 (Least Median of Squares is computed to initialize the search). On the other hand, if the user wants to initialze the search with LTS with all the default options for concentration steps then lms=2. If the user wants to use LTS without concentration steps, lms can be a scalar different from 1 or 2. If lms is a list it is possible to control a series of options for concentration steps (for more details see optionlms insideLXS_control). If, on the other hand, the user wants to initialize the search with a prespecified set of units there are two possibilities:

lms can be a vector with length greater than 1 which contains the list of units forming the initial subset. For example, if the user wants to initialize the search with units 4, 6 and 10 thenlms=c(4, 6, 10);
lms is a struct which contains a field named bsb which contains the list of units to initialize the search. For example, in the case of simple regression through the origin with just one explanatory variable, if the user wants to initialize the search with unit 3 thenlms=list(bsb=3).

init

Search initialization, scalar. It specifies the initial subset size to start monitoring exceedances of minimum deletion residual, if init is not specified it set equal to:p+1, if the sample size is smaller than 40 ormin(3*p+1,floor(0.5*(n+p+1))), otherwise. For example, ifinit=100, the procedure starts monitoring from stepm=100.

nocheck

Check input arguments, scalar. Ifnocheck=TRUE no check is performed on matrixy and matrixX. Notice thaty andX are left unchanged. In other words the additional column of ones for the intercept is not added. As defaultnocheck=FALSE.

bonflev

Option to be used if the distribution of the data is strongly non normal and, thus, the general signal detection rule based on consecutive exceedances cannot be used. In this case bonflev can be:

a scalar smaller than 1 which specifies the confidence level for a signal and a stopping rule based on the comparison of the minimum MD with a Bonferroni bound. For example if bonflev=0.99 the procedure stops when the trajectory exceeds for the first time the 99% bonferroni bound.
A scalar value greater than 1. In this case the procedure stops when the residual trajectory exceeds for the first time this value.

Default value is ”, which means to rely on general rules based on consecutive exceedances.

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

bsbmfullrank

How to behave in case subset at step m (say bsbm) produces a singular X. In other words, this options controls what to do whenrank(X[bsbm, ]) is smaller then number of explanatory variables. Ifbsbmfullrank=1 (default) these units (whose number is say mnofullrank) are constrained to enter the search in the final n-mnofullrank steps else the search continues using as estimate of beta at step m the estimate of beta found in the previous step.

plot

Plot on the screen. Scalar. Ifplot=TRUE the plot of minimum deletion residual with envelopes based on n observations and the scatterplot matrix with the outliers highlighted is produced. Ifplot=2 the user can also monitor the intermediate plots based on envelope superimposition. Ifplot=FALSE (default) no plot is produced.

bivarfit

Wheather to superimpose bivariate least square lines on the plot (ifplot=TRUE.This option adds one or more least squares lines, based on SIMPLE REGRESSION of y on Xi, to the plots of y|Xi. The default isbivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. Ifbivarfit=2, two OLS lines are fitted: one to all points and another to the group of the genuine observations. The group of the potential outliers is not fitted. Ifbivarfit=0 one OLS line is fitted to each group. This is useful for the purpose of fitting mixtures of regression lines. Ifbivarfit='i1' orbivarfit='i2', etc. an OLS line is fitted to a specific group, the one with index 'i' equal to 1, 2, 3 etc. Again, useful in case of mixtures.

multivarfit

Wheather to superimpose multivariate least square lines. This option adds one or more least square lines, based on MULTIVARIATE REGRESSION of y on X, to the plots of y|Xi. The default ismultivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. The line added to the scatter plot y|Xi is avconst + Ci*Xi, where Ci is the coefficient of Xi in the multivariate regression and avconst is the effect of all the other explanatory variables different from Xi evaluated at their centroid (that is overline(y)'C)). Ifmultivarfit=2, same action as withmultivarfit=1 but this time we also add the line based on the group of unselected observations (i.e. the normal units).

labeladd

Add outlier labels in plot. Iflabeladd=TRUE, we label the outliers with the unit row index in matrices X and y. The default value islabeladd=FALSE, i.e. no label is added.

nameX

Add variable labels in plot. A vector of strings of lengthp containing the labels of the variables of the regression dataset. If it is empty (default) the sequenceX1, ..., Xp will be created automatically

namey

Add response label. A string containing the label of the response

ylim

Controly scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale.

xlim

Controlx scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale.

Details

Creates an object of classFSR_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"FSR_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run: data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="FS", control=FSR_control(h=56, nsamp=500, lms=2)))summary(out)## End(Not run)

Creates an`FSReda_control` object

Description

Creates an object of classFSReda_control to be used with thefsreg() function,containing various control parameters.

Usage

FSReda_control(intercept = TRUE, init, nocheck = FALSE,     tstat = c("trad", "scal"), conflev = c(0.95, 0.99))

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

init

nocheck

tstat

The kind of t-statistics which have to be monitored.tstat="trad" implies monitoring of traditional t statistics (out$Tols). In this case the estimate of s2 at step m is based on s2m (notice that s2m<<s2 when m/n is small)tstat="scal" (default) implies monitoring of rescaled t statistics. In this case the estimate of s2 at step m is based on s2m/vartruncnorm(m/n) where vartruncnorm(m/n) is the variance of the truncated normal distribution.

conflev

Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975.

Details

Creates an object of classFSReda_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"FSReda_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run: data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE,     control=FSReda_control(tstat="scal")))## End(Not run)

Income1

Description

Income data taken from the United States Census Bureau. The data area random sample of 200 observations referred to four variables.The goal is to predict HTOTVAL.

Usage

data(Income1)

Format

A data frame with 200 rows and 4 variables.The variables are as follows:

H_NUMPER: Number of persons in household
HOTHVAL: All other types of income except HEARNVAL Recode - Total other household income
HSSVAL: household income - social security
HTOTVAL: total household income (dollar amount)

Source

United States Census Bureau (2021). Annual Social and Economic Supplements

Examples

 data(Income1) head(Income1)

Income2

Description

A sample of 200 observations of full time employees from a municipalityin Northern Italy who have declared extra income from investmentsources. The variables are as follows.The goal is the possibility in predicting income level based onthe individual's personal information.

Usage

data(Income2)

Format

A data frame with 200 rows and 6 variables.The variables are as follows:

Age: Age of the person (the minimum is 19 and the maximum is 73).
Education: Number of years of education (the minimum value of 5 isprimary school, and the maximum value is 16 bachelor degree)
Gender: A factor - Male or Female
ExtraGain: Income from investment sources (profit-losses)apart from wages/salary
Hours: total number of declared hours worked during the week.The minimum value is 35 and the maximum is 99
Income: total yearly income (Euro amount)

Examples

 data(Income2) head(Income2)

Creates an`LSX_control` object

Description

Creates an object of classLXS_control to be used with thefsreg() function,containing various control parameters.

Usage

    LXS_control(intercept = TRUE, lms, h, bdp, nsamp, rew = FALSE, conflev = 0,         msg = TRUE, nocheck = FALSE, nomes = FALSE, plot = FALSE)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

lms

lms can be a vector with length greater than 1 which contains the list of units forming the initial subset. For example, if the user wants to initialize the search with units 4, 6 and 10 thenlms=c(4, 6, 10);
lms is a struct which contains a field named bsb which contains the list of units to initialize the search. For example, in the case of simple regression through the origin with just one explanatory variable, if the user wants to initialize the search with unit 3 thenlms=list(bsb=3).

h

bdp

Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine. If on the other hand the purpose is subgroups detection then bdp can be greater than 0.5. In any case however n*(1-bdp) must be greater than p. If this condition is not fulfilled an error will be given. Please specify h or bdp not both.

nsamp

rew

LXS reweighted - if rew=1 the reweighted version of LTS (LMS) is used and the output quantities refer to the reweighted version else no reweighting is performed (default).

conflev

Confidence level which is used to declare units as outliers, usuallyconflev=0.95, 0.975, 0.99(individual alpha) or 1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975.

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

nocheck

nomes

It controls whether to display or not on the screen messages about estimated time to compute LMS (LTS). If nomes is equal to 1 no message about estimated time to compute LMS (LTS) is displayed, else if nomes is equal to 0 (default), a message about estimated time is displayed.

plot

Plot on the screen. Scalar. Ifplots=TRUE the plot of minimum deletion residual with envelopes based on n observations and the scatterplot matrix with the outliers highlighted is produced. Ifplots=2 the user can also monitor the intermediate plots based on envelope superimposition. Ifplots=FALSE (default) no plot is produced.

Details

Creates an object of classFSR_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"LXS_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run:   data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="LMS", control=LXS_control(h=56, nsamp=500, lms=2)))## End(Not run)

Mixture M5 Data.

Description

A bivariate data set obtained from three normal bivariate distributions with different scales andproportions 1:2:2. One of the components is strongly overlapping with another one. A 10noise is added uniformly distributed in a rectangle containing the three normal components and notstrongly overlapping with the three mixture components. A precise description of the M5 data setcan be found in Garcia-Escudero et al. (2008).

Usage

data(M5data)

Format

A data frame with 2000 rows and 3 variablesThe first two columns are the two variables. The last column is the true classification vector where symbol "0" stands for the contaminating data points.

Source

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008). A General Trimming Approach to Robust Cluster Analysis,Annals of Statistics, Vol.36, 1324-1345.doi:10.1214/07-AOS515.

Creates an`MMreg_control` object

Description

Creates an object of classMMreg_control to be used with thefsreg() function,containing various control parameters for calling the MATLAB functionMMreg().

Usage

MMreg_control(intercept = TRUE, InitialEst, eff, effshape,     rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"),     rhofuncparam, refsteps = 3, tol = 1e-07, conflev,     nocheck = FALSE, Smsg = TRUE, plot = FALSE)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

InitialEst

Starting values of the MM-estimator, a list with the fiollowingelements:loc, a $p x 1$ vector, location vector estimate andscale, a scaler, estimate of the scale. If empty (default) the program will use S estimators. In this last case it is possible to specify the options given in function Sreg.

eff

Scalar defining nominal efficiency (i.e. a number between 0.5 and 0.99). The default value is 0.95.

effshape

Location or scale efficiency. Ifeffshape=1 efficiency refers to shape efficiency else (default) efficiency refers to location efficiency.

rhofunc

Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.

'bisquare' uses Tukey's rho and psi functions. See TBrho and TBpsi.
'optimal' uses optimal rho and psi functions. See OPTrho and OPTpsi.
'hyperbolic' uses hyperbolic rho and psi functions. See HYPrho and HYPpsi.
'hampel' uses Hampel rho and psi functions. See HArho and HApsi.

The default is 'bisquare'.

rhofuncparam

Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5).

For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8)

refsteps

Number of refining iterations in each subsample (default isrefsteps=3).refsteps = 0 means "raw-subsampling" without iterations.

tol

Scalar controlling tolerance in the MM loop. The default value istol=1e-6

conflev

Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975

nocheck

Smsg

Controls whether to display or not messages on the screen IfSmsg==TRUE (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

plot

Details

Creates an object of classMMreg_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"MMreg_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run:   data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="MM", control=MMreg_control(eff=0.99, rhofunc="optimal")))## End(Not run)

Creates an`MMregeda_control` object

Description

Creates an object of classMMregeda_control to be used with thefsreg() function,containing various control parameters.

Usage

    MMregeda_control(intercept = TRUE, InitialEst, Soptions, eff, effshape,     rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"),     rhofuncparam, refsteps = 3, tol = 1e-07, conflev, nocheck = FALSE, plot = FALSE)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

InitialEst

Soptions

Options to pass to Sreg, anSreg_control object. The options are: Srhofunc, Snsamp, Srefsteps, Sreftol, Srefstepsbestr, Sreftolbestr, Sminsctol, Sbestr.See functionSreg_control for more details on these options.

It is necessary to add to the S options the letter S at the beginning. For example, if you want to use the optimal rho function the supplied option is 'Srhofunc','optimal'. For example, if you want to use 3000 subsets, the supplied option is 'Snsamp',3000

eff

Vector defining nominal efficiency (i.e. a series of numbers between 0.5 and 0.99). The default value is the sequenceseq(0.5, 0.99, 0.01)

effshape

Location or scale efficiency. Ifeffshape=1 efficiency refers to shape efficiency else (default) efficiency refers to location efficiency.

rhofunc

Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.

'bisquare' uses Tukey's rho and psi functions. See TBrho and TBpsi.
'optimal' uses optimal rho and psi functions. See OPTrho and OPTpsi.
'hyperbolic' uses hyperbolic rho and psi functions. See HYPrho and HYPpsi.
'hampel' uses Hampel rho and psi functions. See HArho and HApsi.

The default is 'bisquare'.

rhofuncparam

Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5).

For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8)

refsteps

Number of refining iterations in each subsample (default isrefsteps=3).refsteps = 0 means "raw-subsampling" without iterations.

tol

Scalar controlling tolerance in the MM loop. The default value istol=1e-6.

conflev

Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975

nocheck

plot

Details

Creates an object of classMMregeda_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"MMregeda_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run:   data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE,     control=MMregeda_control(eff=seq(0.75, 0.99, 0.01))))## End(Not run)

Creates an`Sreg_control` object

Description

Creates an object of classSreg_control to be used with thefsreg() function,containing various control parameters for calling the MATLAB functionSreg().

Usage

Sreg_control(intercept = TRUE, bdp = 0.5,     rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"),     rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06,     refstepsbestr = 50, reftolbestr = 1e-08,     minsctol = 1e-07, bestr = 5,     conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

bdp

Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine.

Note that given bdp nominal efficiency is automatically determined.

rhofunc

Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.

'bisquare' uses Tukey's rho and psi functions. See TBrho and TBpsi.
'optimal' uses optimal rho and psi functions. See OPTrho and OPTpsi.
'hyperbolic' uses hyperbolic rho and psi functions. See HYPrho and HYPpsi.
'hampel' uses Hampel rho and psi functions. See HArho and HApsi.

The default is 'bisquare'.

rhofuncparam

Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5).

For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8)

nsamp

refsteps

Number of refining iterations in each subsample (default isrefsteps=3).refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance for the refining steps. The default value is 1e-6

refstepsbestr

Scalar defining number of refining iterations for each best subset (default = 50).

reftolbestr

Tolerance for the refining steps for each of the best subsets. The default value isreftolbestr=1e-8.

minsctol

Value of tolerance for the iterative procedure for finding the minimum value of the scale for each subset and each of the best subsets (It is used by subroutineminscale.m).The default value isminsctol=1e-7.

bestr

Defins the number of "best betas" to remember from the subsamples. These will be later iterated until convergence (default isbestr=5).

conflev

Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

nocheck

plot

Details

Creates an object of classSreg_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"Sreg_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="S", control=Sreg_control(bdp=0.25, nsamp=500)))## End(Not run)

Creates an`Sregeda_control` object

Description

Creates an object of classSregeda_control to be used with thefsreg() function,containing various control parameters.

Usage

Sregeda_control(intercept = TRUE, bdp = seq(0.5, 0.01, -0.01),     rhofunc = c("bisquare", "optimal", "hyperbolic", "hampel", "mdpd", "AS"),     rhofuncparam, nsamp = 1000, refsteps = 3, reftol = 1e-06,     refstepsbestr = 50, reftolbestr = 1e-08,     minsctol = 1e-07, bestr = 5,     conflev, msg = TRUE, nocheck = FALSE, plot = FALSE)

Arguments

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

bdp

Breakdown point. It measures the fraction of outliers the algorithm should resist. In this case any value greater than 0 but smaller or equal than 0.5 will do fine.

The default value of bdp is a sequence from 0.5 to 0.01 with step 0.01

rhofunc

Specifies the rho function which must be used to weight the residuals. Possible values are 'bisquare' 'optimal' 'hyperbolic' 'hampel'.

'bisquare' uses Tukey's rho and psi functions. See TBrho and TBpsi.
'optimal' uses optimal rho and psi functions. See OPTrho and OPTpsi.
'hyperbolic' uses hyperbolic rho and psi functions. See HYPrho and HYPpsi.
'hampel' uses Hampel rho and psi functions. See HArho and HApsi.

The default is 'bisquare'.

rhofuncparam

Additional parameters for the specified rho function. For hyperbolic rho function it is possible to set up the value of k = sup CVC (the default value of k is 4.5).

For Hampel rho function it is possible to define parameters a, b and c (the default values are a=2, b=4, c=8)

nsamp

refsteps

Number of refining iterations in each subsample (default isrefsteps=3).refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance for the refining steps. The default value is 1e-6

refstepsbestr

Scalar defining number of refining iterations for each best subset (default = 50).

reftolbestr

Tolerance for the refining steps for each of the best subsets. The default value isreftolbestr=1e-8.

minsctol

bestr

Defins the number of "best betas" to remember from the subsamples. These will be later iterated until convergence (default isbestr=5).

conflev

Confidence level which is used to declare units as outliers. Usually conflev=0.95, 0.975, 0.99 (individual alpha) or conflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

nocheck

plot

Details

Creates an object of classSregeda_control to be used with thefsreg() function,containing various control parameters.

Value

An object of class"Sregeda_control" which is basically alist with components the input arguments of the function mapped accordingly to the corresponding Matlab function.

Author(s)

FSDA team

Examples

## Not run: data(hbk, package="robustbase")(out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE,     control=Sregeda_control(nsamp=500, rhofunc='hyperbolic')))## End(Not run)

Simulated data X.

Description

The X dataset has been simulated by Gordaliza, Garcia-Escudero and Mayo-Iscar during the WorkshopADVANCES IN ROBUST DATA ANALYSIS AND CLUSTERING held in Ispra on October 21st-25th 2013. It isa bivariate dataset of 200 observations. It presents two parallel components without contamination.

Usage

data(X)

Format

A data frame with 200 rows and 2 variables

Bank data (Riani et al., 2014).

Description

There are 60 observations on a response y with the values of three explanatory variables.The scatter plot matrix of the data shows y increasing with each of x1, x2 and x3.The plot of residuals against fitted values shows no obvious pattern. However theFS finds that there are 6 masked outliers.

Usage

data(bank_data)

Format

A data frame with 1949 rows and 14 variables.The variables are as follows:

x1: Personal loans
x2: Financing and hire-purchase
x3: Mortgages
x4: Life insurance
x5: Share amount
x6: Bond account
x7: Current account
x8: Salary deposits
x9: Debit cards
x10: Credit cards
x11: Telephone banking
x12: Domestic direct debits
x13: Money transfers
y: Profit/loss

Source

Riani, M., Cerioli, A., Atkinson, A. C., and Perrotta, D. (2014). Supplement to ”Monitoring robust regression”. doi:10.1214/14-EJS897SUPP.

References

Riani, M., Cerioli, A., Atkinson, A. C., and Perrotta, D. (2014). Monitoring robust regression.Electronic Journal of Statistics, 8, 642-673.

Produces the carbike plot to find best relevant clustering solutions obtained by`tclustICsol`

Description

Takes as input the output of functiontclustICsol(that is a structure containing the best relevant solutions) and produces thecar-bike plot. This plot provides a concise summary of the best relevant solutions.This plot shows on the horizontal axis the value ofc and on the vertical axisthe value ofk. For each solution we draw a rectangle for the interval ofvalues for which the solution is best and stable and a horizontal line which departsfrom the rectangle for the values ofc in which the solution is only stable.Finally, for the best value ofc associated to the solution, we show a circlewith two numbers, the first number indicates the ranked solution among those which arenot spurious and the second one the ranked number including the spurious solutions.This plot has been baptized 'car-bike', because the first best solutions (in general2 or 3) are generally best and stable for a large number of values ofc andtherefore will have large rectangles. In addition, these solutions are likely tobe stable for additional values ofc and therefore are likely to have horizontallines departing from the rectangles (from here the name 'cars'). Finally, local minorsolutions (which are associated with particular values ofc andk) do notgenerally present rectangles or lines and are shown with circles (from here thename 'bikes').

Usage

carbikeplot(out, SpuriousSolutions = FALSE, trace = FALSE, ...)

Arguments

out

An S3 object of classtclusticsol.object,(output oftclustICsol) containing the relevant solutions.

SpuriousSolutions

Wheather to include or not spurious solutions. By default spuriossolutions are not included into the plot.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Cerioli, A., Garcia-Escudero, L.A., Mayo-Iscar, A. and Riani M. (2017).Finding the Number of Groups in Model-Based Clustering via Constrained Likelihoods,Journal of Computational and Graphical Statistics, pp. 404-416,https://doi.org/10.1080/10618600.2017.1390469.

Examples

 ## Not run:  ##  Car-bike plot for the geyser data ======================== data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Find the best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=FALSE, NumberOfBestSolutions=6) print(outMIXMIX$MIXMIXbs) carbikeplot(outMIXMIX) ##  Car-bike plot for the flea data ========================== data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)])    # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100) ##  Find the best solutions using as Information criterion CLACLA print("Best solutions using CLACLA") outCLACLA <- tclustICsol(out,whichIC="CLACLA", plot=FALSE, NumberOfBestSolutions=66) ##  Produce the car-bike plot carbikeplot(outCLACLA) ## End(Not run)

Monitoring the correlations between consecutive distances or residuals

Description

Provides a method for obtaining the maximum empirical efficiency(in case of MM estimates) or maximum empirical breakdownplot (in case of S estimates) ormaximum subset size (in case of forward search),using various measures of correlation between then Mahalanobis distances or residuals atadjacent values of efficiecy, breakdown point or subset size.

Usage

corfwdplot(out, trace = FALSE, ...)

Arguments

out

An object of S3 class returned by one of the estimation functions with themonitoring option selected (monitoring=TRUE):fsreda.object,sregeda.object,mmregeda.object,fsmeda.object,smulteda.object ormmmulteda.object.This is a list containing the monitoring of minimum Mahalanobis distance in case ofmultivariate analysis or the monitoring of residuals in case of regression.

The needed elements ofout are

MAL: matrix containing the squared Mahalanobis distances monitored in eachstep of the forward search. Every row is associated with a unit (row of data matrix Y).This matrix can be created using one of the functionsfsmult,smult ormmmult with the monitoring option selected, i.e.monitoring=TRUE.
RES: matrix containing the residuals monitored in eachstep of the forward search. Every row is associated with a unit (row of data matrix Y).This matrix can be created using the functionfsreg withthe monitoring option selected, i.e.monitoring=TRUE .
bdp: a vector containing breakdown point values that have been used, in case of S estimates.
eff: a vector containing efficiency values that have been used, in case of MM estimates.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

Aggplot plot object which can be printed on screen or to a file.

Author(s)

FSDA team,valentin.todorov@chello.at

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) corfwdplot(out) (out1 <- smult(hbk, monitoring=TRUE, trace=TRUE)) corfwdplot(out1) (out2 <- mmmult(hbk[,1:3], monitoring=TRUE, trace=TRUE)) corfwdplot(out2) (out3 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="FS")) corfwdplot(out3) (out4 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="S")) corfwdplot(out4) (out5 <- fsreg(hbk[,1:3], hbk[,4], monitoring=TRUE, trace=TRUE, method="MM")) corfwdplot(out5) ## End(Not run)

Monitoring of the covariance matrix

Description

Plots the trajectories of the elements of the covariance (correlation) matrix monitored

Usage

covplot(  out,  xlim,  ylim,  xlab,  ylab,  main,  lwd,  lty,  col,  cex.lab,  cex.axis,  subsize,  fg.thresh,  fg.unit,  fg.labstep,  fg.lwd,  fg.lty,  fg.col,  fg.mark,  fg.cex,  standard,  fground,  tag,  datatooltip,  trace = FALSE,  ...)

Arguments

out

An object of S3 classfsmeda.object returned byfsmult withmonitoring=TRUE -a list containing the monitoring of minimum Mahalanobis distance.

The needed elements ofout are

S2cov: matrix containing the monitoring of the elementsof the covariance matrix in each step of the forward search:
Un: matrix containing the order of entry of each unit(necessary if datatooltip or databrush is selected).
X: The data matrix.

xlim

Controls thex scale in the plot.xlim is a vector with two elements controllingminimum and maximum on thex-axis. Default is to use automatic scale.

ylim

Controls they scale in the plot.ylim is a vector with two elements controllingminimum and maximum on they-axis. Default is to use automatic scale.

xlab

A title for the x axis

ylab

A title for the y axis

main

An overall title for the plot

lwd

The line width, a positive number, defaulting to 1

lty

The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed,3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid","dashed", "dotted", "dotdash", "longdash", or "twodash".The latter two are not supported by Matlab.

col

Colors to be used for the highlighted units

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

subsize

Numeric vector containing the subset size with length equal to the number of columns ofmatrix of mahalanobis distances. The default value of subsize is(nrow(MAL) - ncol(MAL) + 1):nrow(MAL)

fg.thresh

(alternative to fg.unit) numeric vector of length 1 or 2 which specifiesthe highlighted trajectories.Iflength(fg.thresh) == 1 the highlighted trajectories are those of units that throughtoutthe search had at leat once a mahalanobis distance greater thanfg.thresh.The default value isfg.thresh=2.5. Iflength(fg.thresh) == 2 the highlightedtrajectories are those of units that throughtout the search had a mahalanobis distance atleast once bigger thanfg.thresh[2] or smaller thanfg.thresh[1].

fg.unit

(alternative to fg.thresh), vector containing the list of the units to be highlighted.Iffg.unit is supplied,fg.thresh is ignored.

fg.labstep

numeric vector which specifies the steps of the search where to put labels forthe highlighted trajectories (units). The default is to put the labels at theinitial and final steps of the search.fg.labstep='' means no label.

fg.lwd

The line width for the highlighted trajectories (units). Default is 1.

fg.lty

The line type for the highlighted trajectories (units). Line types caneither be specified as an integer (1=solid (default), 2=dashed,3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid","dashed", "dotted", "dotdash", "longdash", or "twodash".The latter two are not supported by Matlab.

fg.col

colors to be used for the highlighted units.

fg.mark

Controlls whether to plot highlighted trajectories as symbols.iffg.mark==TRUE each line is plotted using a differentsymbol else no marker is used (default).

fg.cex

Controls the font size of the labels of the trajectories in foreground. Iffg.cex=0 no labels will be shown - equivalent tofg.labstop="".

standard

MATLAB-style arguments - appearance of the plot in terms of xlim, ylim, axes labelsand their font size style, color of the lines, etc.

fground

MATLAB-style arguments - for the trajectories in foregroud.

tag

Plot handle. String which identifies the handle of the plot which is about to be created.The default istag='pl_mmd'. Notice that if the program finds a plot which hasa tag equal to the one specified by the user, then the output of the new plot overwritesthe existing one in the same window else a new window is created.

datatooltip

If datatooltip is not empty the user can use the mouse in order to haveinformation about the unit selected, the step in which the unit enters the search andthe associated label. If datatooltip is a list, it is possible to control the aspectof the data cursor (see MATLAB functiondatacursormode() for more details orsee the examples below). The default options areDisplayStyle="Window" andSnapToDataVertex="on".

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

Examples

 ## Not run:  X <- iris[,1:4] out <- fsmult(X, monitoring=TRUE) ##  Produce monitoring covariances plot with all the default options covplot(out) ## End(Not run)

Diabetes data

Description

The diabetes dataset, introduced by Reaven and Miller (1979),consists of 145 observations (patients). For each patient threemeasurements are reported: plasma glucose response to oral glucose,plasma insulin response to oral glucose, degree of insulin resistance.

Usage

    data("diabetes")

Format

A data frame with the following variables:

glucose: Area under plasma glucose curve after a three hour oral glucose tolerance test (OGTT).
insulin: Area under plasma insulin curve after a three hour oral glucose tolerance test (OGTT).
sspg: Steady state plasma glucose.
class: The type of diabete:Normal,Overt, andChemical.

Source

Reaven, G. M. and Miller, R. G. (1979). An attempt to define the nature of chemical diabetes using a multidimensional analysis.Diabetologia 16:17-24.

Examples

library(rrcov)data(diabetes)head(diabetes)plot(CovMcd(diabetes[, 1:3]), which="pairs", col=diabetes$class)

Demographic data from the 341 miniciplaities in Emilia Romagna (an Italian region).

Description

A data set containing 28 demographic variables for 341 municipalities in Emilia Romagna (an Italian region).

Usage

data(emilia2001)

Format

A data frame with 341 rows and 28 variablesThe variables are as follows:

less10: population aged less than 10
more75: population aged more than 75
single single-member families
divorced": divorsed
widows: widows and widowers
graduates: population aged more than 25 who are graduates
no_education: of those aged over 6 having no education
employed: activity rate
unemployed: unemployment rate
increase_popul: standardised natural increase in population
migration: standardised change in population due to migration
birth_92_94: average birth rate over 1992-94
fecundity: three-year average birth rate amongst women of child-bearing age
houses: occupied houses built since 1982
houses_2WCs: occupied houses with 2 or more WCs
houses_heating: occupied houses with fixed heating system
TV: TV licence holders
cars: number of cars for 100 inhabitants
luxury_cars: luxury cars
hotels: working in hotels and restaurants
banking: working in banking and finance
income: average declared income amongst those filing income tax returns
income_tax_returns: inhabitants filing income tax returns
factories: residents employed in factories and public services
factories_more10: employees employed in factories withy more tha 10 employees
factories_more50: employees employed in factories withy more tha 50 employees
artisanal: artisanal enterprises
entrepreneurs: enterpreneous and skilled self-employed among those of working age

@referencesAtkinson, A. C., Riani, M., and Cerioli, A. (2004).Exploring Multivariate Data with the Forward Search. Springer-Verlag, New York.

Fishery data.

Description

The fishery data consist of 677 transactions of a fishery product in Europe.For each transaction the Value in 1000 euro and the quantity in Tons are reported.

Usage

data(fishery)

Format

A data frame with 677 rows and 2 variables

Flea

Description

Flea-beetle measurements

Usage

data(flea)

Format

A data frame with 74 rows and 7 variables: six explanatory and one response variable -species.The variables are as follows:

tars1: width of the first joint of the first tarsus in microns (the sum of measurements for both tarsi)
tars2: the same for the second joint
head: the maximal width of the head between the external edges of the eyes in 0.01 mm
ade1: the maximal width of the aedeagus in the fore-part in microns
ade2: the front angle of the aedeagus ( 1 unit = 7.5 degrees)
ade3: the aedeagus width from the side in microns
species, which species is being examined -Concinna,Heptapotamica,Heikertingeri

References

A. A. Lubischew (1962), On the Use of Discriminant Functions in Taxonomy,Biometrics,184 pp.455–477.

Examples

 data(flea) head(flea)

Forbes' data on air pressure in the Alps and the boiling point of water (Weisberg, 1985).

Description

A data set on air pressure in the Alps and the boiling point of water (Weisberg, 1985).There are 17 observations on the boiling point of water at different pressures, obtainedfrom measurements at a variety of elevations in the Alps. The purpose of the experiment wasto allow prediction of pressure from boiling point, which is easily measured, and so toprovide an estimate of altitude: the higher the altitude, the lower the pressure.The dataset is characterized by one clear outlier.

Usage

data(forbes)

Format

A data frame with 17 rows and 2 variablesThe variables are as follows:

x: boiling point
y: 100 x log(pressure)

References

Weisberg, S. (1985).Applied Linear Regression. Wiley, New York.

Examples

 data(forbes) plot(y~x, data=forbes)

Description of`fsdalms` Objects

Description

An object of classfsdalms.object holds information about the result of a call tofsreg.

Value

The object itself is basically alist with the following components:

rew

Ifrew=TRUE all subsequent output refers to reweighted else no reweighting is done.

beta

p-by-1 vector containing the estimated regression parameters.

bs

p x 1 vector containing the units forming subset associated with bLMS (bLTS).

residuals

residuals.

scale

scale estimate of the residuals.

weights

Vector like y containing weights. The elements of this vector are 0 or 1. These weights identify the h observations which are used to compute the final LTS (LMS) estimate. sum(weights)=h if there is not a perfect fit otherwise sum(weights) can be greater than h

h

The number of observations that have determined the LTS (LMS) estimator, i.e. the value of h.

outliers

vector containing the list of the units declared as outliers using confidence level specified in input scalar conflev.

conflev

confidence level which is used to declare outliers. Remark:conflev will be used to draw the horizontal lines (confidence bands) in the plots. Default value is 0.975

singsub

Number of subsets wihtout full rank. Notice that if this number is greater than 0.1*(number of subsamples) a warning is produced

X

the data matrix X

y

the response vector y

The object has class"fsdalms".

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="LMS"))    class(out)    summary(out)## End(Not run)

Description of`fsdalts` Objects

Description

An object of classfsdalts.object holds information about the result of a call tofsreg.

Value

The object itself is basically alist with the following components:

rew

Ifrew=TRUE all subsequent output refers to reweighted else no reweighting is done.

beta

p-by-1 vector containing the estimated regression parameters.

bs

p x 1 vector containing the units forming subset associated with bLMS (bLTS).

residuals

residuals.

scale

scale estimate of the residuals.

weights

h

The number of observations that have determined the LTS (LMS) estimator, i.e. the value of h.

outliers

vector containing the list of the units declared as outliers using confidence level specified in input scalar conflev.

conflev

confidence level which is used to declare outliers. Remark:conflev will be used to draw the horizontal lines (confidence bands) in the plots. Default value is 0.975

singsub

Number of subsets wihtout full rank. Notice that if this number is greater than 0.1*(number of subsamples) a warning is produced

X

the data matrix X

y

the response vector y

The object has class"fsdalts".

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="LTS"))    class(out)    summary(out)## End(Not run)

Description of`fsmeda.object` Objects

Description

An object of classfsmeda.object holds information about the result of a call tofsmult when called with parametermonitoring=TRUE.

Value

The object itself is basically alist with the following components:

MAL: n x (n-init+1) matrix containing the monitoring of Each row represents the distance Mahalanobis distance for the corresponding unit.

BB: n x (n-init+1) matrix containing the information about the units belongingto the subset at each step of the forward search. The first column contains the indexes of the units forming subset in the initial step and each further column - the indexes of the units forming the corresponding step. The last column contains the units forming subset in the final step (all units).

md: n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units). This vector contains the distances of each observation from the location of the data, relative to the scatter matrix cov.

mmd: (n-init) x 3 matrix. which contains the monitoring of minimumMD or (m+1)th ordered MD at each step ofthe forward search.

1st column = fwd search index (from init to n-1)
2nd column = minimum MD
3rd column = (m+1)th-ordered MD

msr: (n-init+1) x 3 matrix which contains the monitoring of maximumMD or m-th ordered MD at each step of the forward search.

1st column = fwd search index (from init to n)
2nd column = maximum MD
3rd column = mth-ordered MD

gap: (n-init+1) x 3 matrix which contains the monitoring of the gap(difference between minMD outside subset and max inside).

1st column = fwd search index (from init to n)
2nd column = min MD - max MD
3rd column = (m+1)th-ordered MD - mth ordered distance

Loc: (n-init+1) x (p+1) matrix which contains the monitoring of the estimated means at each step of the fwd search.

S2cov: (n-init+1) x (p*(p+1)/2+1) matrix which contains the monitoring of the of the elements of the covariance matrix in each step of the forward search.

1st column = fwd search index (from init to n)
2nd column = monitoring of S[1,1]
3rd column = monitoring of S[1,2]
...
last column = monitoring of S[p,p]

detS: (n-init+1) x 2 matrix which contains the monitoring of the determinant of the covariance matrix in each step of the forward search.

Un: (n-init)-by-11 matrix which contains the unit(s) includedin the subset at each step of the fwd search.REMARK: in every step the new subset is compared with theold subset. Un contains the unit(s) present in the newsubset but not in the old one. Un[1 ,2] for example contains the unit included in step init+1.Un[end, 2] contains the units included in the final stepof the search.

X: the data matrix X.

The object has class"fsmeda".

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsmult(hbk[,1:3], monitoring=TRUE))    class(out)    summary(out)## End(Not run)

Performs random start monitoring of minimum Mahalanobis distance

Description

The trajectories originate from many different random initial subsetsand provide information on the presence of groups in the data. Groups areinvestigated by monitoring the minimum Mahalanobis distance outsidethe forward search subset.

Usage

fsmmmdrs(  x,  plot = FALSE,  init,  bsbsteps,  nsimul = 200,  nocheck = FALSE,  numpool,  cleanpool = FALSE,  msg = FALSE,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

plot

Plots the random starts minimum Mahalanobis distance with 1Ifplot=FALSE (default) orplot=0 no plot is produced.The scale (ylim) for the y axis is defined as follows:

ylim[2] is the maximum between the values ofmmd in steps[n*0.2 n] and the finalvalue of the 99 per cent envelope multiplied by 1.1.
ylim[1] is the minimum between the values ofmmd in steps[n*0.2 n] and the 1 per cent envelope multiplied by 0.9.

Remark: the plot which is produced is very simple. In order to control a series of optionsin this plot (including the y scale) and in order to connect it dynamically to the otherforward plots it is necessary to use functionmmdrsplot.

init

Point where to start monitoring required diagnostics.Ifinit is not specified it will be set equal to(p+1).

bsbsteps

A vector which specifies for which steps of the forwardsearch it is necessary to save the units forming subset for eachrandom start. ifbsbsteps = 0 for each random start we storethe units forming subset in all steps. The default is store theunits forming subset in all steps ifn <= 500 else to storethe units forming subset at step init and steps which are multipleof 100. For example, ifn = 753 andinit = 6,units forming subset are stored form=init, 100, 200, 300, 400, 500 and 600.

REMARK: The vector bsbsteps must contain numbers from init to n.ifmin(bsbsteps) < init a warning message will be issued.

nsimul

Number of random starts. Default value isnsimul=200.

nocheck

It controls whether to perform checks on matrix Y. Ifnocheck=TRUE,no check is performed.

numpool

Ifnumpool > 1, the routine automatically checks if theParallel Computing Toolbox is installed and distributes the random starts overnumpool parallel processes. Ifnumpool <= 1, the random starts are runsequentially. By default, numpool is set equal to the number of physical coresavailable in the CPU (this choice may be inconvenient if other applicationsare running concurrently). The same happens if the numpool value chosen bythe user exceeds the available number of cores.

REMARK: up to R2013b, there was a limitation on the maximum number of cores thatcould be addressed by the parallel processing toolbox (8 and, more recently, 12).From R2014a, it is possible to run a local cluster of more than 12 workers.

REMARK: Unless you adjust the cluster profile, the default maximum number ofworkers is the same as the number of computational (physical) cores on the machine.

REMARK: In modern computers the number of logical cores is larger than the numberof physical cores. By default, MATLAB is not using all logical cores because,normally, hyper-threading is enabled and some cores are reserved to this feature.

REMARK: It is because of Remarks 3 that we have chosen as default value fornumpool the number of physical cores rather than the number of logical ones.The user can increase the number of parallel pool workers allocated to themultiple start monitoring by:

setting the NumWorkers option in the local cluster profile settings tothe number of logical cores (Remark 2). To do so go on the menuHome|Parallel|Manage Cluster Profile and set the desired"Number of workers to start on your local machine".
setting numpool to the desired number of workers

Therefore, *if a parallel pool is not already open*, UserOption numpool (if set)overwrites the number of workers set in the local/current profile. Similarly,the number of workers in the local/current profile overwrites default value ofnumpool obtained as feature('numCores') (i.e. the number of physical cores).

cleanpool

Set cleanpoolcleanpool=TRUE if the parallel pool has to be cleanedafter the execution of the random starts. Otherwise (default)cleanpool=FALSE.Clearly this option has an effect just if previous optionnumpool > 1.

msg

Level of output to sidplay. It controls whether to display or not messagesabout random start progress. More precisely, if previous optionnumpool > 1,then a progress bar is displayed, on the other hand a message will be displayed onthe screen when 10

REMARK: in order to create the progress bar whennparpool > 1 the programwrites on a temporary .txt file in the folder where the user is working.Therefore it is necessary to work in a folder where the user has writepermission. If this is not the case and the user (say) is working withoutwrite permission in folder C:/Program Files/MATLAB the following messagewill appear on the screen:

Error using ProgressBar (line 57) Do you have write permissions for C:/Program Files/MATLAB?"

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

Returns an object of classfsmmmdrs.object.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson, A.C., Riani, M., and Cerioli, A. (2006), Random Start ForwardSearches with Envelopes for Detecting Clusters in Multivariate Data,in: Zani S., Cerioli A., Riani M., Vichi M., Eds.,Data Analysis,Classification and the Forward Search, pp. 163-172, Springer Verlag.

Atkinson, A.C. and Riani, M., (2007), Exploratory Tools for ClusteringMultivariate Data,Computational Statistics and Data Analysis,Vol. 52, pp. 272-285, doi:10.1016/j.csda.2006.12.034

Riani, M., Cerioli, A., Atkinson, A.C., Perrotta, D. and Torti, F. (2008),Fitting Mixtures of Regression Lines with the Forward Search, in:Mining Massive Data Sets for Security, F. Fogelman-Soulieet al. Eds., pp. 271-286, IOS Press.

Examples

 ## Not run:  data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) class(out) summary(out) ## End(Not run)

Description of`fsmmmdrs.object` Objects

Description

An object of classfsmmmdrs.object holds information about the result of a call tofsmmmdrs.

Value

The object itself is basically alist with the following components:

mmdrs: Minimum Mahalanobis distance, (n-init) by (nsimul+1) matrix which contains the monitoring of minimumMahalanobis distance at each step of the forward search.

1st column = fwd search index (from init to n-1)
2nd column = minimum Mahalanobis distance for random start 1
3rd column ...
nsimul+1 column minimum Mahalanobis distance for random start nsimul

BBrs: Units belonging to the subset at the steps specified by input option bsbsteps. Ifbsbsteps=0BBrs has sizen-by-(n-init+1)-by-nsimul. In this caseBBrs[,,j] with j=1, 2, ..., nsimul has the following structure:

1st row = has number 1 in correspondence of the steps in which unit 1 is included inside subset and a missing value for the other steps
...
(n-1)-th row = has number n-1 in correspondence of the steps in which unit n-1 is included inside subset and a missing value for the other steps
n-th row = has the number n in correspondence of the steps in which unit n is included inside subset and a missing value for the other steps

If, on the other hand, bsbsteps is a vector which specifies the steps of the search in which it is necessary to store subset, BBrs has sizen-by-length(bsbsteps)-by-nsimul.In other words,BBrs[,,j] withj=1, 2, ..., nsimul has the same structure as before, but now contains justlength(bsbsteps) columns.

X: the data matrix X.

The object has class"fsmmmdrs".

Examples

## Not run:     data(hbk, package="robustbase")    out <- fsmmmdrs(hbk[,1:3])    class(out)    summary(out)## End(Not run)

Gives an automatic outlier detection procedure in multivariate analysis

Description

Gives an automatic outlier detection procedure in multivariate analysis andperforms forward search in multivariate analysis with exploratory data

Usage

fsmult(  x,  bsb,  monitoring = FALSE,  crit = c("md", "biv", "uni"),  rf = 0.95,  init,  plot = FALSE,  bonflev,  msg = TRUE,  nocheck = FALSE,  scaled = FALSE,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

bsb

List of units forming the initial subset or size of the initial subset.Ifmonitoring=FALSE the default is to start the search withp+1units, containing those observationswhich are not outlying on any scatterplot, found as the intersection of all points lyingwithin a robust contour containing a specified portion of the data (Riani and Zani 1997)and inside the univariate boxplot.

Remark: if bsb is a vector, the option crit is ignored.

monitoring

Wheather to perform monitoring of Mahalanobis distances and other specific quantities

crit

If specified, the criterion to be used to initialize the search.

Ifcrit="md" the units which form initial subset are those which have thesmallestm0 pseudo Mahalanobis distances computed using procedureunibiv() (bivariate robust ellipses).
Ifcrit="biv" sorting is done first in terms of times units fell outside robust bivariate ellipsesand then in terms of pseudoMD. In other words, the units forming initial subset are chosen firstamong the set of those which never fell outside robust bivariate ellipses then among those whichfell only once outside bivariate ellipses ... up to reachm0.
Ifcrit="uni" sorting is done first in terms of times units fell outside univariate boxplots andthen in terms of pseudoMD. In other words, the units forming initial subset are chosen first amongthe set of those which never fell outside univariate boxplots then among those which fell onlyonce outside univariate boxplots... up to reachm0.

Remark: as the user can see the starting point of the search is not going to affect at all the resultsof the analysis. The user can explore this point with his own datasets.

Remark: ifcrit="biv" the user can also supply in scalar rf (see below) the confidence level ofthe bivariate ellipses.

rf

Confidence level for bivariate ellipses. The default is 0.95. This option is useful only ifcrit='biv'.

init

Point where to start monitoring required diagnostics. Note that if a vectorm0 issupplied,init >= length(m0). Ifinit is not specified it will be set equal tofloor(n*0.6).

plot

Plots the minimum Mahalanobis distance. Ifplot=FALSE (default) orplot=0 no plot is produced.Ifplot=TRUE the plot of minimum MD with envelopes based on n observations and thescatterplot matrix with the outliers highlighted is produced.Ifplot=2 the additional plots of envelope resuperimposition are produced.If plot is a list it may contain the following fields:

ylim vector with two elements controlling minimum and maximum on the y axis.Default value is ” (automatic scale)
xlim vector with two elements controlling minimum and maximum on the x axis.Default value is ” (automatic scale)
resuper vector which specifies for which steps it is necessary to show the plotsof resuperimposed envelopes if resuper is not supplied a plot of each step in whichthe envelope is resuperimposed is shown. Example: ifresuper = c(85 87)plots of resuperimposedenvelopes are shown at stepsm=85 andm=87
ncoord Ifncoord=1 plots are shown in normal coordinates else (default)plots are shown in traditional mmd coordinates
labeladd Iflabeladd=1, the outliers in the spm are labelled with the unitrow index. The default value islabeladd="", i.e. no label is added
nameY character vector containing the labels of the variables. As default value,the labels which are added are Y1, ...Yp.
lwd controls line width of the curve which contains the monitoring of minimumMahalanobis distance. Default islwd=2.
lwdenv Controls linewidth of the envelopes. Default islwdenv=2.

bonflev

Option that might be used to identify extreme outliers when the distribution ofthe data is strongly non normal. In these circumstances, the general signal detection rulebased on consecutive exceedances cannot be used. In this casebonflev can be:

a scalar smaller than 1, which specifies the confidence level for a signal and a stopping rulebased on the comparison of the minimum deletion residual with a Bonferroni bound.For example ifbonflev=0.99 the procedure stops when the trajectory exceedsfor the first time the 99 per cent bonferroni bound.
a scalar value greater than 1. In this case the procedure stops when theresidual trajectory exceeds for the first time this value.

Default value is empty, which means to rely on general rules based on consecutive exceedances.

msg

It controls whether to display or not messages on the screen. Ifmsg=TRUE(default) messages about the progression of the search are displayed on the screenotherwise only error messages will be displayed.

nocheck

It controls whether to perform checks on matrix Y. Ifnocheck=TRUE, no check is performed.

scaled

Controls whether to monitor scaled Mahalanobis distances (only ifmonitoring=TRUE). Ifscaled=TRUEMahalanobis distances monitored during the search are scaled using ratio of determinant.Ifscaled=2 Mahalanobis distances monitored during the search are scaled usingasymptotic consistency factor. The default isscaled=FALSE, that is Mahalanobis distances are not scaled.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

Depending on the input parametermonitoring, one ofthe following objects will be returned:

fsmult.object
fsmeda.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknownnumber of multivariate outliers. Journal of the Royal StatisticalSociety Series B, Vol. 71, pp. 201-221.

Cerioli A., Farcomeni A., Riani M., (2014). Strong consistency and robustnessof the Forward Search estimator of multivariate location and scatter,Journal of Multivariate Analysis, Vol. 126, pp. 167-183,http://dx.doi.org/10.1016/j.jmva.2013.12.010.

Atkinson Riani and Cerioli (2004),Exploring multivariate data with theforward search Springer Verlag, New York.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3])) class(out) summary(out) ##  Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- fsmult(Xcont, trace=TRUE)           # no plots (plot defaults to FALSE) names(out1) (out1 <- fsmult(Xcont, trace=TRUE, plot=TRUE))    # identical to plot=1 ## plot=1 - minimum MD with envelopes based on n observations ##  and the scatterplot matrix with the outliers highlighted (out1 <- fsmult(Xcont, trace=TRUE, plot=1)) ## plot=2 - additional plots of envelope resuperimposition (out1 <- fsmult(Xcont, trace=TRUE, plot=2)) ## plots is a list: plots showing envelope superimposition in normal coordinates. (out1 <- fsmult(Xcont, trace=TRUE, plot=list(ncoord=1))) ##  Choosing an initial subset formed by the three observations with ##  the smallest Mahalanobis Distance. (out1 <- fsmult(Xcont, m0=5, crit="md", trace=TRUE)) ## fsmult() with monitoring (out2 <- fsmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ## Monitor the exceedances from m=200 without showing plots. n <- 1000 p <- 10 Y <- matrix(rnorm(10000), ncol=10) (out <- fsmult(Y, init=200)) ##  Forgery Swiss banknotes examples. data(swissbanknotes) ## Monitor the exceedances of Minimum Mahalanobis Distance (out1 <- fsmult(swissbanknotes[101:200,], plot=1)) ##  Control minimum and maximum on the x axis (out1 <- fsmult(swissbanknotes[101:200,], plot=list(xlim=c(60,90)))) ##  Monitor the exceedances of Minimum Mahalanobis Distance using ##  normal coordinates for mmd. (out1 <- fsmult(swissbanknotes[101:200,], plot=list(ncoord=1))) ## End(Not run)

Description of`fsmult.object` Objects

Description

An object of classfsmult.object holds information about the result of a call tofsmult.

Value

The object itself is basically alist with the following components:

outliers

kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous.

loc

p-by-1 vector containing location of the data.

cov

p-by-p robust estimate of covariance matrix.

md

n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units). This vector contains the distances of each observation from the location of the data, relative to the scatter matrix cov.

mmd

(n-init)-by-2 matrix. 1st col is the forward search index; 2nd col is the value of minimum Mahalanobis Distance in each step of the fwd search.

Un

(n-init)-by-11 matrix which contains the unit(s) includedin the subset at each step of the fwd search.REMARK: in every step the new subset is compared with theold subset. Un contains the unit(s) present in the newsubset but not in the old one. Un[1 ,2] for example contains the unit included in step init+1.Un[end, 2] contains the units included in the final stepof the search.

nout

2 x 5 matrix containing the number of times mdr went outof particular quantiles.First row contains quantiles 1 99 99.9 99.99 99.999.Second row contains the frequency distribution.It is NULL if bonflev threshold is used.

constr

This output is produced only if the search found at acertain step is a non singular matrix X. In this case thesearch run in a constrained mode, that is including theunits which produced a singular matrix in the last n-constrsteps. out.constr is a vector which contains the list ofunits which produced a singular X matrix.

X

the data matrix X

The object has class"fsmult".

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsmult(hbk[,1:3]))    class(out)    summary(out)## End(Not run)

Description of`fsr` Objects

Description

An object of classfsr.object holds information about the result of a call tofsreg.

Value

The object itself is basically alist with the following components:

beta

p-by-1 vector containing the estimated regression parameters (in step n-k).

scale

scalar containing the estimate of the scale (sigma).

residuals

residuals.

fittedvalues

fitted values.

outliers

kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous.

mdr

(n-init) x 2 matrix 1st col = fwd search index, 2nd col = value of minimum deletion residual in each step of the fwd search

Un

(n-init) x 11 matrix which contains the unit(s) includedin the subset at each step of the fwd search.REMARK: in every step the new subset is compared with theold subset. Un contains the unit(s) present in the newsubset but not in the old one.Un(1,2) for example contains the unit included in stepinit+1.Un(end,2) contains the units included in the final stepof the search.

nout

2 x 5 matrix containing the number of times mdr went outof particular quantiles.First row contains quantiles 1 99 99.9 99.99 99.999.Second row contains the frequency distribution.

constr

X

the data matrix X

y

the response vector y

The object has class"fsr".

Examples

## Not run:       data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="FS"))    class(out)    summary(out)## End(Not run)

fsrbase: an automatic outlier detection procedure in linear regression

Description

An automatic outlier detection procedure in linear regression

Usage

fsrbase(x, ...) ## S3 method for class 'formula'fsrbase(formula, data, subset, weights, na.action,       model = TRUE, x.ret = FALSE, y.ret = FALSE,       contrasts = NULL, offset, ...)## Default S3 method:fsrbase(x, y, bsb, intercept = TRUE,         monitoring = FALSE, control, trace = FALSE,        ...)

Arguments

formula

aformula of the formy ~ x1 + x2 + ....

data

data frame from which variables specified informula are to be taken.

subset

an optional vector specifying a subset of observationsto be used in the fitting process.

weights

an optional vector of weights to be usedin the fitting process.NOT USED YET.

na.action

a function which indicates what should happenwhen the data containNAs. The default is set bythena.action setting ofoptions, and isna.fail if that is unset. The “factory-fresh”default isna.omit. Another possible value isNULL, no action. Valuena.exclude can be useful.

model,x.ret,y.ret

logicals indicating if themodel frame, the model matrix and the response are to be returned,respectively.

contrasts

an optional list. See thecontrasts.argofmodel.matrix.default.

offset

this can be used to specify ana prioriknown component to be included in the linear predictorduring fitting. Anoffset term can be included in theformula instead or as well, and if both are specified their sum is used.

x

Predictor variables. Matrix. Matrix of explanatoryvariables (also called 'regressors') of dimension n x (p-1)where p denotes the number of explanatory variablesincluding the intercept.Rows of X represent observations, and columns representvariables. By default, there is a constant term in themodel, unless you explicitly remove it using input optionintercept=FALSE, so do not include a column of 1s in X. Missingvalues (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite valueswill automatically be excluded from the computations.

y

Response variable. Vector. Response variable, specified asa vector of length n, where n is the number ofobservations. Each entry in y is the response for thecorresponding row of X.Missing values (NA's) and infinite values (Inf's) areallowed, since observations (rows) with missing or infinitevalues will automatically be excluded from thecomputations.

bsb

Initial subset - vector of indices. Ifbsb=0 (default) then the procedure starts with p units randomly chosen. If bsb is not 0 the search will start withm0=length(bsb).

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

monitoring

wheather to perform monitoring for several quantities in each step of the forward search. Deafault ismonitoring=FALSE.

control

A control object (S3) containing estimation options, as returned byFSR_control. Use the functionFSR_control and see its help page. If the control object is supplied, the parameters from itwill be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

Potential further optional arguments, see the help of the functionFSR_control.

Value

Depending on the input parametermonitoring, one of the following objects will be returned:

fsr.object
fsreda.object

Author(s)

FSDA team

References

Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknownnumber of multivariate outliers. Journal of the Royal StatisticalSociety Series B, Vol. 71, pp. 201-221.

Examples

    ## Not run:     n <- 200    p <- 3        X <- matrix(data=rnorm(n*p), nrow=n, ncol=p)    y <- matrix(data=rnorm(n*1), nrow=n, ncol=1)    (out = fsrbase(X, y))    ## Now we use the formula interface:    (out1 = fsrbase(y~X, control=FSR_control(plot=FALSE)))    ## Or use the variables in a data frame    (out2 = fsrbase(Y~., data=hbk, control=FSR_control(plot=FALSE)))    ## let us compare to the LTS solution    (out3 = ltsReg(Y~., data=hbk))        ## Now compute the model without intercept    (out4 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE)))        ## And compare again with the LTS solution    (out5 = ltsReg(Y~.-1, data=hbk))    ## using default (optional arguments)            (out6 = fsrbase(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50)))    ## End(Not run)

Description of`fsreda` Objects

Description

An object of classfsreda.object holds information about the result of a call tofsreg.

Value

The object itself is basically alist with the following components:

RES

n x (n-init+1) matrix containing the monitoring of scaled residuals: the first row is the residual for the first unit, ..., n-th row is the residual for the n-th unit.

LEV

(n+1) x (n-init+1) matrix containing the monitoring of leverage: the first row is the leverage for the first unit, ..., n-th row is the leverage for the n-th unit.

BB

n x (n-init+1) matrix containing the information about the units belonging to the subset at each step of the forward search: first col contains indexes of the units forming subset in the initial step; ...; last column contains units forming subset in the final step (all units).

mdr

n-init x 3 matrix which contains the monitoring of minimum deletion residual or (m+1)-ordered residual at each step of the forward search: first col is the fwd search index (from init to n-1);2nd col = minimum deletion residual; 3rd col = (m+1)-ordered residual.

Remark: these quantities are stored with sign, that is the min deletion residual is stored with negative sign if it corresponds to a negative residual.

msr

n-init+1 x 3 matrix which contains the monitoring of maximum studentized residual or m-th ordered residual: first col is the fwd search index (from init to n); 2nd col = maximum studentized residual; 3rd col = (m)-ordered studentized residual.

nor

(n-init+1) x 4 matrix containing the monitoring of normality test in each step of the forward search: first col = fwd search index (from init to n); 2nd col = Asymmetry test; 3rd col = Kurtosis test; 4th col = Normality test.

Bols

(n-init+1) x (p+1) matrix containing the monitoring of estimated beta coefficients in each step of the forward search.

S2

(n-init+1) x 5 matrix containing the monitoring of S2 or R2 and F-test in each step of the forward search:

1st col = fwd search index (from init to n);
2nd col = monitoring of S2;
3rd col = monitoring of R2;
4th col = monitoring of rescaled S2. In this case theestimated of sigma^2 at step m is divided by theconsistency factor (to make the estimate asymptoticallyunbiased)
5th col = monitoring of F test. Note that an asymptoticunbiased estimate of sigma^2 is used.

In this case the estimated of s2 at step m is divided by the consistency factor (to make the estimate asymptotically unbiased).

coo

(n-init+1) x 3 matrix containing the monitoring of Cook or modified Cook distance in each step of the forward search:

1st col = fwd search index (from init to n);
2nd col = monitoring of Cook distance;
3rd col = monitoring of modified Cook distance.

Tols

(n-init+1) x (p+1) matrix containing the monitoring of estimated t-statistics (as specified in option input 'tstat') in each step of the forward search

Un

(n-init) x 11 matrix which contains the unit(s) included in the subset at each step of the fwd search.

REMARK: in every step the new subset is compared with the old subset. Un contains the unit(s) present in the new subset but not in the old one Un(1,2) for example contains the unit included in step init+1 Un(end,2) contains the units included in the final step of the search.

betaINT

Confidence intervals for the elements of β. betaINT is a (n-init+1)-by-2*length(confint)-by-p 3D array. Each third dimension refers to an element of beta:

betaINT[,,1] is associated with first element of beta;
...;
betaINT[,,p] is associated with last element of beta.

The first two columns contain the lower and upper confidence limits associated with conflev(1). Columns three and four contain the lower and upper confidence limits associated with conflev(2); ...;The last two columns contain the lower and upper confidence limits associated with conflev(end).For examplebetaINT[,3:4,5] contain the lower and upper confidence limits for the fifth element of beta using confidence level specified in the second element of input optionconflev.

sigma2INT

confidence interval for s2.

1st col = fwd search index;
2nd col = lower confidence limit based on conflev(1);
3rd col = upper confidence limit based on conflev(1);
4th col = lower confidence limit based on conflev(2);
5th col = upper confidence limit based on conflev(2);
...
penultimate col = lower confidence limit based on conflev(end);
last col = upper confidence limit based on conflev(end).

X

the data matrix X

y

the response vector y

The object has class"fsreda".

Examples

 ## Not run:     data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="FS", monitoring=TRUE))    class(out)    summary(out)## End(Not run)

fsreg: an automatic outlier detection procedure in linear regression

Description

An automatic outlier detection procedure in linear regression

Usage

fsreg(x, ...) ## S3 method for class 'formula'fsreg(formula, data, subset, weights, na.action,       model = TRUE, x.ret = FALSE, y.ret = FALSE,       contrasts = NULL, offset, ...)## Default S3 method:fsreg(x, y, bsb, intercept = TRUE,         family = c("homo", "hetero", "bayes"),method = c("FS", "S", "MM", "LTS", "LMS"),        monitoring = FALSE, control, trace = FALSE,        ...)

Arguments

formula

aformula of the formy ~ x1 + x2 + ....

data

data frame from which variables specified informula are to be taken.

subset

an optional vector specifying a subset of observationsto be used in the fitting process.

weights

an optional vector of weights to be usedin the fitting process.NOT USED YET.

na.action

model,x.ret,y.ret

logicals indicating if themodel frame, the model matrix and the response are to be returned,respectively.

contrasts

an optional list. See thecontrasts.argofmodel.matrix.default.

offset

family

family of robust regression models, can be 'homo' for homoscedastic (same variance) regression model, 'hetero' for heteroskedastic regression model or 'bayes' Bayesian linear regression. The default isfamily='homo' for homoscedastic regression model.

method

robust regression estimation model, can be 'FS' for Forward search, 'S' for S regression, 'MM' for MM regression, 'LMS' or 'LTS'. The default ismethod='FS' for forward search estimation.

monitoring

wheather to perform monitoring for several quantities in each step of the forward search or for series of values of the breakdown point in case of S estimates or for series of values of the efficiency in case of MM estimates. Deafault ismonitoring=FALSE.

y

x

bsb

Initial subset - vector of indices. Ifbsb=0 (default) then the procedure starts with p units randomly chosen. If bsb is not 0 the search will start withm0=length(bsb).

intercept

Indicator for constant term. Scalar. Ifintercept=TRUE, a model with constant term will be fitted (default), else, no constant term will be included.

control

A control object (S3) containing estimation options. If the control object is supplied, the parameters from itwill be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

Depending on the input parametersfamily andmethod, one of the following objects will be returned:

Author(s)

FSDA team

References

Riani, M., Atkinson A.C., Cerioli A. (2009). Finding an unknownnumber of multivariate outliers. Journal of the Royal StatisticalSociety Series B, Vol. 71, pp. 201-221.

Examples

    ## Not run:     library(robustbase)        n <- 200    p <- 3        X <- matrix(data=rnorm(n*p), nrow=n, ncol=p)    y <- matrix(data=rnorm(n*1), nrow=n, ncol=1)    (out = fsreg(X, y))    ## Now we use the formula interface:    (out1 = fsreg(y~X, control=FSR_control(plot=FALSE)))    ## Or use the variables in a data frame    (out2 = fsreg(Y~., data=hbk, control=FSR_control(plot=FALSE)))    ## let us compare to the LTS solution    library(robustbase)    (out3 = ltsReg(Y~., data=hbk))        ## Now compute the model without intercept    (out4 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE)))        ## And compare again with the LTS solution    (out5 = ltsReg(Y~.-1, data=hbk))    ## using default (optional arguments)            (out6 = fsreg(Y~.-1, data=hbk, control=FSR_control(plot=FALSE, nsamp=1500, h=50)))    ## End(Not run)

Robust transformations for regression

Description

The transformations for negative and positive responses were determinedby Yeo and Johnson (2000) by imposing the smoothness condition that the secondderivative of zYJ (\lambda) with respect to y be smooth at y = 0. However some authors,for example Weisberg (2005), query the physical interpretability of this constraintwhich is oftern violated in data analysis. Accordingly, Atkinson et al. (2019) and (2020)extend the Yeo-Johnson transformation to allow two values of the transformationsparameter:\lambda_N for negative observations and\lambda_P for non-negative ones.

FSRfan monitors:

the t test associated with the constructed variable computed assuming the same transformationparameter for positive and negative observations fixed. In short we call this test,"global score test for positive observations".
the t test associated with the constructed variable computed assuming a differenttransformation for positive observations keeping the value of the transformation parameterfor negative observations fixed. In short we call this test, "test for positive observations".
the t test associated with the constructed variable computed assuming a differenttransformation for negative observations keeping the value of the transformation parameterfor positive observations fixed. In short we call this test, "test for negative observations".
the F test for the joint presence of the two constructed variables described in points 2) and 3).
the F likelihood ratio test based on the MLE of\lambda_P and\lambda_N.In this case the residual sum of squares of the null model bsaed on a single trasnformationparameter\lambda is compared with the residual sum of squares of the model basedon data transformed data using MLE of\lambda_P and\lambda_N.

Usage

fsrfan(x, ...)## S3 method for class 'formula'fsrfan(  formula,  data,  subset,  weights,  na.action,  model = TRUE,  x.ret = FALSE,  y.ret = FALSE,  contrasts = NULL,  offset,  ...)## Default S3 method:fsrfan(  x,  y,  intercept = TRUE,  plot = FALSE,  family = c("BoxCox", "YJ", "YJpn", "YJall"),  la = c(-1, -0.5, 0, 0.5, 1),  lms,  alpha = 0.75,  h,  init,  msg = FALSE,  nocheck = FALSE,  nsamp = 1000,  conflev = 0.99,  xlab,  ylab,  main,  xlim,  ylim,  lwd = 2,  lwd.env = 1,  trace = FALSE,  ...)## S3 method for class 'fsrfan'plot(  x,  conflev = 0.99,  xlim,  ylim,  xlab = "Subset of size m",  ylab = "Score test statistic",  main = "Fan plot",  col,  lty,  lwd = 2.5,  lwd.env = 1,  ...)

Arguments

x

Ann x p data matrix (n observations andp variables).Rows ofx represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

...

potential further arguments passed to lower level functions.

formula

aformula of the formy ~ x1 + x2 + ....

data

data frame from which variables specified informula are to be taken.

subset

an optional vector specifying a subset of observationsto be used in the fitting process.

weights

an optional vector of weights to be usedNOT USED YET.

na.action

model

logical indicating if themodel frame, is to be returned.

x.ret

logical indicating if thethe model matrixis to be returned.

y.ret

logical indicating if theresponse is to be returned.

contrasts

an optional list. See thecontrasts.argofmodel.matrix.default.

offset

this can be used to specify ana prioriknown component to be included in the linear predictorduring fitting. Anoffset term can be included in the

y

Response variable. A vector withn elements thatcontains the response variable.

intercept

wheather to use constant term (default isintercept=TRUE

plot

Ifplot=FALSE (default) no plot is produced.Ifplot=TRUE a fan plot is shown.

family

string which identifies the family of transformations which must be used. Possible values arec('BoxCox', 'YJ', 'YJpn', 'YJall'). Default is'BoxCox'. The Box-Cox family of powertransformations equals(y^{\lambda}-1)/\lambda for\lambda not equal to zero, and\log(y)if\lambda = 0.The Yeo-Johnson (YJ) transformation is the Box-Cox transformation ofy+1 for nonnegative values, and of|y|+1 with parameter2-\lambda fory negative. Remember that BoxCox can be used onlyif input y is positive. Yeo-Johnson family of transformations does not have this limitation.Iffamily='YJpn' Yeo-Johnson family is applied but in this case it is also possibleto monitor (in the output argumentsScorep andScoren) the score test forpositive and negative observations respectively. Iffamily='YJall', it is alsopossible to monitor the joint F test for the presence of the two constructed variablesfor positive and negative observations.

la

values of the transformation parameter for which it is necessaryto compute the score test. Default value of lambda isla=c(-1, -0.5, 0, 0.5, 1), i.e., the five most common values of lambda.

lms

how to find the initlal subset to initialize the search. Iflms=1 (default)Least Median of Squares (LMS) is computed, else Least Trimmed Squares (LTS) is computed.If,lms is matrix of sizep - 1 + intercept X length(la) it contains in columnj=1,..., lenght(la) the list of units forming the initial subset for the searchassociated withla(j). In this case the input optionnsamp is ignored.

alpha

the percentage (roughly) of squared residuals whose sum will be minimized,by defaultalpha=0.5. In general, alpha must between 0.5 and 1.

h

The number of observations that have determined the least trimmed squaresestimator, scalar.h is an integer greater or equal thanp but smallerthenn. Generallyh=[0.5*(n+p+1)] (default value).

init

Search initialization. It specifies the initial subset size to startmonitoring the value of the score test. Ifinit is not specified it willbe set equal to:p+1, if the sample size is smaller than 40 ormin(3 * p + 1, floor(0.5 * (n+p+1))), otherwise.

msg

Controls whether to display or not messages on the screen. Ifmsg==TRUEmessages are displayed on the screen. Ifmsg=2, detailed messages are displayed,for example the information at iteration level.

nocheck

Whether to check input arguments. Ifnocheck=TRUE no check is performedon matrixy and matrixX. Notice thaty andXare left unchanged. In other words the additional column of ones for theintercept is not added. The default isnocheck=FALSE.

nsamp

number of subsamples which will be extracted to find the robust estimator. Ifnsamp=0all subsets will be extracted. They will ben choose p.

Remark: if the number of all possible subset is <1000 the default is to extract all subsetsotherwise just 1000. Ifnsamp is a matrix of sizer-by-p, it contains in the rowsthe subsets which sill have to be extracted. For example, ifp=3 andnsamp=c(2,4,9; 23, 45, 49; 90, 34, 1)the first subset is made up of unitsc(2, 4, 9), the second subset of unitsc(23, 45, 49)and the third subset of unitsc(90 34 1).

conflev

Confidence level for the bands (default is 0.99, that is,we plot two horizontal lines corresponding to values -2.58 and 2.58).

xlab

A label for the X-axis, default is 'Subset size m'

ylab

A label for the Y-axis, default is 'Score test statistic'

main

A label for the title, default is 'Fan plot'

xlim

Minimum and maximum for the X-axis

ylim

Minimum and maximum for the Y-axis

lwd

The line width of the curves which contain the score test, a positive number, default islwd=2

lwd.env

The line width of the lines associated with the envelopes, a positive number, default islwd.env=1

trace

Whether to print intermediate results. Default istrace=FALSE.

col

a vector specifying the colors for the lines, eachone corresponding to ala value. iflength(col) < length(la),the colors will be recycled.

lty

a vector specifying the line types for the lines, eachone corresponding to ala value. iflength(col) < length(la),the colors will be recycled.

Value

An S3 object of classfsrfan.object will be returned which is basically a listcontaining the following elements:

la: vector containing the values of lambda for which fan plot is constructed
bs: matrix of sizep X length(la) containing the units formingthe initial subset for each value of lambda
Score: a matrix containing the values of the score test foreach value of the transformation parameter:
- 1st col = fwd search index;
- 2nd col = value of the score test in each step of the fwd search for la[1]
- ...
Scorep: matrix containing the values of the score test for positiveobservations for each value of the transformation parameter.
Note: this output is present only if input optionfamily='YJpn' orfamily='YJall'.
Scoren: matrix containing the values of the score test for negative observationsfor each value of the transformation parameter.
Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb: matrix containing the values of the score test for the jointpresence of both constructed variables (associated with positive and negativeobservations) for each value of the transformation parameter. In this casethe reference distribution is theF with 2 andsubset_size - pdegrees of freedom.
Note: this output is present only if input optionfamily='YJall'.
Un: a three-dimensional array containinglength(la) matrices ofsizeretnUn=(n-init) X retpUn=11. Each matrix containsthe unit(s) included in the subset at each step in the search associatedwith the corresponding element ofla.
REMARK: at each step the new subset is compared with the old subset.Un contains the unit(s) present in the new subset but not in the old one.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson, A.C. and Riani, M. (2000),Robust Diagnostic Regression Analysis Springer Verlag, New York.

Atkinson, A.C. and Riani, M. (2002), Tests in the fan plot for robust, diagnostic transformations in regression,Chemometrics and Intelligent Laboratory Systems,60, pp. 87–100.

Atkinson, A.C. Riani, M. and Corbellini A. (2019), The analysis of transformations for profit-and-loss data,Journal of the Royal Statistical Society, Series C, "Applied Statistics",69, pp. 251–275.doi:10.1111/rssc.12389

Atkinson, A.C. Riani, M. and Corbellini A. (2021), The Box-Cox Transformation: Review and Extensions,Statistical Science,36(2), pp. 239–255.doi:10.1214/20-STS778.

Examples

## Not run:    data(wool)   XX <- wool   y <- XX[, ncol(XX)]   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]   out <- fsrfan(X, y)                   # call 'fsrfan' with all default parameters   out <- fsrfan(cycles~., data=wool)    # use the formula interface   set.seed(10)   out <- fsrfan(cycles~., data=wool, plot=TRUE) # call 'fsrfan' and produce the plot   plot(out)                               # use the plot method on the fsrfan object   plot(out, conflev=c(0.9, 0.95, 0.99))   # change the confidence leel in the plot method ##====================== ## ##  fsrfan() with all default options. ##  Store values of the score test statistic for the five most common ##  values of $\lambda$. Produce also a fan plot and display it on the screen. ##  Common part to all examples: load 'wool' data set. data(wool) head(wool) dim(wool) ##  The function fsrfan() stores the score test statistic. ##  In this case we use the five most common values of lambda are considered out <- fsrfan(cycles~., data=wool) plot(out) ##  fanplot(out)                # Not yet implemented in fsdaR ##  The fan plot shows the log transformation is diffused throughout the data ##  and does not depend on the presence of particular observations. ##====================== ## ##  Example specifying 'lambda'. ##      Produce a fan plot for each value of 'lambda' in the vector 'la'. ##      Extract in matrix 'Un' the units which entered the search in each step     data(wool)     out <- fsrfan(cycles~., data=wool, la=c(-1, -0.5, 0, 0.5), plot=TRUE)     plot(out)     out$Un[,2,] ##====================== ##  Example specifying the confidence level and the initial starting point for monitoring. ##  Construct the fan plot specifying the confidence level and the initial starting point ##  for monitoring.     data(wool)     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, conflev=0.95, plots=TRUE)     plot(out, conflev=0.95) ##===================== ##  Example with starting point based on LTS. ##  Extract all subsamples, construct a fan plot specifying the confidence level ##  and the initial starting point for monitoring based on p+2 observations, ##  strong line width for lines associated with the confidence bands.     data(wool)     out <- fsrfan(cycles~., data=wool, init=ncol(wool)+1, nsamp=0, lms=0,         lwd.env=3, plot=TRUE)     plot(out, lwd.env=3) ##===================== ##  Fan plot using the loyalty cards data. ##  In this example, 'la' is the vector contanining the most common values ##  of the transformation parameter. ##  Store the score test statistics for the specified values of lambda ##  and automatically produce the fan plot     data(loyalty)     head(loyalty)     dim(loyalty) ##  la is a vector contanining the most common values of the transformation parameter     out <- fsrfan(amount_spent~., data=loyalty, la=c(0, 1/3, 0.4, 0.5),           init=ncol(loyalty)+1, plot=TRUE, lwd=3)     plot(out, lwd=3) ##  The fan plot shows that even if the third root is the best value of the transformation ##  parameter at the end of the search, in earlier steps it lies very close to the upper ##  rejection region. The best value of the transformation parameter seems to be the one ##  associated with la=0.4, which is always the confidence bands but at the end of search, ##  due to the presence of particular observations it goes below the lower rejection line. ##===================== ##  Compare BoxCox with Yeo and Johnson transformation. ##  Store values of the score test statistic for the five most common ##  values of lambda. Produce also a fan plot and display it on the screen. ##  Common part to all examples: load wool dataset.     data(wool)     ##  Store the score test statistic using Box Cox transformation.     outBC <- fsrfan(cycles~., data=wool, nsamp=0)     ##  Store the score test statistic using Yeo and Johnson transformation.     outYJ <- fsrfan(cycles~., data=wool, family="YJ", nsamp=0)     ## Not yet fully implemented     ##  fanplot(outBC, main="Box Cox")     ##  fanplot(outYJ,main="Yeo and Johnson")     plot(outBC, main="Box Cox")     plot(outYJ, main="Yeo and Johnson")     cat("\nMaximum difference in absolute value: ",         max(max(abs(outYJ$Score - outBC$Score), na.rm=TRUE)), "\n") ##======================   ## Call 'fsrfan' with Yeo-Johnson (YJ) transformation   out <- fsrfan(cycles~., data=wool, family="YJ")   plot(out)## End(Not run)

Objects returned by the function`fsrfan`

Description

An object of classfsrfan.object holds information aboutthe result of a call tofsrfan.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classfsrfan is a list containing at least the following components:

la vector containing the values of lambda for which fan plot is constructed
bs matrix of sizep X length(la) containing the units formingthe initial subset for each value of lambda
Score a matrix containing the values of the score test foreach value of the transformation parameter:
- 1st col = fwd search index;
- 2nd col = value of the score test in each step of the fwd search for la[1]
- ...
Scorep matrix containing the values of the score test for positiveobservations for each value of the transformation parameter.
Note: this output is present only if input optionfamily='YJpn' orfamily='YJall'.
Scoren matrix containing the values of the score test for negative observationsfor each value of the transformation parameter.
Note: this output is present only if input option 'family' is 'YJpn' or 'YJall'.
Scoreb matrix containing the values of the score test for the jointpresence of both constructed variables (associated with positive and negativeobservations) for each value of the transformation parameter. In this casethe reference distribution is theF with 2 andsubset_size - pdegrees of freedom.
Note: this output is present only if input optionfamily='YJall'.
Un a three-dimensional array containinglength(la) matrices ofsizeretnUn=(n-init) X retpUn=11. Each matrix containsthe unit(s) included in the subset at each step in the search associatedwith the corresponding element ofla.
REMARK: at each step the new subset is compared with the old subset.Un contains the unit(s) present in the new subset but not in the old one.

Examples

 ## Not run:    data(wool)   XX <- wool   y <- XX[, ncol(XX)]   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]   out <- fsrfan(X, y)   class(out)   summary(out) ## End(Not run)

Old Faithful Geyser Data.

Description

A bivariate data set obtained from the Old Faithful Geyser, containing the eruptionlength and the length of the previous eruption for 271 eruptions of this geyser in minutes.

Usage

data(geyser2)

Format

A data frame with 271 rows and 2 variablesThe variables are as follows:

Eruption length: The eruption length in minutes.
Previous eruption length: The length of the previous eruption in minutes.

References

Garcia-Escudero, L.A., Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means,Journal of the American Statistical Assoc., Vol.94, No.447, 956-969.

Haerdle, W. (1991).Smoothing Techniques with Implementation in S, New York: Springer.

Hawkins data.

Description

These data, simulated by Hawkins, consist of 128 observationsand eight explanatory variables(X1, ..., X8) and one dependent variable,y.

Usage

data(hawkins)

Format

A data frame with 128 rows and 9 variables

Hospital data (Neter et al., 1996)

Description

Data on the logged survival time of 108 patients undergoing liver surgery,together with four potential explanatory variables. Data are composed of54 observations plus other 54 observations, introduced to check the modelfitted to the first 54. Their comparison suggests there is no systematicdifference between the two sets. However by looking at some FS plots(Riani and Atkinson, 2007), we conclude that these two groups are significantlydifferent

Usage

data(hospital)

Format

A data frame with 108 rows and 5 variablesThe variables are as follows:

@sourceJ. NETER, M. H. KUTNER, C. J. NACHTSHEIM, W.WASSERMAN,Applied Linear Statistical Models (4th edition). McGraw-Hill, New York, 1996.

@referencesA. C. ATKINSON, M. RIANI,Robust Diagnostic Regression Analysis. Springer-Verlag, New York, 2000.

Plots the trajectories of the monitored scaled (squared) residuals

Description

Plots the trajectories of the monitored scaled (squared) residuals

Usage

levfwdplot(out,     xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis,     xvalues,     fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex,     bg.thresh, bg.style,     xground=c("lev", "res"), tag, datatooltip, label, nameX, namey, msg, databrush,     standard, fground, bground, ...)

Arguments

out

An object containing monitoring of leverage,fsreda.object.

The needed elements ofout are

LEV: matrix containing the leverage monitored in each step of the forward search. Every row is associated with a unit.This matrix can be created using functionfsreg() withmethod="FS", monitoring=TRUE.
Un: (for FSR only) - matrix containing the order of entry in the subset of each unit (required only when datatooltip is true or databrush is not empty).
y: a vector containing the response (required only when option databrush is requested).
X: a matrix containing the explanatory variables (required only when option databrush is requested).
Bols: (n-init+1) x (p+1) matrix containing the estimated beta coefficients monitored in each step of the robust procedure (required only when option databrush is requested and suboption multivarfit is requested).

ylim

Controly scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale.

xlim

Controlx scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale.

xlab

a title for the x axis

ylab

a title for the y axis

main

an overall title for the plot

lwd

The line width, a positive number, defaulting to 1

lty

The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab.

col

colors to be used for the highlighted units

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

xvalues

values for the x axis. Numeric vector ofncol(RES) controlling the x axis coordinates. The default value of xvalues is(nrow(RES) - ncol(RES) + 1):nrow(RES)

fg.thresh

(alternative to fg.unit) numeric vector of length 1 or 2 which specifies the highlighted trajectories.Iflength(fthresh) == 1 the highlighted trajectories are those of units that throughtout the search had at leat once a residual greater (in absolute value) than thresh. The default value isfg.thresh=2.5. Iflength(fthresh) == 2 the highlighted trajectories are those of units that throughtout the search had a residual at leat once bigger thanfg.thresh[2] or smaller thanfg.thresh[1].

fg.unit

(alternative to fg.thresh), vector containing the list of the units to be highlighted. Iffg.unit is supplied,fg.thresh is ignored.

fg.labstep

numeric vector which specifies the steps of the search where to put labels for the highlighted trajectories (units). The default is to put the labels at the initial and final steps of the search.flabstep='' means no label.

fg.lwd

The line width for the highlighted trajectories (units). Default is 1.

fg.lty

The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab.

fg.col

colors to be used for the highlighted units.

fg.mark

Controlls whether to plot highlighted trajectories as symbols.iffg.mark==TRUE each line is plotted using a different symbol else no marker is used (default).

fg.cex

controls the font size of the labels of the trajectories in foreground.

bg.thresh

numeric vector of length 1 or 2 which specifies how to define the unimmportant trajectories.Unimmportant trajectories will be plotted using a colormap, in greysh or will be hidden.Iflength(thresh) == 1 the irrelevant units are those which always had a residual smaller (in absolute value) than thresh.Iflength(bthresh) == 2 the irrelevant units are those which always had a residual greater than bthresh(1) and smaller than bthresh(2). The default is:bg.thresh=2.5 ifn > 100 andbg.thresh=-Inf ifn <= 100 i.e. to treat all trajectories as important ifn <= 100 and, ifn > 100, to reduce emphasis only to trajectories having in all steps of the search a value of scaled residual smaller than 2.5.

bg.style

specifies how to plot the unimportant trajectories as defined in option bthresh.

bg.style="faint": unimportant trajectories are plotted using a colormap.
bg.style="hide": unimportant trajectories are hidden.
bg.style="greyish": unimportant trajectories are displayed in a faint grey.

Whenn>100 the default option isbg.style='faint'. Whenn <= 100 andbg.thresh == -Inf option bstyle is ignored.Remark: bground=” is equivalent to -Inf that is all trajectories are considered relevant.

tag

Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_resfwd'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created.

xground

trajectories to highlight in connection with resfwdplot. Ifxground="lev" (default), the levfwdplot trajectories are put in foreground or in background depending on the leverage values. Ifxground="res", the levfwdplot trajectories are put in foreground or in background depending on the residual values. See optionsbg.thresh andfg.thresh.

datatooltip

Interactive clicking. It is inactive if this parameter is missing or empty. The default isdatatooltip=TRUE, i.e. the user can select with the mouse an individual residual trajectory in order to have information about the corresponding unit. The information displayed depends on the estimator in use.

For example for classfsreda.object the information concerns the label and the step of the search in which the unit enters the subset. If datatooltip is a list it may contain the following fields:

DisplayStyle determines how the data cursor displays. Possible values are'datatip' and'window' (default).'datatip' displays data cursor information in a small yellow text box attached to a black square marker at a data point you interactively select.'window' displays data cursor information for the data point you interactively select in a floating window within the figure.
SnapToDataVertex: specifies whether the data cursor snaps to the nearest data value or is located at the actual pointer position. Possible values areSnapToDataVertex='on' (default) andSnapToDataVertex='off'.
LineColor: controls the color of the trajectory selected with the mouse. It can be an RGB triplet of values between 0 and 1, or character vector indicating a color name. Note that a RGB vector can be conveniently chosen with our MATLAB class FSColor, see documentation.
SubsetLinesColor: enables to control the color of the trajectories of the units that are in the subset at a given step of the search (iflevfwdplot() is applied to an object of classfsreda.object) or have a weight greater than 0.9 (iflevfwdplot() is applied to an object of classsregeda.object ormmregeda.object). This can be done (repeatedly) with a left mouse click in proximity of the step of interest. A right mouse click will terminate the selection by marking with a up-arrow the step corresponding to the highlighted lines. The highlighted lines by default are in red, but a different color can be specified as RGB triplet or character of color name. Note that a RGB vector can be conveniently chosen with our MATLAB class FSColor, see documentation. By defaultSubsetLinesColor="", i.e. the modality is not active.Any initialization forSubsetLinesColor which cannot be interpreted as RGB vector will be converted to blue, i.e.SubsetLinesColor will be forced to be [0 0 1].IfSubsetLinesColor is not empty the previous optionLineColor is ignored.

label

Character vector containing the labels of the units (optional argument used whendatatooltip=TRUE. If this field is not present labels row1, ..., rown will be automatically created and included in the pop up datatooltip window).

nameX

namey

Add response label. A string containing the label of the response

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

databrush

interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.The activation of this option (databrush is a scalar or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the y|X plot, i.e. a matrix of scatter plots of y against each column of X, grouped according to the selection(s) done by brushing. If the plot y|X does not exist it is automatically created.In addition, brushed units are automatically highlighted in the minimum deletion residual plot if it is already open. The extension to the following plots will be available in future versions of the toolbox:

monitoring leverage plot;
maximum studentized residual;
s^2 and R^2;
Cook distance and modified Cook distance;
deletion t statistics.

Note that the window style of the other figures is set equal to that which contains the monitoring residual plot. In other words, if the monitoring residual plot is docked all the other figures will be docked too

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments of functionselectdataFS() and the following optional argument:

persist. Persist is an empty value or a character containing 'on' or 'off'. The default value ispersist="", that is brushing is allowed only once. Ifpersist="on" orpersis="off" brushing can be done as many time as the user requires. Ifpersist='on' then the unit(s) currently brushed are added to those previously brushed. It is possible, every time a new brushing is done, to use a different color for the brushed units. Ifpersist='off' every time a new brush is performed units previously brushed are removed.
bivarfit. Wheather to superimpose bivariate least square lines on the plot (ifplot=TRUE.This option adds one or more least squares lines, based on SIMPLE REGRESSION of y on Xi, to the plots of y|Xi. The default isbivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. Ifbivarfit=2, two OLS lines are fitted: one to all points and another to the group of the genuine observations. The group of the potential outliers is not fitted. Ifbivarfit=0 one OLS line is fitted to each group. This is useful for the purpose of fitting mixtures of regression lines. Ifbivarfit='i1' orbivarfit='i2', etc. an OLS line is fitted to a specific group, the one with index 'i' equal to 1, 2, 3 etc. Again, useful in case of mixtures.
multivarfit. Wheather to superimpose multivariate least square lines. This option adds one or more least square lines, based on MULTIVARIATE REGRESSION of y on X, to the plots of y|Xi. The default ismultivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. The line added to the scatter plot y|Xi is avconst + Ci*Xi, where Ci is the coefficient of Xi in the multivariate regression and avconst is the effect of all the other explanatory variables different from Xi evaluated at their centroid (that is overline(y)'C)). Ifmultivarfit=2, same action as withmultivarfit=1 but this time we also add the line based on the group of unselected observations (i.e. the normal units).
labeladd. Add outlier labels in plot. Iflabeladd=TRUE, we label the outliers with the unit row index in matrices X and y. The default value islabeladd=FALSE, i.e. no label is added.

standard

(MATLAB-style arguments) appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc.

fground

MATLAB-style arguments for the fground trajectories in foregroud.

bground

MATLAB-style arguments for the fground trajectories in backgroud.

...

potential further arguments passed to lower level functions.

Details

No details

Value

No value returned

Author(s)

FSDA team

Examples

## Not run: n <- 100y <- rnorm(n)X <- matrix(rnorm(n*4), nrow=n)out <- fsreg(y~X, method="LTS")out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE)levfwdplot(out)## End(Not run)

Loyalty data

Description

The loyalty data consist of 509 observations on the behaviour of customers with loyalty cards from a supermarket chain in Northern Italy. The responsey is the amount in euros spent at the shop over six months and the explanatory variables are: X1, the number of visits to the supermarket in the six month period; X2, the age of the customer; X3, the number of members of the customers' family. To find out more about this data set please see Atkinson and Riani (2006), JCGS

Usage

data("loyalty")

Format

A data frame with 509 observations on the following 4 variables.

visits: the number of visits to the supermarket in the six month period
age: the age of the customer
family: the number of members of the customers' family
amount_spent: the amount in euros spent at the shop over six months

Details

To find out more about this data set please see Atkinson and Riani (2006), JCGS

Source

The data are themselves a random sample from a larger database. The sample of 509observations is available athttp://www.riani.it/trimmed/.

References

Atkinson, A. and Riani, M (2006) Distribution Theory and Simulations for Tests of Outliers in Regression,Journal of Computational and Graphical Statistics,15 2, pp 460–476.

Examples

data(loyalty)

Plots the trajectories of scaled Mahalanobis distances along the search

Description

Plots the trajectories of scaled Mahalanobis distances along the forward search

Usage

malfwdplot(  out,  xlim,  ylim,  xlab,  ylab,  main,  lwd,  lty,  col,  cex.lab,  cex.axis,  subsize,  fg.thresh,  fg.unit,  fg.labstep,  fg.lwd,  fg.lty,  fg.col,  fg.mark,  fg.cex,  bg.thresh,  bg.style,  standard,  fground,  bground,  tag,  datatooltip,  label,  nameX,  databrush,  conflev,  trace = FALSE,  ...)

Arguments

out

An object of S3 classfsmeda.object returned byfsmult withmonitoring=TRUE -a list containing the monitoring of minimum Mahalanobis distance.

The needed elements ofout are

MAL: matrix containing the squared Mahalanobis distances monitored in eachstep of the forward search. Every row is associated with a unit (row of data matrix X).
Un: matrix containing the order of entry of each unit(necessary if datatooltip or databrush is selected).
X: The data matrix.

xlim

Controls thex scale in the plot.xlim is a vector with two elements controllingminimum and maximum on thex-axis. Default is to use automatic scale.

ylim

Controls they scale in the plot.ylim is a vector with two elements controllingminimum and maximum on they-axis. Default is to use automatic scale.

xlab

A title for the x axis

ylab

A title for the y axis, deafult is "Squared Mahalanobis distances".

main

An overall title for the plot

lwd

The line width, a positive number, defaulting to 1

lty

col

Colors to be used for the highlighted units

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

subsize

Numeric vector containing the subset size with length equal to the number of columns ofmatrix of mahalanobis distances. The default value of subsize is(nrow(MAL) - ncol(MAL) + 1):nrow(MAL)

fg.thresh

fg.unit

(alternative to fg.thresh), vector containing the list of the units to be highlighted.Iffg.unit is supplied,fg.thresh is ignored.

fg.labstep

fg.lwd

The line width for the highlighted trajectories (units). Default is 1.

fg.lty

fg.col

colors to be used for the highlighted units.

fg.mark

Controlls whether to plot highlighted trajectories as symbols.iffg.mark==TRUE each line is plotted using a differentsymbol else no marker is used (default).

fg.cex

Controls the font size of the labels of the trajectories in foreground. Iffg.cex=0 no labels will be shown - equivalent tofg.labstop="".

bg.thresh

Numeric vector of length 1 or 2 which specifies how to define theunimmportant trajectories.Unimmportant trajectories will be plotted using a colormap, in greysh or will be hidden.Iflength(bg.thresh) == 1 the irrelevant units are those which alwayshad a mahalanobis distance smaller thanbg.thresh.Iflength(bg.thresh) == 2 the irrelevant units are those which alwayshad a mahalanobis distance greater thanbg.thresh[1] and smaller thanbg.thresh[2].The default isbg.thresh=2.5 ifn > 100 andbg.thresh=-Inf ifn <= 100i.e. to treat all trajectories as important ifn <= 100 and, ifn > 100,to reduce emphasis only to trajectories having in all steps of the search a valueof mahalanobis distance smaller than 2.5.

bg.style

Specifies how to plot the unimportant trajectories as defined in option bg.thresh.

bg.style="faint": unimportant trajectories are plotted using a colormap.
bg.style="hide": unimportant trajectories are hidden.
bg.style="greyish": unimportant trajectories are displayed in a faint grey.

Whenn > 100 the default option isbg.style='faint'. Whenn <= 100andbg.thresh == -Inf option bg.style is ignored.Remark: bground=” is equivalent to -Inf that is all trajectories are considered relevant.

standard

MATLAB-style arguments - appearance of the plot in terms of xlim, ylim, axes labelsand their font size style, color of the lines, etc.

fground

MATLAB-style arguments - for the trajectories in foregroud.

bground

MATLAB-style arguments - for the trajectories in backgroud.

tag

Plot handle. String which identifies the handle of the plot which is about to be created.The default is to use tag 'pl_resfwd'. Notice that if the program finds a plot which hasa tag equal to the one specified by the user, then the output of the new plot overwritesthe existing one in the same window else a new window is created.

datatooltip

Interactive clicking. It is inactive if this parameter is set to FALSE.The default isdatatooltip=TRUE, the user can select with the mouse an individualmahalanobis distance trajectory in order to have information about the corresponding unit, the associatedlabel and the step of the search in which the unit enters the subset.If datatooltip is a list it may contain the following fields:

DisplayStyle determines how the data cursor displays. Possible values are'datatip' and'window' (default).'datatip' displaysdata cursor information in a small yellow text box attached to a blacksquare marker at a data point you interactively select.'window'displays data cursor information for the data point you interactivelyselect in a floating window within the figure.
SnapToDataVertex: specifies whether the data cursor snaps to the nearest data value oris located at the actual pointer position.Possible values areSnapToDataVertex='on' (default) andSnapToDataVertex='off'.
LineColor: controls the color of the trajectory selected with the mouse. It can be an RGB tripletof values between 0 and 1, or character vector indicating a color name. Note that a RGB vectorcan be conveniently chosen with our MATLAB class FSColor, see documentation.
SubsetLinesColor: enables to control the color of the trajectories of the units that arein the subset at a given step of the search (ifresfwdplot() is applied to anobject of classfsreda.object) or have a weight greater than 0.9 (ifresfwdplot() is applied to an object of classsregeda.objectormmregeda.object). This can be done (repeatedly) with a left mouseclick in proximity of the step of interest. A right mouse click will terminate theselection by marking with a up-arrow the step corresponding to the highlightedlines. The highlighted lines by default are in red, but a different color can bespecified as RGB triplet or character of color name. Note that a RGB vector canbe conveniently chosen with our MATLAB class FSColor, see documentation.By defaultSubsetLinesColor="", i.e. the modality is not active.Any initialization forSubsetLinesColor which cannot be interpreted asRGB vector will be converted to blue, i.e.SubsetLinesColor will be forced to be [0 0 1].IfSubsetLinesColor is not empty the previous optionLineColor is ignored.

label

Character vector containing the labels of the units (optional argument used whendatatooltip=TRUE. If this field is not present labelsrow1, ..., rown will be automatically created and includedin the pop up datatooltip window).

nameX

Add variable labels in plot. A vector of strings of lengthpcontaining the labels of the variables in the dataset.If it is empty (default) the sequenceX1, ..., Xp will be created automatically

databrush

Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.The activation of this option (databrush isTRUE or a list) enables the user to selecta set of trajectories in the current plot and to see them highlighted in the scatterplot matrix.If the scatterplot matrix does not exist it is automatically created.In addition, brushed units can be highlighted in the monitoring mahalanobis distance plot.Note that the window style of the other figures is set equal to that which contains the monitoring mahalanobis distance plot.In other words, if the monitoring mahalanobis distance plot is docked all the other figures will be docked too.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist. Persist is an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.
label: add labels of brushed units in the monitoring plot.
labeladd: add labels of brushed units in the scatterplot matrix.If this option is '1', we label the units of the last selected group withthe unit row index in the matrix X. The default value is labeladd=”,i.e. no label is added.

conflev

confidence interval for the horizontal bands. It can be a vector ofdifferent confidence level values, e.g. c(0.95, 0.99, 0.999).The confidence interval is based on the $chi^2$ distribution.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson A.C., Riani M. and Cerioli A. (2004), Exploring Multivariate Data with the Forward Search, Springer Verlag, New York.

Examples

 ## Not run:  ## Produce monitoring MD plot with all the default options. ##  Generate input structure for malfwdplot n <- 100 p <- 4 Y <- matrix(rnorm(n*p), ncol=p) Y[1:10,] <- Y[1:10,] + 4 out <- fsmult(Y, monitoring=TRUE, init=30) ##  Produce monitoring MD plot with all the default options malfwdplot(out) ## End(Not run)

Plots the trajectory of minimum Mahalanobis distance (mmd)

Description

Plots the trajectory of minimum Mahalanobis distance (mmd)

Usage

malindexplot(  out,  p,  xlab,  ylab,  main,  nameX,  conflev,  numlab,  tag,  trace = FALSE,  ...)

Arguments

out

a numeric vector or an object of S3 class (one offsmult.object,smult.object ormmmult.object) returned byone of the functionsfsmult orsmult ormmmult -a list containing the monitoring of minimum Mahalanobis distance

p

Ifout is a vector, p is the number of variables of theoriginal data matrix which have been used to compute md.

xlab

A title for the x axis

ylab

A title for the y axis

main

An overall title for the plot

nameX

Add variable labels in the plot. A vector of strings of lengthpcontaining the labels of the variables of the original data matrixX.If it is empty (default) the sequenceX1, ..., Xp will be created automatically

conflev

confidence interval for the horizontal bands. It can be a vector ofdifferent confidence level values, e.g. c(0.95, 0.99, 0.999).The confidence interval is based on the chi^2 distribution.

numlab

Number of points to be labeled in the plot. Ifnumlab is asingle number, e.g.numlab]k, the units with thek largestmd are labelled in the plots. Ifnumlab is a vector, the units indexedby the vector are labelled in the plot. Default isnumlab=5, i.e. the 5 unitsunits with the largest md are labelled. Usenumlab="" for no labelling.

tag

Tag of the figure which will host the malindexplot. The default tag istag="pl_malindex".

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson and Riani (2000), Robust Diagnostic Regression Analysis, Springer Verlag, New York.

Examples

 ## Not run:  ##  Mahalanobis distance plot of 100 random numbers. ##  Numbers are from from the chisq with 5 degrees of freedom malindexplot(rchisq(100, 5), 5) ## End(Not run)

Plots the trajectory of minimum deletion residual (mdr)

Description

Plots the trajectory of minimum deletion residual (mdr).

Usage

    mdrplot(out, quant = c(0.01, 0.5, 0.99), sign = TRUE,         mplus1 = FALSE, envm,         xlim, ylim, xlab, ylab, main,         lwdenv, lwd, cex.lab, cex.axis,         tag, datatooltip, label, nameX, namey, databrush,         ...)

Arguments

out

An object returned by FSReda() (seeFSReda_control).

The needed elements ofout are

mdr: Minimum deletion residual. A matrix containing the monitoring of minimum deletion residual in each step of the forward search. The first column of mdr must contain the fwd search index.
Un: (for FSR only) - matrix containing the order of entry in the subset of each unit (required only when datatooltip is true or databrush is not empty).
y: a vector containing the response (required only when option databrush is requested).
X: a matrix containing the explanatory variables (required only when option databrush is requested).
Bols: (n-init+1) x (p+1) matrix containing the estimated beta coefficients monitored in each step of the robust procedure (required only when option databrush is requested and suboption multivarfit is requested).

quant

Quantiles for which envelopes have to be computed. The default is to produce 1%, 50% and 99% envelopes. In other words the default isquant=c(0.01, 0.5, 0.99).

sign

Wheather to use MDR with sign: ifsign=TRUE (default) we distinguish steps for which minimum deletion residual was associated with positive or negative value of the residual. Steps associated with positive values of mdr are plotted in black, while other steps are plotted in red.

mplus1

Wheather to plot the (m+1)-th order statistic. Specifies if it is necessary to plot the curve associated with (m+1)-th order statistic.

envm

Sample size for drawing enevlopes. Specifies the size of the sample which is used to superimpose the envelope. The default is to add an envelope based on all the observations (size n envelope).

ylim

Controly scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale.

xlim

Controlx scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale.

xlab

a title for the x axis

ylab

a title for the y axis

main

an overall title for the plot

lwdenv

Controls the width of the lines associated with the envelopes, default islvdenv=1.

lwd

Controls the linewidth of the curve which contains the monitoring of minimum deletion residual.

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

tag

Plot handle. String which identifies the handle of the plot which is about to be created. The default is to use tag 'pl_mdr'. Notice that if the program finds a plot which has a tag equal to the one specified by the user, then the output of the new plot overwrites the existing one in the same window else a new window is created.

datatooltip

label

nameX

namey

Add response label. A string containing the label of the response

databrush

monitoring leverage plot;
maximum studentized residual;
s^2 and R^2;
Cook distance and modified Cook distance;
deletion t statistics.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments of functionselectdataFS() and the following optional argument:

persist. Persist is an empty value or a character containing 'on' or 'off'. The default value ispersist="", that is brushing is allowed only once. Ifpersist="on" orpersis="off" brushing can be done as many time as the user requires. Ifpersist='on' then the unit(s) currently brushed are added to those previously brushed. It is possible, every time a new brushing is done, to use a different color for the brushed units. Ifpersist='off' every time a new brush is performed units previously brushed are removed.
bivarfit. Wheather to superimpose bivariate least square lines on the plot (ifplot=TRUE.This option adds one or more least squares lines, based on SIMPLE REGRESSION of y on Xi, to the plots of y|Xi. The default isbivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. Ifbivarfit=2, two OLS lines are fitted: one to all points and another to the group of the genuine observations. The group of the potential outliers is not fitted. Ifbivarfit=0 one OLS line is fitted to each group. This is useful for the purpose of fitting mixtures of regression lines. Ifbivarfit='i1' orbivarfit='i2', etc. an OLS line is fitted to a specific group, the one with index 'i' equal to 1, 2, 3 etc. Again, useful in case of mixtures.
multivarfit. Wheather to superimpose multivariate least square lines. This option adds one or more least square lines, based on MULTIVARIATE REGRESSION of y on X, to the plots of y|Xi. The default ismultivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. The line added to the scatter plot y|Xi is avconst + Ci*Xi, where Ci is the coefficient of Xi in the multivariate regression and avconst is the effect of all the other explanatory variables different from Xi evaluated at their centroid (that is overline(y)'C)). Ifmultivarfit=2, same action as withmultivarfit=1 but this time we also add the line based on the group of unselected observations (i.e. the normal units).
labeladd. Add outlier labels in plot. Iflabeladd=TRUE, we label the outliers with the unit row index in matrices X and y. The default value islabeladd=FALSE, i.e. no label is added.

...

potential further arguments passed to lower level functions.

Details

No details

Value

No value returned

Author(s)

FSDA team

Examples

## Not run: n <- 100y <- rnorm(n)X <- matrix(rnorm(n*4), nrow=n)out <- fsreg(y~X, method="LTS")out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE)mdrplot(out)## End(Not run)

Plots the trajectory of minimum Mahalanobis distance (mmd)

Description

Plots the trajectory of minimum Mahalanobis distance (mmd)

Usage

mmdplot(  out,  quant = c(0.01, 0.5, 0.99),  mplus1 = FALSE,  envm,  lwd,  lwdenv,  xlim,  ylim,  tag,  datatooltip,  label,  xlab,  ylab,  main,  nameX,  cex.lab,  cex.axis,  databrush,  trace = FALSE,  ...)

Arguments

out

An object of S3 classfsmeda.object returned byfsmult withmonitoring=TRUE -a list containing the monitoring of minimum Mahalanobis distance

quant

Quantiles for which envelopes have to be computed.The default is to produce 1%, 50% and 99% envelopes. In otherwords the default isquant=c(0.01, 0.5, 0.99).

mplus1

Wheather to plot the (m+1)-th order statistic.

envm

Sample size for drawing enevlopes. Specifies the size of the sample which isused to superimpose the envelope. The default is to add an envelope based onall the observations (sizen envelope).

lwd

Controls the line width of the curve which contains the monitoringof minimum deletion residual.

lwdenv

Controls the width of the lines associated with the envelopes. Default islwdenv=1

xlim

Control the x scale in plot. Vector with two elements controllingminimum and maximum on the x axis. Default is to use automatic scale.

ylim

Control the y scale in plot. Vector with two elements controllingminimum and maximum on the y axis. Default is to use automatic scale.

tag

datatooltip

label

Row labels. Character vector containing the labels of the units (optional argument usedwhendatatooltip=TRUE. If this field is not present labelsrow1, ..., rown will be automatically created and included in the pop up datatooltip window).

xlab

A title for the x axis

ylab

A title for the y axis

main

An overall title for the plot

nameX

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

databrush

Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.The activation of this option (databrush is TRUE or a list) enables the user to selecta set of trajectories in the current plot and to see them highlighted in the scatterplot matrix.If the scatterplot matrix does not exist it is automatically created.In addition, brushed units can be highlighted in the monitoring MD plot. Note that the windowstyle of the other figures is set equal to that which contains the monitoring residual plot.In other words, if the monitoring residual plot is docked all the other figures will be docked too.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Note that the window style of the other figures is set equal to that which contains themonitoring residual plot. In other words, if the monitoring residual plot is docked allthe other figures will be docked too

Ifdatabrush=TRUE the default selection tool is a rectangular brush andit is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist: This option can be an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.
labeladd: add labels of brushed units in the scatterplot matrix.If this option is '1', we label the units of the last selected group withthe unit row index in the matrix X. The default value is labeladd=”,i.e. no label is added.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson and Riani (2000), Robust Diagnostic Regression Analysis, Springer Verlag, New York.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- fsmult(hbk[,1:3], monitoring=TRUE)) mmdplot(out) ## End(Not run)

Plots the trajectories of minimum Mahalanobis distances from different starting points

Description

Plots the trajectories of minimum Mahalanobis distances from different starting points

Usage

mmdrsplot(  out,  quant = c(0.01, 0.5, 0.99),  envm,  lwd,  lwdenv,  xlim,  ylim,  tag,  datatooltip,  label,  xlab,  ylab,  envlab = TRUE,  main,  nameX,  cex.lab,  cex.axis,  databrush,  scaled = FALSE,  trace = FALSE,  ...)

Arguments

out

An object of S3 classfsmmmdrs.object returned byfsmmmdrs -a list containing the following elements:

mmdrs = a matrix of size (n-ninit)-by-(nsimul+1) containing the monitoringof minimum Mahalanobis distance in each step of the forward search for eachof the nsimul random starts. The first column of mmdrs must contain the forward search index.This matrix can be created using functionfsmmmdrs.
BBrs = 3D array of size n-by-n-(init)-by-nsimul containingunits forming subset for rach random start. This field is necessaryif datatooltip is true or databrush is not empty.
X = n-by-v matrix containing the original data matrix. Thisfield is necessary if datatooltip is true or databrush is not empty.

quant

Quantiles for which envelopes have to be computed.The default is to produce 1%, 50% and 99% envelopes. In otherwords the default isquant=c(0.01, 0.5, 0.99).

envm

Sample size for drawing enevlopes. Specifies the size of the sample which isused to superimpose the envelope. The default is to add an envelope based onall the observations (sizen envelope).

lwd

Controls the linewidth of the curve which contains the monitoringof minimum deletion residual.

lwdenv

line width: a scalar which controls the width of the lines associatedwith the envelopes. Default islwdenv=1

xlim

Control the x scale in plot. Vector with two elements controllingminimum and maximum on the x axis. Default is to use automatic scale.

ylim

Control the y scale in plot. Vector with two elements controllingminimum and maximum on the y axis. Default is to use automatic scale.

tag

Plot handle. String which identifies the handle of the plot which is about to be created.The default istag='pl_mmdrs'. Notice that if the program finds a plot which hasa tag equal to the one specified by the user, then the output of the new plot overwritesthe existing one in the same window else a new window is created.

datatooltip

label

xlab

A title for the x axis

ylab

A title for the y axis

envlab

wheather to label the envelopes. Ifenvlab is true (default)labels of the confidence envelopes which are used are added on the y axis.

main

An overall title for the plot

nameX

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

databrush

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Note that the window style of the other figures is set equal to that which contains themonitoring residual plot. In other words, if the monitoring residual plot is docked allthe other figures will be docked too

Ifdatabrush=TRUE the default selection tool is a rectangular brush andit is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist. Persist is an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.

scaled

Wheather to use scaled or unscaled envelopes. Ifscaled=TRUEthe envelopes are produced for scaled Mahalanobis distances (no consistency factoris applied) else the traditional consistency factor is applied.Default isscaled=FALSE

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson, A.C., Riani, M. and Cerioli, A. (2004), 'Exploring multivariate data with the forward search, Springer Verlag, New York.

Examples

 ## Not run:  data(hbk, package="robustbase") out <- fsmmmdrs(hbk[,1:3]) mmdrsplot(out) ## End(Not run)

Computes MM estimators in multivariate analysis with auxiliary S-scale

Description

Computes MM estimators in multivariate analysis with auxiliary S-scale

Usage

mmmult(  x,  monitoring = FALSE,  plot = FALSE,  eff,  conflev = 0.975,  nocheck = FALSE,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

monitoring

Wheather to perform monitoring of Mahalanobis distances and other specific quantities

plot

Plots the Mahalanobis distances against index number. Ifplot=FALSE(default) orplot=0 no plot is produced. The confidencelevel used to draw the confidence bands for the MD is given by the input option conflev.If conflev is not specified a nominal 0.975 confidence interval will be used.Ifplot=2 a scatter plot matrix with the outliers highlighted is produced.If plot is a list it may contain the following fields:

labeladd Iflabeladd=1, the outliers in the spm are labelled with the unitrow index. The default value islabeladd="", i.e. no label is added
nameY character vector containing the labels of the variables. As default value,the labels which are added are Y1, ...Yp.

eff

Defining the nominal efficiency (i.e. a number between 0.5 and 0.99). The default value iseff=0.95.

conflev

Confidence level which is used to declare units as outliers (scalar).Usuallyconflev=0.95,conflev=0.975 orconflev=0.99 (individual alpha)conflev=1-0.05/n,conflev=1-0.025/n orconflev=1-0.01/n (simultaneous alpha).Default value isconvlev=0.975.

nocheck

It controls whether to perform checks on matrix Y. Ifnocheck=TRUE, no check is performed.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Details

This function follows the lines of MATLAB/R code developed during the years by many authors.For more details see http://www.econ.kuleuven.be/public/NDBAE06/programs/ andthe R packageCovMMestThe core of these routines, e.g. the resampling approach, however, has beencompletely redesigned, with considerable increase of the computational performance.

Value

Depending on the input parametermonitoring, one ofthe following objects will be returned:

Author(s)

FSDA team,valentin.todorov@chello.at

References

Maronna, R.A., Martin D. and Yohai V.J. (2006), Robust Statistics, Theory and Methods, Wiley, New York.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- mmmult(hbk[,1:3])) class(out) summary(out) ##  Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- mmmult(Xcont, trace=TRUE)           # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ##    index number. The confidence level used to draw the confidence bands for ##    the MD is given by the input option conflev. If conflev is ##    not specified a nominal 0.975 confidence interval will be used and ##    (2) a scatter plot matrix with the outliers highlighted. (out1 <- mmmult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- mmmult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## mmmult() with monitoring (out2 <- mmmult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ##  Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- mmmult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- mmmult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)

Description of`mmmult.object` Objects

Description

An object of classmmmult.object holds information about the result of a call tommmult.

Value

The object itself is basically alist with the following components:

loc

p-by-1 vector containing MM estimate of location.

shape

p-by-p matrix with MM estimate of the shape matrix.

cov

matrix with MM estimate of the covariance matrix.Remark:covariance = auxscale^2 * shape.

weights

A vector containing the estimates of the weights.

outliers

A vector containing the list of the units declared as outliers using confidence level specified in input scalarconflev.

Sloc

A vector with S estimate of location.

Sshape

A matrix with S estimate of the shape matrix.

Scov

A matrix with S estimate of the covariance matrix.

auxscale

S estimate of the scale.

md

n-by-1 vector containing the estimates of the robust Mahalanobis distances (in squared units).

conflev

Confidence level that was used to declare outliers.

X

the data matrix X

The object has class"mmmult".

Examples

    ## Not run:     data(hbk, package="robustbase")    (out <- mmmult(hbk[,1:3]))    class(out)    summary(out)    ## End(Not run)

Description of`mmmulteda.object` Objects

Description

An object of classmmmulteda.object holds information about the result of a call tommmult withmonitoring=TRUE.

Value

The object itself is basically alist with the following components:

Loc

length(eff)-by-p matrix containing MM estimate of location for each value ofeff.

Shape

p-by-p-by-length(eff) 3D array containing robust estimate of the shape for each value of eff. Remark: det|shape|=1.

Scale

length(eff) vector containing robust estimate of the scalefor each value of eff.

Cov

p-by-p-by-length(eff) 3D array containing robust estimate of covariance matrix for each value ofeff.Note thatscale(i)^2 * shape[,,i] = robust estimate of covariance matrix.

Bs

(p+1)-by-length(eff) matrix containing the units forming best subset for each value of eff.

MAL

n-by-length(eff) matrix containing the estimates of the robust Mahalanobis distances (in squared units) for each value of eff.

Outliers

n-by-length(eff) matrix containing flags for the outliers. Boolean matrix containing the list of the units declared as outliers for each value of eff using confidence level specified in input scalarconflev

Weights

n x length(eff) matrix containing the weights for each value of eff.

conflev

Confidence level that was used to declare outliers.

singsub

Number of subsets without full rank. Notice thatsingsub > 0.1*(number of subsamples) produces a warning.

eff

vector which contains the values of eff which have been used.

X

the data matrix X.

The object has class"mmmulteda".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- mmmult(hbk[,1:3], monitoring=TRUE))    class(out)    summary(out)## End(Not run)

Description of mmreg Objects

Description

An object of classmmreg.object holds information about the result of a call tofsreg withmethod="MM".

Value

The object itself is basically alist with the following components:

beta

p-by-1 vector containing the MM estimate of regression coefficients.

auxscale

scalar, S estimate of the scale (or supplied external estimate of scale, if option InitialEst is not empty).

residuals

residuals.

fittedvalues

fitted values.

weights

n x 1 vector. Weights assigned to each observation.

Sbeta

p x 1 vector containing S estimate of regression coefficients (or supplied initial external estimate of regression coefficients, if option InitialEst is not empty)

Ssingsub

Number of subsets without full rank in the S preliminary part. Notice that out.singsub > 0.1*(number of subsamples) produces a warning.

outliers

kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous.

conflev

Confidence level which is used to declare units as outliers. Usuallyconflev=0.95, 0.975, 0.99 (individual alpha) orconflev=1-0.05/n, 1-0.025/n, 1-0.01/n (simultaneous alpha). Default value is 0.975

rhofunc

Specifies the rho function which has been used to weight the residuals. If a different rho function is specified for S and MM loop then insted ofrhofunc we will haverhofuncS andrhofuncMM.

rhofuncparam

Vector which contains the additional parameters for the specified rho function which has been used. For hyperbolic rho function the value of k =sup CVC. For Hampel rho function the parameters a, b and c. If a different rho function is specified for S and MM loop then insted ofrhofuncparam we will haverhofuncparamS andrhofuncparamMM.

X

the data matrix X

y

the response vector y

The object has class"mmreg".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="MM"))    class(out)    summary(out)## End(Not run)

Description of`mmregeda` Objects

Description

An object of classmmregeda.object holds information about the result of a call tofsreg whenmethod="MM" andmonitoring=TRUE.

Value

The object itself is basically alist with the following components:

auxscale

scalar, S estimate of the scale (or supplied external estimate of scale, if option InitialEst is not empty).

Beta

p x length(eff) matrix containing MM estimate of regression coefficients for each value of eff.

RES

n x length(eff) matrix containing the monitoring of scaled residuals for each value ofeff.

Weights

n x length(eff) matrix containing the estimates of the weights for each value of eff

Outliers

Boolean matrix containing the list of the units declared as outliers for each value of eff using confidence level specified in input scalar conflev.

conflev

Confidence level which is used to declare units as outliers. Remark: conflev will be used to draw the horizontal line (confidence band) in the plot.

Ssingsub

Number of subsets wihtout full rank. Notice that Notice that singsub > 0.1*(number of subsamples) produces a warning

rhofunc

string identifying the rho function which has been used.

rhofuncparam

vector which contains the additional parameters for the specified rho function which have been used. For hyperbolic rho function the value of k =sup CVC. For Hampel rho function the parameters a, b and c.

eff

vector containing the value of eff which have been used.

X

the data matrix X

y

the response vector y

The object has class"mmregeda".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="MM", monitoring=TRUE))    class(out)    summary(out)## End(Not run)

Multiple regression data showing the effect of masking (Atkinson and Riani, 2000).

Description

Usage

data(multiple_regression)

Format

A data frame with 60 rows and 4 variablesThe variables are as follows:

@referencesAtkinson, A. C., and Riani, M. (2000).Robust Diagnostic Regression Analysis. Springer-Verlag, New York.

Mussels data.

Description

These data, introduced by Cook and Weisberg (1994), consist of 82 observations on horse mussels fromNew Zeland. The variables are shell length, width, height, mass and muscle mass

Usage

data(mussels)

Format

A data frame with 82 rows and 5 variables

Set seed for the MATLAB random number generator

Description

Initializes the MATLAB random generator

Usage

myrng(seed)

Arguments

seed

a single value, interpreted as an integer

Value

Integer, the seed value with which the MATLAB random number generator was initialized.

Author(s)

FSDA team,valentin.todorov@chello.at

Examples

 ## Not run:    data(wool)   XX <- wool   y <- XX[, ncol(XX)]   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]   seed <- myrng()                       #i nitialized the RNG and keep the seed   myrng(seed)                           # repeat the computations with the same seed   (out2 <- fsreg(X, y, method="LTS"))   all.equal(out1, out2) ## End(Not run)

Poison

Description

The poison data (by Box and Cox, 1964) are about the time to death of animals in a3 x 4factorial experiment with four observations at each factor combination. There are no outliersor influential observations that cannot be reconciled with the greater part of the data by asuitable transformation.

Usage

data(poison)

Format

A data frame with 48 rows and 7 variables: six explanatory and one response variable.

Source

G. E. P. Box and D. R. Cox (1964). An Analysis of Transformations,Journal of the Royal Statistical Society. Series B,262 pp. 211–252.

Examples

 data(poison) head(poison)

Finds the tuning constant(s) associated to the supplied breakdown point or asymptotic efficiency for different psi functions

Description

Finds the tuning constant(s) associated to the suppliedbreakdown point or asymptotic efficiency or computes breakdown point andefficiency associated with the supplied constant(s) for the following psi functions:TB=Tukey biweight, HA=Hampel, HU=Huber, HYP=Hyperbolic, OPT=Optimal, PD=mdpd.

Usage

psifun(  u = vector(mode = "double", length = 0),  p = 1,  fun = c("TB", "bisquare", "biweight", "HA", "hampel", "HU", "huber", "HYP",    "hyperbolic", "OPT", "optimal", "PD", "mdpd"),  bdp,  eff,  const,  param,  trace = FALSE)

Arguments

u

optional vector containing scaled residuals or Mahalanobisdistances for then units of the sample. If not provided,rho, psi, psider, psix and weights are not computed

p

number of variables (p=1 for regression)

fun

psi function class. One of TB, HA, HU, HP, OPT or PD.

bdp

requested breakdown point

eff

requested asymptotic efficiency

const

tuning constant c

param

additional parameters

trace

whether to print intermediate results. Default istrace=FALSE.

Value

A list will be returnedcontaining the following elements:

class: link function which has be used. Possible values are'bisquare', 'optimal', 'hyperbolic', 'hampel', 'huber' or 'mdpd'
bdp: breakdown point
eff: asymptotic efficiency
c1: consistency factor (and other parameters) associatedto required breakdown point or nominal efficiency.
kc1: Expectation of rho associated withc1 to get aconsistent estimator at the model distributionkc1 = E(rho) = sup(rho)*bdp
rho: vector of lengthn which contains the rhofunction associated to the residuals or Mahalanobis distancesfor then units of the sample. Empty ifu is not provided.
psi: vector of lengthn which contains the psifunction associated with the residuals or Mahalanobis distancesfor then units of the sample. Empty ifu is not provided.
psider: vector of lengthn which contains the derivative of thepsi function associated with the residuals or Mahalanobis distancesfor then units of the sample. Empty ifu is not provided.
psix: vector of lengthn which containsthe psi function mutiplied by uassociated with the residuals or Mahalanobis distancesfor then units of the sample. Empty ifu is not provided.
wei: vector of lengthn which contains the weightsassociated with the residuals or Mahalanobis distancesfor then units of the sample. Empty ifu is not provided.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Hoaglin, D.C. and Mosteller, F. and Tukey, J.W. (1982),Understanding Robust and Exploratory Data Analysis, Wiley, New York.

Huber, P.J. (1981),Robust Statistics, Wiley.

Huber, P.J. and Ronchetti, E.M. (2009),Robust Statistics, 2nd Edition, Wiley.

Hampel, F.R. and Rousseeuw, P.J. and Ronchetti E. (1981), The Change-of-Variance Curve and Optimal Redescending M-Estimators,Journal of the American Statistical Association,76, pp. 643–648.

Maronna, R.A. and Martin D. and Yohai V.J. (2006),Robust Statistics, Theory and Methods, Wiley, New York.

Riani, M. and Atkinson, A. C. and Corbellini, A. and Perrotta, D. (2020)Robust regression with density power divergence: Theory, comparisons,and data analysis,Entropy22. doi:10.3390/e22040399.

Examples

 ## Not run:  ##  Find c for given bdp for the Tukey biweight function ##  The constant c associated to a breakdown point of ##  50 percent in regression is ##      c=1.547644980928226     psifun(bdp=0.5)     psifun(c=1.547644980928226) ##  Find c for given bdp for the Hampel function psifun(bdp=0.5, fun="hampel") ## Plot Huber rho function. x <- seq(-3, 3, 0.001) c <- 1.345; HUc1 <- psifun(u=x, p=1, fun="HU", const=c) rhoHU <- HUc1$rho plot(x, rhoHU, type="l", lty="solid", lwd=2, col="blue",     xlab="u", ylab="rho (u,1.345)", ylim=c(0.16, 4.5)) lines(x, x^2/2, type="l", lty="dotted", lwd=1.5, col="red") legend(-1, 4.6, legend=c("Huber rho function", "u^2/2"),     lty=c("solid", "dotted"), lwd=c(2,1.5), col=c("blue", "red")) yc <- 0.13; text(-c, yc, paste0("-c=", -c), adj=1) text(c,yc, paste0("c=",c), adj=0) segments(c, 0, c, c**2/2, col="red") segments(-c, 0, -c, c**2/2, col="red") points(c, c**2/2, col="red") points(-c, c**2/2, col="red") ## End(Not run)

Interactive scatterplot matrix for regression

Description

Produces an interactive scatterplot of the responceyagainst each variable of the predictor matrixX.

Usage

regspmplot(  y,  X,  group,  plot,  namey,  nameX,  col,  cex,  pch,  labeladd,  legend,  xlim,  ylim,  tag,  datatooltip,  databrush,  subsize,  selstep,  selunit,  trace = FALSE,  ...)

Arguments

y

responce variable or an object containing the responce,the predictors and possibly other variable resulting from monitoringof regression.

Ify is a vector, a data matrixX must be present as an argumentIfy is a list containing justy andX, the call is equivallent toregspmplot(y, X). Otherwisey must be an an object of S3 classfsreda.object returned byfsreg withmonitoring=TRUE - a list containingthe monitoring along a search

X

Predictor variables. Data matrix of explanatory variables (also called 'regressors') ofdimensionn byp if the argumenty is a vector. The rows ofXrepresent observations, and the columns represent variables.

group

grouping variable. Vector withn elements. Specifiesa grouping variable defined as a categorical variable (factor), numeric, or array of strings,or string matrix, and it must have the same number of rows asX.This grouping variable determines the marker and color assigned to each point.Remark: ifgroup is used to distinguish a set of outliers froma set of good units, the id number for the outliers should be the larger(see optional fieldlabeladd of parameterplo for details).

plot

This option controls the names which are displayed in the marginsof the scatterplot matrix as well as the labels of the legend.Ifplot=FALSE, then namey, nameX and labeladd are both set tothe empty string (default), and no label and no name is added to the plot.Ifplot=TRUE the names y, and X1,..., Xp are added to the margins ofthe the scatter plot matrix else nothing is shown.Ifplot is a list, it is possible to control not only the names but also,point labels, colors and symbols. More precisely listplot may contain the following elements:

labeladd - see parameterlabeladd
namey - a character string containing the response variable name. See parameternamey.
nameX - a vector of character strings containing the labels ofthe explanatory variables. As default value, the labels which areadded areY1, ..., Yp. See parameternameX.
clr - see parametercol
sym - see parameterpch
siz - see parametercex
doleg - see parameterlegend
xlimx - see parameterxlim
ylimy - see parameterylim

namey

a character string with the name of the responce variable

nameX

a vector of character strings with the names of the explanatory variables

col

color specification for the data point. Can be different for each group.By default, the order of the colors isblue,red,black,magenta,green,cyan andyelow.

cex

the size of the symbols used for plotting. By defaultcex=1the symbol size depends on the number of plots and the size of thefigure window. Values larger than 1 will increase the size andvalues smaller than 1 will decrease the size.

pch

specification of the symbols to use. For example, ifthere are three groups, andpch=c(1, 3, 4), the first group will beplotted with a circle, the second with a plus, and the third with a 'x' (see?pch or?points fora list of symbols. NOTE: not all symbols available in R can be mapped to the symbols in MATLAB.

labeladd

logical, controls wheather the elements belonging to the last groupin the scatterplot matrix are labelled with their unit row indexor their rowname. The rowname is taken from the parameterlabelor if it is missing, from the sequence1:n. The default value islabeladd=FALSE, i.e. no label is added.

legend

logical, controls where a legend is shown or not.

xlim

x limits. A vector with two elements controlling minimumand maximum on the x axis. By defaul automatic scale is used.

ylim

y limits. A vector with two elements controlling minimumand maximum on the y axis. By defaul automatic scale is used.

tag

datatooltip

databrush

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Note that the window style of the other figures is set equal to that which contains themonitoring residual plot. In other words, if the monitoring residual plot is docked allthe other figures will be docked too

Ifdatabrush=TRUE the default selection tool is a rectangular brush andit is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist: This option can be an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.
labeladd: add labels of brushed units in the scatterplot matrix.If this option is '1', we label the units of the last selected group withthe unit row index in the matrix X. The default value is labeladd=”,i.e. no label is added.

subsize

x axis control, a numeric vector containing the subset sizewith length equal to the number of columns of matrix residuals. If it isnot specified it will be set equal to(nrow(residuals) - ncol(residuals) + 1) : nrow(residuals).

selstep

Text shown in selected steps, a numeric vector which specifiesfor which steps of the forward search textlabels are added in the monitoringresidual plot after a brushing action in the yXplot. The default is towrite the labels at the initial and final step. The default isselstep=c(m0, n) wherem0 andn are respectivelythe first and final step of the search.

selunit

Unit labelling. A vector of strings, a string, or a numericvector for labelling units. If out is an object the threshold is associatedwith the trajectories of the residuals monitored along the search else itrefers to the values of the response variable. If it is a vector of strings,only the lines associated with the units that in at least one step of thesearch had a residual smaller thanselunit[1] or greater thansellunit[2] will have a textbox. If it is a string it specifiesthe threshold above which labels have to be put. For exampleselunit='2.6'means that the text labels are written only for the units which havein at least one step of the search a value of the scaled residualgreater than 2.6 in absolute value. If it is a numeric vector itcontains the list of the units for which it is necessary to putthe text labels. The default value of selunit is string'2.5'ify is an object else it is an empty value.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

Examples

 ## Not run:  ##  Example of the use of function regspmplot with all the default options ##  regsmplot() with first argument vector y and no option. ##  In the first example as input there are two matrices: y and X respectively ##  A simple plot is created n <- 100 p <- 3 X <- matrix(data=rnorm(n*p), nrow=n, ncol=p) y <- matrix(data=rnorm(n*1), nrow=n, ncol=1) regspmplot(y, X) ##  Example of the use of function regspmplot with first argument ##  vector y and third argument group. ##  Different groups are shown in the yXplot group <- rep(0, n) group[1:(n/2)] <- rep(1, n/2) regspmplot(y, X, group) ##  Example of the use of function regspmplot with first argument ##  vector y, third argument group and fourth argument plot ##  (Ex1) plot=TRUE regspmplot(y, X, group, plot=TRUE) ##  (Ex1) Set the scale for the x axes, the y axis and control symbol type regspmplot(y, X, group, xlim=c(-1,2), ylim=c(0,2), pch=c(10,11), trace=TRUE) ##  When the first input argument is an object. ##  In the following example the input is an object which also contains ##  information about the forward search.     (out <- fsreg(y~X, method="LMS", control=LXS_control(nsamp=1000)))     (out <- fsreg(y~X, bsb=out$bs, monitoring=TRUE))     regspmplot(out, plot=0) ## End(Not run)

Plots the trajectories of the monitored scaled (squared) residuals

Description

Plots the trajectories of the monitored scaled (squared) residuals

Usage

resfwdplot(out,     xlim, ylim, xlab, ylab, main, lwd, lty, col, cex.lab, cex.axis,     xvalues,     fg.thresh, fg.unit, fg.labstep, fg.lwd, fg.lty, fg.col, fg.mark, fg.cex,     bg.thresh, bg.style,     tag, datatooltip, label, nameX, namey, msg, databrush,     standard, fground, bground, ...)

Arguments

out

An object returned by one of the monitoring functions (seeFSReda_control,Sregeda_control andMMregeda_control). The object is one offsreda.object,sregeda.object ormmregeda.object.

The needed elements ofout are

RES: matrix containing the residuals monitored in each step of the forward search or any other robust procedure. Every row is associated with a residual (unit). This matrix can be created using function FSReda, Sregeda, MMregeda.
Un: (for FSR only) - matrix containing the order of entry in the subset of each unit (required only when datatooltip is true or databrush is not empty).
bdp: (for Sreg only) - vector containing a sequence of breakdown point values to monitor on.
eff: (for MMreg only) - vector containing a sequence of efficiency values to monitor on.
y: a vector containing the response (required only when option databrush is requested).
X: a matrix containing the explanatory variables (required only when option databrush is requested).
Bols: (n-init+1) x (p+1) matrix containing the estimated beta coefficients monitored in each step of the robust procedure (required only when option databrush is requested and suboption multivarfit is requested).

ylim

Controly scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale.

xlim

Controlx scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale.

xlab

a title for the x axis

ylab

a title for the y axis

main

an overall title for the plot

lwd

The line width, a positive number, defaulting to 1

lty

The line type. Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab.

col

colors to be used for the highlighted units

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

xvalues

values for the x axis. Numeric vector ofncol(RES) controlling the x axis coordinates. The default value of xvalues is(nrow(RES) - ncol(RES) + 1):nrow(RES)

fg.thresh

fg.unit

(alternative to fg.thresh), vector containing the list of the units to be highlighted. Iffg.unit is supplied,fg.thresh is ignored.

fg.labstep

numeric vector which specifies the steps of the search where to put labels for the highlighted trajectories (units). The default is to put the labels at the initial and final steps of the search.fg.labstep='' means no label.

fg.lwd

The line width for the highlighted trajectories (units). Default is 1.

fg.lty

The line type for the highlighted trajectories (units). Line types can either be specified as an integer (1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash". The latter two are not supported by Matlab.

fg.col

colors to be used for the highlighted units.

fg.mark

Controlls whether to plot highlighted trajectories as symbols.iffg.mark==TRUE each line is plotted using a different symbol else no marker is used (default).

fg.cex

controls the font size of the labels of the trajectories in foreground.

bg.thresh

bg.style

specifies how to plot the unimportant trajectories as defined in option bthresh.

bg.style="faint": unimportant trajectories are plotted using a colormap.
bg.style="hide": unimportant trajectories are hidden.
bg.style="greyish": unimportant trajectories are displayed in a faint grey.

tag

datatooltip

For example for classfsreda.object the information concerns the label and the step of the search in which the unit enters the subset. If datatooltip is a list it may contain the following fields:

DisplayStyle determines how the data cursor displays. Possible values are'datatip' and'window' (default).'datatip' displays data cursor information in a small yellow text box attached to a black square marker at a data point you interactively select.'window' displays data cursor information for the data point you interactively select in a floating window within the figure.
SnapToDataVertex: specifies whether the data cursor snaps to the nearest data value or is located at the actual pointer position. Possible values areSnapToDataVertex='on' (default) andSnapToDataVertex='off'.
LineColor: controls the color of the trajectory selected with the mouse. It can be an RGB triplet of values between 0 and 1, or character vector indicating a color name. Note that a RGB vector can be conveniently chosen with our MATLAB class FSColor, see documentation.
SubsetLinesColor: enables to control the color of the trajectories of the units that are in the subset at a given step of the search (ifresfwdplot() is applied to an object of classfsreda.object) or have a weight greater than 0.9 (ifresfwdplot() is applied to an object of classsregeda.object ormmregeda.object). This can be done (repeatedly) with a left mouse click in proximity of the step of interest. A right mouse click will terminate the selection by marking with a up-arrow the step corresponding to the highlighted lines. The highlighted lines by default are in red, but a different color can be specified as RGB triplet or character of color name. Note that a RGB vector can be conveniently chosen with our MATLAB class FSColor, see documentation. By defaultSubsetLinesColor="", i.e. the modality is not active.Any initialization forSubsetLinesColor which cannot be interpreted as RGB vector will be converted to blue, i.e.SubsetLinesColor will be forced to be [0 0 1].IfSubsetLinesColor is not empty the previous optionLineColor is ignored.

label

Character vector containing the labels of the units (optional argument used whendatatooltip=TRUE). If this field is not present labels row1, ..., rown will be automatically created and included in the pop up datatooltip window).

nameX

namey

Add response label. A string containing the label of the response

msg

Controls whether to display or not messages on the screen Ifmsg==1 (default) messages are displayed on the screen about step in which signal took place else no message is displayed on the screen.

databrush

monitoring leverage plot;
maximum studentized residual;
s^2 and R^2;
Cook distance and modified Cook distance;
deletion t statistics.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments of functionselectdataFS() and the following optional argument:

persist. Persist is an empty value or a character containing 'on' or 'off'. The default value ispersist="", that is brushing is allowed only once. Ifpersist="on" orpersis="off" brushing can be done as many time as the user requires. Ifpersist='on' then the unit(s) currently brushed are added to those previously brushed. It is possible, every time a new brushing is done, to use a different color for the brushed units. Ifpersist='off' every time a new brush is performed units previously brushed are removed.
bivarfit. Wheather to superimpose bivariate least square lines on the plot (ifplot=TRUE.This option adds one or more least squares lines, based on SIMPLE REGRESSION of y on Xi, to the plots of y|Xi. The default isbivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. Ifbivarfit=2, two OLS lines are fitted: one to all points and another to the group of the genuine observations. The group of the potential outliers is not fitted. Ifbivarfit=0 one OLS line is fitted to each group. This is useful for the purpose of fitting mixtures of regression lines. Ifbivarfit='i1' orbivarfit='i2', etc. an OLS line is fitted to a specific group, the one with index 'i' equal to 1, 2, 3 etc. Again, useful in case of mixtures.
multivarfit. Wheather to superimpose multivariate least square lines. This option adds one or more least square lines, based on MULTIVARIATE REGRESSION of y on X, to the plots of y|Xi. The default ismultivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. The line added to the scatter plot y|Xi is avconst + Ci*Xi, where Ci is the coefficient of Xi in the multivariate regression and avconst is the effect of all the other explanatory variables different from Xi evaluated at their centroid (that is overline(y)'C)). Ifmultivarfit=2, same action as withmultivarfit=1 but this time we also add the line based on the group of unselected observations (i.e. the normal units).
labeladd. Add outlier labels in plot. Iflabeladd=TRUE, we label the outliers with the unit row index in matrices X and y. The default value islabeladd=FALSE, i.e. no label is added.

standard

(MATLAB-style arguments) appearance of the plot in terms of xlim, ylim, axes labels and their font size style, color of the lines, etc.

fground

MATLAB-style arguments for the fground trajectories in foregroud.

bground

MATLAB-style arguments for the fground trajectories in backgroud.

...

potential further arguments passed to lower level functions.

Details

No details

Value

No value returned

Author(s)

FSDA team

Examples

## Not run: n <- 100y <- rnorm(n)X <- matrix(rnorm(n*4), nrow=n)out <- fsreg(y~X, method="LTS")out <- fsreg(y~X, method="FS", bsb=out$bs, monitoring=TRUE)resfwdplot(out)## End(Not run)

Plots the residuals from a regression analysis versus index number or any other variable

Description

The functionresindexplot() plots the residuals from a regression analysisversus index number or any other variable. The residuals come from an outputobject of any of the regression fucntions or a simply a vector of values.In order to use the databrush option, the residuals must come from one ofthe fsdaR regression functions.

Usage

resindexplot(out, x, xlim, ylim, xlab, ylab, main, numlab, indlab, conflev, cex.axis,     cex.lab, lwd, nameX, namey, tag, col, cex, databrush, ...)

Arguments

out

A vector containing the residuals from a regression analysis or an object returned by one of the regression functions (seeFSR_control,LXS_control,Sreg_control andMMreg_control). The object is one offsr.object,fsdalts.object,fsdalms.object,sreg.object ormmreg.object. The needed elements ofout are at leastresiduals, but if the optiondatabrush is used, alsoX amdy will be needed.

x

The vector to be plotted on the x-axis. As default the sequence 1:length(residuals) will be used

xlim

Controlx scale in plot. Vector with two elements controlling minimum and maximum on the x axis. Default is to use automatic scale.

ylim

Controly scale in plot. Vector with two elements controlling minimum and maximum on the y axis. Default is to use automatic scale.

xlab

a title for the x axis

ylab

a title for the y axis

main

an overall title for the plot

numlab

Number of points to be identified in plots (see alsoindlab) . By default the five points with largest values will be identified.If numlab is a single number containing scalar k, the units with the k largest residuals are labelled in the plots.If numlab is a vector, the units inside vector numlab are labelled in the plots. The default value ofnumlab=5 and the units with the 5 largest residuals will be labelled. Ifnumlib=0 ornumlib=NULL no labelling will be done.

indlab

Which points to be identified in plots (see alsonumlab) - the units with indexes in the vector indlab are labelled in the plots.

conflev

Confidence interval for the horizontal bands (a numeric vector). It can be a vector of different confidence level values.

Remark: confidence interval is based on the chi^2 distribution

cex.axis

The magnification to be used for axis annotation relative to the current setting of cex

cex.lab

The magnification to be used for x and y labels relative to the current setting of cex

lwd

The line width, a positive number, defaulting to 1

tag

Figure tag (character). Tag of the figure which will host theresindexplot. The default tag ispl_resindex.

col

Fill color for markers that are closed shapes (circle, square, diamond, pentagram, hexagram, and the four triangles).Can be'none' or'auto' or color name(string) or RGB triplet.

cex

Size of the point symbols. The magnification to be used relative to the current setting of cex.

nameX

namey

Add response label. A string containing the label of the response

databrush

Interactive mouse brushing. If databrush is missing or empty (default) ordatabrush=FALSE, no brushing is done.The activation of this option (databrush is a scalar or a list) enables the user to select a set of trajectories in the current plot and to see them highlighted in the y|X plot, i.e. a matrix of scatter plots of y against each column of X, grouped according to the selection(s) done by brushing. If the plot y|X does not exist it is automatically created.In addition, brushed units are automatically highlighted in the minimum deletion residual plot if it is already open. The extension to the following plots will be available in future versions of the package:

monitoring leverage plot;
maximum studentized residual;
s^2 and R^2;
Cook distance and modified Cook distance;
deletion t statistics.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it is possible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments of functionselectdataFS() and the following optional argument:

persist. Persist is an empty value or a character containing 'on' or 'off'. The default value ispersist="", that is brushing is allowed only once. Ifpersist="on" orpersis="off" brushing can be done as many time as the user requires. Ifpersist='on' then the unit(s) currently brushed are added to those previously brushed. It is possible, every time a new brushing is done, to use a different color for the brushed units. Ifpersist='off' every time a new brush is performed units previously brushed are removed.
bivarfit. This option adds one or more least square lines based on SIMPLE REGRESSION to the plots of y|X, depending on the selected groups.The default isbivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. Ifbivarfit=2, two OLS lines are fitted: one to all points and another to the group of the genuine observations. The group of the potential outliers is not fitted. Ifbivarfit=0 one OLS line is fitted to each group. This is useful for the purpose of fitting mixtures of regression lines. Ifbivarfit='i1' orbivarfit='i2', etc. an OLS line is fitted to a specific group, the one with index 'i' equal to 1, 2, 3 etc. Again, useful in case of mixtures.
multivarfit. Wheather to superimpose multivariate least square lines. This option adds one or more least square lines, based on MULTIVARIATE REGRESSION of y on X, to the plots of y|Xi. The default ismultivarfit=FALSE: no line is fitted. Ifbivarfit=1, a single OLS line is fitted to all points of each bivariate plot in the scatter matrix y|X. The line added to the scatter plot y|Xi is avconst + Ci*Xi, where Ci is the coefficient of Xi in the multivariate regression and avconst is the effect of all the other explanatory variables different from Xi evaluated at their centroid (that is overline(y)'C)). Ifmultivarfit=2, same action as withmultivarfit=1 but this time we also add the line based on the group of unselected observations (i.e. the normal units).
labeladd. Add outlier labels in plot. Iflabeladd=TRUE, we label the outliers with the unit row index in matrices X and y. The default value islabeladd=FALSE, i.e. no label is added.

...

potential further arguments passed to lower level functions.

Details

No details

Value

No value returned

Author(s)

FSDA team

Examples

## Not run: out <- fsreg(stack.loss~., data=stackloss)resindexplot(out, conflev=c(0.95,0.99), col="green")## End(Not run)

Computes the score test for transformation in regression

Description

Computes the score test for transformation in regression

Usage

score(x, ...)## S3 method for class 'formula'score(  formula,  data,  subset,  weights,  na.action,  model = TRUE,  contrasts = NULL,  offset,  ...)## Default S3 method:score(  x,  y,  intercept = TRUE,  la = c(-1, -0.5, 0, 0.5, 1),  lik = FALSE,  nocheck = FALSE,  trace = FALSE,  ...)

Arguments

x

Ann x p data matrix (n observations andp variables).Rows ofx represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

...

potential further arguments passed to lower level functions.

formula

aformula of the formy ~ x1 + x2 + ....

data

data frame from which variables specified informula are to be taken.

subset

an optional vector specifying a subset of observationsto be used in the fitting process.

weights

an optional vector of weights to be usedNOT USED YET.

na.action

model

logical indicating if themodel frame, is to be returned.

contrasts

an optional list. See thecontrasts.argofmodel.matrix.default.

offset

this can be used to specify ana prioriknown component to be included in the linear predictorduring fitting. Anoffset term can be included in the

y

Response variable. A vector withn elements thatcontains the response variable.

intercept

wheather to use constant term (default isintercept=TRUE

la

values of the transformation parameter for which it is necessaryto compute the score test. Default value of lambda isla=c(-1, -0.5, 0, 0.5, 1), i.e., the five most common values of lambda.

lik

likelihood for the augmented model. If true the value of the likelihoodfor the augmented model will be calculated and returend otherwise (default) onlythe value of the score test will be given

nocheck

trace

Whether to print intermediate results. Default istrace=FALSE.

Value

An S3 object of classscore.object will be returned which is basically a listcontaining the following elements:

la: vector containing the values of lambda for which fan plot is constructed
Score: a vector containing the values of the score test foreach value of the transformation parameter.
Lik: value of the likelihood. This output is produced only if lik=TRUE.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Atkinson, A.C. and Riani, M. (2000),Robust Diagnostic Regression Analysis Springer Verlag, New York.

Examples

 ## Not run:    data(wool)   XX <- wool   y <- XX[, ncol(XX)]   X <- XX[, 1:(ncol(XX)-1), drop=FALSE]   (out <- score(X, y))                          # call 'score' with all default parameters   (out <- score(cycles~., data=wool))           # use the formula interface   (out <- score(cycles~., data=wool, lik=TRUE)) # return the likelihood   data(loyalty)   head(loyalty)   ##    la is a vector containing the values of \lambda which have to be tested   (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5)))   (out <- score(amount_spent~., data=loyalty, la=c(0.25, 1/3, 0.4, 0.5), lik=TRUE)) ## End(Not run)

Objects returned by the function`score`

Description

An object of classscore.object holds information aboutthe result of a call toscore.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classscore is a list containing at least the following components:

la: vector containing the values of lambda for which fan plot is constructed
Score: a vector containing the values of the score test foreach value of the transformation parameter.
Lik: value of the likelihood. This output is produced only if lik=TRUE.

Examples

 ## Not run:    data(wool)   (out <- score(cycles~., data=wool, lik=TRUE))   class(out)   summary(out) ## End(Not run)

Computes S estimators in multivariate analysis

Description

Computes S estimators in multivariate analysis

Usage

smult(  x,  monitoring = FALSE,  plot = FALSE,  bdp,  nsamp,  conflev = 0.975,  nocheck = FALSE,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

monitoring

Wheather to perform monitoring of Mahalanobis distances and other specific quantities

plot

labeladd Iflabeladd=1, the outliers in the spm are labelled with the unitrow index. The default value islabeladd="", i.e. no label is added
nameY character vector containing the labels of the variables. As default value,the labels which are added are Y1, ...Yp.

bdp

Measures the fraction of outliers the algorithm should resist.In this case any value greater than 0 but smaller or equal than 0.5 will do fine (default isbdp=0.5).Note that given bdp nominal efficiency is automatically determined.

nsamp

Number of subsamples which will be extracted to find the robust estimator.Ifnsamp=0 all subsets will be extracted. They will be(n choose p). If the number of all possible subset is<1000the default is to extract all subsets otherwise just 1000.

conflev

nocheck

It controls whether to perform checks on matrix Y. Ifnocheck=TRUE, no check is performed.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Details

This function follows the lines of MATLAB/R code developed during the years by many authors.For more details see http://www.econ.kuleuven.be/public/NDBAE06/programs/ andthe R packageCovSestThe core of these routines, e.g. the resampling approach, however, has beencompletely redesigned, with considerable increase of the computational performance.

Value

Depending on the input parametermonitoring, one ofthe following objects will be returned:

smult.object
smulteda.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Maronna, R.A., Martin D. and Yohai V.J. (2006), Robust Statistics, Theory and Methods, Wiley, New York.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- smult(hbk[,1:3])) class(out) summary(out) ##  Generate contaminated data (200,3) n <- 200 p <- 3 set.seed(123456) X <- matrix(rnorm(n*p), nrow=n) Xcont <- X Xcont[1:5, ] <- Xcont[1:5,] + 3 out1 <- smult(Xcont, trace=TRUE)           # no plots (plot defaults to FALSE) names(out1) ## plot=TRUE - generates: (1) a plot of Mahalanobis distances against ##    index number. The confidence level used to draw the confidence bands for ##    the MD is given by the input option conflev. If conflev is ##    not specified a nominal 0.975 confidence interval will be used and ##    (2) a scatter plot matrix with the outliers highlighted. (out1 <- smult(Xcont, trace=TRUE, plot=TRUE)) ## plots is a list: the spm shows the labels of the outliers. (out1 <- smult(Xcont, trace=TRUE, plot=list(labeladd="1"))) ## plots is a list: the spm uses the variable names provided by 'nameY'. (out1 <- smult(Xcont, trace=TRUE, plot=list(nameY=c("A", "B", "C")))) ## smult() with monitoring (out2 <- smult(Xcont, monitoring=TRUE, trace=TRUE)) names(out2) ##  Forgery Swiss banknotes examples. data(swissbanknotes) (out1 <- smult(swissbanknotes[101:200,], plot=TRUE)) (out1 <- smult(swissbanknotes[101:200,], plot=list(labeladd="1"))) ## End(Not run)

Description of`smult.object` Objects

Description

An object of classsmult.object holds information about the result of a call tosmult.

Value

The object itself is basically alist with the following components:

loc

p-by-1 vector containing S estimate of location.

shape

p-by-p matrix containing robust estimate of the shapematrix. Remark: det|shape|=1.

scale

robust estimate of the scale.

cov

scale^2 * shape: robust estimate of covariance matrix.

bs

a (p+1) vector containing the units forming best subset associated with S estimate of location.

md

outliers

A vector containing the list of the units declared as outliers using confidence level specified in input scalarconflev.

conflev

Confidence level that was used to declare outliers.

singsub

Number of subsets without full rank. Notice thatsingsub > 0.1*(number of subsamples) produces a warning.

weights

n x 1 vector containing the estimates of the weights.

X

the data matrix X

The object has class"smult".

Examples

    ## Not run:     data(hbk, package="robustbase")    (out <- smult(hbk[,1:3]))    class(out)    summary(out)    ## End(Not run)

Description of`smulteda.object` Objects

Description

An object of classsmulteda.object holds information about the result of a call tosmult withmonitoring=TRUE.

Value

The object itself is basically alist with the following components:

Loc

length(bdp)-by-p matrix containing S estimate of location for each value ofbdp.

Shape

p-by-p-by-length(bdp) 3D array containing robust estimate of the shape for each value of bdp. Remark: det|shape|=1.

Scale

length(bdp) vector containing robust estimate of the scalefor each value of bdp.

Cov

p-by-p-by-length(bdp) 3D array containing robust estimate of covariance matrix for each value ofbdp.Note thatscale(i)^2 * shape[,,i] = robust estimate of covariance matrix.

Bs

(p+1)-by-length(bdp) matrix containing the units forming best subset for each value of bdp.

MAL

n-by-length(bdp) matrix containing the estimates of the robust Mahalanobis distances (in squared units) for each value of bdp.

Outliers

n-by-length(bdp) matrix containing flags for the outliers. Boolean matrix containing the list of the units declared as outliers for each value of bdp using confidence level specified in input scalarconflev

Weights

n x length(bdp) matrix containing the weights for each value of bdp.

conflev

Confidence level that was used to declare outliers.

singsub

Number of subsets without full rank. Notice thatsingsub > 0.1*(number of subsamples) produces a warning.

bdp

vector which contains the values of bdp which have been used.

X

the data matrix X.

The object has class"smulteda".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- smult(hbk[,1:3], monitoring=TRUE))    class(out)    summary(out)## End(Not run)

Interactive scatterplot matrix

Description

Produces an interactive scatterplot matrix with boxplots or histograms on the maindiagonal and possibly robust bivariate contours

Usage

spmplot(  X,  group,  plot,  variables,  col,  cex,  pch,  labeladd,  label,  legend,  dispopt = c("hist", "box"),  tag,  datatooltip,  databrush,  trace = FALSE,  ...)

Arguments

X

data matrix (2D array) containingn observations onp variablesor an object of S3 classfsmeda.object returned byfsmult withmonitoring=TRUE - a list containingthe monitoring of minimum Mahalanobis distance

group

plot

controls the names which are displayed in the margins of thescatter-plot matrix, the labels of the legend the colors and the symbols.Ifplot isempty (plot=FALSE orplot=0 orplot=c()orplot=NULL) empty strings are displayed and no label and no name is addedto the plot. Ifplot=TRUE orplot=1, the namesY1,..., Ypare added to the margins of the the scatter plot matrix else nothing is added.Ifplot is a list, it is possible to control not only the names but also,point labels, colors and symbols. More precisely listplot may contain the following elements:

labeladd - see parameterlabeladd
nameY - a character string containing the labels ofthe variables. As default value, the labels which areadded areY1, ..., Yp. See parametervariables.
clr - see parametercol
sym - see parameterpch
siz - see parametercex
doleg - see parameterlegend
label - see parameterlabel

variables

a character string with the names of the variables

col

color specification for the data point. Can be different for each group.By default, the order of the colors isblue,red,black,magenta,green,cyan andyelow.

cex

pch

labeladd

label

a character vector of lengthn (the number of rows in the data matrix)containing the labels of the units. If this field is emptythe sequence1:n will be used to label the units.

legend

logical, controls where a legend is shown or not.

dispopt

controls how to fill the diagonals in the plot (main diagonal ofthe scatter plot matrix). Setdispopt='hist' (default) to plot histograms,ordispopt='box' to plot boxplots. The style which is used for univariateboxplots is traditional, if the number of groups is less or equal 5, else it is'compact' (plot boxes using a smaller box style designed for plots with many groups).

tag

datatooltip

databrush

Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.The activation of this option (databrush is TRUE or a list) enables the user to selecta set of trajectories in the current plot and to see them highlighted in the scatterplot matrix.If the scatterplot matrix does not exist it is automatically created.In addition, brushed units can be highlighted in the monitoring MD plot.Note that the windowstyle of the other figures is set equal to that which contains the monitoring residual plot.In other words, if the monitoring residual plot is docked all the other figures will be docked too.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist: This option can be an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.
labeladd: add labels of brushed units in the scatterplot matrix.If this option is '1', we label the units of the last selected group withthe unit row index in the matrix X. The default value is labeladd=”,i.e. no label is added.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

none

Author(s)

FSDA team,valentin.todorov@chello.at

Examples

 ## Not run:  ##  Call of spmplot() without optional parameters. ##  Iris data: scatter plot matrix with univariate boxplots on the main ##  diagonal. X <- iris[,1:4] group <- iris[,5] spmplot(X, group, variables=c('SL','SW','PL','PW'), dispopt="box") ##  Example of spmplot() called by routine fsmult(). ##  Generate contaminated data.     n <- 200; p <- 3     X <- matrix(rnorm(n*p), ncol=3)     Xcont <- X     Xcont[1:5,] <- Xcont[1:5,] + 3 ##  spmplot is called automatically by all outlier detection methods, e.g. fsmult()     out <- fsmult(Xcont, plot=TRUE); ##  Now test the direct use of fsmult(). Set two groups, e.g. those obtained ##  from fsmult().     group = rep(0, n)     group[out$outliers] <- 1 ##  option 'labeladd' is used to label the outliers ##  By default, the legend identifies the groups with the identifiers ##  given in vector 'group'. ##  Set the colors for the two groups to blue and red.     spmplot(Xcont, group, col=c("blue", "red"), labeladd=1, dispopt="box") ## End(Not run)

Description of sreg Objects

Description

An object of classsreg.object holds information about the result of a call tofsreg.

Value

The object itself is basically alist with the following components:

beta

p-by-1 vector containing the estimated regression parameters (in step n-k).

scale

scalar containing the estimate of the scale (sigma).

bs

p x 1 vector containing the units forming best subset associated with S estimate of regression coefficient.

residuals

residuals.

fittedvalues

fitted values.

outliers

kx1 vector containing the list of the k units declared as outliers or NULL if the sample is homogeneous.

conflev

singsub

Number of subsets wihtout full rank. Notice thatsingsub > 0.1*(number of subsamples) produces a warning

weights

n x 1 vector containing the estimates of the weights

rhofunc

Specifies the rho function which has been used to weight the residuals.

rhofuncparam

X

the data matrix X

y

the response vector y

The object has class"sreg".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="S"))    class(out)    summary(out)## End(Not run)

Description of`sregeda` Objects

Description

An object of classsregeda.object holds information about the result of a call tofsreg whenmethod="S" andmonitoring=TRUE.

Value

The object itself is basically alist with the following components:

Beta

matrix containing the S estimator of regression coefficients for each value of bdp.

Scale

vector containing the estimate of the scale (sigma) for each value of bdp. This is the value of the objective function.

BS

p x 1 vector containing the units forming best subset associated with S estimate of regression coefficient.

RES

n x length(bdp) matrix containing the monitoring of scaled residuals for each value ofbdp.

Weights

n x length(bdp) matrix containing the estimates of the weights for each value of bdp

Outliers

Boolean matrix containing the list of the units declared as outliers for each value of bdp using confidence level specified in input scalar conflev.

conflev

Confidence level which is used to declare units as outliers. Remark: conflev will be used to draw the horizontal line (confidence band) in the plot.

Singsub

Number of subsets wihtout full rank. Notice that singsub[bdp[jj]] > 0.1*(number of subsamples) produces a warning

rhofunc

Specifies the rho function which has been used to weight the residuals.

rhofuncparam

X

the data matrix X

y

the response vector y

The object has class"sregeda".

Examples

## Not run:     data(hbk, package="robustbase")    (out <- fsreg(Y~., data=hbk, method="S", monitoring=TRUE))    class(out)    summary(out)## End(Not run)

Summary Method for`fsdalms` objects

Description

summary method for class"fsdalms".

Usage

## S3 method for class 'fsdalms'summary(object, correlation = FALSE, ...)## S3 method for class 'summary.fsdalms'print(x, digits = max(3, getOption("digits") - 3),     signif.stars = getOption("show.signif.stars"), ...)

Arguments

object,x

an object of class"fsdalms" (or"summary.fsdalms");usually, a result of a call tofsreg.

correlation

logical; ifTRUE, the correlation matrix of the estimated parameters is returned and printed.

digits

the number of significant digits to use when printing.

signif.stars

logical indicating if “significance stars”should be printer, seeprintCoefmat.

...

further arguments passed to or from other methods.

Details

summary.fsdalms(), the S3 method, simply returns an (S3) object of class"summary.fsdalms"for which there's aprint method:

print.summary.fsdalms prints summary statistics for the forward search (FS) regression estimates.While the functionprint.fsdalms prints only the robust estimatesof the coefficients,print.summary.fsdalms will print also the regression table.

Value

summary.fsdalms returns ansummary.fsdalms object, whereas theprint methods returns its first argument viainvisible, as allprint methods do.

Examples

## Not run:     data(Animals, package = "MASS")    brain <- Animals[c(1:24, 26:25, 27:28),]    lbrain <- log(brain)    (fs <- fsreg(brain~body, data=lbrain, method="LTS"))    summary(fs)    ## compare to the result of ltsReg() from 'robustbase'    library(robustbase)    (lts <- ltsReg(brain~body, data=lbrain))    summary(lts)## End(Not run)

Summary Method for`fsdalts` objects

Description

summary method for class"fsdalts".

Usage

## S3 method for class 'fsdalts'summary(object, correlation = FALSE, ...)## S3 method for class 'summary.fsdalts'print(x, digits = max(3, getOption("digits") - 3),     signif.stars = getOption("show.signif.stars"), ...)

Arguments

object,x

an object of class"fsdalts" (or"summary.fsdalts");usually, a result of a call tofsreg.

correlation

logical; ifTRUE, the correlation matrix of the estimated parameters is returned and printed.

digits

the number of significant digits to use when printing.

signif.stars

logical indicating if “significance stars”should be printer, seeprintCoefmat.

...

further arguments passed to or from other methods.

Details

summary.fsdalts(), the S3 method, simply returns an (S3) object of class"summary.fsdalts"for which there's aprint method:

print.summary.fsdalts prints summary statistics for the forward search (FS) regression estimates.While the functionprint.fsdalts prints only the robust estimatesof the coefficients,print.summary.fsdalts will print also the regression table.

Value

summary.fsdalts returns ansummary.fsdalts object, whereas theprint methods returns its first argument viainvisible, as allprint methods do.

Examples

## Not run:     data(Animals, package = "MASS")    brain <- Animals[c(1:24, 26:25, 27:28),]    lbrain <- log(brain)    (fs <- fsreg(brain~body, data=lbrain, method="LTS"))    summary(fs)    ## compare to the result of ltsReg() from 'robustbase'    library(robustbase)    (lts <- ltsReg(brain~body, data=lbrain))    summary(lts)## End(Not run)

Summary Method for FSR objects

Description

summary method for class"fsr".

Usage

## S3 method for class 'fsr'summary(object, correlation = FALSE, ...)## S3 method for class 'summary.fsr'print(x, digits = max(3, getOption("digits") - 3),     signif.stars = getOption("show.signif.stars"), ...)

Arguments

object,x

an object of class"fsr" (or"summary.fsr");usually, a result of a call tofsreg.

correlation

logical; ifTRUE, the correlation matrix of the estimated parameters is returned and printed.

digits

the number of significant digits to use when printing.

signif.stars

logical indicating if “significance stars”should be printer, seeprintCoefmat.

...

further arguments passed to or from other methods.

Details

summary.fsr(), the S3 method, simply returns an (S3) object of class"summary.fsr"for which there's aprint method:

print.summary.fsr prints summary statistics for the forward search (FS) regression estimates.While the functionprint.fsr prints only the robust estimatesof the coefficients,print.summary.fsr will print also the regression table.

Value

summary.fsr returns ansummary.fsr object, whereas theprint methods returns its first argument viainvisible, as allprint methods do.

Examples

## Not run:     data(Animals, package = "MASS")    brain <- Animals[c(1:24, 26:25, 27:28),]    lbrain <- log(brain)    (fs <- fsreg(brain~body, data=lbrain, method="FS"))    summary(fs)## End(Not run)

Swiss banknote data

Description

Six variables measured on 100 genuine and 100 counterfeit old (printed before the second world war) Swiss 1000-franc bank notes (Flury and Riedwyl, 1988).

Usage

data(swissbanknotes)

Format

A data frame with 200 observations on the following 7 variables.

length: Length of bill, mm
left: Width of left edge, mm
right: Width of right edge, mm
bottom: Bottom margin width, mm
top: Top margin width, mm
diagonal: Length of image diagonal, mm
class: 1 = genuine, 2 = counterfeit

Source

Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. London: Chapman & Hall.

References

Weisberg, S. (2005).Applied Linear Regression, 3rd edition. New York: Wiley, Problem 12.5.

Examples

library(rrcov)data(swissbanknotes)head(swissbanknotes)plot(CovMcd(swissbanknotes[, 1:6]), which="pairs", col=swissbanknotes$class)

Swiss heads data

Description

Six dimensions in millimetres of the heads of 200 twenty year old Swiss soldiers (Flury and Riedwyl, 1988, p. 218 and also Flury, 1997, p. 6).

The data were collected to determine the variability in size and shapeof heads of young men in order to help in the design of a new protectionmask for the Swiss army.

Usage

data(swissheads)

Format

A data frame with 200 observations on the following 6 variables.

minimal_frontal_breadth: Minimal frontal breadth, mm
breadth_angulus_mandibulae: Breadth of angulus mandibulae, mm
true_facial_height: True facial height, mm
length_glabella_nasi: Length from glabella to apex nasi, mm
length_tragion_nasion: Length from tragion to nasion, mm
length_tragion_gnathion: Length from tragion to gnathion, mm

Source

Flury, B. and Riedwyl, H. (1988).Multivariate Statistics: A practical approach. London: Chapman & Hall.

References

Atkinson, A. C., Riani, M. and Cerioli, A. (2004)Exploring multivariate data with the forward search, New York: Springer-Verlag.

Examples

library(rrcov)data(swissheads)head(swissheads)plot(CovMcd(swissheads), which="pairs")

Performs cluster analysis by calling`tclustfsda` for differentnumber of groups`k` and restriction factors`c`

Description

Computes the values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA),for different values ofk (number of groups) and different values ofc(restriction factor), for a prespecified level of trimming (the last two letters in the namestand for 'Information Criterion'). In order to minimizerandomness, givenk, the same subsets are used for each value ofc.

Usage

tclustIC(  x,  kk = 1:5,  cc = c(1, 2, 4, 8, 16, 32, 64, 128),  alpha = 0,  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"),  nsamp,  refsteps = 15,  reftol = 1e-14,  equalweights = FALSE,  msg = TRUE,  nocheck = FALSE,  plot = FALSE,  startv1 = 1,  restrtype = c("eigen", "deter"),  UnitsSameGroup,  numpool,  cleanpool,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

kk

an integer vector specifying the number of mixture components (clusters) for which the BIC is to be calculated. By defaultkk=1:5.

cc

an vector specifying the values of the restriction factor which have to be considered. By defaultcc=c(1, 2, 4, 8, 16, 32, 64, 128).

alpha

Global trimming level. A scalar between 0 and 0.5 or an integer specifying the number ofobservations which have to be trimmed. Ifalpha=0 all observations are considered. By defaultalpha=0.

More in detail, if0 < alpha < 1 clustering is based onh = fix(n * (1-alpha))observations, else if alpha is an integer greater than 1 clustering is based onh = n - floor(alpha).

whichIC

A character value which specifies which information criteria must be computedfor eachk (number of groups) and each value of the restriction factorc. Possible values forwhichIC are:

"MIXMIX": a mixture model is fitted and for computing the information criterionthe mixture likelihood is used. This option corresponds to the use of the BayesianInformation criterion (BIC). In output just the matrixMIXMIX is given.
"MIXCLA": a mixture model is fitted but to compute the information criterionthe classification likelihood is used. This option corresponds to the use of theIntegrated Complete Likelihood (ICL). In the output just the matrixMIXCLA is given.
"CLACLA": everything is based on the classification likelihood. This informationcriterion will be called CLA. In the output just the matrixCLACLA is given.
"ALL": both classification and mixture likelihood are used. In this case allthree information criteria CLA, ICL and BIC are computed. In the output allthree matricesMIXMIX,MIXCLA andCLACLA are given.

nsamp

If a scalar, it contains the number of subsamples which will be extracted.Ifnsamp = 0 all subsets will be extracted. Remark - if the number of all possiblesubset is greater than 300 the default is to extract all subsets, otherwise just 300.Ifnsamp is a matrix it contains in the rows the indexes of the subsets whichhave to be extracted.nsamp in this case can be conveniently generated byfunctionsubsets().nsamp can havek columns ork * (p + 1)columns. Ifnsamp hask columns thek initial centroids eachiteration i are given byX[nsamp[i,] ,] and the covariance matrices are equalto the identity.

Ifnsamp hask * (p + 1) columns, the initial centroids and covariancematrices in iterationi are computed as follows:

X1 <- X[nsamp[i ,] ,]
mean(X1[1:p + 1, ]) contains the initial centroid for group 1
cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
...
mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.

REMARK: Ifnsamp is not a scalar, the optionstartv1 given below is ignored.More precisely, ifnsamp hask columnsstartv1 = 0 else ifnsamp hask*(p+1) columns optionstartv1=1.

refsteps

Number of refining iterations in each subsample. Default isrefsteps=15.refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

A logical specifying wheather cluster weights in the concentrationand assignment steps shall be considered. Ifequalweights=TRUE we are (ideally)assuming equally sized groups, else ifequalweights = false (default) we allow fordifferent group weights. Please, check in the given references which functionsare maximized in both cases.

msg

Controls whether to display or not messages on the screen Ifmsg==TRUE (default)messages are displayed on the screen. Ifmsg=2, detailed messages are displayed,for example the information at iteration level.

nocheck

Check input arguments. Ifnocheck=TRUE no check is performedon matrixX. The defaultnocheck=FALSE.

plot

Ifplot=TRUE, a plot of the BIC (MIXMIX), ICL (MIXCLA) curveand CLACLA is shown on the screen. The plots which are shown depend onthe input optionwhichIC.

startv1

How to initialize centroids and covariance matrices. Scalar.Ifstartv1=1 then initial centroids and covariance matrices are basedon(p+1) observations randomly chosen, else each centroid is initializedtaking a random row of input data matrix and covariance matrices are initializedwith identity matrices. The default value isstartv1=1.

Remark 1: in order to start with a routine which is in the required parameter space,eigenvalue restrictions are immediately applied.

Remark 2 - optionstartv1 is used just ifnsamp is a scalar(see for more details the help associated withnsamp).

restrtype

Type of restriction to be applied on the cluster scatter matrices.Valid values are'eigen' (default), or'deter'."eigen" implies restriction on the eigenvalues while"deter"implies restriction on the determinants.

UnitsSameGroup

List of the units which must (whenever possible) havea particular label. For exampleUnitsSameGroup=c(20, 26), means thatgroup which contains unit 20 is always labelled with number 1. Similarly,the group which contains unit 26 is always labelled with number 2, (unlessit is found that unit 26 already belongs to group 1).In general, group which contains unitUnitsSameGroup(r) wherer=2, ...length(kk)-1is labelled with numberr (unless it is found that unitUnitsSameGroup(r)has already been assigned to groups1, 2, ..., r-1.

numpool

The number of parallel sessions to open. If numpool is not defined,then it is set equal to the number of physical cores in the computer.

cleanpool

Logical, indicating if the open pool must be closed or not.It is useful to leave it open if there are subsequent parallel sessions to execute,so that to save the time required to open a new pool.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of classtclustic.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Examples

 ## Not run:  data(geyser2) (out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1)) summary(out) ## End(Not run) ## Not run:  data(flea) Y <- as.matrix(flea[, 1:(ncol(flea)-1)])    # select only the numeric variables rownames(Y) <- 1:nrow(Y) head(Y) (out <- tclustIC(Y, whichIC="CLACLA", plot=FALSE, alpha=0.1, nsamp=100, numpool=1)) summary(out) ## End(Not run)

Plots information criterion as a function of`c` and`k`, based on the solutions obtained by`tclustIC`

Description

The functiontclustICplot() takes as input an object of classtclustic.object, the outputof functiontclustIC (that is a series of matrices which containthe values of the information criteria BIC/ICL/CLA for different values ofkandc) and plots them as function ofc or ofk. The plot enablesinteraction in the sense that, if optiondatabrush has been activated, it ispossible to click on a point in the plot and to see the associated classificationin the scatter plot matrix.

Usage

tclustICplot(  out,  whichIC = c("ALL", "MIXMIX", "MIXCLA", "CLACLA"),  tag,  datatooltip,  databrush,  nameY,  trace = FALSE,  ...)

Arguments

out

An S3 object of classtclustic.object(output oftclustIC) containing the valuesof the information criteria BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA),for different values of k (number of groups) and differentvalues of c (restriction factor), for a prespecified level of trimming.

whichIC

Specifies the information criterion to use for the plot.SeetclustIC() for the possible values of whichIC.

tag

plot handle. String which identifies the handle of the plot which is about to be created.The default is to use tag 'pl_IC'. Notice that if the program finds a plot which hasa tag equal to the one specified by the user, then the output of the new plot overwritesthe existing one in the same window else a new window is created.

datatooltip

Interactive clicking. It is inactive if this parameter is set to FALSE.Ifdatatooltip=TRUE, the user can select with the mouse a solutionin order to have the following information:

1) value ofk which has been selected
2) value ofc which has been selected
3) values of the information criterion
4) frequency distribution of the associated classification.

If datatooltip is a list it may contain the following fields:

DisplayStyle determines how the data cursor displays. Possible values are'datatip' and'window' (default).'datatip' displaysdata cursor information in a small yellow text box attached to a blacksquare marker at a data point you interactively select.'window'displays data cursor information for the data point you interactivelyselect in a floating window within the figure.
SnapToDataVertex: specifies whether the data cursor snaps to the nearest data value oris located at the actual pointer position.Possible values areSnapToDataVertex='on' (default) andSnapToDataVertex='off'.

databrush

Interactive mouse brushing. If databrush is missing or empty (default), no brushing is done.The activation of this option (databrush isTRUE or a list) enables the user to selecta set of values of IC in the current plot and to see thecorresponding classification highlighted in the scatterplot matrix.If the scatterplot matrix does not exist it is automatically created.Note that the window style of the other figures is set equal to that which contains the IC plot.In other words, if the IC plot is docked all the other figures will be docked too.

Ifdatabrush=TRUE the default selection tool is a rectangular brush and it ispossible to brush only once (that is persist=”).

Ifdatabrush=list(...), it is possible to use all optional arguments ofthe MATLAB functionselectdataFS() and the following optional arguments:

persist: Persist is an empty value or a character containing 'on' or 'off'.The default value ispersist="", that is brushing is allowed only once.Ifpersist="on" orpersis="off" brushing can be done as many time asthe user requires. Ifpersist='on' then the unit(s) currently brushed areadded to those previously brushed. It is possible, every time a new brushing isdone, to use a different color for the brushed units. Ifpersist='off'every time a new brush is performed units previously brushed are removed.
label: add labels of brushed units in the monitoring plot.
dispopt: controls how to fill the diagonals in the scatterplot matrixof the brushed solutions. Setdispopt="hist" (default) to plot histograms,ordispopt="box" to plot boxplots.

nameY

Add variable labels in plot. A vector of strings of lengthpcontaining the labels of the variables in the dataset.If it is empty (default) the sequenceX1, ..., Xp will be created automatically

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Author(s)

FSDA team,valentin.todorov@chello.at

References

Hubert L. and Arabie P. (1985), Comparing Partitions,Journal of Classification,Vol. 2, pp. 193-218.

Examples

 ## Not run:  data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) tclustICplot(out, whichIC="MIXMIX") ## End(Not run)

Extracts a set of best relevant solutions obtained by`tclustIC`

Description

The functiontclustICsol() takes as input an object of classtclustic.object, the outputof functiontclustIC (that is a series of matrices which containthe values of the information criteria BIC/ICL/CLA for different values ofkandc) and extracts the first best solutions. Two solutions are consideredequivalent if the value of the adjusted Rand index (or the adjusted Fowlkes andMallows index) is above a certain threshold. For each tentative solution the programchecks the adjacent values ofc for which the solution is stable.A matrix with adjusted Rand indexes is given for the extracted solutions.

Usage

tclustICsol(  out,  NumberOfBestSolutions = 5,  ThreshRandIndex = 0.7,  whichIC = c("ALL", "CLACLA", "MIXMIX", "MIXCLA"),  Rand = TRUE,  msg = TRUE,  plot = FALSE,  trace = FALSE,  ...)

Arguments

out

NumberOfBestSolutions

Number of best solutions to extract from BIC/ICL matrix.The default value of NumberOfBestSolutions is 5

ThreshRandIndex

Threshold to identify spurious solutions - the thresholdof the adjusted Rand index to use to consider two solutions as equivalent.The default value of ThreshRandIndex is 0.7

whichIC

Specifies the information criterion to use to extract best solutions.Possible values for whichIC are:

CLACLA = in this case best solutions are referred to the classification likelihood.
MIXMIX = in this case in this case best solutions are referred to the mixture likelihood (BIC).
MIXCLA = in this case in this case best solutions are referred to ICL.
ALL = in this case best solutions both three solutions using classificationand mixture likelihood are produced. In the output classout all thethree matricesMIXMIXbs,CLACLAbs andMIXCLAbs are given.

The default value iswhichIC="ALL".

Rand

Index to use to compare partitions. IfRand=TRUE (default) the adjusted Randindex is used, else the adjusted Fowlkes and Mallows index is used.

msg

It controls whether to display or not messages (from MATLAB) on the screen. Ifmsg=TRUE(default) messages about the progression of the search are displayed on the screenotherwise only error messages will be displayed.

plot

Ifplot=TRUE, the best solutions which have been found are shown on the screen.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of classtclusticsol.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Hubert L. and Arabie P. (1985), Comparing Partitions,Journal of Classification,Vol. 2, pp. 193-218.

Examples

 ## Not run:  data(geyser2) out <- tclustIC(geyser2, whichIC="MIXMIX", plot=FALSE, alpha=0.1) ## Plot first two best solutions using as Information criterion MIXMIX print("Best solutions using MIXMIX") outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2) print(outMIXMIX$MIXMIXbs) ## End(Not run)

Objects returned by the function`tclustfsda` with the option`monitoring=TRUE`

Description

An object of classtclusteda.object holds information aboutthe result of a call totclustfsda with the optionmonitoring=TRUE.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classtclusteda is a list containing at least the following components:

call

the matched call

k

number of groups

alpha

trimming level

restrfactor

restriction factor

IDX

ann-by-length(alpha) vector containing assignment of each unit to each of thekgroups. Cluster names are integer numbers from 1 to k.0 indicates trimmed observations. The first column refers toalpha[1], the second columnrefers toalpha[2], ..., the last column refers toalpha[length(alpha)].

MU

a 3 dimensional array of size k-by-p-by-length(alpha) containing the monitoringof the centroid for each value of alpha.MU[,,1], refers toalpha[1] ...,MU(,,length(alpha)] refers toalpha[length(alpha)].The first row in each slice refers to group 1, second row refers to group 2, etc.

SIGMA

A list of lengthlength(alpha) containing in elementj,withj=1, 2, ..., length(alpha), the 3D array of size p-by-p-by-k containingthek (constrained) estimated covariance matrices associated withalpha[j].

Amon

Amon stands for alpha monitoring. Matrix of size(length(alpha)-1)-by-4 whichcontains for two consecutive values of alpha the monitoring of three quantities(change in classification, change in centroid location, change in covariance location).

1st col = value of alpha.
2nd col = ARI index.
3rd col = squared Euclidean distance between centroids.
4th col = squared Euclidean distance between covariance matrices.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2, monitoring=TRUE)) class(out) summary(out) ## End(Not run)

Computes trimmed clustering with scatter restrictions

Description

Partitions the points in the n-by-v data matrixY intok clusters. This partition minimizes the trimmed sum,over all clusters, of the within-cluster sums of point-to-cluster-centroiddistances. Rows of Y correspond to points, columns correspond to variables.Returns in the output object of classtclustfsda.object an n-by-1 vectoridx containing the cluster indices of each point. By default,tclustfsda() uses (squared), possibly constrained, Mahalanobis distances.

Usage

tclustfsda(  x,  k,  alpha,  restrfactor = 12,  monitoring = FALSE,  plot = FALSE,  nsamp,  refsteps = 15,  reftol = 1e-13,  equalweights = FALSE,  mixt = 0,  msg = FALSE,  nocheck = FALSE,  startv1 = 1,  RandNumbForNini,  restrtype = c("eigen", "deter"),  UnitsSameGroup,  numpool,  cleanpool,  trace = FALSE,  ...)

Arguments

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

k

Number of groups.

alpha

A scalar between 0 and 0.5 or an integer specifying the number ofobservations which have to be trimmed. Ifalpha=0,tclust reduces totraditional model based or mixture clustering (mclust): see for example theMatlab functiongmdistribution.

More in detail, if0 < alpha < 1 clustering is based onh = floor(n * (1-alpha))observations, else if alpha is an integer greater than 1 clustering is based onh = n - floor(alpha).Ifmonitoring=TRUE,alpha is a vector which specifies the values oftrimming levels which have to be considered - contains decresing elements whichlie in the interval 0 and 0.5.For example ifalpha=c(0.1, 0.05, 0),tclust() considers these 3 values of trimming level.The default for alpha is vectoralpha=c(0.1, 0.05, 0). The sequence isforced to be monotonically decreasing.

restrfactor

Positive scalar which constrains the allowed differences among group scatters.Larger values imply larger differences of group scatters. On the other handa value of 1 specifies the strongest restriction forcing alleigenvalues/determinants to be equal and so the method looksfor similarly scattered (respectively spherical) clusters.The default is to applyrestrfactor to eigenvalues. In order toapplyrestrfactor to determinants it is necessary to use theoptional input argumentrestrtype.

monitoring

Ifmonitoring=TRUE TCLUST is performed for a series of valuesof the trimming factoralpha givenk (number of groups) and givenc (restriction factor).

plot

Ifplot=FALSE (default) orplot=0 no plot is produced.Ifplot=TRUE andmonitoring=FALSE a plot with the classificationis shown (using the spmplot function). The plot can be:

forp = 1, a histogram of the univariate data,
forp = 2, a bivariate scatterplot,
forp > 2, a scatterplot matrix generated by the MATLAB functionspmplot().

Whenp >= 2 the following additional features are offered(forp = 1 the behaviour is forced to be as forplots=TRUE):

plot = 'contourf' adds in the background of the bivariate scatterplots afilled contour plot. The colormap of the filled contour is based on grey levels as default.This argument may also be inserted in a field named 'type' of a list. In the latter caseit is possible to specify the additional field 'cmap', which changes the default colorsof the color map used. The field 'cmap' may be a three-column matrix of values in therange [0,1] where each row is an RGB triplet that defines one color. Check the colormapfunction for additional informations.
plot = 'contour' adds in the background of the bivariate scatterplots a contour plot.The colormap of the contour is based on grey levels as default. This argument may also beinserted in a field namedtype of a list. In the latter case it is possible to specifythe additional fieldcmap, which changes the default colors of the color map used. The fieldcmap may be a three-column matrix of values in the range [0,1] where each row is an RGBtriplet that defines one color. Check thecolormap() (MATLAB) function for additional information.
plot = 'ellipse' superimposes confidence ellipses to each group in the bivariatescatterplots. The size of the ellipse isqchisq(0.95, 2), i.e. the confidence level usedby default is 95 percent. This argument may also be inserted in a field namedtype ofa list. In the latter case it is possible to specify the additional fieldconflev,which specifies the confidence level to use and it is a value between 0 and 1.
plot = 'boxplotb' superimposes on the bivariate scatterplots the bivariate boxplotsfor each group, using the boxplotb function. This argument may also be inserted in a fieldnamedtype of a list.

The parameterplot can be also a list and in this case its elements are:

type - specifies the type of plot as when plot option is a character.Therefore, plots$type can be one of 'contourf', 'contour', 'ellipse' or 'boxplotb'.
cmap - used to set a colormap for the plot type (MATLAB style). For example, plot$cmap = 'autumn'.See the MATLAB help of colormap for a list of colormap possiblilites.
conflev - this is the confidence level for the confidence ellipses.It must me a scalar between 0 and 1.

Ifplot=TRUE andmonitoring=TRUE two plots are shown.The first plot (monitor plot) shows three panels with the monitoring between twoconsecutive values of alpha: (i) the change in classification using ARI index (top panel),(ii) the change in centroids using squared euclidean distances (central panel) and(iii) the change in covariance matrices using squared euclidean distance (bottom panel).

The second plot (gscatter plot) shows a series of subplots which monitor the classificationfor each value ofalpha. In order to make sure that consistent labels are used for thegroups, between two consecutive values ofalpha, we assign labelr to a group ifthis group shows the smallest distance with groupr for the previous value ofalpha.The type of plot which is used to monitor the stability of the classification depends on thedata dimensionalityp.

forp = 1, a histogram of the univariate data (the MATLAB functionhistFS() is called),
forp = 2, a bivariate scatterplot (the MATLAB functiongscatter() is called),
forp > 2, a scatterplot of the first two principal components (functiongscatter()is called and we show on the axes titles the percentage of variance explained by the firsttwo principal components).

Also in the case ofmonitoring=TRUE the parameterplot can be a list and its elements are:

name: character vector which enables to specify which plot to display.name = "gscatter" produces a figure with a series of subplots which show theclassification for each value ofalpha.name = "monitor" shows a figurewith three panels which monitor between two consecutive values of alpha the change inclassification using ARI index (top panel), the change in centroids using squaredeuclidean distances (central panel), the change in covariance matrices using squaredeuclidean distance (bottom panel). If this field is not specified, by defaultname=c("gscatter", "monitor") and both figures will be shown.
alphasel: a numeric vector which specifies for which values of alpha itis possible to see the classification. For example ifalphasel = c(0.05, 0.02),the classification will be shown just foralpha=0.05 andalpha=0.02.If this field is not specifiedalphasel=alpha and therefore the classificationis shown for each value of alpha.

nsamp

Ifnsamp hask * (p + 1) columns, the initial centroids and covariancematrices in iterationi are computed as follows:

X1 <- X[nsamp[i ,] ,]
mean(X1[1:p + 1, ]) contains the initial centroid for group 1
cov(X1[1:p + 1, ]) contains the initial cov matrix for group 1
mean(X1[(p + 2):(2*p + 2), ]) contains the initial centroid for group 2
cov(X1[(p + 2):(2*p + 2), ]) contains the initial cov matrix for group 2
...
mean(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial centroids for group k
cov(X1[(k-1)*p+1):(k*(p+1), ]) contains the initial cov matrix for group k.

REMARK: Ifnsamp is not a scalar, the optionstartv1 given below is ignored.More precisely, ifnsamp hask columnsstartv1 = 0 else ifnsamp hask*(p+1) columns optionstartv1=1.

refsteps

Number of refining iterations in each subsample. Default isrefsteps=15.refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

mixt

Specifies whether mixture modelling or crisp assignment approach to modelbased clustering must be used. In the case of mixture modelling parameter mixt alsocontrols which is the criterion to find the untrimmed units in each step of the maximization.Ifmixt >=1 mixture modelling is assumed else crisp assignment.The default value ismixt=0, i.e. crisp assignment. Please seefor details the provided references.The parametermixt also controls the criterion to select the units to trim.Ifmixt = 2 theh units are those which give the largest contributionto the likelihood, else ifmixt=1 the criterion to select theh units isexactly the same as the one which is used in crisp assignment.

msg

nocheck

Check input arguments. Ifnocheck=TRUE no check is performedon matrixX. The defaultnocheck=FALSE.

startv1

Remark 1: in order to start with a routine which is in the required parameter space,eigenvalue restrictions are immediately applied.

Remark 2 - optionstartv1 is used just ifnsamp is a scalar(see for more details the help associated withnsamp).

RandNumbForNini

pre-extracted random numbers to initialize proportions.Matrix of size k-by-nrow(nsamp) containing the random numbers whichare used to initialize the proportions of the groups. This option is effective just ifnsamp is a matrix which contains pre-extracted subsamples. The purpose of thisoption is to enable to user to replicate the results in case routinetclustreg*()is called using a parfor instruction (as it happens for example in routine IC, wheretclustreg() is called through a parfor for different values of the restriction factor).The default is thatRandNumbForNini is empty - then uniform random numbers are used.

restrtype

UnitsSameGroup

numpool

The number of parallel sessions to open. If numpool is not defined,then it is set equal to the number of physical cores in the computer.

cleanpool

Logical, indicating if the open pool must be closed or not.It is useful to leave it open if there are subsequent parallel sessions to execute,so that to save the time required to open a new pool.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Details

This iterative algorithm initializesk clusters randomly and performsconcentration steps in order to improve the current cluster assignment. The number ofmaximum concentration steps to be performed is given by input parameterrefsteps.For approximately obtaining the global optimum, the system is initializednsamptimes and concentration steps are performed until convergence orrefsteps isreached. When processing more complex data sets higher values ofnsamp andrefsteps have to be specified (obviously implying extra computation time).However, if more then 10 per cent of the iterations do not converge, a warning messageis issued, indicating thatnsamp has to be increased.

Value

Depending on the input parametermonitoring, one ofthe following objects will be returned:

Author(s)

FSDA team,valentin.todorov@chello.at

References

Garcia-Escudero, L.A., Gordaliza, A., Matran, C. and Mayo-Iscar, A. (2008).A General Trimming Approach to Robust Cluster Analysis. Annals of Statistics,Vol. 36, 1324-1345.doi:10.1214/07-AOS515.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ##  TCLUST of Gayser data with three groups (k=3), 10%% trimming (alpha=0.1) ##      and restriction factor (c=10000) data(geyser2) (out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000)) ## Use the plot options to produce more complex plots ---------- ##  Plot with all default options out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=TRUE) ##  Default confidence ellipses. out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="ellipse") ##  Confidence ellipses specified by the user: confidence ellipses set to 0.5 plots <- list(type="ellipse", conflev=0.5) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ##  Confidence ellipses set to 0.9 plots <- list(type="ellipse", conflev=0.9) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot=plots) ##  Contour plots out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, plot="contour") ##  Filled contour plots with additional options: contourf plot with a named colormap. ##  Here we define four MATLAB-like colormaps, but the user can define anything else, ##  presented by a matrix with three columns which are the RGB triplets. summer <- as.matrix(data.frame(x1=seq(from=0, to=1, length=65),                                x2=seq(from=0.5, to=1, length=65),                                x3=rep(0.4, 65))) spring <- as.matrix(data.frame(x1=rep(1, 65),                                x2=seq(from=0, to=1, length=65),                                x3=seq(from=1, to=0, length=65))) winter <- as.matrix(data.frame(x1=rep(0, 65),                                x2=seq(from=0, to=1, length=65),                                x3=seq(from=1, to=0, length=65))) autumn <- as.matrix(data.frame(x1=rep(1, 65),                                x2=seq(from=0, to=1, length=65),                                x3=rep(0, 65))) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,       plot=list(type="contourf", cmap=autumn)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,       plot=list(type="contourf", cmap=winter)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,       plot=list(type="contourf", cmap=spring)) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000,       plot=list(type="contourf", cmap=summer)) ##  We compare the output using three different values of restriction factor ##      nsamp is the number of subsamples which will be extracted data(geyser2) out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10000, nsamp=500, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=10, nsamp=500, refsteps=10, plot="ellipse") out <- tclustfsda(geyser2, k=3, alpha=0.1, restrfactor=1, nsamp=500, refsteps=10, plot="ellipse") ##  TCLUST applied to M5 data: A bivariate data set obtained from three normal ##  bivariate distributions with different scales and proportions 1:2:2. One of the ##  components is very overlapped with another one. A 10 per cent background noise is ##  added uniformly distributed in a rectangle containing the three normal components ##  and not very overlapped with the three mixture components. A precise description ##  of the M5 data set can be found in Garcia-Escudero et al. (2008). ## data(M5data) plot(M5data[, 1:2]) ##  Scatter plot matrix library(rrcov) plot(CovClassic(M5data[,1:2]), which="pairs") out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=1000, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0, restrfactor=10, nsamp=100, plot=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1, nsamp=1000,         plot=TRUE, equalweights=TRUE) out <- tclustfsda(M5data[,1:2], k=3, alpha=0.1, restrfactor=1000, nsamp=100, plot=TRUE) ##  TCLUST with simulated data: 5 groups and 5 variables ## n1 <- 100 n2 <- 80 n3 <- 50 n4 <- 80 n5 <- 70 p <- 5 Y1 <- matrix(rnorm(n1 * p) + 5, ncol=p) Y2 <- matrix(rnorm(n2 * p) + 3, ncol=p) Y3 <- matrix(rnorm(n3 * p) - 2, ncol=p) Y4 <- matrix(rnorm(n4 * p) + 2, ncol=p) Y5 <- matrix(rnorm(n5 * p), ncol=p) group <- c(rep(1, n1), rep(2, n2), rep(3, n3), rep(4, n4), rep(5, n5)) Y <- Y1 Y <- rbind(Y, Y2) Y <- rbind(Y, Y3) Y <- rbind(Y, Y4) Y <- rbind(Y, Y5) dim(Y) table(group) out <- tclustfsda(Y, k=5, alpha=0.05, restrfactor=1.3, refsteps=20, plot=TRUE) ##  Automatic choice of best number of groups for Geyser data ------------------------ ## data(geyser2) maxk <- 6 CLACLA <- matrix(0, nrow=maxk, ncol=2) CLACLA[,1] <- 1:maxk MIXCLA <- MIXMIX <- CLACLA for(j in 1:maxk) {     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5)     CLACLA[j, 2] <- out$CLACLA } for(j in 1:maxk) {     out <- tclustfsda(geyser2, k=j, alpha=0.1, restrfactor=5, mixt=2)     MIXMIX[j ,2] <- out$MIXMIX     MIXCLA[j, 2] <- out$MIXCLA } oldpar <- par(mfrow=c(1,3)) plot(CLACLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="CLACLA") plot(MIXMIX[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXMIX") plot(MIXCLA[,1:2], type="l", xlim=c(1, maxk), xlab="Number of groups", ylab="MIXCLA") par(oldpar) ##  Monitoring examples ------------------------------------------ ##  Monitoring using Geyser data ##  Monitoring using Geyser data (all default options) ##  alpha and restriction factor are not specified therefore alpha=c(0.10, 0.05, 0) ##  is used while the restriction factor is set to c=12 out <- tclustfsda(geyser2, k=3, monitoring=TRUE) ##  Monitoring using Geyser data with alpha and c specified. out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.10, 0, by=-0.01), monitoring=TRUE) ##  Monitoring using Geyser data with plot argument specified as a list. ##      The trimming levels to consider in this case are 31 values of alpha ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,         plot=list(alphasel=c(0.2, 0.10, 0.05, 0.01)), trace=TRUE) ##  Monitoring using Geyser data with argument UnitsSameGroup ## ##      Make sure that group containing unit 10 is in a group which is labelled "group 1" ##      and group containing unit 12 is in group which is labelled "group 2" ## ##      Mixture model is used (mixt=2) ## out <- tclustfsda(geyser2, k=3, restrfac=100, alpha=seq(0.30, 0, by=-0.01), monitoring=TRUE,         mixt=2, UnitsSameGroup=c(10, 12), trace=TRUE) ##  Monitoring using M5 data data(M5data) ##  alphavec=vector which contains the trimming levels to consider ##  in this case 31 values of alpha are considered alphavec <- seq(0.10, 0, by=-0.02) out <- tclustfsda(M5data[, 1:2], 3, alpha=alphavec, restrfac=1000, monitoring=TRUE,     nsamp=1000, plots=TRUE) ## End(Not run)

Objects returned by the function`tclustfsda`

Description

An object of classtclustfsda.object holds information aboutthe result of a call totclustfsda.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classtclustfsda is a list containing at least the following components:

call

the matched call

muopt

a k-by-p matrix containing cluster centroid locations. Robust estimate of final centroids of the groups

sigmaopt

a p-by-p-by-k array rray containing estimated constrained covariance for the k groups

idx

a vector of length n containing assignment of each unit to each of the k groups. Cluster names are integer numbers from 1 to k. 0 indicates trimmed observations.

size

a matrix of size (k+1)-by-3. The 1st col is sequence from 0 to k (cluster name); the 2nd col is the number of observations in each cluster; the 3rd col is the percentage of observations in each cluster.

Remark: 0 denotes unassigned units.

postprob

n-by-k matrix containing posterior probabilities.postprob[i, j] contains posterior probabilitiy of uniti from component (cluster)j. For the trimmed units posterior probabilities are 0.

emp

"Empirical" statistics computed on final classification. When convergence is reached,emp=0. When convergence is not obtained, this field is a list which contains the statistics of interest:idxemp (ordered from 0 to k*, k* being the number of groups with at least one observation and 0 representing the possible group of outliers),muemp,sigmaemp andsizemp, which are the empirical counterparts ofidx,muopt,sigmaopt andsize.

MIXMIX

BIC which uses parameters estimated using the mixture loglikelihood and the maximized mixture likelihood as goodness of fit measure.

Remark: this output is present just ifmixt > 0.

MIXCLA

BIC which uses parameters estimated using the mixture loglikelihood and the maximized mixture likelihood as goodness of fit measure.

Remark: this output is present just ifmixt > 0.

CLACLA

BIC which uses the classification likelihood based on parameters estimated using the classification likelihood.

Remark: this output is present just ifmixt > 0.

notconver

number of subsets without convergence

bs

a vector of lengthk containing the units forming initial subset associated with muopt.

obj

value of the objective function which is minimized (value of the best returned solution).

equalweights

ifequalweights=TRUE means that in the clustering procedure we (ideally) assumed equal cluster weights else (equalweitghts=FALSE means that we allowed for different cluster sizes.

h

number of observations that have determined the centroids (number of untrimmed units).

fullsol

a vector of sizensamp which contains the value of the objective function at the end of the iterative process for each extracted subsample.

X

the original data matrix X.

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- tclustfsda(hbk[, 1:3], k=2)) class(out) summary(out) ## End(Not run)

Objects returned by the function`tclustIC`

Description

An object of classtclustic.object holds information aboutthe result of a call totclustIC.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classtclustic is a list containing at least the following components:

call

the matched call

kk

a vector containing the values ofk (number of components) which have been considered.This vector is identical to the optional argumentkk (default iskk=1:5.

cc

a vector containing the values ofc (values of the restriction factor) whichhave been considered. This vector is identical to the optional argumentcc (defalt iscc=c(1, 2, 4, 8, 16, 32, 64, 128).

alpha

trimming level

whichIC

Information criteria used

CLACLA

a matrix of sizelength(kk)-times-length(cc) containinig the value ofthe penalized classification likelihood. This output is present only ifwhichIC="CLACLA" orwhichIC="ALL".

IDXCLA

a matrix of lists of sizelength(kk)-times-length(cc) containinig the assignment of each unitusing the classification model. This output is present only ifwhichIC="CLACLA" orwhichIC="ALL".

MIXMIX

a matrix of sizelength(kk)-times-length(cc) containinig the value ofthe penalized mixtrue likelihood. This output is present only ifwhichIC="MIXMIX" orwhichIC="ALL".

IDXMIX

a matrix of lists of sizelength(kk)-times-length(cc) containinig the assignment of each unitusing the classification model. This output is present only ifwhichIC="MIXMIX" orwhichIC="ALL".

MIXCLA

a matrix of sizelength(kk)-times-length(cc) containinig the value ofthe ICL criterion. This output is present only ifwhichIC="MIXCLA" orwhichIC="ALL".

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3])) class(out) summary(out) ## End(Not run)

Objects returned by the function`tclustICsol`

Description

An object of classtclusticsol.object holds information aboutthe result of a call totclustICsol.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classtclusticsol is a list containing at least the following components:

call

the matched call

kk

a vector containing the values ofk (number of components) which have been considered.This vector is identical to the optional argumentkk (default iskk=1:5.

cc

a vector containing the values ofc (values of the restriction factor) whichhave been considered. This vector is identical to the optional argumentcc (defalt iscc=c(1, 2, 4, 8, 16, 32, 64, 128).

alpha

trimming level

whichIC

Information criteria used

MIXMIXbs

a matrix of lists of sizeNumberOfBestSolutions-times-5 whichcontains the details of the best solutions for MIXMIX (BIC). Each row refers to asolution. The information which is stored in the columns is as follows.

1st col = value of k for which solution takes place
2nd col = value of c for which solution takes place;
3rd col = a vector of lengthd which contains the values ofcfor which the solution is uniformly better.
4th col = a vector of lengthd + r which contains the values ofcfor which the solution is considered stable (i.e. for which the valueof the adjusted Rand index (or the adjusted Fowlkes and Mallows index)does not go below the threshold defined in input optionThreshRandIndex).
5th col = string which contains 'true' or 'spurious'. The solution is labelledspurious if the value of the adjusted Rand index with the previous solutionsis greater than ThreshRandIndex.

Remark: the field MIXMIXbs is present only ifwhichIC=ALL orwhichIC="MIXMIX".

MIXMIXbsari

a matrix of adjusted Rand indexes (or Fowlkes and Mallows indexes)associated with the best solutions for MIXMIX. A matrix of sizeNumberOfBestSolutions-times-NumberOfBestSolutionswhosei,j-th entry contains the adjusted Rand index between classification produced by solutioni and solutionj,i,j=1,2, ...,NumberOfBestSolutions.

Remark: the fieldMIXMIXbsari is present only ifwhichIC=ALL orwhichIC="MIXMIX".

ARIMIX

a matrix of adjusted Rand indexes between two consecutive value ofc.Matrix of sizek-by-length(cc)-1. The first column contains the ARI indexesbetweencc[2] andcc[1] givenk.The second column contains the the ARI indexes betweencc[3] andcc[2] givenk.

Remark: the fieldARIMIX is present only ifwhichIC=ALL orwhichIC="MIXMIX" orwhichIC="MIXCLA".

MIXCLAbs

has the same structure asMIXMIXbs but referres to MIXCLA.

Remark: the field MIXCLAbs is present only ifwhichIC=ALL orwhichIC="MIXCLA".

MIXCLAbsari

has the same structure asMIXMIXbsari but referres to MIXCLA.

Remark: the fieldMIXMIXbsari is present only ifwhichIC=ALL orwhichIC="MIXCLA".

CLACLAbs

has the same structure asMIXMIXbs but referres to CLACLA.

Remark: the field CLACLAbs is present only ifwhichIC=ALL orwhichIC="CLACLA".

CLACLAbsari

has the same structure asMIXMIXbsari but referres to CLACLA.

Remark: the fieldCLACLAbsari is present only ifwhichIC=ALL orwhichIC="CLACLA".

ARICLA

Remark: the fieldARICLA is present only ifwhichIC=ALL orwhichIC="CLACLA".

Examples

 ## Not run:  data(hbk, package="robustbase") (out <- tclustIC(hbk[, 1:3]))  ## Plot first two best solutions using as Information criterion MIXMIX  print("Best solutions using MIXMIX")  outMIXMIX <- tclustICsol(out, whichIC="MIXMIX", plot=TRUE, NumberOfBestSolutions=2)  class(outMIXMIX)  summary(outMIXMIX)  print(outMIXMIX$MIXMIXbs) ## End(Not run)

Computes robust linear grouping analysis

Description

Performs robust linear grouping analysis.

Usage

tclustreg(  y,  x,  k,  alphaLik,  alphaX,  restrfactor = 12,  intercept = TRUE,  plot = FALSE,  nsamp,  refsteps = 10,  reftol = 1e-13,  equalweights = FALSE,  mixt = 0,  wtrim = 0,  we,  msg = TRUE,  RandNumbForNini,  trace = FALSE,  ...)

Arguments

y

Response variable. A vector withn elements thatcontains the response variable.

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

k

Number of groups.

alphaLik

Trimming level, a scalar between 0 and 0.5 or aninteger specifying the number of observations which have to be trimmed.IfalphaLik=0, there is no trimming. More in detail, if0 < alphaLik < 1clustering is based onh = floor(n * (1 - alphaLik)) observations.IfalphaLik is an integer greater than 1 clustering isbased onh = n - floor(alphaLik). More in detail, likelihoodcontributions are sorted and the units associated with the smallestn - hcontributions are trimmed.

alphaX

Second-level trimming or constrained weighted model forx.

restrfactor

Restriction factor for regression residuals andcovariance matrices of the explanatory variables. Scalar or vectorwith two elements. Ifrestrfactor is a scalar it controls the differencesamong group scatters of the residuals. The value 1 is the strongestrestriction. Ifrestrfactor is a vector with two elementsthe first element controls the differences among group scatters ofthe residuals and the second the differences among covariancematrices of the explanatory variables. Note thatrestrfactor[2]is used just ifalphaX=1, that is if constrained weighted modelforx is assumed.

intercept

wheather to use constant term (default isintercept=TRUE

plot

Ifplot=FALSE (default) orplot=0 no plot is produced.Ifplot=TRUE a plot with the final allocation is shown (using the spmplot function).IfX is 2-dimensional, the lines associated to the groups are shown too.

nsamp

If a scalar, it contains the number of subsamples which will be extracted.Ifnsamp = 0 all subsets will be extracted. Remark - if the number of all possiblesubset is greater than 300 the default is to extract all subsets, otherwise just 300.Ifnsamp is a matrix it contains in the rows the indexes of the subsets whichhave to be extracted.nsamp in this case can be conveniently generated byfunctionsubsets().nsamp must havek * p columns. The firstpcolumns are used to estimate the regression coefficient of group 1, ..., the lastpcolumns are used to estimate the regression coefficient of groupk.

refsteps

Number of refining iterations in each subsample. Default isrefsteps=10.refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

mixt

Specifies whether mixture modelling or crisp assignment approach to modelbased clustering must be used. In the case of mixture modelling parameter mixt alsocontrols which is the criterion to find the untrimmed units in each step of the maximization.Ifmixt>=1 mixture modelling is assumed else crisp assignment.The default value ismixt=0, i.e. crisp assignment. Please seefor details the provided references.The parametermixt also controls the criterion to select the units to trim.Ifmixt = 2 theh units are those which give the largest contributionto the likelihood, else ifmixt=1 the criterion to select theh units isexactly the same as the one which is used in crisp assignment.

wtrim

How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).

Ifwtrim==0 (no weights), the algorithm reduces to the standardtclustreg algorithm.
Ifwtrim==1, trimming is done by weighting the observations using values specified in vectorwe. In this case, vectorwe must be supplied by the user.
Ifwtrim==2, trimming is again done by weighting the observationsusing values specified in vectorwe. In this case, vectorweis computed from the data as a function of the density estimate pdfe.Specifically, the weight of each observation is the probability of retainingthe observation, computed as
pretain_{ig} = 1-pdfe_{ig}/max_{ig}(pdfe_{ig})
Ifwtrim==3, trimming is again done by weighting the observations usingvalues specified in vectorwe. In this case, each element wei of vectorwe is a Bernoulli random variable with probability of successpdfe_{ig}.In the clustering framework this is done under the constraint that no group is empty.
Ifwtrim==4, trimming is done with the tandem approach of Cerioli and Perrotta (2014).

we

Weights. A vector of size n-by-1 containing application-specific weightsDefault is a vector of ones.

msg

RandNumbForNini

pre-extracted random numbers to initialize proportions.Matrix of size k-by-nrow(nsamp) containing the random numbers whichare used to initialize the proportions of the groups. This option is effective only ifnsamp is a matrix which contains pre-extracted subsamples. The purpose of thisoption is to enable the user to replicate the results when the functiontclustreg()is called using a parfor instruction (as it happens for example in routine IC, wheretclustreg() is called through a parfor for different values of the restriction factor).The default is thatRandNumbForNini is empty - then uniform random numbers are used.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of classtclustreg.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Mayo-Iscar A. (2016). The joint role of trimming and constraints in robustestimation for mixtures of gaussian factor analyzers,Computational Statistics and Data Analysis", Vol. 99, pp. 131-147.

Garcia-Escudero, L.A., Gordaliza, A., Greselin, F., Ingrassia, S. and Mayo-Iscar, A. (2017),Robust estimation of mixtures of regressions with random covariates, via trimming and constraints,Statistics and Computing, Vol. 27, pp. 377-402.

Garcia-Escudero, L.A., Gordaliza A., Mayo-Iscar A., and San Martin R. (2010).Robust clusterwise linear regression through trimming,Computational Statistics and Data Analysis, Vol. 54, pp.3057-3069.

Cerioli, A. and Perrotta, D. (2014). Robust Clustering Around Regression Lines with High Density Regions.Advances in Data Analysis and Classification, Vol. 8, pp. 5-26.

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data,Advances in Data Analysis and Classification, Vol. 13, pp 227-257.

Examples

 ## Not run:  ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, plot=TRUE, trace=TRUE)) (out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=2,         mixt=2, plot=TRUE, trace=TRUE)) ##  Examples with fishery data data(fishery) X <- fishery ## some jittering is necessary because duplicated units are not treated: ## this needs to be addressed X <- X + 10^(-8) * abs(matrix(rnorm(nrow(X)*ncol(X)), ncol=2)) y1 <- X[, ncol(X)] X1 <- X[, -ncol(X), drop=FALSE] (out <- tclustreg(y1, X1, k=3, restrfact=50, alphaLik=0.04, alphaX=0.01, trace=TRUE)) ## Example 2: ## Define some arbitrary weightssome arbitrary weights for the units     we <- sqrt(X1)/sum(sqrt(X1)) ##  tclustreg required parameters     k <- 2; restrfact <- 50; alpha1 <- 0.04; alpha2 <- 0.01 ##  Now tclust is run on each combination of mixt and wtrim options     cat("\nmixt=0; wtrim=0",         "\nStandard tclustreg, with classification likelihood and without thinning\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=0, wtrim=0, trace=TRUE))     cat("\nmixt=2; wtrim=0",         "\nMixture likelihood, no thinning\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=2, wtrim=0, trace=TRUE))     cat("\nmixt=0; wtrim=1",         "\nClassification likelihood, thinning based on user weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=0, we=we, wtrim=1, trace=TRUE))     cat("\nmixt=2; wtrim=1",         "\nMixture likelihood, thinning based on user weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=2, we=we, wtrim=1, trace=TRUE))     cat("\nmixt=0; wtrim=2",         "\nClassification likelihood, thinning based on retention probabilities\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=0, wtrim=2, trace=TRUE))     cat("\nmixt=2; wtrim=2",         "\nMixture likelihood, thinning based on retention probabilities\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=2, wtrim=2, trace=TRUE))     cat("\nmixt=0; wtrim=3",         "\nClassification likelihood, thinning based on bernoulli weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=0, wtrim=3, trace=TRUE))     cat("\nmixt=2; wtrim=3",         "\nMixture likelihood, thinning based on bernoulli weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=2, wtrim=3, trace=TRUE))     cat("\nmixt=0; wtrim=4",         "\nClassification likelihood, tandem thinning based on bernoulli weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=0, wtrim=4, trace=TRUE))     cat("\nmixt=2; wtrim=4",         "\nMixture likelihood, tandem thinning based on bernoulli weights\n")     (out <- tclustreg(y1, X1, k=k, restrfact=restrfact, alphaLik=alpha1, alphaX=alpha2,             mixt=2, wtrim=4, trace=TRUE)) ## End(Not run)

Objects returned by the function`tclustreg`

Description

An object of classtclustreg.object holds information aboutthe result of a call totclustreg.

Value

The functionsprint() andsummary() are used to obtain and print asummary of the results. An object of classtclustreg is a list containing at least the following components:

call

the matched call

Examples

 ## Not run:  ## The X data have been introduced by Gordaliza, Garcia-Escudero & Mayo-Iscar (2013). ## The dataset presents two parallel components without contamination. data(X) y1 = X[, ncol(X)] X1 = X[,-ncol(X), drop=FALSE] out <- tclustreg(y1, X1, k=2, alphaLik=0.05, alphaX=0.01, restrfactor=5, trace=TRUE) class(out) str(out) ## End(Not run)

Computes`tclustreg` for different number of groups`k`and restriction factors`c`.

Description

(the last two letters stand for 'Information Criterion') computesthe values of BIC (MIXMIX), ICL (MIXCLA) or CLA (CLACLA), for different valuesofk (number of groups) and different values ofc(restriction factor for the variances of the residuals), fora prespecified level of trimming. In order to minimize randomness, givenk,the same subsets are used for each value ofc.

Usage

tclustregIC(  y,  x,  alphaLik,  alphaX,  intercept = TRUE,  plot = FALSE,  nsamp,  refsteps = 10,  reftol = 1e-13,  equalweights = FALSE,  wtrim = 0,  we,  msg = TRUE,  RandNumbForNini,  trace = FALSE,  ...)

Arguments

y

Response variable. A vector withn elements thatcontains the response variable.

x

An n x p data matrix (n observations and p variables).Rows of x represent observations, and columns represent variables.

Missing values (NA's) and infinite values (Inf's) are allowed,since observations (rows) with missing or infinite values willautomatically be excluded from the computations.

alphaLik

alphaX

Second-level trimming or constrained weighted model forx.

intercept

wheather to use constant term (default isintercept=TRUE

plot

nsamp

If a scalar, it contains the number of subsamples which will be extracted.Ifnsamp = 0 all subsets will be extracted. Remark - if the number of all possiblesubset is greater than 300 the default is to extract all subsets, otherwise just 300.Ifnsamp is a matrix it contains in the rows the indexes of the subsets whichhave to be extracted.nsamp in this case can be conveniently generated byfunctionsubsets().nsamp must havek * p columns. The firstpcolumns are used to estimate the regression coefficient of group 1, ..., the lastpcolumns are used to estimate the regression coefficient of groupk.

refsteps

Number of refining iterations in each subsample. Default isrefsteps=10.refsteps = 0 means "raw-subsampling" without iterations.

reftol

Tolerance of the refining steps. The default value is 1e-14

equalweights

wtrim

How to apply the weights on the observations - a flag taking values in c(0, 1, 2, 3, 4).

Ifwtrim==0 (no weights), the algorithm reduces to the standardtclustreg algorithm.
Ifwtrim==1, trimming is done by weighting the observations using values specified in vectorwe. In this case, vectorwe must be supplied by the user.
Ifwtrim==2, trimming is again done by weighting the observationsusing values specified in vectorwe. In this case, vectorweis computed from the data as a function of the density estimate pdfe.Specifically, the weight of each observation is the probability of retainingthe observation, computed as
pretain_{ig} = 1-pdfe_{ig}/max_{ig}(pdfe_{ig})
Ifwtrim==3, trimming is again done by weighting the observations usingvalues specified in vectorwe. In this case, each element wei of vectorwe is a Bernoulli random variable with probability of successpdfe_{ig}.In the clustering framework this is done under the constraint that no group is empty.
Ifwtrim==4, trimming is done with the tandem approach of Cerioli and Perrotta (2014).

we

Weights. A vector of size n-by-1 containing application-specific weightsDefault is a vector of ones.

msg

RandNumbForNini

pre-extracted random numbers to initialize proportions.Matrix of size k-by-nrow(nsamp) containing the random numbers whichare used to initialize the proportions of the groups. This option is effective only ifnsamp is a matrix which contains pre-extracted subsamples. The purpose of thisoption is to enable the user to replicate the results when the functiontclustreg()is called using a parfor instruction (as it happens for example in routine IC, wheretclustreg() is called through a parfor for different values of the restriction factor).The default is thatRandNumbForNini is empty - then uniform random numbers are used.

trace

Whether to print intermediate results. Default istrace=FALSE.

...

potential further arguments passed to lower level functions.

Value

An S3 object of classtclustreg.object

Author(s)

FSDA team,valentin.todorov@chello.at

References

Torti F., Perrotta D., Riani, M. and Cerioli A. (2019). Assessing Robust Methodologies for Clustering Linear Regression Data,Advances in Data Analysis and Classification, Vol. 13, pp 227-257.

Wool data.

Description

The wool data give the number of cycles to failure of a worsted yarn undercycles of repeated loading. The variables are: length of test specimen;amplitude of loading cycle; load

Usage

data(wool)

Format

A data frame with 27 rows and 4 variables

z1

Description

Simulated data to test tclustIC() and tclustICsol(), carbike() functions

Usage

data(z1)

Format

A data frame with 150 rows and 2 variables.The variables are as follows:

References

Maitra, R. and Melnykov, V. (2010), Simulating data to study performanceof finite mixture modeling and clustering algorithms,The Journal ofComputational and Graphical Statistics, Vol. 19, pp. 354-376.

Examples

 data(z1) head(z1) ## Not run:  (out <- tclustIC(z1, plots=FALSE, whichIC="CLACLA")) (outCLACLA <- tclustICsol(out, whichIC="CLACLA", plot=FALSE)) carbikeplot(outCLACLA) ## End(Not run)

Movatterモバイル変換

Creates anFSR_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Creates anFSReda_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Income1

Description

Usage

Format

Source

Examples

Income2

Description

Usage

Format

Examples

Creates anLSX_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Mixture M5 Data.

Description

Usage

Format

Source

Creates anMMreg_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Creates anMMregeda_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Creates anSreg_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Creates anSregeda_control object

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Creates an`FSR_control` object

Creates an`FSReda_control` object

Creates an`LSX_control` object

Creates an`MMreg_control` object

Creates an`MMregeda_control` object

Creates an`Sreg_control` object

Creates an`Sregeda_control` object

Produces the carbike plot to find best relevant clustering solutions obtained by`tclustICsol`

Description of`fsdalms` Objects

Description of`fsdalts` Objects

Description of`fsmeda.object` Objects