Movatterモバイル変換

Version:

6.2.6

Title:

Visualization and Imputation of Missing Values

Depends:

R (≥ 4.1.0),colorspace,grid

Imports:

car, grDevices, robustbase, stats, sp, vcd, nnet, e1071,methods, Rcpp, utils, graphics, laeken, ranger, MASS, xgboost,data.table(≥ 1.9.4)

Suggests:

dplyr, tinytest, knitr, mgcv, rmarkdown, reactable, covr,withr, pdist, enetLTS, robmixglm, stringr

Description:

New tools for the visualization of missing and/or imputed values are introduced, which can be used for exploring the data and the structure of the missing and/or imputed values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missing values and allows to explore the data including missing values. In addition, the quality of imputation can be visually explored using various univariate, bivariate, multiple and multivariate plot methods. A graphical user interface available in the separate package VIMGUI allows an easy handling of the implemented plot methods.

LazyData:

TRUE

ByteCompile:

TRUE

License:

GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]

URL:

https://github.com/statistikat/VIM

Repository:

CRAN

LinkingTo:

Rcpp

RoxygenNote:

7.3.2

Encoding:

UTF-8

VignetteBuilder:

knitr

NeedsCompilation:

yes

Packaged:

2025-09-17 14:04:34 UTC; matthias

Author:

Matthias Templ [aut, cre], Alexander Kowarik

[aut], Andreas Alfons [aut], Gregor de Cillia [aut], Bernd Prantner [ctb], Wolfgang Rannetbauer [aut]

Maintainer:

Matthias Templ <matthias.templ@gmail.com>

Date/Publication:

2025-09-18 05:10:48 UTC

The VIM Package

Description

VIM provides tools for visualization, imputation, and exploration of missingand multivariate data.

Details

Visualization and Imputation of Missing Values

This package introduces new tools for the visualization of missing orimputed values in , which can be used for exploring the data and thestructure of the missing or imputed values. Depending on this structure,they may help to identify the mechanism generating the missing values orerrors, which may have happened in the imputation process. This knowledge isnecessary for selecting an appropriate imputation method in order toreliably estimate the missing values. Thus the visualization tools should beapplied before imputation and the diagnostic tools afterwards.

Detecting missing values mechanisms is usually done by statistical tests ormodels. Visualization of missing and imputed values can support the testdecision, but also reveals more details about the data structure. Mostnotably, statistical requirements for a test can be checked graphically, andproblems like outliers or skewed data distributions can be discovered.Furthermore, the included plot methods may also be able to detect missingvalues mechanisms in the first place.

A graphical user interface available in the package VIMGUI allows an easyhandling of the plot methods. In addition,VIM can be used for datafrom essentially any field.

This package includes advanced imputation methods, robust statistics,and tools for data preprocessing and diagnostics.

Author(s)

Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner

Maintainer: Matthias Templtempl@tuwien.ac.at

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwise regressionimputation using standard and robust methods.Journal ofComputational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.

Animals_na

Description

Average log brain and log body weights for 28 Species

Format

A data frame with 28 observations on the following 2 variables.

lbody: log body weight
lbrain: log brain weight

Details

The original data can be found in package MASS.10 values on brain weight are set to be missing.

Source

P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p. 57.

References

Venables, W. N. and Ripley, B. D. (1999) Modern Applied Statistics with S-PLUS. Third Edition. Springer.

Templ, M. (2022) Visualization and Imputation of Missing Values. Springer Publishing. Upcoming book.

Examples

data(Animals_na)aggr(Animals_na)

Synthetic subset of the Austrian structural business statistics data

Description

Synthetic subset of the Austrian structural business statistics (SBS) data,namely NACE code 52.42 (retail sale of clothing).

Details

The Austrian SBS data set consists of more than 320.000 enterprises.Available raw (unedited) data set: 21669 observations in 90 variables,structured according NACE revision 1.1 with 3891 missing values.

We investigate 9 variables of NACE 52.42 (retail sale of clothing).

From these confidential raw data set a non-confidential, close-to-reality,synthetic data set was generated.

Source

http://www.statistik.at

Examples

data(SBS5242)aggr(SBS5242)

Aggregations for missing/imputed values

Description

Calculate or plot the amount of missing/imputed values in each variable andthe amount of missing/imputed values in certain combinations of variables.

Print method for objects of class"aggr".

Summary method for objects of class"aggr".

Print method for objects of class"summary.aggr".

Usage

aggr(x, delimiter = NULL, plot = TRUE, ...)## S3 method for class 'aggr'plot(  x,  col = c("skyblue", "red", "orange"),  bars = TRUE,  numbers = FALSE,  prop = TRUE,  combined = FALSE,  varheight = FALSE,  only.miss = FALSE,  border = par("fg"),  sortVars = FALSE,  sortCombs = TRUE,  ylabs = NULL,  axes = TRUE,  labels = axes,  cex.lab = 1.2,  cex.axis = par("cex"),  cex.numbers = par("cex"),  gap = 4,  ...)## S3 method for class 'aggr'print(x, ..., digits = NULL)## S3 method for class 'aggr'summary(object, ...)## S3 method for class 'summary.aggr'print(x, ...)

Arguments

x

an object of class"summary.aggr".

delimiter

a character-vector to distinguish between variables andimputation-indices for imputed variables (therefore,x needs to havecolnames()). If given, it is used to determine the correspondingimputation-index for any imputed variable (a logical-vector indicating whichvalues of the variable have been imputed). If such imputation-indices arefound, they are used for highlighting and the colors are adjusted accordingto the given colors for imputed variables (seecol).

plot

a logical indicating whether the results should be plotted (thedefault isTRUE).

...

Further arguments, currently ignored.

col

a vector of length three giving the colors to be used forobserved, missing and imputed data. If only one color is supplied, it isused for missing and imputed data and observed data is transparent. If onlytwo colors are supplied, the first one is used for observed data and thesecond color is used for missing and imputed data.

bars

a logical indicating whether a small barplot for the frequenciesof the different combinations should be drawn.

numbers

a logical indicating whether the proportion or frequencies ofthe different combinations should be represented by numbers.

prop

a logical indicating whether the proportion of missing/imputedvalues and combinations should be used rather than the total amount.

combined

a logical indicating whether the two plots should becombined. IfFALSE, a separate barplot on the left hand side showsthe amount of missing/imputed values in each variable. IfTRUE, asmall version of this barplot is drawn on top of the plot for thecombinations of missing/imputed and non-missing values. See“Details” for more information.

varheight

a logical indicating whether the cell heights are given bythe frequencies of occurrence of the corresponding combinations.

only.miss

a logical indicating whether the small barplot for thefrequencies of the combinations should only be drawn for combinationsincluding missing/imputed values (ifbars isTRUE). This isuseful if most observations are complete, in which case the correspondingbar would dominate the barplot such that the remaining bars are toocompressed. The proportion or frequency of complete observations (asdetermined byprop) is then represented by a number instead of a bar.

border

the color to be used for the border of the bars andrectangles. Useborder=NA to omit borders.

sortVars

a logical indicating whether the variables should be sortedby the number of missing/imputed values.

sortCombs

a logical indicating whether the combinations should besorted by the frequency of occurrence.

ylabs

ifcombined isTRUE, a character string givingthe y-axis label of the combined plot, otherwise a character vector oflength two giving the y-axis labels for the two plots.

axes

a logical indicating whether axes should be drawn.

labels

either a logical indicating whether labels should be plottedon the x-axis, or a character vector giving the labels.

cex.lab

the character expansion factor to be used for the axislabels.

cex.axis

the character expansion factor to be used for the axisannotation.

cex.numbers

the character expansion factor to be used for theproportion or frequencies of the different combinations

gap

ifcombined isFALSE, a numeric value giving thedistance between the two plots in margin lines.

digits

the minimum number of significant digits to be used (seeprint.default()).

object

an object of class"aggr".

Details

Often it is of interest how many missing/imputed values are contained ineach variable. Even more interesting, there may be certain combinations ofvariables with a high number of missing/imputed values.

Ifcombined isFALSE, two separate plots are drawn for themissing/imputed values in each variable and the combinations ofmissing/imputed and non-missing values. The barplot on the left hand sideshows the amount of missing/imputed values in each variable. In theaggregation plot on the right hand side, all existing combinations ofmissing/imputed and non-missing values in the observations are visualized.Available, missing and imputed data are color coded as given bycol.Additionally, there are two possibilities to represent the frequencies ofoccurrence of the different combinations. The first option is to visualizethe proportions or frequencies by a small bar plot and/or numbers. Thesecond option is to let the cell heights be given by the frequencies of thecorresponding combinations. Furthermore, variables may be sorted by thenumber of missing/imputed values and combinations by the frequency ofoccurrence to give more power to finding the structure of missing/imputedvalues.

Ifcombined isTRUE, a small version of the barplot showingthe amount of missing/imputed values in each variable is drawn on top of theaggregation plot.

The graphical parameteroma will be set unless supplied as anargument.

Value

foraggr, a list of class"aggr" containing thefollowing components:

x the data used.
combinations a character vector representing the combinations ofvariables.
count the frequencies of these combinations.
percent the percentage of these combinations.
missings adata.frame containing the amount ofmissing/imputed values in each variable.
tabcomb the indicator matrix for the combinations of variables.

a list of class"summary.aggr" containing the followingcomponents:

missings adata.frame containing the amount of missing orimputed values in each variable.
combinations adata.frame containing a character vectorrepresenting the combinations of variables along with their frequencies andpercentages.

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentslabsandnames.arg can still be supplied to...{} and are handledcorrectly. Nevertheless, they are deprecated and no longer documented. Useylabs andlabels instead.

Author(s)

Andreas Alfons, Matthias Templ, modifications for displaying imputedvalues by Bernd Prantner

Matthias Templ, modifications by Andreas Alfons and Bernd Prantner

Matthias Templ, modifications by Andreas Alfons

Andreas Alfons, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package="VIM")## for missing valuesa <- aggr(sleep)asummary(a)## for imputed valuessleep_IMPUTED <- kNN(sleep)a <- aggr(sleep_IMPUTED, delimiter="_imp")asummary(a)data(sleep, package = "VIM")a <- aggr(sleep, plot=FALSE)adata(sleep, package = "VIM")summary(aggr(sleep, plot=FALSE))data(sleep, package = "VIM")s <- summary(aggr(sleep, plot=FALSE))s

Alphablending for colors

Description

Convert colors to semitransparent colors.

Usage

alphablend(col, alpha = NULL, bg = NULL)

Arguments

col

a vector specifying colors.

alpha

a numeric vector containing the alpha values (between 0 and 1).

bg

the background color to be used for alphablending. This can beused as a workaround for graphics devices that do not supportsemitransparent colors.

Value

a vector containing the semitransparent colors.

Author(s)

Andreas Alfons

Examples

alphablend("red", 0.6)

Barplot with information about missing/imputed values

Description

Barplot with highlighting of missing/imputed values in other variables bysplitting each bar into two parts. Additionally, information aboutmissing/imputed values in the variable of interest is shown on the righthand side.

Usage

barMiss(  x,  delimiter = NULL,  pos = 1,  selection = c("any", "all"),  col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"),  border = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  labels = axes,  only.miss = TRUE,  miss.labels = axes,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

delimiter

pos

a numeric value giving the index of the variable of interest.Additional variables inx are used for highlighting.

selection

the selection method for highlighting missing/imputedvalues in multiple additional variables. Possible values are"any"(highlighting of missing/imputed values inany of the additionalvariables) and"all" (highlighting of missing/imputed values inall of the additional variables).

col

a vector of length six giving the colors to be used. If only onecolor is supplied, the bars are transparent and the supplied color is usedfor highlighting missing/imputed values. Else if two colors are supplied,they are recycled.

border

the color to be used for the border of the bars. Useborder=NA to omit borders.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether axes should be drawn on the plot.

labels

either a logical indicating whether labels should be plottedbelow each bar, or a character vector giving the labels.

only.miss

logical; ifTRUE, the missing/imputed values in thevariable of interest are visualized by a single bar. Otherwise, a smallbarplot is drawn on the right hand side (see ‘Details’).

miss.labels

either a logical indicating whether label(s) should beplotted below the bar(s) on the right hand side, or a character string orvector giving the label(s) (see ‘Details’).

interactive

a logical indicating whether variables can be switchedinteractively (see ‘Details’).

...

further graphical parameters to be passed tographics::title() andgraphics::axis().

Details

If more than one variable is supplied, the bars for the variable of interestare split according to missingness/number of imputed missings in theadditional variables.

Ifonly.miss=TRUE, the missing/imputed values in the variable ofinterest are visualized by one bar on the right hand side. If additionalvariables are supplied, this bar is again split into two parts according tomissingness/number of imputed missings in the additional variables.

Otherwise, a small barplot consisting of two bars is drawn on the right handside. The first bar corresponds to observed values in the variable ofinterest and the second bar to missing/imputed values. Since these two barsare not on the same scale as the main barplot, a second y-axis is plotted onthe right (ifaxes=TRUE). Each of the two bars are again split intotwo parts according to missingness/number of imputed missings in theadditional variables. Note that this display does not make sense if onlyone variable is supplied, thereforeonly.miss is ignored in thatcase.

Ifinteractive=TRUE, clicking in the left margin of the plot resultsin switching to the previous variable and clicking in the right marginresults in switching to the next variable. Clicking anywhere else on thegraphics device quits the interactive session. When switching to acontinuous variable, a histogram is plotted rather than a barplot.

Value

a numeric vector giving the coordinates of the midpoints of thebars.

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentsaxisnames,names.arg andnames.miss can still besupplied to...{} and are handled correctly. Nevertheless, theyare deprecated and no longer documented. Uselabels andmiss.labels instead.

Author(s)

Andreas Alfons, modifications to show imputed values by BerndPrantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")## for missing valuesx <- sleep[, c("Exp", "Sleep")]barMiss(x)barMiss(x, only.miss = FALSE)## for imputed valuesx_IMPUTED  <- kNN(sleep[, c("Exp", "Sleep")])barMiss(x_IMPUTED, delimiter = "_imp")barMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)

Breast cancer Wisconsin data set

Description

Dataset containing the original Wisconsin breast cancer data.

Format

A data frame with 699 observations on the following 11 variables.

ID: Sample ID
clump_thickness: as integer from 1 - 10
uniformity_cellsize: as integer from 1 - 10
uniformity_cellshape: as integer from 1 - 10
adhesion: as integer from 1 - 10
epithelial_cellsize: as integer from 1 - 10
bare_nuclei: as integer from 1 - 10, includes 16 missings
chromatin: as integer from 1 - 10
normal_nucleoli: as integer from 1 - 10
mitoses: as integer from 1 - 10
class: benign or malignant

References

The data downloaded and conditioned for R from the UCI machine learning repository,see https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)This breast cancer databases was obtained from the University of Wisconsin Hospitals,Madison from Dr. William H. Wolberg. If you publish results when using this database,then please include this information in your acknowledgements.Also, please cite one or more of:O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming",SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.William H. Wolberg and O.L. Mangasarian:"Multisurface method of pattern separation for medical diagnosis applied to breast cytology",Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.O. L. Mangasarian, R. Setiono, and W.H. Wolberg:"Pattern recognition via linear programming: Theory and application to medical diagnosis",in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors,SIAM Publications, Philadelphia 1990, pp 22-30.K. P. Bennett & O. L. Mangasarian:"Robust linear programming discrimination of two linearly inseparable sets",Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

Examples

data(bcancer)aggr(bcancer)

Backgound map

Description

Plot a background map.

Usage

bgmap(map, add = FALSE, ...)

Arguments

map

either a matrix ordata.frame with two columns, a listwith componentsx andy, or an object of any class that can beused for maps and provides its own plot method (e.g.,"SpatialPolygons" from packagesp). A list of the previouslymentioned types can also be provided.

add

a logical indicating whethermap should be added to analready existing plot (the default isFALSE).

...

further arguments and graphical parameters to be passed toplot and/orgraphics::lines().

Author(s)

Andreas Alfons

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(kola.background, package = "VIM")bgmap(kola.background)

Brittleness index data set

Description

A plastic product is produced in three parallel reactors (TK104, TK105, or TK107).For each row in the dataset, we have the same batch of raw material that was split, and fed to the 3 reactors.These values are the brittleness index for the product produced in the reactor. A simulated data set.

Format

A data frame with 23 observations on the following 3 variables.

TK104: Brittleness for batches of raw material in reactor 104
TK105: Brittleness for batches of raw material in reactor 105
TK107: Brittleness for batches of raw material in reactor 107

Source

https://openmv.net/info/brittleness-index

Examples

data(brittleness)aggr(brittleness)

C-horizon of the Kola data with missing values

Description

This data set is the same asin packagemvoutlier, except that values below the detection limitare coded asNA.

Format

A data frame with 606 observations on the following 110 variables.

*ID: a numeric vector
XCOO: anumeric vector
YCOO: a numeric vector
Ag: anumeric vector
Ag_INAA: a numeric vector
Al: anumeric vector
Al2O3: a numeric vector
As: anumeric vector
As_INAA: a numeric vector
Au_INAA: a numeric vector
B: a numeric vector
Ba: a numeric vector
Ba_INAA: a numeric vector
Be: a numeric vector
Bi: a numeric vector
Br_IC: a numeric vector
Br_INAA: a numericvector
Ca: a numeric vector
Ca_INAA: a numericvector
CaO: a numeric vector
Cd: a numericvector
Ce_INAA: a numeric vector
Cl_IC: anumeric vector
Co: a numeric vector
Co_INAA: anumeric vector
EC: a numeric vector
Cr: anumeric vector
Cr_INAA: a numeric vector
Cs_INAA: a numeric vector
Cu: a numeric vector
Eu_INAA: a numeric vector
F_IC: a numericvector
Fe: a numeric vector
Fe_INAA: a numericvector
Fe2O3: a numeric vector
Hf_INAA: anumeric vector
Hg: a numeric vector
Hg_INAA: anumeric vector
Ir_INAA: a numeric vector
K: anumeric vector
K2O: a numeric vector
La: anumeric vector
La_INAA: a numeric vector
Li: anumeric vector
LOI: a numeric vector
Lu_INAA: a numeric vector
wt_INAA: a numericvector
Mg: a numeric vector
MgO: a numericvector
Mn: a numeric vector
MnO: a numericvector
Mo: a numeric vector
Mo_INAA: a numericvector
Na: a numeric vector
Na_INAA: a numericvector
Na2O: a numeric vector
Nd_INAA: anumeric vector
Ni: a numeric vector
Ni_INAA: anumeric vector
NO3_IC: a numeric vector
P: anumeric vector
P2O5: a numeric vector
Pb: anumeric vector
pH: a numeric vector
PO4_IC: anumeric vector
Rb: a numeric vector
S: anumeric vector
Sb: a numeric vector
Sb_INAA: anumeric vector
Sc: a numeric vector
Sc_INAA: anumeric vector
Se: a numeric vector
Se_INAA: anumeric vector
Si: a numeric vector
SiO2: anumeric vector
Sm_INAA: a numeric vector
Sn_INAA: a numeric vector
SO4_IC: a numericvector
Sr: a numeric vector
Sr_INAA: a numericvector
SUM_XRF: a numeric vector
Ta_INAA: anumeric vector
Tb_INAA: a numeric vector
Te: anumeric vector
Th: a numeric vector
Th_INAA: anumeric vector
Ti: a numeric vector
TiO2: anumeric vector
U_INAA: a numeric vector
V: anumeric vector
W_INAA: a numeric vector
Y: anumeric vector
Yb_INAA: a numeric vector
Zn: anumeric vector
Zn_INAA: a numeric vector
ELEV: a numeric vector
*COUN: a numeric vector
*ASP: a numeric vector
TOPC: a numeric vector
LITO: a numeric vector
Al_XRF: a numericvector
Ca_XRF: a numeric vector
Fe_XRF: anumeric vector
K_XRF: a numeric vector
Mg_XRF: a numeric vector
Mn_XRF: a numericvector
Na_XRF: a numeric vector
P_XRF: anumeric vector
Si_XRF: a numeric vector
Ti_XRF: a numeric vector

Note

For a more detailed description of this data set, see the help filechorizon in packagemvoutlier.

Source

Kola Project (1993-1998)

References

Reimann, C., Filzmoser, P., Garrett, R.G. and Dutter, R. (2008)Statistical Data Analysis Explained: Applied Environmental Statisticswith R. Wiley.

Examples

data(chorizonDL, package = "VIM")summary(chorizonDL)

HCL and RGB color sequences

Description

Compute color sequences by linear interpolation based on a continuous colorscheme between certain start and end colors. Color sequences may thereby becomputed in theHCL orRGB color space.

Usage

colSequence(p, start, end, space = c("hcl", "rgb"), ...)colSequenceRGB(p, start, end, fixup = TRUE, ...)colSequenceHCL(p, start, end, fixup = TRUE, ...)

Arguments

p

a numeric vector with values between 0 and 1 giving values to be usedfor interpolation between the start and end color (0 corresponds to thestart color, 1 to the end color).

start,end

the start and end color, respectively. For HCL colors,each can be supplied as a vector of length three (hue, chroma, luminance) oran object of class "colorspace::polarLUV()". For RGB colors,each can be supplied as a character string, a vector of length three (red,green, blue) or an object of class "colorspace::RGB()".

space

character string; ifstart andend are bothnumeric, this determines whether they refer to HCL or RGB values. Possiblevalues are"hcl" (for the HCL space) or"rgb" (for the RGBspace).

...

forcolSequence, additional arguments to be passed tocolSequenceHCL orcolSequenceRGB. ForcolSequenceHCLandcolSequenceRGB, additional arguments to be passed tocolorspace::hex().

fixup

a logical indicating whether the colors should be corrected tovalid RGB values (seecolorspace::hex()).

Value

A character vector containing hexadecimal strings of the form"#RRGGBB".

Author(s)

Andreas Alfons

References

Zeileis, A., Hornik, K., Murrell, P. (2009) Escaping RGBland:Selecting colors for statistical graphics.Computational Statistics &Data Analysis,53 (9), 1259–1270.

Examples

p <- c(0, 0.3, 0.55, 0.8, 1)## HCL colorscolSequence(p, c(0, 0, 100), c(0, 100, 50))colSequence(p, polarLUV(L=90, C=30, H=90), c(0, 100, 50))## RGB colorscolSequence(p, c(1, 1, 1), c(1, 0, 0), space="rgb")colSequence(p, RGB(1, 1, 0), "red")

Colic horse data set

Description

This is a modified version of the original training data settaken from the UCI repository, see reference.The modifications are only related to having appropriate levels for factor variables.This data set is about horse diseases where the task is to determine,if the lesion of the horse was surgical or not.

Format

A training data frame with 300 observations on the following 31 variables.

surgery: yes or no
age: 1 equals an adult horse, 2 is a horse younger than 6 months
hospitalID: ID
temp_rectal: rectal temperature
pulse: heart rate in beats per minute
respiratory_rate: a normal rate is between 8 and 10
temp_extreme: temperature of extremities
pulse_peripheral: factor with four categories
capillayr_refill_time: a clinical judgement. The longer the refill, the poorer the circulation. Possible values are1 = < 3 seconds and 2 = >= 3 seconds
pain: a subjective judgement of the horse's pain level
peristalsis: an indication of the activity in the horse's gut.As the gut becomes more distended or the horse becomes more toxic, the activity decreases
abdominal_distension: An animal with abdominal distension is likely to be painful and have reduced gut motility.A horse with severe abdominal distension is likely to require surgery just tio relieve the pressure
nasogastric_tube: This refers to any gas coming out of the tube.A large gas cap in the stomach is likely to give the horse discomfort
nasogastric_reflux: posible values are 1 = none, 2 = > 1 liter, 3 = < 1 liter.The greater amount of reflux, the more likelihood that there is someserious obstruction to the fluid passage from the rest of the intestine
nasogastric_reflux_PH: scale is from 0 to 14 with 7 being neutral.Normal values are in the 3 to 4 range
rectal_examination: Rectal examination. Absent feces probably indicates an obstruction
abdomen: abdomen. possible values 1 = normal, 2 = other, 3 = firm feces in the large intestine,4 = distended small intestine, 5 = distended large intestine
cell_volume: packed cell volume. normal range is 30 to 50.The level rises as the circulation becomes compromised or as the animal becomes dehydrated.
protein: total protein. Normal values lie in the 6-7.5 (gms/dL) range. The higher the value the greater the dehydration
abdominocentesis_appearance: Abdominocentesis appearance.A needle is put in the horse's abdomen and fluid is obtained from the abdominal cavity
abdomcentesis_protein: abdomcentesis total protein.The higher the level of protein the more likely it is to have a compromised gut. Values are in gms/dL
outcome: What eventually happened to the horse?
surgical_lesion: retrospectively, was the problem (lesion) surgical?
lesion_type1: type of lesion
lesion_type2: type of lesion
lesion_type3: type of lesion
cp_data
temp_extreme_ordered: temperature of extremities (ordered)
mucous_membranes_col: mucous membranes. A subjective measurement of colour
mucous_membranes_group: different recodings of mucous membrances

Source

https://archive.ics.uci.edu/ml/datasets/Horse+ColicCreators: Mary McLeish & Matt Cecile, Department of Computer Science, University of Guelph,Guelph, Ontario, Canada N1G 2W1Donor: Will Taylor

Examples

data(colic)aggr(colic)

Subset of the collision data

Description

Subset of the collision data from December 20. to December 31. 2018 from NYCD.

Details

Each record represents a collision in NYC by city, borough, precinct and cross street.

Source

https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95

Examples

data(collisions)aggr(collisions)

Colored map with information about missing/imputed values

Description

Colored map in which the proportion or amount of missing/imputed values ineach region is coded according to a continuous or discrete color scheme.The sequential color palette may thereby be computed in theHCL ortheRGB color space.

Usage

colormapMiss(  x,  region,  map,  imp_index = NULL,  prop = TRUE,  polysRegion = 1:length(x),  range = NULL,  n = NULL,  col = c("red", "orange"),  gamma = 2.2,  fixup = TRUE,  coords = NULL,  numbers = TRUE,  digits = 2,  cex.numbers = 0.8,  col.numbers = par("fg"),  legend = TRUE,  interactive = TRUE,  ...)colormapMissLegend(  xleft,  ybottom,  xright,  ytop,  cmap,  n = 1000,  horizontal = TRUE,  digits = 2,  cex.numbers = 0.8,  col.numbers = par("fg"),  ...)

Arguments

x

a numeric vector.

region

a vector or factor of the same length asx giving theregions.

map

an object of any class that contains polygons and provides itsown plot method (e.g.,"SpatialPolygons" from packagesp).

imp_index

a logical-vector indicating which values of ‘x’ havebeen imputed. If given, it is used for highlighting and the colors areadjusted according to the given colors for imputed variables (seecol).

prop

a logical indicating whether the proportion of missing/imputedvalues should be used rather than the total amount.

polysRegion

a numeric vector specifying the region that each polygonbelongs to.

range

a numeric vector of length two specifying the range (minimumand maximum) of the proportion or amount of missing/imputed values to beused for the color scheme.

n

forcolormapMiss, the number of equally spaced cut-offpoints for a discretized color scheme. If this is not a positive integer, acontinuous color scheme is used (the default). In the latter case, thenumber of rectangles to be drawn in the legend can be specified incolormapMissLegend. A reasonably large number makes it appearcontinuously.

col

the color range (start end end) to be used. RGB colors may bespecified as character strings or as objects of class"colorspace::RGB()". HCL colors need to be specified as objectsof class "colorspace::polarLUV()". If only one color issupplied, it is used as end color, while the start color is taken to betransparent for RGB or white for HCL.

gamma

numeric; the displaygamma value (seecolorspace::hex()).

fixup

a logical indicating whether the colors should be corrected tovalid RGB values (seecolorspace::hex()).

coords

a matrix ordata.frame with two columns giving thecoordinates for the labels.

numbers

a logical indicating whether the corresponding proportions ornumbers of missing/imputed values should be used as labels for the regions.

digits

the number of digits to be used in the labels (in case ofproportions).

cex.numbers

the character expansion factor to be used for the labels.

col.numbers

the color to be used for the labels.

legend

a logical indicating whether a legend should be plotted.

interactive

a logical indicating whether more detailed informationabout missing/imputed values should be displayed interactively (see‘Details’).

...

further arguments to be passed toplot.

xleft

leftx position of the legend.

ybottom

bottomy position of the legend.

xright

rightx position of the legend.

ytop

topy position of the legend.

cmap

a list as returned bycolormapMiss that contains therequired information for the legend.

horizontal

a logical indicating whether the legend should be drawnhorizontally or vertically.

Details

The proportion or amount of missing/imputed values inx of eachregion is coded according to a continuous or discrete color scheme in thecolor range defined bycol. In addition, the proportions or numberscan be shown as labels in the regions.

Ifinteractive isTRUE, clicking in a region displays moredetailed information about missing/imputed values on the console. Clickingoutside the borders quits the interactive session.

Value

colormapMiss returns a list with the following components:

nmiss a numeric vector containing the number of missing/imputedvalues in each region.
nobs a numeric vector containing the number of observations ineach region.
pmiss a numeric vector containing the proportion of missingvalues in each region.
prop a logical indicating whether the proportion ofmissing/imputed values have been used rather than the total amount.
range the range of the proportion or amount of missing/imputedvalues corresponding to the color range.
n either a positive integer giving the number of equally spacedcut-off points for a discretized color scheme, orNULL for acontinuous color scheme.
start the start color of the color scheme.
end the end color of the color scheme.
space a character string giving the color space (either"rgb" for RGB colors or"hcl" for HCL colors).
gamma numeric; the displaygamma value (seecolorspace::hex()).
fixup a logical indicating whether the colors have beencorrected to valid RGB values (seecolorspace::hex()).

Note

Some of the argument names and positions have changed with versions1.3 and 1.4 due to extended functionality and for more consistency withother plot functions inVIM. For back compatibility, the argumentscex.text andcol.text can still be supplied to...{}and are handled correctly. Nevertheless, they are deprecated and no longerdocumented. Usecex.numbers andcol.numbers instead.

Author(s)

Andreas Alfons, modifications to show imputed values by BerndPrantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Count number of infinite or missing values

Description

Count the number of infinite or missing values in a vector.

Usage

countInf(x)

Arguments

x

a vector.

Value

countInf returns the number of infinite values inx.countNA returns the number of missing values inx.

Author(s)

Andreas Alfons

Examples

data(sleep, package="VIM")countInf(log(sleep$Dream))countNA(sleep$Dream)

Indian Prime Diabetes Data

Description

The datasets consists of several medical predictor variables andone target variable, Outcome. Predictor variables includes the number of pregnanciesthe patient has had, their BMI, insulin level, age, and so on.

Format

A data frame with 768 observations on the following 9 variables.

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age in years
Outcome: Diabetes (yes or no)

Details

This dataset is originally from the National Institute of Diabetes andDigestive and Kidney Diseases. The objective of the dataset is todiagnostically predict whether or not a patient has diabetes, basedon certain diagnostic measurements included in the dataset.Several constraints were placed on the selection of these instancesfrom a larger database. In particular, all patients here are femalesat least 21 years old of Pima Indian heritage.

Source

https://www.kaggle.com/uciml/pima-indians-diabetes-database/data

References

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988).Using the ADAP learning algorithm to forecast the onset of diabetes mellitus.In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.

Examples

data(diabetes)aggr(diabetes)

Error performance measures

Description

Various error measures evaluating the quality of imputations

Usage

evaluation(x, y, m, vartypes = "guess")nrmse(x, y, m)pfc(x, y, m)msecov(x, y)msecor(x, y)

Arguments

x

matrix or data frame

y

matrix or data frame of the same size as x

m

the indicator matrix for missing cells

vartypes

a vector of length ncol(x) specifying the variables types, like factor or numeric

Details

This function has been mainly written for procuduresthat evaluate imputation or replacement of rounded zeros. The ni parameter can thus, e.g. beused for expressing the number of rounded zeros.

Value

the error measures value

Author(s)

Matthias Templ

References

M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwiseregression imputation using standard and robust methods.Journal ofComputational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.

Examples

data(iris)iris_orig <- iris_imp <- irisiris_imp$Sepal.Length[sample(1:nrow(iris), 10)] <- NAiris_imp$Sepal.Width[sample(1:nrow(iris), 10)] <- NAiris_imp$Species[sample(1:nrow(iris), 10)] <- NAm <- is.na(iris_imp)iris_imp <- kNN(iris_imp, imp_var = FALSE)evaluation(iris_orig, iris_imp, m = m, vartypes = c(rep("numeric", 4), "factor"))msecov(iris_orig[, 1:4], iris_imp[, 1:4])

Food consumption

Description

The relative consumption of certain food items in European and Scandinavian countries.

Format

A data frame with 16 observations on the following 21 variables.

Details

The numbers represent the percentage of the population consuming that food type.

Source

https://openmv.net/info/food-consumption

Examples

data(food)str(food)aggr(food)

Missing value gap statistics

Description

Computes the average missing value gap of a vector.

Usage

gapMiss(x, what = mean)

Arguments

x

a numeric vector

what

default is the arithmetic mean.One can include an own function that returns a vector of lenght 1 (e.g. median)

Details

The length of each sequence of missing values (gap) in a vector is calculated and themean gap is reported

Value

The gap statistics

Author(s)

Matthias Templ based on a suggestion and draft from Huang Tian Yuan.

Examples

v <- rnorm(20)v[3] <- NAv[6:9] <- NAv[13:17] <- NAvgapMiss(v)gapMiss(v, what = median)gapMiss(v, what = function(x) mean(x, trim = 0.1))gapMiss(v, what = var)

Computes the extended Gower distance of two data sets

Description

The function gowerD is used by kNN to compute the distances for numerical,factor ordered and semi-continous variables.

Usage

gowerD(  data.x,  data.y = data.x,  weights = rep(1, ncol(data.x)),  numerical = colnames(data.x),  factors = vector(),  orders = vector(),  mixed = vector(),  levOrders = vector(),  mixed.constant = rep(0, length(mixed)),  returnIndex = FALSE,  nMin = 1L,  returnMin = FALSE,  methodStand = "range")

Arguments

data.x

data frame

data.y

data frame

weights

numeric vector providing weights for the observations in x

numerical

names of numerical variables

factors

names of factor variables

orders

names of ordered variables

mixed

names of mixed variables

levOrders

vector with number of levels for each orders variable

mixed.constant

vector with length equal to the number of semi-continuous variables specifying the point of the semi-continuous distribution with non-zero probability

returnIndex

logical if TRUE return the index of the minimum distance

nMin

integer number of values with smallest distance to be returned

returnMin

logical if the computed distances for the indices should be returned

methodStand

character either "range" or "iqr", iqr is more robust for outliers

Details

returnIndex=FALSE: a numerical matrix n x m with the computed distancesreturnIndex=TRUE: a named list with "ind" containing the requested indices and "mins" the computed distances

Examples

data(sleep)# all variables used as numericalgowerD(sleep)# split in numerical angowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"),  orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5))# as before but only returning the index of the closest observationgowerD(sleep, numerical = c("BodyWgt", "BrainWgt", "NonD", "Dream", "Sleep", "Span", "Gest"),  orders = c("Pred","Exp","Danger"), levOrders = c(5,5,5), returnIndex = TRUE)

Growing dot map with information about missing/imputed values

Description

Map with dots whose sizes correspond to the values in a certain variable.Observations with missing/imputed values in additional variables arehighlighted.

Usage

growdotMiss(  x,  coords,  map,  pos = 1,  delimiter = NULL,  selection = c("any", "all"),  log = FALSE,  col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"),  border = par("bg"),  alpha = NULL,  scale = NULL,  size = NULL,  exp = c(0, 0.95, 0.05),  col.map = grey(0.5),  legend = TRUE,  legtitle = "Legend",  cex.legtitle = par("cex"),  cex.legtext = par("cex"),  ncircles = 6,  ndigits = 1,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

coords

a matrix ordata.frame with two columns giving thespatial coordinates of the observations.

map

a background map to be passed tobgmap().

pos

a numeric value giving the index of the variable determining thedot sizes.

delimiter

selection

log

a logical indicating whether the variable given byposshould be log-transformed.

col

a vector of length six giving the colors to be used in the plot.If only one color is supplied, it is used for the borders of non-highlighteddots and the surface area of highlighted dots. Else if two colors aresupplied, they are recycled.

border

a vector of length four giving the colors to be used for theborders of the growing dots. UseNA to omit borders.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

scale

scaling factor of the map.

size

a vector of length two giving the sizes for the smallest andlargest dots.

exp

a vector of length three giving the factors that define the shapeof the exponential function (see ‘Details’).

col.map

the color to be used for the background map.

legend

a logical indicating whether a legend should be plotted.

legtitle

the title for the legend.

cex.legtitle

the character expansion factor to be used for the titleof the legend.

cex.legtext

the character expansion factor to be used in the legend.

ncircles

the number of circles displayed in the legend.

ndigits

the number of digits displayed in the legend. Note that \this is just a suggestion (seeformat()).

interactive

a logical indicating whether information about certainobservations can be displayed interactively (see ‘Details’).

...

forgrowdotMiss, further arguments and graphicalparameters to be passed tobgmap(). ForbubbleMiss, thearguments to be passed togrowdotMiss.

Details

The smallest dots correspond to the 10\the 99\defining the shape of the exponential function. Missings/imputed missingsin the variable of interest will be drawn as rectangles.

Ifinteractive=TRUE, detailed information for an observation can beprinted on the console by clicking on the corresponding point. Clicking ina region that does not contain any points quits the interactive session.

Note

The function was renamed togrowdotMiss in version 1.3.bubbleMiss is a (deprecated) wrapper forgrowdotMiss for backcompatibility with older versions. However, due to extended functionality,some of the argument positions have changed.

The code is based on (removed from CRAN) bubbleFIN from packageStatDA.

Author(s)

Andreas Alfons, Matthias Templ, Peter Filzmoser, Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(chorizonDL, package = "VIM")data(kola.background, package = "VIM")coo <- chorizonDL[, c("XCOO", "YCOO")]## for missing valuesx <- chorizonDL[, c("Ca","As", "Bi")]growdotMiss(x, coo, kola.background, border = "white")## for imputed valuesx_imp <- kNN(chorizonDL[,c("Ca","As","Bi" )])growdotMiss(x_imp, coo, kola.background, delimiter = "_imp", border = "white")

Histogram with information about missing/imputed values

Description

Histogram with highlighting of missing/imputed values in other variables bysplitting each bin into two parts. Additionally, information aboutmissing/imputed values in the variable of interest is shown on the righthand side.

Usage

histMiss(  x,  delimiter = NULL,  pos = 1,  selection = c("any", "all"),  breaks = "Sturges",  right = TRUE,  col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"),  border = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  only.miss = TRUE,  miss.labels = axes,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

delimiter

pos

a numeric value giving the index of the variable of interest.Additional variables inx are used for highlighting.

selection

breaks

either a character string naming an algorithm to compute thebreakpoints (seehist()), or a numeric value giving the numberof cells.

right

logical; ifTRUE, the histogram cells are right-closed(left-open) intervals.

col

border

the color to be used for the border of the cells. Useborder=NA to omit borders.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether axes should be drawn on the plot.

only.miss

logical; ifTRUE, the missing/imputed values in thefirst variable are visualized by a single bar. Otherwise, a small barplotis drawn on the right hand side (see ‘Details’).

miss.labels

either a logical indicating whether label(s) should beplotted below the bar(s) on the right hand side, or a character string orvector giving the label(s) (see ‘Details’).

interactive

a logical indicating whether the variables can beswitched interactively (see ‘Details’).

...

further graphical parameters to be passed tographics::title() andgraphics::axis().

Details

If more than one variable is supplied, the bins for the variable of interestwill be split according to missingness/number of imputed missings in theadditional variables.

Ifinteractive=TRUE, clicking in the left margin of the plot resultsin switching to the previous variable and clicking in the right marginresults in switching to the next variable. Clicking anywhere else on thegraphics device quits the interactive session. When switching to acategorical variable, a barplot is produced rather than a histogram.

Value

a list with the following components:

breaks the breakpoints.
counts the number of observations in each cell.
missings the number of highlighted observations in each cell.
mids the cell midpoints.

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentsaxisnames andnames.miss can still be supplied to...{} and are handled correctly. Nevertheless, they are deprecatedand no longer documented. Usemiss.labels instead.

Author(s)

Andreas Alfons, Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(tao, package = "VIM")## for missing valuesx <- tao[, c("Air.Temp", "Humidity")]histMiss(x)histMiss(x, only.miss = FALSE)## for imputed valuesx_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")])histMiss(x_IMPUTED, delimiter = "_imp")histMiss(x_IMPUTED, delimiter = "_imp", only.miss = FALSE)

Hot-Deck Imputation

Description

Implementation of the popular Sequential, Random (within a domain) hot-deckalgorithm for imputation.

Usage

hotdeck(  data,  variable = NULL,  ord_var = NULL,  domain_var = NULL,  makeNA = NULL,  NAcond = NULL,  impNA = TRUE,  donorcond = NULL,  imp_var = TRUE,  imp_suffix = "imp")

Arguments

data

data.frame or matrix

variable

variables where missing values should be imputed (not overlapping with ord_var)

ord_var

variables for sorting the data set before imputation (not overlapping with variable)

domain_var

variables for building domains and impute within thesedomains

makeNA

list of length equal to the number of variables, with values, that should be converted to NA for each variable

NAcond

list of length equal to the number of variables, with a condition for imputing a NA

impNA

TRUE/FALSE whether NA should be imputed

donorcond

list of length equal to the number of variables, with a donorcond condition as character string.e.g. ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable.

imp_var

TRUE/FALSE if a TRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix for the TRUE/FALSE variables showing the imputationstatus

Value

the imputed data set.

Note

If the sequential hotdeck does not lead to a suitable,a random donor in the group will be used.

Author(s)

Alexander Kowarik

References

A. Kowarik, M. Templ (2016) Imputation withR package VIM.Journal ofStatistical Software, 74(7), 1-16.

Examples

data(sleep)sleepI <- hotdeck(sleep)sleepI2 <- hotdeck(sleep,ord_var="BodyWgt",domain_var="Pred")# Usage of donorcond in a simple examplesleepI3 <- hotdeck(  sleep,  variable = c("NonD", "Dream", "Sleep", "Span", "Gest"),  ord_var = "BodyWgt", domain_var = "Pred",  donorcond = list(">4", "<17", ">1.5", "%between%c(8,13)", ">5"))set.seed(132)nRows <- 1e3# Generate a data set with nRows rows and several variablesx <- data.frame(  x = rnorm(nRows), y = rnorm(nRows),  z = sample(LETTERS, nRows, replace = TRUE),  d1 = sample(LETTERS[1:3], nRows, replace = TRUE),  d2 = sample(LETTERS[1:2], nRows, replace = TRUE),  o1 = rnorm(nRows), o2 = rnorm(nRows), o3 = rnorm(100))origX <- xx[sample(1:nRows,nRows/10), 1] <- NAx[sample(1:nRows,nRows/10), 2] <- NAx[sample(1:nRows,nRows/10), 3] <- NAx[sample(1:nRows,nRows/10), 4] <- NAxImp <- hotdeck(x,ord_var = c("o1", "o2", "o3"), domain_var = "d2")

Iterative EM PCA imputation

Description

Greedy algorithm for EM-PCA including robust methods

Usage

impPCA(  x,  method = "classical",  m = 1,  eps = 0.5,  k = ncol(x) - 1,  maxit = 100,  boot = FALSE,  verbose = TRUE)

Arguments

x

data.frame or matrix

method

"classical" or"mcd" (robust estimation)

m

number of multiple imputations (only if parameterboot equalsTRUE)

eps

threshold for convergence

k

number of principal components for reconstruction ofx

maxit

maximum number of iterations

boot

residual bootstrap (ifTRUE)

verbose

TRUE/FALSE if additional information about the imputationprocess should be printed

Value

the imputed data set. Ifboot = FALSE this is a data.frame.Ifboot = TRUE this is a list where each list element contains a data.frame.

Author(s)

Matthias Templ

References

Serneels, Sven and Verdonck, Tim (2008).Principal component analysis for data containing outliers and missing elements.Computational Statistics and Data Analysis, Elsevier, vol. 52(3), pages 1712-1727

Examples

data(Animals, package = "MASS")Animals$brain[19] <- Animals$brain[19] + 0.01Animals <- log(Animals)colnames(Animals) <- c("log(body)", "log(brain)")Animals_na <- Animalsprobs <- abs(Animals$`log(body)`^2)probs <- rep(0.5, nrow(Animals))probs[c(6,16,26)] <- 0set.seed(1234)Animals_na[sample(1:nrow(Animals), 10, prob = probs), "log(brain)"] <- NAw <- is.na(Animals_na$`log(brain)`)impPCA(Animals_na)impPCA(Animals_na, method = "mcd")impPCA(Animals_na, boot = TRUE, m = 10)impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]]plot(`log(brain)` ~ `log(body)`, data = Animals, type = "n", ylab = "", xlab="")mtext(text = "impPCA robust", side = 3)points(Animals$`log(body)`[!w], Animals$`log(brain)`[!w])points(Animals$`log(body)`[w], Animals$`log(brain)`[w], col = "grey", pch = 17)imputed <- impPCA(Animals_na, method = "mcd", boot = TRUE)[[1]]colnames(imputed) <- c("log(body)", "log(brain)")points(imputed$`log(body)`[w], imputed$`log(brain)`[w], col = "red", pch = 20, cex = 1.4)segments(x0 = Animals$`log(body)`[w], x1 = imputed$`log(body)`[w], y0 = Animals$`log(brain)`[w],y1 = imputed$`log(brain)`[w], lty = 2, col = "grey")legend("topleft", legend = c("non-missings", "set to missing", "imputed values"),pch = c(1,17,20), col = c("black","grey","red"), cex = 0.7)mape <- round(100* 1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` -imputed$`log(brain)`) / Animals$`log(brain)`)), 2)s2 <- var(Animals$`log(brain)`)nrmse <- round(sqrt(1/sum(is.na(Animals_na$`log(brain)`)) * sum(abs((Animals$`log(brain)` -imputed$`log(brain)`) / s2))), 2)text(x = 8, y = 1.5, labels = paste("MAPE =", mape))text(x = 8, y = 0.5, labels = paste("NRMSE =", nrmse))

Robust imputation

Description

Multiple imputation using classical and robust methodsaccounting for model and imputation uncertainty.

Usage

imputeRobust(  form,  data,  boot = TRUE,  robustboot = "stratified",  method = "MM",  takeAll = TRUE,  alpha = 0.75,  uncert = "pmm",  family = "Gaussian",  value_back = "all")

Arguments

form

Model formulas as a list.

data

Data set to impute

boot

Accounting for model uncertainty with a classical bootstrap,Default: TRUE

robustboot

Accounting for model uncertainty with robust bootstrapmethods, Default: 'stratified'

method

Imputation method, Default: 'MM'

takeAll

Missing values are intialized when TRUE, Default: TRUE

alpha

Relative size of good data points. Used for the robustbootstrap methods, Default: 0.75

uncert

Imputation uncertainty method, Default: 'pmm'

family

Not supported and ignored. Foreseen for future versions, Default: 'Gaussian'

value_back

Only observations with imputed values as return object (ymiss),or the whole data set, Default: 'all'

Details

Complex formulas can be provided for each variable inyour data set.

Value

Imputed data set.

Examples

## Not run: if(interactive()){ #EXAMPLE1 }## End(Not run)

FUNCTION_TITLE

Description

FUNCTION_DESCRIPTION

Usage

imputeRobustChain(  formulas = vector("list", ncol(data)),  data,  boot = TRUE,  robustboot = TRUE,  method = "lts",  multinom.method = "multinom",  takeAll = TRUE,  eps = 0.5,  maxit = 4,  alpha = 0.5,  uncert = "pmm",  familiy = "Gaussian",  value_back = "matrix",  trace = FALSE)

Arguments

formulas

PARAM_DESCRIPTION, Default: vector("list", ncol(data))

data

PARAM_DESCRIPTION

boot

PARAM_DESCRIPTION, Default: TRUE

robustboot

PARAM_DESCRIPTION, Default: TRUE

method

PARAM_DESCRIPTION, Default: 'lts'

multinom.method

PARAM_DESCRIPTION, Default: 'multinom'

takeAll

PARAM_DESCRIPTION, Default: TRUE

eps

PARAM_DESCRIPTION, Default: 0.5

maxit

PARAM_DESCRIPTION, Default: 4

alpha

PARAM_DESCRIPTION, Default: 0.5

uncert

PARAM_DESCRIPTION, Default: 'pmm'

familiy

PARAM_DESCRIPTION, Default: 'Gaussian'

value_back

PARAM_DESCRIPTION, Default: 'matrix'

trace

PARAM_DESCRIPTION, Default: FALSE

Details

DETAILS

Value

OUTPUT_DESCRIPTION

Examples

## Not run: if(interactive()){ #EXAMPLE1 }## End(Not run)

Initialization of missing values

Description

Rough estimation of missing values in a vector according to its type.

Usage

initialise(x, mixed, method = "kNN", mixed.constant = NULL)

Arguments

x

a vector.

mixed

a character vector containing the names of variables of typemixed (semi-continous).

method

Method used for Initialization (median or kNN)

mixed.constant

vector with length equal to the number ofsemi-continuous variables specifying the point of the semi-continuousdistribution with non-zero probability

Details

Missing values are imputed with the mean for vectors of class"numeric", with the median for vectors of class"integer", andwith the mode for vectors of class"factor". Hence,x shouldbe prepared in the following way: assign class"numeric" to numericvectors, assign class"integer" to ordinal vectors, and assign class"factor" to nominal or binary vectors.

Value

the initialized vector.

Note

The function is used internally by some imputation algorithms.

Author(s)

Matthias Templ, modifications by Andreas Alfons

Iterative robust model-based imputation (IRMI)

Description

In each step of the iteration, one variable is used as a response variableand the remaining variables serve as the regressors.

Usage

irmi(  x,  eps = 5,  maxit = 100,  mixed = NULL,  mixed.constant = NULL,  count = NULL,  step = FALSE,  robust = FALSE,  takeAll = TRUE,  noise = TRUE,  noise.factor = 1,  force = FALSE,  robMethod = "lmrob",  force.mixed = TRUE,  mi = 1,  addMixedFactors = FALSE,  trace = FALSE,  init.method = "kNN",  modelFormulas = NULL,  multinom.method = "multinom",  imp_var = TRUE,  imp_suffix = "imp")

Arguments

x

data.frame or matrix

eps

threshold for convergency

maxit

maximum number of iterations

mixed

column index of the semi-continuous variables

mixed.constant

vector with length equal to the number ofsemi-continuous variables specifying the point of the semi-continuousdistribution with non-zero probability

count

column index of count variables

step

a stepwise model selection is applied when the parameter is setto TRUE

robust

if TRUE, robust regression methods will be applied

takeAll

takes information of (initialised) missings in the responseas well for regression imputation.

noise

irmi has the option to add a random error term to the imputedvalues, this creates the possibility for multiple imputation. The error termhas mean 0 and variance corresponding to the variance of the regressionresiduals.

noise.factor

amount of noise.

force

if TRUE, the algorithm tries to find a solution in any case,possible by using different robust methods automatically.

robMethod

regression method when the response is continuous. Default isMM-regression withlmrob.

force.mixed

if TRUE, the algorithm tries to find a solution in anycase, possible by using different robust methods automatically.

mi

number of multiple imputations.

addMixedFactors

if TRUE add additional factor variable for eachmixed variable as X variable in the regression

trace

Additional information about the iterations when trace equalsTRUE.

init.method

Method for initialization of missing values (kNN ormedian)

modelFormulas

a named list with the name of variables for the rhsof the formulas, which must contain a rhs formula for each variable withmissing values, it should look like 'list(y1=c("x1","x2"),y2=c("x1","x3"))“if factor variables for the mixed variables should be created for theregression models

multinom.method

Method for estimating the multinomial models(current default and only available method is multinom)

imp_var

TRUE/FALSE if a TRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix for the TRUE/FALSE variables showing the imputationstatus

Details

The method works sequentially and iterative. The method can deal with amixture of continuous, semi-continuous, ordinal and nominal variablesincluding outliers.

A full description of the method can be found in the mentioned reference.

Value

the imputed data set.

Author(s)

Matthias Templ, Alexander Kowarik

References

M. Templ, A. Kowarik, P. Filzmoser (2011) Iterative stepwiseregression imputation using standard and robust methods.Journal ofComputational Statistics and Data Analysis, Vol. 55, pp. 2793-2806.

A. Kowarik, M. Templ (2016) Imputation withR package VIM.Journal ofStatistical Software, 74(7), 1-16.

Examples

data(sleep)irmi(sleep)data(testdata)imp_testdata1 <- irmi(testdata$wna, mixed = testdata$mixed)# mixed.constant != 0 (-10)testdata$wna$m1[testdata$wna$m1 == 0] <- -10testdata$wna$m2 <- log(testdata$wna$m2 + 0.001)imp_testdata2 <- irmi(  testdata$wna,  mixed = testdata$mixed,  mixed.constant = c(-10,log(0.001)))imp_testdata2$m2 <- exp(imp_testdata2$m2) - 0.001#example with fixed formulas for the variables with missingform = list(  NonD  = c("BodyWgt", "BrainWgt"),  Dream = c("BodyWgt", "BrainWgt"),  Sleep = c("BrainWgt"           ),  Span  = c("BodyWgt"            ),  Gest  = c("BodyWgt", "BrainWgt"))irmi(sleep, modelFormulas = form, trace = TRUE)# Example with ordered variabletd <- testdata$wnatd$c1 <- as.ordered(td$c1)irmi(td)

k-Nearest Neighbour Imputation

Description

k-Nearest Neighbour Imputation based on a variation of the Gower Distancefor numerical, categorical, ordered and semi-continous variables.

Usage

kNN(  data,  variable = colnames(data),  k = 5,  dist_var = colnames(data),  weights = NULL,  numFun = median,  catFun = maxCat,  makeNA = NULL,  NAcond = NULL,  impNA = TRUE,  donorcond = NULL,  mixed = vector(),  mixed.constant = NULL,  trace = FALSE,  imp_var = TRUE,  imp_suffix = "imp",  addRF = FALSE,  onlyRF = FALSE,  addRandom = FALSE,  useImputedDist = TRUE,  weightDist = FALSE,  methodStand = "range",  ordFun = medianSamp)

Arguments

data

data.frame or matrix

variable

variables where missing values should be imputed

k

number of Nearest Neighbours used

dist_var

names or variables to be used for distance calculation

weights

weights for the variables for distance calculation.Ifweights = "auto" weights will be selected based on variable importance from random forest regression, using functionranger::ranger().Weights are calculated for each variable seperately.

numFun

function for aggregating the k Nearest Neighbours in the caseof a numerical variable

catFun

function for aggregating the k Nearest Neighbours in the caseof a categorical variable

makeNA

list of length equal to the number of variables, with values, that should be converted to NA for each variable

NAcond

list of length equal to the number of variables, with a condition for imputing a NA

impNA

TRUE/FALSE whether NA should be imputed

donorcond

list of length equal to the number of variables, with a donorcond condition as character string.e.g. a list element can be ">5" or c(">5","<10). If the list element for a variable is NULL no condition will be applied for this variable.

mixed

names of mixed variables

mixed.constant

vector with length equal to the number ofsemi-continuous variables specifying the point of the semi-continuousdistribution with non-zero probability

trace

TRUE/FALSE if additional information about the imputationprocess should be printed

imp_var

TRUE/FALSE if a TRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix for the TRUE/FALSE variables showing the imputationstatus

addRF

TRUE/FALSE each variable will be modelled using random forest regression (ranger::ranger()) and used as additional distance variable.

onlyRF

TRUE/FALSE if TRUE only additional distance variables created from random forest regression will be used as distance variables.

addRandom

TRUE/FALSE if an additional random variable should be addedfor distance calculation

useImputedDist

TRUE/FALSE if an imputed value should be used for distance calculation for imputing another variable.Be aware that this results in a dependency on the ordering of the variables.

weightDist

TRUE/FALSE if the distances of the k nearest neighbours should be used as weights in theaggregation step

methodStand

either "range" or "iqr" to be used in the standardization of numeric vaiables in the gower distance

ordFun

function for aggregating the k Nearest Neighbours in the caseof a ordered factor variable

Value

the imputed data set.

Author(s)

Alexander Kowarik, Statistik Austria

References

A. Kowarik, M. Templ (2016) Imputation withR package VIM.Journal ofStatistical Software, 74(7), 1-16.

Examples

data(sleep)kNN(sleep)library(laeken)kNN(sleep, numFun = weightedMean, weightDist=TRUE)

Background map for the Kola project data

Description

Coordinates of the Kola background map.

Source

Kola Project (1993-1998)

References

Reimann, C., Filzmoser, P., Garrett, R.G. and Dutter, R. (2008)Statistical Data Analysis Explained: Applied Environmental Statisticswith R. Wiley, 2008.

Examples

data(kola.background, package = "VIM")bgmap(kola.background)

Map with information about missing/imputed values

Description

Map of observed and missing/imputed values.

Usage

mapMiss(  x,  coords,  map,  delimiter = NULL,  selection = c("any", "all"),  col = c("skyblue", "red", "orange"),  alpha = NULL,  pch = c(19, 15),  col.map = grey(0.5),  legend = TRUE,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

coords

adata.frame or matrix with two columns giving thespatial coordinates of the observations.

map

a background map to be passed tobgmap().

delimiter

selection

the selection method for displaying missing/imputed valuesin the map. Possible values are"any" (display missing/imputedvalues inany variable) and"all" (display missing/imputedvalues inall variables).

col

a vector of length three giving the colors to be used forobserved, missing and imputed values. If a single color is supplied, it isused for all values.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

pch

a vector of length two giving the plot characters to be used forobserved and missing/imputed values. If a single plot character issupplied, it will be used for both.

col.map

the color to be used for the background map.

legend

a logical indicating whether a legend should be plotted.

interactive

a logical indicating whether information about selectedobservations can be displayed interactively (see ‘Details’).

...

further graphical parameters to be passed tobgmap() andgraphics::points().

Details

Author(s)

Matthias Templ, Andreas Alfons, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(chorizonDL, package = "VIM")data(kola.background, package = "VIM")coo <- chorizonDL[, c("XCOO", "YCOO")]## for missing valuesx <- chorizonDL[, c("As", "Bi")]mapMiss(x, coo, kola.background)## for imputed valuesx_imp <- kNN(chorizonDL[, c("As", "Bi")])mapMiss(x_imp, coo, kola.background, delimiter = "_imp")

Marginplot Matrix

Description

Create a scatterplot matrix with information about missing/imputed values inthe plot margins of each panel.

Usage

marginmatrix(  x,  delimiter = NULL,  col = c("skyblue", "red", "red4", "orange", "orange4"),  alpha = NULL,  ...)

Arguments

x

a matrix ordata.frame.

delimiter

col

a vector of length five giving the colors to be used in themarginplots in the off-diagonal panels. The first color is used for thescatterplot and the boxplots for the available data, the second/fourth colorfor the univariate scatterplots and boxplots for the missing/imputed valuesin one variable, and the third/fifth color for the frequency ofmissing/imputed values in both variables (see ‘Details’). If onlyone color is supplied, it is used for the bivariate and univariatescatterplots and the boxplots for missing/imputed values in one variable,whereas the boxplots for the available data are transparent. Else if twocolors are supplied, the second one is recycled.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

...

further arguments and graphical parameters to be passed topairsVIM() andmarginplot().par("oma") willbe set appropriately unless supplied (seegraphics::par()).

Details

marginmatrix usespairsVIM() with a panel function basedonmarginplot().

The graphical parameteroma will be set unless supplied as anargument.

Author(s)

Andreas Alfons, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")## for missing valuesx <- sleep[, 1:5]x[,c(1,2,4)] <- log10(x[,c(1,2,4)])marginmatrix(x)## for imputed valuesx_imp <- kNN(sleep[, 1:5])x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)])marginmatrix(x_imp, delimiter = "_imp")

Scatterplot with additional information in the margins

Description

In addition to a standard scatterplot, information about missing/imputedvalues is shown in the plot margins. Furthermore, imputed values arehighlighted in the scatterplot.

Usage

marginplot(  x,  delimiter = NULL,  col = c("skyblue", "red", "red4", "orange", "orange4"),  alpha = NULL,  pch = c(1, 16),  cex = par("cex"),  numbers = TRUE,  cex.numbers = par("cex"),  zeros = FALSE,  xlim = NULL,  ylim = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  ann = par("ann"),  axes = TRUE,  frame.plot = axes,  ...)

Arguments

x

amatrix ordata.frame with two columns.

delimiter

col

a vector of length five giving the colors to be used in the plot.The first color is used for the scatterplot and the boxplots for theavailable data. In case of missing values, the second color is taken for theunivariate scatterplots and boxplots for missing values in one variable andthe third for the frequency of missing/imputed values in both variables (see‘Details’). Otherwise, in case of imputed values, the fourth color isused for the highlighting, the frequency, the univariate scatterplot and theboxplots of mputed values in the first variable and the fifth color for thesame applied to the second variable. A black color is used for thehighlighting and the frequency of imputed values in both variables instead.If only one color is supplied, it is used for the bivariate and univariatescatterplots and the boxplots for missing/imputed values in one variable,whereas the boxplots for the available data are transparent. Else if twocolors are supplied, the second one is recycled.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

pch

a vector of length two giving the plot symbols to be used for thescatterplot and the univariate scatterplots. If a single plot character issupplied, it is used for the scatterplot and the default value will be usedfor the univariate scatterplots (see ‘Details’).

cex

the character expansion factor to be used for the bivariate andunivariate scatterplots.

numbers

a logical indicating whether the frequencies ofmissing/imputed values should be displayed in the lower left of the plot(see ‘Details’).

cex.numbers

the character expansion factor to be used for thefrequencies of the missing/imputed values.

zeros

a logical vector of length two indicating whether the variablesare semi-continuous, i.e., contain a considerable amount of zeros. IfTRUE, only the non-zero observations are used for drawing therespective boxplot. If a single logical is supplied, it is recycled.

xlim,ylim

axis limits.

main,sub

main and sub title.

xlab,ylab

axis labels.

ann

a logical indicating whether plot annotation (main,sub,xlab,ylab) should be displayed.

axes

a logical indicating whether both axes should be drawn on theplot. Use graphical parameter"xaxt" or"yaxt" to suppressonly one of the axes.

frame.plot

a logical indicating whether a box should be drawn aroundthe plot.

...

further graphical parameters to be passed down (seegraphics::par()).

Details

Boxplots for available and missing/imputed data, as well as univariatescatterplots for missing/imputed values in one variable are shown in theplot margins.

Imputed values in either of the variables are highlighted in thescatterplot.

Furthermore, the frequencies of the missing/imputed values can be displayedby a number (lower left of the plot). The number in the lower left corner isthe number of observations that are missing/imputed in both variables.

Note

Some of the argument names and positions have changed with versions1.3 and 1.4 due to extended functionality and for more consistency withother plot functions inVIM. For back compatibility, the argumentcex.text can still be supplied to...{} and is handledcorrectly. Nevertheless, it is deprecated and no longer documented. Usecex.numbers instead.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(tao, package = "VIM")data(chorizonDL, package = "VIM")## for missing valuesmarginplot(tao[,c("Air.Temp", "Humidity")])marginplot(log10(chorizonDL[,c("CaO", "Bi")]))## for imputed valuesmarginplot(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp")marginplot(kNN(log10(chorizonDL[,c("CaO", "Bi")])), delimiter = "_imp")

Fast matching/imputation based on categorical variable

Description

Suitable donors are searched based on matching of the categorical variables.The variables are dropped in reversed order, so that the last element of'match_var' is dropped first and the first element of the vector is dropped last.

Usage

matchImpute(  data,  variable = colnames(data)[!colnames(data) %in% match_var],  match_var,  imp_var = TRUE,  imp_suffix = "imp")

Arguments

data

data.frame, data.table or matrix

variable

variables to be imputed

match_var

variables used for matching

imp_var

TRUE/FALSE if a TRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix for the TRUE/FALSE variables showing the imputationstatus

Details

The method works by sampling values from the suitable donors.

Value

the imputed data set.

Author(s)

Johannes Gussenbauer, Alexander Kowarik

Examples

data(sleep,package="VIM")imp_data <- matchImpute(sleep,variable=c("NonD","Dream","Sleep","Span","Gest"),  match_var=c("Exp","Danger"))data(testdata,package="VIM")imp_testdata1 <- matchImpute(testdata$wna,match_var=c("c1","c2","b1","b2"))dt <- data.table::data.table(testdata$wna)imp_testdata2 <- matchImpute(dt,match_var=c("c1","c2","b1","b2"))

Matrix plot

Description

Create a matrix plot, in which all cells of a data matrix are visualized byrectangles. Available data is coded according to a continuous color scheme,while missing/imputed data is visualized by a clearly distinguishable color.

Usage

matrixplot(  x,  delimiter = NULL,  sortby = NULL,  col = c("red", "orange"),  fixup = TRUE,  xlim = NULL,  ylim = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  labels = axes,  xpd = NULL,  interactive = TRUE,  ...)

Arguments

x

a matrix ordata.frame.

delimiter

sortby

a numeric or character value specifying the variable to sortthe data matrix by, orNULL to plot without sorting.

col

the colors to be used in the plot. RGB colors may be specifiedas character strings or as objects of class "colorspace::RGB()".HCL colors need to be specified as objects of class"colorspace::polarLUV()". If only one color is supplied, it isused for missing and imputed data and a greyscale is used for availabledata. If two colors are supplied, the first is used for missing and thesecond for imputed data and a greyscale for available data. If three colorsare supplied, the first is used as end color for the available data, whilethe start color is taken to be transparent for RGB or white for HCL.Missing/imputed data is visualized by the second/third color in this case.If four colors are supplied, the first is used as start color and the secondas end color for the available data, while the third/fourth color is usedfor missing/imputed data.

fixup

a logical indicating whether the colors should be corrected tovalid RGB values (seecolorspace::hex()).

xlim,ylim

axis limits.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether axes should be drawn on the plot.

labels

either a logical indicating whether labels should be plottedbelow each column, or a character vector giving the labels.

xpd

a logical indicating whether the rectangles should be allowed togo outside the plot region. IfNULL, it defaults toTRUEunless axis limits are specified.

interactive

a logical indicating whether a variable to be used forsorting can be selected interactively (see ‘Details’).

...

formatrixplot andiimagMiss, further graphicalparameters to be passed tographics::plot.window(),graphics::title() andgraphics::axis(). ForTKRmatrixplot, further arguments to be passed tomatrixplot.

Details

In amatrix plot, all cells of a data matrix are visualized byrectangles. Available data is coded according to a continuous color scheme.To compute the colors via interpolation, the variables are first scaled tothe interval between 0 and 1. Missing/imputed values can then bevisualized by a clearly distinguishable color. It is thereby possible to usecolors in theHCL orRGB color space. A simple way ofvisualizing the magnitude of the available data is to apply a greyscale,which has the advantage that missing/imputed values can easily bedistinguished by using a color such as red/orange. Note that-InfandInf are always assigned the begin and end color, respectively, ofthe continuous color scheme.

Additionally, the observations can be sorted by the magnitude of a selectedvariable. Ifinteractive isTRUE, clicking in a columnredraws the plot with observations sorted by the corresponding variable.Clicking anywhere outside the plot region quits the interactive session.

Note

This is a much more powerful extension to the functionimagmissin the former CRAN packagedprep.

iimagMiss is deprecated and may be omitted in future versions ofVIM. Usematrixplot instead.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")## for missing valuesx <- sleep[, -(8:10)]x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)])matrixplot(x, sortby = "BrainWgt")## for imputed valuesx_imp <- kNN(sleep[, -(8:10)])x_imp[,c(1,2,4,6,7)] <- log10(x_imp[,c(1,2,4,6,7)])matrixplot(x_imp, delimiter = "_imp", sortby = "BrainWgt")

Aggregation function for a factor variable

Description

The function maxCat chooses the levelwith the most occurrences and random if the maximum is not unique.

Usage

maxCat(x, weights = NULL)

Arguments

x

factor vector

weights

numeric vector providing weights for the observations in x

Aggregation function for a ordinal variable

Description

The function medianSamp chooses the level as the median or randomly betweentwo levels.

Usage

medianSamp(x, weights = NULL)

Arguments

x

ordered factor vector

weights

numeric vector providing weights for the observations in x

Mosaic plot with information about missing/imputed values

Description

Create a mosaic plot with information about missing/imputed values.

Usage

mosaicMiss(  x,  delimiter = NULL,  highlight = NULL,  selection = c("any", "all"),  plotvars = NULL,  col = c("skyblue", "red", "orange"),  labels = NULL,  miss.labels = TRUE,  ...)

Arguments

x

a matrix ordata.frame.

delimiter

highlight

a vector giving the variables to be used for highlighting.IfNULL (the default), all variables are used for highlighting.

selection

the selection method for highlighting missing/imputedvalues in multiple highlight variables. Possible values are"any"(highlighting of missing/imputed values inany of the highlightvariables) and"all" (highlighting of missing/imputed values inall of the highlight variables).

plotvars

a vector giving the categorical variables to be plotted. IfNULL (the default), all variables are plotted.

col

a vector of length three giving the colors to be used forobserved, missing and imputed data. If only one color is supplied, the tilescorresponding to observed data are transparent and the supplied color isused for highlighting.

labels

a list of arguments for the labeling functionvcd::labeling_border().

miss.labels

either a logical indicating whether labels should beplotted for observed and missing/imputed (highlighted) data, or a charactervector giving the labels.

...

additional arguments to be passed tovcd::mosaic().

Details

Mosaic plots are graphical representations of multi-way contingency tables.The frequencies of the different cells are visualized by area-proportionalrectangles (tiles). Additional tiles are be used to display the frequenciesof missing/imputed values. Furthermore, missing/imputed values in a certainvariable or combination of variables can be highlighted in order to exploretheir structure.

Value

An object of class"structable" is returned invisibly.

Note

This function uses the highly flexiblestrucplot framework ofpackagevcd.

Author(s)

Andreas Alfons, modifications by Bernd Prantner

References

Meyer, D., Zeileis, A. and Hornik, K. (2006) Thestrucplot framework: Visualizing multi-way contingency tables withvcd.Journal of Statistical Software,17 (3), 1–48.

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data usingvisualization tools.Journal of Advances in Data Analysis andClassification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")## for missing valuesmosaicMiss(sleep, highlight = 4,     plotvars = 8:10, miss.labels = FALSE)## for imputed valuesmosaicMiss(kNN(sleep), highlight = 4,     plotvars = 8:10, delimiter = "_imp", miss.labels = FALSE)

Scatterplot Matrices

Description

Create a scatterplot matrix.

Usage

pairsVIM(  x,  ...,  delimiter = NULL,  main = NULL,  sub = NULL,  panel = points,  lower = panel,  upper = panel,  diagonal = NULL,  labels = TRUE,  pos.labels = NULL,  cex.labels = NULL,  font.labels = par("font"),  layout = c("matrix", "graph"),  gap = 1)

Arguments

x

a matrix ordata.frame.

...

further arguments and graphical parameters to be passed down.par("oma") will be set appropriately unless supplied (seegraphics::par()).

delimiter

main,sub

main and sub title.

panel

afunction(x, y, ...{}), which is used to plot thecontents of each off-diagonal panel of the display.

lower,upper

separate panel functions to be used below and above thediagonal, respectively.

diagonal

optionalfunction(x, ...{}) to be applied on thediagonal panels.

labels

either a logical indicating whether labels should be plottedin the diagonal panels, or a character vector giving the labels.

pos.labels

the vertical position of the labels in the diagonalpanels.

cex.labels

the character expansion factor to be used for the labels.

font.labels

the font to be used for the labels.

layout

a character string giving the layout of the scatterplotmatrix. Possible values are"matrix" (a matrix-like layout with thefirst row on top) and"graph" (a graph-like layout with the first rowat the bottom).

gap

a numeric value giving the distance between the panels in marginlines.

Details

This function is the workhorse formarginmatrix() andscattmatrixMiss().

The graphical parameteroma will be set unless supplied as anargument.

A panel function should not attempt to start a new plot, since thecoordinate system for each panel is set up bypairsVIM.

Note

The code is based ongraphics::pairs(). Starting withversion 1.4, infinite values are no longer removed before passing thex andy vectors to the panel functions.

Author(s)

Andreas Alfons, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")x <- sleep[, -(8:10)]x[,c(1,2,4,6,7)] <- log10(x[,c(1,2,4,6,7)])pairsVIM(x)

Parallel coordinate plot with information about missing/imputed values

Description

Parallel coordinate plot with adjustments for missing/imputed values.Missing values in the plotted variables may be represented by a point abovethe corresponding coordinate axis to prevent disconnected lines. Inaddition, observations with missing/imputed values in selected variables maybe highlighted.

Usage

parcoordMiss(  x,  delimiter = NULL,  highlight = NULL,  selection = c("any", "all"),  plotvars = NULL,  plotNA = TRUE,  col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"),  alpha = NULL,  lty = par("lty"),  xlim = NULL,  ylim = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  labels = TRUE,  xpd = NULL,  interactive = TRUE,  ...)

Arguments

x

a matrix ordata.frame.

delimiter

highlight

a vector giving the variables to be used for highlighting.IfNULL (the default), all variables are used for highlighting.

selection

plotvars

a vector giving the variables to be plotted. IfNULL(the default), all variables are plotted.

plotNA

a logical indicating whether missing values in the plotvariables should be represented by a point above the correspondingcoordinate axis to prevent disconnected lines.

col

ifplotNA isTRUE, a vector of length six givingthe colors to be used for observations with different combinations ofobserved and missing/imputed values in the plot variables and highlightvariables (vectors of length one or two are recycled). Otherwise, a vectorof length two giving the colors for non-highlighted and highlightedobservations (if a single color is supplied, it is used for both).

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

lty

ifplotNA isTRUE, a vector of length four givingthe line types to be used for observations with different combinations ofobserved and missing/imputed values in the plot variables and highlightvariables (vectors of length one or two are recycled). Otherwise, a vectorof length two giving the line types for non-highlighted and highlightedobservations (if a single line type is supplied, it is used for both).

xlim,ylim

axis limits.

main,sub

main and sub title.

xlab,ylab

axis labels.

labels

either a logical indicating whether labels should be plottedbelow each coordinate axis, or a character vector giving the labels.

xpd

a logical indicating whether the lines should be allowed to gooutside the plot region. IfNULL, it defaults toTRUE unlessaxis limits are specified.

interactive

a logical indicating whether interactive features shouldbe enabled (see ‘Details’).

...

forparcoordMiss, further graphical parameters to bepassed down (seegraphics::par()). ForTKRparcoordMiss,further arguments to be passed toparcoordMiss.

Details

In parallel coordinate plots, the variables are represented by parallelaxes. Each observation of the scaled data is shown as a line. Observationswith missing/imputed values in selected variables may thereby behighlighted. However, plotting variables with missing values results indisconnected lines, making it impossible to trace the respectiveobservations across the graph. As a remedy, missing values may berepresented by a point above the corresponding coordinate axis, which isseparated from the main plot by a small gap and a horizontal line, asdetermined byplotNA. Connected lines can then be drawn for allobservations. Nevertheless, a caveat of this display is that it may drawattention away from the main relationships between the variables.

Ifinteractive isTRUE, it is possible switch between thisdisplay and the standard display without the separate level for missingvalues by clicking in the top margin of the plot. In addition, the variablesto be used for highlighting can be selected interactively. Observationswith missing/imputed values in any or in all of the selected variables arehighlighted (as determined byselection). A variable can be added tothe selection by clicking on a coordinate axis. If a variable is alreadyselected, clicking on its coordinate axis removes it from the selection.Clicking anywhere outside the plot region (except the top margin, ifmissing/imputed values exist) quits the interactive session.

Note

Some of the argument names and positions have changed with versions1.3 and 1.4 due to extended functionality and for more consistency withother plot functions inVIM. For back compatibility, the argumentscolcomb andxaxlabels can still be supplied to...{}and are handled correctly. Nevertheless, they are deprecated and no longerdocumented. Usehighlight andlabels instead.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

Wegman, E. J. (1990) Hyperdimensional data analysis usingparallel coordinates.Journal of the American Statistical Association85 (411), 664–675.

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incomplete data usingvisualization tools.Journal of Advances in Data Analysis andClassification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(chorizonDL, package = "VIM")## for missing valuesparcoordMiss(chorizonDL[,c(15,101:110)],     plotvars=2:11, interactive = FALSE)legend("top", col = c("skyblue", "red"), lwd = c(1,1),     legend = c("observed in Bi", "missing in Bi"))## for imputed valuesparcoordMiss(kNN(chorizonDL[,c(15,101:110)]), delimiter = "_imp" ,    plotvars=2:11, interactive = FALSE)legend("top", col = c("skyblue", "orange"), lwd = c(1,1),     legend = c("observed in Bi", "imputed in Bi"))

Parallel boxplots with information about missing/imputed values

Description

Boxplot of one variable of interest plus information about missing/imputedvalues in other variables.

Usage

pbox(  x,  delimiter = NULL,  pos = 1,  selection = c("none", "any", "all"),  col = c("skyblue", "red", "red4", "orange", "orange4"),  numbers = TRUE,  cex.numbers = par("cex"),  xlim = NULL,  ylim = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  frame.plot = axes,  labels = axes,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

delimiter

pos

a numeric value giving the index of the variable of interest.Additional variables inx are used for grouping according tomissingness/number of imputed missings.

selection

the selection method for grouping according tomissingness/number of imputed missings in multiple additional variables.Possible values are"none" (grouping according to missingness/numberof imputed missings in every other variable that contains missing/imputedvalues),"any" (grouping according to missingness/number of imputedmissings inany of the additional variables) and"all"(grouping according to missingness/number of imputed missings inallof the additional variables).

col

a vector of length five giving the colors to be used in the plot.The first color is used for the boxplots of the available data, thesecond/fourth are used for missing/imputed data, respectively, and thethird/fifth color for the frequencies of missing/imputed values in bothvariables (see ‘Details’). If only one color is supplied, it is usedfor the boxplots for missing/imputed data, whereas the boxplots for theavailable data are transparent. Else if two colors are supplied, the secondone is recycled.

numbers

a logical indicating whether the frequencies ofmissing/imputed values should be displayed (see ‘Details’).

cex.numbers

the character expansion factor to be used for thefrequencies of the missing/imputed values.

xlim,ylim

axis limits.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether axes should be drawn on the plot.

frame.plot

a logical indicating whether a box should be drawn aroundthe plot.

labels

either a logical indicating whether labels should be plottedbelow each box, or a character vector giving the labels.

interactive

a logical indicating whether variables can be switchedinteractively (see ‘Details’).

...

forpbox, further arguments and graphical parameters tobe passed tographics::boxplot() and other functions. ForTKRpbox, further arguments to be passed topbox.

Details

This plot consists of several boxplots. First, a standard boxplot of thevariable of interest is produced. Second, boxplots grouped by observed andmissing/imputed values according toselection are produced for thevariable of interest.

Additionally, the frequencies of the missing/imputed values can berepresented by numbers. If so, the first line corresponds to the observedvalues of the variable of interest and their distribution in the differentgroups, the second line to the missing/imputed values.

Value

a list as returned bygraphics::boxplot().

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentsnamesandcex.text can still be supplied to...{} and are handledcorrectly. Nevertheless, they are deprecated and no longer documented. Uselabels andcex.numbers instead.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(chorizonDL, package = "VIM")## for missing valuespbox(log(chorizonDL[, c(4,5,8,10,11,16:17,19,25,29,37,38,40)]))## for imputed valuespbox(kNN(log(chorizonDL[, c(4,8,10,11,17,19,25,29,37,38,40)])),     delimiter = "_imp")

Transformation and standardization

Description

This function is used by theVIM GUI for transformation andstandardization of the data.

Usage

prepare(  x,  scaling = c("none", "classical", "MCD", "robust", "onestep"),  transformation = c("none", "minus", "reciprocal", "logarithm", "exponential", "boxcox",    "clr", "ilr", "alr"),  alpha = NULL,  powers = NULL,  start = 0,  alrVar)

Arguments

x

a vector, matrix ordata.frame.

scaling

the scaling to be applied to the data. Possible values are"none","classical",MCD,"robust" and"onestep".

transformation

the transformation of the data. Possible values are"none","minus","reciprocal","logarithm","exponential","boxcox","clr","ilr" and"alr".

alpha

a numeric parameter controlling the size of the subset for theMCD (ifscaling="MCD"). Seerobustbase::covMcd().

powers

a numeric vector giving the powers to be used in the Box-Coxtransformation (iftransformation="boxcox"). IfNULL, thepowers are calculated with functioncar::powerTransform().

start

a constant to be added prior to Box-Cox transformation (iftransformation="boxcox").

alrVar

variable to be used as denominator in the additive logratiotransformation (iftransformation="alr").

Details

Transformation:

"none": no transformation is used.

"logarithm": compute the the logarithm (to the base 10).

"boxcox": apply a Box-Cox transformation. Powers may be specified orcalculated with the functioncar::powerTransform().

Standardization:

"none": no standardization is used.

"classical": apply az-Transformation on each variable byusing functionscale().

"robust": apply a robustifiedz-Transformation by using medianand MAD.

Value

Transformed and standardized data.

Author(s)

Matthias Templ, modifications by Andreas Alfons

Examples

data(sleep, package = "VIM")x <- sleep[, c("BodyWgt", "BrainWgt")]prepare(x, scaling = "robust", transformation = "logarithm")

Pulp lignin content

Description

Pulp quality by lignin content remaining

Format

A data frame with 301 observations on the following 23 variables.

Details

Pulp quality is measured by the lignin content remaining in the pulp:the Kappa number. This data set is used to understand which variablesin the process influence the Kappa number, and if it can be predictedaccurately enough for an inferential sensor application.Variables with a number at the end have been lagged by thatnumber of hours to line up the data.

Source

https://openmv.net/info/kamyr-digester

References

K. Walkush and R.R. Gustafson. Application of feedforward neural networks and partial leastsquares regression for modelling Kappa number in a continuous Kamyr digester",Pulp and Paper Canada, 95, 1994, p T7-T13.

Examples

data(pulplignin)str(pulplignin)aggr(pulplignin)

Random Forest Imputation

Description

Impute missing values based on a random forest model usingranger::ranger()

Usage

rangerImpute(  formula,  data,  imp_var = TRUE,  imp_suffix = "imp",  ...,  verbose = FALSE,  median = FALSE)

Arguments

formula

model formula for the imputation

data

Adata.frame containing the data

imp_var

TRUE/FALSE if aTRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix used for TF imputation variables

...

Arguments passed toranger::ranger()

verbose

Show the number of observations used for trainingand evaluating the RF-Model. This parameter is also passed down toranger::ranger() to show computation status.

median

Use the median (rather than the arithmetic mean) to averagethe values of individual trees for a more robust estimate.

Value

the imputed data set.

Examples

data(sleep)rangerImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep)

Regression Imputation

Description

Impute missing values based on a regression model.

Usage

regressionImp(  formula,  data,  family = "AUTO",  robust = FALSE,  imp_var = TRUE,  imp_suffix = "imp",  mod_cat = FALSE)

Arguments

formula

model formula to impute one variable

data

A data.frame containing the data

family

family argument forglm()."AUTO" (the default) tries to chooseautomatically and is the only really tested option!!!

robust

TRUE/FALSE if robust regression should be used. See details.

imp_var

TRUE/FALSE if aTRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix used for TF imputation variables

mod_cat

TRUE/FALSE ifTRUE for categorical variables the level withthe highest prediction probability is selected, otherwise it is sampledaccording to the probabilities.

Details

lm() is used for family "normal" andglm() for all other families.Ifrobust = TRUE,lmrob() is used for family "normal"andglmrob() for all other families.

Value

the imputed data set.

Author(s)

Alexander Kowarik

References

A. Kowarik, M. Templ (2016) Imputation withR package VIM.Journal ofStatistical Software, 74(7), 1-16.

Examples

data(sleep)sleepImp1 <- regressionImp(Dream+NonD~BodyWgt+BrainWgt,data=sleep)sleepImp2 <- regressionImp(Sleep+Gest+Span+Dream+NonD~BodyWgt+BrainWgt,data=sleep)data(testdata)imp_testdata1 <- regressionImp(b1+b2~x1+x2,data=testdata$wna)imp_testdata3 <- regressionImp(x1~x2,data=testdata$wna,robust=TRUE)

Rug representation of missing/imputed values

Description

Add a rug representation of missing/imputed values in only one of thevariables to scatterplots.

Usage

rugNA(  x,  y,  ticksize = NULL,  side = 1,  col = "red",  alpha = NULL,  miss = NULL,  lwd = 0.5,  ...)

Arguments

x,y

numeric vectors.

ticksize

the length of the ticks. Positive lengths give inwardticks.

side

an integer giving the side of the plot to draw the rugrepresentation.

col

the color to be used for the ticks.

alpha

the alpha value (between 0 and 1).

miss

adata.frame ormatrix with two columns andlogical values. IfNULL,x andy are searched formissing values, otherwise, the first column ofmiss is used todetermine the imputed values inx and the second one for the imputedvalues iny.

lwd

the line width to be used for the ticks.

...

further arguments to be passed tographics::Axis().

Details

Ifside is 1 or 3, the rug representation consists of valuesavailable inx but missing/imputed iny. Else ifsideis 2 or 4, it consists of values available iny but missing/imputedinx.

Author(s)

Andreas Alfons, modifications by Bernd Prantner

Examples

data(tao, package = "VIM")## for missing valuesx <- tao[, "Air.Temp"]y <- tao[, "Humidity"]plot(x, y)rugNA(x, y, side = 1)rugNA(x, y, side = 2)## for imputed valuesx_imp <- kNN(tao[, c("Air.Temp","Humidity")])x <- x_imp[, "Air.Temp"]y <- x_imp[, "Humidity"]miss <- x_imp[, c("Air.Temp_imp","Humidity_imp")]plot(x, y)rugNA(x, y, side = 1, col = "orange", miss = miss)rugNA(x, y, side = 2, col = "orange", miss = miss)

Random aggregation function for a factor variable

Description

The function sampleCat samples with probabilites corresponding to theoccurrence of the level in the NNs.

Usage

sampleCat(x, weights = NULL)

Arguments

x

factor vector

weights

numeric vector providing weights for the observations in x

Bivariate jitter plot

Description

Create a bivariate jitter plot.

Usage

scattJitt(  x,  delimiter = NULL,  col = c("skyblue", "red", "red4", "orange", "orange4"),  alpha = NULL,  cex = par("cex"),  col.line = "lightgrey",  lty = "dashed",  lwd = par("lwd"),  numbers = TRUE,  cex.numbers = par("cex"),  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  frame.plot = axes,  labels = c("observed", "missing", "imputed"),  ...)

Arguments

x

adata.frame ormatrix with two columns.

delimiter

col

a vector of length five giving the colors to be used in the plot.The first color will be used for complete observations, the second/fourthcolor for missing/imputed values in only one variable, and the third/fifthcolor for missing/imputed values in both variables. If only one color issupplied, it is used for all. Else if two colors are supplied, the secondone is recycled.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

cex

the character expansion factor for the plot characters.

col.line

the color for the lines dividing the plot region.

lty

the line type for the lines dividing the plot region (seegraphics::par()).

lwd

the line width for the lines dividing the plot region.

numbers

a logical indicating whether the frequencies of observed andmissing/imputed values should be displayed (see ‘Details’).

cex.numbers

the character expansion factor to be used for thefrequencies of the observed and missing/imputed values.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether both axes should be drawn on theplot. Use graphical parameter"xaxt" or"yaxt" to suppressjust one of the axes.

frame.plot

a logical indicating whether a box should be drawn aroundthe plot.

labels

a vector of length three giving the axis labels for theregions for observed, missing and imputed values (see ‘Details’).

...

further graphical parameters to be passed down (seegraphics::par()).

Details

The amount of observed and missing/imputed values is visualized by jitteredpoints. Thereby the plot region is divided into up to four regionsaccording to the existence of missing/imputed values in one or bothvariables. In addition, the amount of observed and missing/imputed valuescan be represented by a number.

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentcex.text can still be supplied to...{} and is handledcorrectly. Nevertheless, it is deprecated and no longer documented. Usecex.numbers instead.

Author(s)

Matthias Templ, modifications by Andreas Alfons and Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(tao, package = "VIM")## for missing valuesscattJitt(tao[, c("Air.Temp", "Humidity")])## for imputed valuesscattJitt(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp")

Scatterplot with information about missing/imputed values

Description

In addition to a standard scatterplot, lines are plotted for the missingvalues in one variable. If there are imputed values, they will behighlighted.

Usage

scattMiss(  x,  delimiter = NULL,  side = 1,  col = c("skyblue", "red", "orange", "lightgrey"),  alpha = NULL,  lty = c("dashed", "dotted"),  lwd = par("lwd"),  quantiles = c(0.5, 0.975),  inEllipse = FALSE,  zeros = FALSE,  xlim = NULL,  ylim = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  interactive = TRUE,  ...)

Arguments

x

amatrix ordata.frame with two columns.

delimiter

side

ifside=1, a rug representation and vertical lines areplotted for the missing/imputed values in the second variable; ifside=2, a rug representation and horizontal lines for themissing/imputed values in the first variable.

col

a vector of length four giving the colors to be used in the plot.The first color is used for the scatterplot, the second/third color for therug representation for missing/imputed values. The second color is also usedfor the lines for missing values. Imputed values will be highlighted withthe third color, and the fourth color is used for the ellipses (see‘Details’). If only one color is supplied, it is used for thescatterplot, the rug representation and the lines, whereas the default coloris used for the ellipses. Else if a vector of length two is supplied, thedefault color is used for the ellipses as well.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

lty

a vector of length two giving the line types for the lines andellipses. If a single value is supplied, it will be used for both.

lwd

a vector of length two giving the line widths for the lines andellipses. If a single value is supplied, it will be used for both.

quantiles

a vector giving the quantiles of the chi-squaredistribution to be used for the tolerance ellipses, orNULL tosuppress plotting ellipses (see ‘Details’).

inEllipse

plot lines only inside the largest ellipse. Ignored ifquantiles isNULL or if there are imputed values.

zeros

a logical vector of length two indicating whether the variablesare semi-continuous, i.e., contain a considerable amount of zeros. IfTRUE, only the non-zero observations are used for computing thetolerance ellipses. If a single logical is supplied, it is recycled.Ignored ifquantiles isNULL.

xlim,ylim

axis limits.

main,sub

main and sub title.

xlab,ylab

axis labels.

interactive

a logical indicating whether theside argument canbe changed interactively (see ‘Details’).

...

further graphical parameters to be passed down (seegraphics::par()).

Details

Information about missing values in one variable is included as vertical orhorizontal lines, as determined by theside argument. The lines arethereby drawn at the observed x- or y-value. In case of imputed values, theywill additionally be highlighted in the scatterplot. Supplementary,percentage coverage ellipses can be drawn to give a clue about the shape ofthe bivariate data distribution.

IfinteractiveisTRUE, clicking in the bottom margin redrawsthe plot with information about missing/imputed values in the first variableand clicking in the left margin redraws the plot with information aboutmissing/imputed values in the second variable. Clicking anywhere else inthe plot quits the interactive session.

Note

The argumentzeros has been introduced in version 1.4. As aresult, some of the argument positions have changed.

Author(s)

Andreas Alfons, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(tao, package = "VIM")## for missing valuesscattMiss(tao[,c("Air.Temp", "Humidity")])## for imputed valuesscattMiss(kNN(tao[,c("Air.Temp", "Humidity")]), delimiter = "_imp")

Scatterplot matrix with information about missing/imputed values

Description

Scatterplot matrix in which observations with missing/imputed values incertain variables are highlighted.

Usage

scattmatrixMiss(  x,  delimiter = NULL,  highlight = NULL,  selection = c("any", "all"),  plotvars = NULL,  col = c("skyblue", "red", "orange"),  alpha = NULL,  pch = c(1, 3),  lty = par("lty"),  diagonal = c("density", "none"),  interactive = TRUE,  ...)

Arguments

x

a matrix ordata.frame.

delimiter

highlight

a vector giving the variables to be used for highlighting.IfNULL (the default), all variables are used for highlighting.

selection

plotvars

a vector giving the variables to be plotted. IfNULL(the default), all variables are plotted.

col

a vector of length three giving the colors to be used in theplot. The second/third color will be used for highlighting missing/imputedvalues.

alpha

a numeric value between 0 and 1 giving the level oftransparency of the colors, orNULL. This can be used to preventoverplotting.

pch

a vector of length two giving the plot characters. The secondplot character will be used for the highlighted observations.

lty

a vector of length two giving the line types for the densityplots in the diagonal panels (ifdiagonal="density"). The secondline type is used for the highlighted observations. If a single value issupplied, it is used for both non-highlighted and highlighted observations.

diagonal

a character string specifying the plot to be drawn in thediagonal panels. Possible values are"density" (density plots fornon-highlighted and highlighted observations) and"none".

interactive

a logical indicating whether the variables to be used forhighlighting can be selected interactively (see ‘Details’).

...

forscattmatrixMiss, further arguments and graphicalparameters to be passed topairsVIM().par("oma") willbe set appropriately unless supplied (seegraphics::par()). ForTKRscattmatrixMiss, further arguments to be passed toscattmatrixMiss.

Details

scattmatrixMiss usespairsVIM() with a panel functionthat allows highlighting of missing/imputed values.

Ifinteractive=TRUE, the variables to be used for highlighting can beselected interactively. Observations with missing/imputed values in any orin all of the selected variables are highlighted (as determined byselection). A variable can be added to the selection by clicking ina diagonal panel. If a variable is already selected, clicking on thecorresponding diagonal panel removes it from the selection. Clickinganywhere else quits the interactive session.

The graphical parameteroma will be set unless supplied as anargument.

TKRscattmatrixMiss behaves likescattmatrixMiss, but usestkrplot to embed the plot in aTcl/Tk window.This is useful if the number of variables is large, because scrollbars allowto move from one part of the plot to another.

Note

Some of the argument names and positions have changed with version 1.3due to a re-implementation and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentcolcomb can still be supplied to...{} and is handledcorrectly. Nevertheless, it is deprecated and no longer documented. Usehighlight instead. The argumentssmooth,reg.line andlegend.plot are no longer used and ignored if supplied.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(sleep, package = "VIM")## for missing valuesx <- sleep[, 1:5]x[,c(1,2,4)] <- log10(x[,c(1,2,4)])scattmatrixMiss(x, highlight = "Dream")## for imputed valuesx_imp <- kNN(sleep[, 1:5])x_imp[,c(1,2,4)] <- log10(x_imp[,c(1,2,4)])scattmatrixMiss(x_imp, delimiter = "_imp", highlight = "Dream")

Mammal sleep data

Description

Sleep data with missing values.

Format

A data frame with 62 observations on the following 10 variables.

BodyWgt: a numeric vector
BrainWgt: a numeric vector
NonD: a numericvector
Dream: a numeric vector
Sleep: anumeric vector
Span: a numeric vector
Gest: anumeric vector
Pred: a numeric vector
Exp: anumeric vector
Danger: a numeric vector

Source

Allison, T. and Chichetti, D. (1976) Sleep in mammals: ecologicaland constitutional correlates.Science194 (4266), 732–734.

The data set was imported fromGGobi.

Examples

data(sleep, package = "VIM")summary(sleep)aggr(sleep)

Spineplot with information about missing/imputed values

Description

Spineplot or spinogram with highlighting of missing/imputed values in othervariables by splitting each cell into two parts. Additionally, informationabout missing/imputed values in the variable of interest is shown on theright hand side.

Usage

spineMiss(  x,  delimiter = NULL,  pos = 1,  selection = c("any", "all"),  breaks = "Sturges",  right = TRUE,  col = c("skyblue", "red", "skyblue4", "red4", "orange", "orange4"),  border = NULL,  main = NULL,  sub = NULL,  xlab = NULL,  ylab = NULL,  axes = TRUE,  labels = axes,  only.miss = TRUE,  miss.labels = axes,  interactive = TRUE,  ...)

Arguments

x

a vector, matrix ordata.frame.

delimiter

pos

a numeric value giving the index of the variable of interest.Additional variables inx are used for highlighting.

selection

breaks

if the variable of interest is numeric,breaks controlsthe breakpoints (seegraphics::hist() for possible values).

right

logical; ifTRUE and the variable of interest isnumeric, the spinogram cells are right-closed (left-open) intervals.

col

border

the color to be used for the border of the cells. Useborder=NA to omit borders.

main,sub

main and sub title.

xlab,ylab

axis labels.

axes

a logical indicating whether axes should be drawn on the plot.

labels

if the variable of interest is categorical, either a logicalindicating whether labels should be plotted below each cell, or a charactervector giving the labels. This is ignored if the variable of interest isnumeric.

only.miss

logical; ifTRUE, the missing/imputed values in thevariable of interest are also visualized by a cell in the spineplot orspinogram. Otherwise, a small spineplot is drawn on the right hand side(see ‘Details’).

miss.labels

either a logical indicating whether label(s) should beplotted below the cell(s) on the right hand side, or a character string orvector giving the label(s) (see ‘Details’).

interactive

a logical indicating whether the variables can beswitched interactively (see ‘Details’).

...

further graphical parameters to be passed tographics::title() andgraphics::axis().

Details

A spineplot is created if the variable of interest is categorial and aspinogram if it is numerical. The horizontal axis is scaled according torelative frequencies of the categories/classes. If more than one variableis supplied, the cells are split according to missingness/number of imputedvalues in the additional variables. Thus the proportion of highlightedobservations in each category/class is displayed on the vertical axis. Sincethe height of each cell corresponds to the proportion of highlightedobservations, it is now possible to compare the proportions ofmissing/imputed values among the different categories/classes.

Ifonly.miss=TRUE, the missing/imputed values in the variable ofinterest are also visualized by a cell in the spine plot or spinogram. Ifadditional variables are supplied, this cell is again split into two partsaccording to missingness/number if imputed values in the additionalvariables.

Otherwise, a small spineplot that visualizes missing/imputed values in thevariable of interest is drawn on the right hand side. The first cellcorresponds to observed values and the second cell to missing/imputedvalues. Each of the two cells is again split into two parts according tomissingness/number of imputed values in the additional variables. Note thatthis display does not make sense if only one variable is supplied, thereforeonly.miss is ignored in that case.

Value

a table containing the frequencies corresponding to the cells.

Note

Some of the argument names and positions have changed with version 1.3due to extended functionality and for more consistency with other plotfunctions inVIM. For back compatibility, the argumentsxaxlabels andmissaxlabels can still be supplied to...{} and are handled correctly. Nevertheless, they are deprecatedand no longer documented. Uselabels andmiss.labels instead.

The code is based on the functiongraphics::spineplot() by AchimZeileis.

Author(s)

Andreas Alfons, Matthias Templ, modifications by Bernd Prantner

References

M. Templ, A. Alfons, P. Filzmoser (2012) Exploring incompletedata using visualization tools.Journal of Advances in Data Analysisand Classification, Online first. DOI: 10.1007/s11634-011-0102-y.

Examples

data(tao, package = "VIM")data(sleep, package = "VIM")## for missing valuesspineMiss(tao[, c("Air.Temp", "Humidity")])spineMiss(sleep[, c("Exp", "Sleep")])## for imputed valuesspineMiss(kNN(tao[, c("Air.Temp", "Humidity")]), delimiter = "_imp")spineMiss(kNN(sleep[, c("Exp", "Sleep")]), delimiter = "_imp")

create table with highlighted missings/imputations

Description

Create areactable table that highlights missing values and imputed valueswith the same colors ashistMiss()

Usage

tableMiss(x, delimiter = "_imp")

Arguments

x

a vector, matrix ordata.frame.

delimiter

Examples

data(tao)x_IMPUTED <- kNN(tao[, c("Air.Temp", "Humidity")])tableMiss(x_IMPUTED[105:114, ])x_IMPUTED[106, 2] <- NAx_IMPUTED[105, 1] <- NAx_IMPUTED[107, "Humidity_imp"] <- TRUEtableMiss(x_IMPUTED[105:114, ])

Tropical Atmosphere Ocean (TAO) project data

Description

A small subsample of the Tropical Atmosphere Ocean (TAO) project data,derived from theGGOBI project.

Format

A data frame with 736 observations on the following 8 variables.

Year: a numeric vector
Latitude: anumeric vector
Longitude: a numeric vector
Sea.Surface.Temp: a numeric vector
Air.Temp: anumeric vector
Humidity: a numeric vector
UWind: zonal wind, i.e. latitude-parallel wind
VWind: meridional wind, i.e. longitude-parallel wind

Details

All cases recorded for five locations and two time periods.

Source

http://www.pmel.noaa.gov/tao/

Examples

data(tao, package = "VIM")summary(tao)aggr(tao)

Simulated data set for testing purpose

Description

2 numeric, 2 binary, 2 nominal and 2 mixed (semi-continous) variables

Format

The format is: List of 4

⁠$wna⁠ : adata.frame with 500 obs. of 8 variables:
- x1: numeric 10.87 9.53 7.83 8.53 8.67 ...
- x2: numeric 10.9 9.32 7.68 8.2 8.41 ... ..
- c1: Factor w/ 4 levels "a","b","c","d": 3 2 2 1 2 2 1 3 3 2 ...
- c2: Factor w/ 4 levels "a","b","c","d": 2 3 2 2 2 2 2 4 2 2 ...
- b1: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 1 ...
- b2: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 2 2 ...
- m1: numeric 0 8.29 9.08 0 0 ...
- m2: numeric 10.66 9.39 7.8 8.11 7.33 ...
⁠$wona⁠ : a 'data.frame“ with 500 obs. of 8 variables:
- x1: numeric 10.87 9.53 7.83 8.53 8.67 ...
- x2: numeric 10.9 9.32 7.68 8.2 8.41 ...
- c1: Factor w/ 4 levels "a","b","c","d": 3 2 2 1 2 2 1 3 3 2 ...
- c2: Factor w/ 4 levels "a","b","c","d": 2 3 2 2 2 2 2 4 2 2 ...
- b1: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 2 1 1 ...
- b2: Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 2 2 2 ...
- m1: numeric 0 8.29 9.08 0 0 ...
- m2: numeric 10.66 9.39 7.8 8.11 7.33 ...
⁠$mixed⁠:c("m1", "m2")
⁠$outlierInd⁠: 'NULL“

Examples

data(testdata)

Simulated toy data set for examples

Description

A 2-dimensional data set with additional information.

Format

data frame with 100 observations and 12 variables. The first twovariables represent the fully observed data.

Examples

data(toydataMiss)

Wine tasting and price

Description

Wine reviews from France, Switzerland, Austria and Germany.

Format

A data frame with 9627 observations on the following 9 variables.

country: country of origin
points: the number of points WineEnthusiast rated the wine on a scale of 1-100(though they say they only post reviews for wines that score >=80)
price: the cost for a bottle of the wine
province: the province or state that the wine is from
taster_name: name of the person who tasted and reviewed the wine
taster_twitter_handle: Twitter handle for the person who tasted ane reviewed the wine
variety: the type of grapes used to make the wine (ie pinot noir)
winery: the winery that made the wine
variety_main: broader category as variety

Details

The data was scraped from WineEnthusiast during the week of Nov 22th, 2017.The code for the scraper can be found at https://github.com/zackthoutt/wine-deep-learningThis data set is slightly modified, i.e. only four countries are selected andbroader categories on the variety have been added.

Source

https://www.kaggle.com/zynicide/wine-reviews

Examples

data(wine)str(wine)aggr(wine)

Xgboost Imputation

Description

Impute missing values based on a random forest model usingxgboost::xgboost()

Usage

xgboostImpute(  formula,  data,  imp_var = TRUE,  imp_suffix = "imp",  verbose = FALSE,  nrounds = 100,  objective = NULL,  ...)

Arguments

formula

model formula for the imputation

data

Adata.frame containing the data

imp_var

TRUE/FALSE if aTRUE/FALSE variables for each imputedvariable should be created show the imputation status

imp_suffix

suffix used for TF imputation variables

verbose

Show the number of observations used for trainingand evaluating the RF-Model. This parameter is also passed down toxgboost::xgboost() to show computation status.

nrounds

max number of boosting iterations,argument passed toxgboost::xgboost()

objective

objective for xgboost,argument passed toxgboost::xgboost()

...

Arguments passed toxgboost::xgboost()

Value

the imputed data set.

Examples

data(sleep)xgboostImpute(Dream~BodyWgt+BrainWgt,data=sleep)xgboostImpute(Dream+NonD~BodyWgt+BrainWgt,data=sleep)xgboostImpute(Dream+NonD+Gest~BodyWgt+BrainWgt,data=sleep)sleepx <- sleepsleepx$Pred <- as.factor(LETTERS[sleepx$Pred])sleepx$Pred[1] <- NAxgboostImpute(Pred~BodyWgt+BrainWgt,data=sleepx)

Movatterモバイル変換

The VIM Package

Description

Details

Author(s)

References

Animals_na

Description

Format

Details

Source

References

Examples

Synthetic subset of the Austrian structural business statistics data

Description

Details

Source

Examples

Aggregations for missing/imputed values

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Alphablending for colors

Description

Usage

Arguments

Value

Author(s)

Examples

Barplot with information about missing/imputed values

Description

Usage

Arguments

Details

Value

Note

Author(s)

References

See Also

Examples

Breast cancer Wisconsin data set

Description

Format

References

Examples

Backgound map

Description

Usage

Arguments

Author(s)

References

See Also

Examples

Brittleness index data set

Description

Format

Source

Examples

C-horizon of the Kola data with missing values

Description

Format

Note

Source

References

Examples

HCL and RGB color sequences

Description

Usage

Arguments

Value

Author(s)

References

See Also