Movatterモバイル変換


[0]ホーム

URL:


Version:5.2-4
Date:2025-10-02
Title:Harrell Miscellaneous
Depends:R (≥ 4.2.0)
Imports:methods, ggplot2, cluster, rpart, nnet, foreign, gtable, grid,gridExtra, data.table, htmlTable (≥ 1.11.0), viridisLite,htmltools, base64enc, colorspace, rmarkdown, knitr, Formula
Suggests:survival, qreport, acepack, chron, rms, mice, rstudioapi,tables, plotly (≥ 4.5.6), rlang, VGAM, leaps, pcaPP, digest,parallel, polspline, abind, kableExtra, rio, lattice,latticeExtra, gt, sparkline, jsonlite, htmlwidgets, qs,getPass, keyring, safer, htm2txt, boot
Description:Contains many functions useful for dataanalysis, high-level graphics, utility operations, functions forcomputing sample size and power, simulation, importing and annotating datasets,imputing missing values, advanced table making, variable clustering,character string manipulation, conversion of R objects to LaTeX and html code,recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis.
License:GPL-2 |GPL-3 [expanded from: GPL (≥ 2)]
LazyLoad:Yes
URL:https://hbiostat.org/R/Hmisc/
Encoding:UTF-8
RoxygenNote:7.3.3
NeedsCompilation:yes
Packaged:2025-10-03 20:24:42 UTC; harrelfe
Author:Frank E Harrell JrORCID iD [aut, cre], Cole Beck [ctb], Charles Dupont [ctb]
Maintainer:Frank E Harrell Jr <fh@fharrell.com>
Repository:CRAN
Date/Publication:2025-10-05 06:50:02 UTC

Find Matching (or Non-Matching) Elements

Description

%nin% is a binary operator, which returns a logical vector indicatingif there is a match or not for its left operand. A true vector elementindicates no match in left operand, false indicates a match.

Usage

x %nin% table

Arguments

x

a vector (numeric, character, factor)

table

a vector (numeric, character, factor), matching the mode ofx

Value

vector of logical values with length equal to length ofx.

See Also

match%in%

Examples

c('a','b','c') %nin% c('a','b')

Character strings from unquoted names

Description

Cs makes a vector of character strings from a list of valid Rnames..q is similar but also makes uses of names of arguments.

Usage

Cs(...).q(...)

Arguments

...

any number of names separated by commas. For.q any names ofarguments will be used.

Value

character string vector. For.q there will be anamesattribute to the vector if any names appeared in ....

See Also

sys.frame, deparse

Examples

Cs(a,cat,dog)# subset.data.frame <- dataframe[,Cs(age,sex,race,bloodpressure,height)].q(a, b, c, 'this and that').q(dog=a, giraffe=b, cat=c)

Empirical Cumulative Distribution Plot

Description

Computes coordinates of cumulative distribution function of x, and by defaultsplots it as a step function. A grouping variable may be specified so thatstratified estimates are computed and (by default) plotted. If there ismore than one group, thelabcurve function is used (by default) to labelthe multiple step functions or to draw a legend defining line types, colors,or symbols by linking them with group labels. Aweights vector maybe specified to get weighted estimates. Specifynormwt to makeweights sum to the length ofx (after removing NAs). Other wisethe total sample size is taken to be the sum of the weights.

Ecdf is actually a method, andEcdf.default is what'scalled for a vector argument.Ecdf.data.frame is called when thefirst argument is a data frame. This function can automatically set upa matrix of ECDFs and wait for a mouse click if the matrix requires morethan one page. Categorical variables, character variables, andvariables having fewer than a set number of unique values are ignored.Ifpar(mfrow=..) is not set up beforeEcdf.data.frame iscalled, the function will try to figure the best layout depending on thenumber of variables in the data frame. Upon return the originalmfrow is left intact.

When the first argument toEcdf is a formula, a Trellis/Lattice functionEcdf.formula is called. This allows for multi-panelconditioning, superposition using agroups variable, and otherTrellis features, along with the ability to easily plot transformedECDFs using thefun argument. For example, iffun=qnorm,the inverse normal transformation will be used for the y-axis. If thetransformed curves are linear this indicates normality. Like thexYplot function,Ecdf will create a functionKey ifthegroups variable is used. This function can be invoked by theuser to define the keys for the groups.

Usage

Ecdf(x, ...)## Default S3 method:Ecdf(x, what=c('F','1-F','f','1-f'),     weights=rep(1, length(x)), normwt=FALSE,     xlab, ylab, q, pl=TRUE, add=FALSE, lty=1,      col=1, group=rep(1,length(x)), label.curves=TRUE, xlim,      subtitles=TRUE, datadensity=c('none','rug','hist','density'),     side=1,      frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1),     dens.opts=NULL, lwd=1, log='', ...)## S3 method for class 'data.frame'Ecdf(x, group=rep(1,nrows),     weights=rep(1, nrows), normwt=FALSE,     label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE,      vnames=c('labels','names'),...)## S3 method for class 'formula'Ecdf(x, data=sys.frame(sys.parent()), groups=NULL,     prepanel=prepanel.Ecdf, panel=panel.Ecdf, ..., xlab,     ylab, fun=function(x)x, what=c('F','1-F','f','1-f'), subset=TRUE)

Arguments

x

a numeric vector, data frame, or Trellis/Lattice formula

what

The default is"F" which results in plotting the fraction of values<= x. Set to"1-F" to plot the fraction > x or"f" to plot thecumulative frequency of values <= x. Use"1-f" to plot thecumulative frequency of values >= x.

weights

numeric vector of weights. Omit or specify a zero-length vector orNULL to get unweighted estimates.

normwt

see above

xlab

x-axis label. Default is label(x) or name of calling argument. ForEcdf.formula,xlab defaults to thelabel attributeof the x-axis variable.

ylab

y-axis label. Default is"Proportion <= x","Proportion > x", or "Frequency <= x" depending on value ofwhat.

q

a vector for quantiles for which to draw reference lines on the plot.Default is not to draw any.

pl

set to F to omit the plot, to just return estimates

add

set to TRUE to add the cdf to an existing plot. Does not apply if usinglattice graphics (i.e., if a formula is given as the first argument).

lty

integer line type for plot. Ifgroup is specified, this can be a vector.

lwd

line width for plot. Can be a vector corresponding togroups.

log

seeplot. Setlog='x' to use log scale forx-axis.

col

color for step function. Can be a vector.

group

a numeric, character, orfactor categorical variable used for stratifyingestimates. Ifgroup is present, as many ECDFs are drawn as there arenon–missing group levels.

label.curves

applies if more than onegroup exists.Default isTRUE to uselabcurve to label curves where they are farthestapart. Setlabel.curves to alist to specify options tolabcurve, e.g.,label.curves=list(method="arrow", cex=.8).These option names may be abbreviated in the usual way argumentsare abbreviated. Use for examplelabel.curves=list(keys=1:5)to draw symbols periodically (as inpch=1:5 - seepoints)on the curves and automatically position a legendin the most empty part of the plot. Setlabel.curves=FALSE tosuppress drawing curve labels. Thecol,lty, andtypeparameters are automatically passed tolabcurve, although youcan override them here. You can setlabel.curves=list(keys="lines") tohave different line types defined in an automatically positioned key.

xlim

x-axis limits. Default is entire range ofx.

subtitles

set toFALSE to suppress putting a subtitle at the bottom left of eachplot. The subtitle indicates the numbers ofnon-missing and missing observations, which are labeledn,m.

datadensity

Ifdatadensity is not"none", eitherscat1d orhistSpike is called toadd a rug plot (datadensity="rug"), spike histogram(datadensity="hist"), or smooth density estimate ("density") tothe bottom or top of the ECDF.

side

Ifdatadensity is not"none", the default is to place the additionalinformation on top of the x-axis (side=1). Useside=3 to place atthe top of the graph.

frac

passed tohistSpike

dens.opts

a list of optional arguments forhistSpike

...

other parameters passed to plot if add=F. For data frames, otherparameters to pass toEcdf.default.ForEcdf.formula, ifgroups is not used, you can also adddata density information to each panel's ECDF by specifying thedatadensity and optionalfrac,side,dens.opts arguments.

n.unique

minimum number of unique values before an ECDF is drawn for a variablein a data frame. Default is 10.

na.big

set toTRUE to draw the number of NAs in larger letters in the middle ofthe plot forEcdf.data.frame

vnames

By default, variable labels are used to label x-axes. Setvnames="names"to instead use variable names.

method

method for computing the empirical cumulative distribution. Seewtd.Ecdf. The default is to use the standard"i/n" method as isused by the non-Trellis versions ofEcdf.

fun

a function to transform the cumulative proportions, for theTrellis-type usage ofEcdf

data,groups,subset,prepanel,panel

the usual Trellis/Lattice parameters, withgroupscausingEcdf.formula to overlay multiple ECDFs on one panel.

Value

forEcdf.default an invisible list with elements x and y giving thecoordinates of the cdf. If there is more than onegroup, a list ofsuch lists is returned. An attribute,N, is in the returnedobject. It contains the elementsn andm, the number ofnon-missing and missing observations, respectively.

Side Effects

plots

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

See Also

wtd.Ecdf,label,table,cumsum,labcurve,xYplot,histSpike

Examples

set.seed(1)ch <- rnorm(1000, 200, 40)Ecdf(ch, xlab="Serum Cholesterol")scat1d(ch)                       # add rug plothistSpike(ch, add=TRUE, frac=.15)   # add spike histogram# Better: add a data density display automatically:Ecdf(ch, datadensity='density')label(ch) <- "Serum Cholesterol"Ecdf(ch)other.ch <- rnorm(500, 220, 20)Ecdf(other.ch,add=TRUE,lty=2)sex <- factor(sample(c('female','male'), 1000, TRUE))Ecdf(ch, q=c(.25,.5,.75))  # show quartilesEcdf(ch, group=sex,     label.curves=list(method='arrow'))# Example showing how to draw multiple ECDFs from paired datapre.test <- rnorm(100,50,10)post.test <- rnorm(100,55,10)x <- c(pre.test, post.test)g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))# keys=1:2 causes symbols to be drawn periodically on top of curves# Draw a matrix of ECDFs for a data framem <- data.frame(pre.test, post.test,                 sex=sample(c('male','female'),100,TRUE))Ecdf(m, group=m$sex, datadensity='rug')freqs <- sample(1:10, 1000, TRUE)Ecdf(ch, weights=freqs)  # weighted estimates# Trellis/Lattice examples:region <- factor(sample(c('Europe','USA','Australia'),100,TRUE))year <- factor(sample(2001:2002,1000,TRUE))Ecdf(~ch | region*year, groups=sex)Key()           # draw a key for sex at the default location# Key(locator(1)) # user-specified positioning of keyage <- rnorm(1000, 50, 10)Ecdf(~ch | lattice::equal.count(age), groups=sex)  # use overlapping shinglesEcdf(~ch | sex, datadensity='hist', side=3)  # add spike histogram at top

Debug Printing Function Generator

Description

Takes the name of a systemoptions(opt=) and checks to see if optionopt isset toTRUE, taking its default value to beFALSE. IfTRUE, a function iscreated that callsprn() to print an object with the object's name in thedescription along with the option name and the name of the function within whichthe generated function was called, if any. If optionopt is not set, a dummy functionis generated instead. Ifoptions(debug_file=) is set when the generated functionis called,prn() output will be appended to that file name instead of the console.At any time, setoptions(debug_file='') to resume printing to the console.

Usage

Fdebug(opt)

Arguments

opt

character string containing an option name

Value

a function

Author(s)

Fran Harrell

Examples

dfun <- Fdebug('my_option_name')   # my_option_name not currently setdfundfun(sqrt(2))options(my_option_name=TRUE)dfun <- Fdebug('my_option_name')dfundfun(sqrt(2))# options(debug_file='/tmp/z') to append output to /tmp/zoptions(my_option_name=NULL)

Gini's Mean Difference

Description

GiniMD computes Gini's mean difference on anumeric vector. This index is defined as the mean absolute differencebetween any two distinct elements of a vector. For a Bernoulli(binary) variable with proportion of ones equal top and samplesizen, Gini's mean difference is2\frac{n}{n-1}p(1-p). For a trinomial variable (e.g., predicted values for a 3-level categoricalpredictor using two dummy variables) having (predicted)valuesA, B, C with corresponding proportionsa, b, c,Gini's mean difference is2\frac{n}{n-1}[ab|A-B|+ac|A-C|+bc|B-C|]

Usage

GiniMd(x, na.rm=FALSE)

Arguments

x

a numeric vector (forGiniMd)

na.rm

set toTRUE if you suspect there may beNAsinx; these will then be removed. Otherwise an error willresult.

Value

a scalar numeric

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.

Examples

set.seed(1)x <- rnorm(40)# Test GiniMd against a brute-force solutiongmd <- function(x) {  n <- length(x)  sum(outer(x, x, function(a, b) abs(a - b))) / n / (n - 1)  }GiniMd(x)gmd(x)z <- c(rep(0,17), rep(1,6))n <- length(z)GiniMd(z)2*mean(z)*(1-mean(z))*n/(n-1)a <- 12; b <- 13; c <- 7; n <- a + b + cA <- -.123; B <- -.707; C <- 0.523xx <- c(rep(A, a), rep(B, b), rep(C, c))GiniMd(xx)2*(a*b*abs(A-B) + a*c*abs(A-C) + b*c*abs(B-C))/n/(n-1)

Internal Hmisc functions

Description

Internal Hmisc functions.

Details

These are not to be called by the user or are undocumented.


Overview of Hmisc Library

Description

The Hmisc library contains many functions useful for data analysis,high-level graphics, utility operations, functions for computingsample size and power, translating SAS datasets intoR, imputingmissing values, advanced table making, variable clustering, characterstring manipulation, conversion ofR objects to LaTeX code, recodingvariables, and bootstrap repeated measures analysis. Most of thesefunctions were written by F Harrell, but a few were collected fromstatlib and from s-news; other authors are indicated below. Thiscollection of functions includes all of Harrell's submissions tostatlib other than the functions in therms and displaylibraries. A few of the functions do not have “Help”documentation.

To makeHmisc load silently, issueoptions(Hverbose=FALSE) beforelibrary(Hmisc).

Functions

Function NamePurpose
abs.error.pred Computes various indexes of predictive accuracy based
on absolute errors, for linear models
addMarginal Add marginal observations over selected variables
all.is.numeric Check if character strings are legal numerics
approxExtrap Linear extrapolation
aregImpute Multiple imputation based on additive regression,
bootstrapping, and predictive mean matching
areg.boot Nonparametrically estimate transformations for both
sides of a multiple additive regression, and
bootstrap these estimates andR^2
ballocation Optimum sample allocations in 2-sample proportion test
binconf Exact confidence limits for a proportion and more accurate
(narrower!) score stat.-based Wilson interval
(Rollin Brant, mod. FEH)
bootkm Bootstrap Kaplan-Meier survival or quantile estimates
bpower Approximate power of 2-sided test for 2 proportions
Includes bpower.sim for exact power by simulation
bpplot Box-Percentile plot
(Jeffrey Banfield,umsfjban@bill.oscs.montana.edu)
bpplotM Chart extended box plots for multiple variables
bsamsize Sample size requirements for test of 2 proportions
bystats Statistics on a single variable by levels of >=1 factors
bystats2 2-way statistics
character.table Shows numeric equivalents of all latin characters
Useful for putting many special chars. in graph titles
(Pierre Joyet,pierre.joyet@bluewin.ch)
ciapower Power of Cox interaction test
cleanup.import More compactly store variables in a data frame, and clean up
problem data when e.g. Excel spreadsheet had a non-
numeric value in a numeric column
combine.levels Combine infrequent levels of a categorical variable
confbar Draws confidence bars on an existing plot using multiple
confidence levels distinguished using color or gray scale
contents Print the contents (variables, labels, etc.) of a data frame
cpower Power of Cox 2-sample test allowing for noncompliance
Cs Vector of character strings from list of unquoted names
csv.get Enhanced importing of comma separated files labels
cut2 Like cut with better endpoint label construction and allows
construction of quantile groups or groups with given n
datadensity Snapshot graph of distributions of all variables in
a data frame. For continuous variables uses scat1d.
dataRep Quantify representation of new observations in a database
ddmmmyy SAS “date7” output format for a chron object
deff Kish design effect and intra-cluster correlation
describe Function to describe different classes of objects.
Invoke by saying describe(object). It calls one of the
following:
describe.data.frame Describe all variables in a data frame (generalization
of SAS UNIVARIATE)
describe.default Describe a variable (generalization of SAS UNIVARIATE)
dotplot3 A more flexible version of dotplot
Dotplot Enhancement of Trellis dotplot allowing for matrix
x-var., auto generation of Key function, superposition
drawPlot Simple mouse-driven drawing program, including a function
for fitting Bezier curves
Ecdf Empirical cumulative distribution function plot
errbar Plot with error bars (Charles Geyer, U. Chi., mod FEH)
event.chart Plot general event charts (Jack Lee,jjlee@mdanderson.org,
Ken Hess, Joel Dubin; Am Statistician 54:63-70,2000)
event.history Event history chart with time-dependent cov. status
(Joel Dubin,jdubin@uwaterloo.ca)
find.matches Find matches (with tolerances) between columns of 2 matrices
first.word Find the first word in anR expression (R Heiberger)
fit.mult.impute Fit most regression models over multiple transcan imputations,
compute imputation-adjusted variances and avg. betas
format.df Format a matrix or data frame with much user control
(R Heiberger and FE Harrell)
ftupwr Power of 2-sample binomial test using Fleiss, Tytun, Ury
ftuss Sample size for 2-sample binomial test using " " " "
(Both by Dan Heitjan,dheitjan@biostats.hmc.psu.edu)
gbayes Bayesian posterior and predictive distributions when both
the prior and the likelihood are Gaussian
getHdata Fetch and list datasets on our web site
hdquantile Harrell-Davis nonparametric quantile estimator with s.e.
histbackback Back-to-back histograms (Pat Burns, Salomon Smith
Barney, London,pburns@dorado.sbi.com)
hist.data.frame Matrix of histograms for all numeric vars. in data frame
Use hist.data.frame(data.frame.name)
histSpike Add high-resolution spike histograms or density estimates
to an existing plot
hoeffd Hoeffding's D test (omnibus test of independence of X and Y)
impute Impute missing data (generic method)
interaction More flexible version of builtin function
is.present Tests for non-blank character values or non-NA numeric values
james.stein James-Stein shrinkage estimates of cell means from raw data
labcurve Optimally label a set of curves that have been drawn on
an existing plot, on the basis of gaps between curves.
Also position legends automatically at emptiest rectangle.
label Set or fetch a label for anR-object
Lag Lag a vector, padding on the left with NA or ''
latex Convert anR object to LaTeX (R Heiberger & FE Harrell)
list.tree Pretty-print the structure of any data object
(Alan Zaslavsky,zaslavsk@hcp.med.harvard.edu)
Load Enhancement ofload
mask 8-bit logical representation of a short integer value
(Rick Becker)
matchCases Match each case on one continuous variable
matxv Fast matrix * vector, handling intercept(s) and NAs
mgp.axis Version of axis() that uses appropriate mgp from
mgp.axis.labels and gets around bug in axis(2, ...)
that causes it to assume las=1
mgp.axis.labels Used by survplot and plot inrms library (and other
functions in the future) so that different spacing
between tick marks and axis tick mark labels may be
specified for x- and y-axes.
Use mgp.axis.labels('default') to set defaults.
Users can set values manually using
mgp.axis.labels(x,y) where x and y are 2nd value of
par('mgp') to use. Use mgp.axis.labels(type=w) to
retrieve values, where w='x', 'y', 'x and y', 'xy',
to get 3 mgp values (first 3 types) or 2 mgp.axis.labels.
minor.tick Add minor tick marks to an existing plot
mtitle Add outer titles and subtitles to a multiple plot layout
multLines Draw multiple vertical lines at each x
in a line plot
%nin% Opposite of %in%
nobsY Compute no. non-NA observations for left hand formula side
nomiss Return a matrix after excluding any row with an NA
panel.bpplot Panel function for trellis bwplot - box-percentile plots
panel.plsmo Panel function for trellis xyplot - uses plsmo
pBlock Block variables for certain lattice charts
pc1 Compute first prin. component and get coefficients on
original scale of variables
plotCorrPrecision Plot precision of estimate of correlation coefficient
plsmo Plot smoothed x vs. y with labeling and exclusion of NAs
Also allows a grouping variable and plots unsmoothed data
popower Power and sample size calculations for ordinal responses
(two treatments, proportional odds model)
prn prn(expression) does print(expression) but titles the
output with 'expression'. Do prn(expression,txt) to add
a heading (‘txt’) before the ‘expression’ title
pstamp Stamp a plot with date in lower right corner (pstamp())
Add ,pwd=T and/or ,time=T to add current directory
name or time
Put additional text for label as first argument, e.g.
pstamp('Figure 1') will draw 'Figure 1 date'
putKey Different way to use key()
putKeyEmpty Put key at most empty part of existing plot
rcorr Pearson or Spearman correlation matrix with pairwise deletion
of missing data
rcorr.cens Somers' Dxy rank correlation with censored data
rcorrp.cens Assess difference in concordance for paired predictors
rcspline.eval Evaluate restricted cubic spline design matrix
rcspline.plot Plot spline fit with nonparametric smooth and grouped estimates
rcspline.restate Restate restricted cubic spline in unrestricted form, and
create TeX expression to print the fitted function
reShape Reshape a matrix into 3 vectors, reshape serial data
rm.boot Bootstrap spline fit to repeated measurements model,
with simultaneous confidence region - least
squares using spline function in time
rMultinom Generate multinomial random variables with varying prob.
samplesize.bin Sample size for 2-sample binomial problem
(Rick Chappell,chappell@stat.wisc.edu)
sas.get Convert SAS dataset to S data frame
sasxport.get Enhanced importing of SAS transport dataset in R
Save Enhancement ofsave
scat1d Add 1-dimensional scatterplot to an axis of an existing plot
(like bar-codes, FEH/Martin Maechler,
maechler@stat.math.ethz.ch/Jens Oehlschlaegel-Akiyoshi,
oehl@psyres-stuttgart.de)
score.binary Construct a score from a series of binary variables or
expressions
sedit A set of character handling functions written entirely
inR. sedit() does much of what the UNIX sed
program does. Other functions included are
substring.location, substring<-, replace.string.wild,
and functions to check if a string is numeric or
contains only the digits 0-9
setTrellis Set Trellis graphics to use blank conditioning panel strips,
line thickness 1 for dot plot reference lines:
setTrellis(); 3 optional arguments
show.col Show colors corresponding to col=0,1,...,99
show.pch Show all plotting characters specified by pch=.
Just type show.pch() to draw the table on the
current device.
showPsfrag Use LaTeX to compile, and dvips and ghostview to
display a postscript graphic containing psfrag strings
solvet Version of solve with argument tol passed to qr
somers2 Somers' rank correlation and c-index for binary y
spearman Spearman rank correlation coefficient spearman(x,y)
spearman.test Spearman 1 d.f. and 2 d.f. rank correlation test
spearman2 Spearman multiple d.f.\rho^2, adjusted\rho^2, Wilcoxon-Kruskal-
Wallis test, for multiple predictors
spower Simulate power of 2-sample test for survival under
complex conditions
Also contains the Gompertz2,Weibull2,Lognorm2functions.
spss.get Enhanced importing of SPSS files using read.spssfunction
src src(name) = source("name.s") with memory
store store an object permanently (easy interface to assign function)
strmatch Shortest unique identifier match
(Terry Therneau,therneau@mayo.edu)
subset More easily subset a data frame
substi Substitute one var for another when observations NA
summarize Generate a data frame containing stratified summary
statistics. Useful for passing to trellis.
summary.formula General table making and plotting functions for summarizing
data
summaryD Summarizing using user-provided formula and dotchart3
summaryM Replacement for summary.formula(..., method='reverse')
summaryP Multi-panel dot chart for summarizing proportions
summaryS Summarize multiple response variables for multi-panel
dot chart or scatterplot
summaryRc Summary for continuous variables using lowess
symbol.freq X-Y Frequency plot with circles' area prop. to frequency
sys Execute unix() or dos() depending on what's running
tabulr Front-end to tabular function in the tables package
tex Enclose a string with the correct syntax for using
with the LaTeX psfrag package, for postscript graphics
transace ace() packaged for easily automatically transforming all
variables in a matrix
transcan automatic transformation and imputation of NAs for a
series of predictor variables
trap.rule Area under curve defined by arbitrary x and y vectors,
using trapezoidal rule
trellis.strip.blank To make the strip titles in trellis more visible, you can
make the backgrounds blank by saying trellis.strip.blank().
Use before opening the graphics device.
t.test.cluster 2-sample t-test for cluster-randomized observations
uncbind Form individual variables from a matrix
upData Update a data frame (change names, labels, remove vars, etc.)
units Set or fetch "units" attribute - units of measurement for var.
varclus Graph hierarchical clustering of variables using squared
Pearson or Spearman correlations or Hoeffding D as similarities
Also includes the naclus function for examining similarities in
patterns of missing values across variables.
wtd.mean
wtd.var
wtd.quantile
wtd.Ecdf
wtd.table
wtd.rank
wtd.loess.noiter
num.denom.setup Set of function for obtaining weighted estimates
xy.group Compute mean x vs. function of y by groups of x
xYplot Like trellis xyplot but supports error bars and multiple
response variables that are connected as separate lines
ynbind Combine a series of yes/no true/false present/absent variables into a matrix
zoom Zoom in on any graphical display
(Bill Dunlap,bill@statsci.com)

Copyright Notice

GENERAL DISCLAIMER
This program is free software; you can redistribute itand/or modify it under the terms of the GNU General PublicLicense as published by the Free Software Foundation; eitherversion 2, or (at your option) any later version.

This program is distributed in the hope that it will beuseful, but WITHOUT ANY WARRANTY; without even the impliedwarranty of MERCHANTABILITY or FITNESS FOR A PARTICULARPURPOSE. See the GNU General Public License for moredetails.

In short: You may use it any way you like, as long as youdon't charge money for it, remove this notice, or hold anyone liablefor its results. Also, please acknowledge the source and communicatechanges to the author.

If this software is used is work presented for publication, kindlyreference it using for example:
Harrell FE (2014): Hmisc: A package of miscellaneous R functions.Programs available fromhttps://hbiostat.org/R/Hmisc/.
Be sure to referenceR itself and other libraries used.

Author(s)

Frank E Harrell Jr
Professor of Biostatistics
Vanderbilt University School of Medicine
Nashville, Tennessee
fh@fharrell.com

References

See Alzola CF, Harrell FE (2004): An Introduction to S and theHmisc and Design Libraries athttps://hbiostat.org/R/doc/sintro.pdffor extensive documentation and examples for the Hmisc package.


Lag a Numeric, Character, or Factor Vector

Description

Shifts a vectorshift elements later. Character or factorvariables are padded with"", numerics withNA. The shiftmay be negative.

Usage

Lag(x, shift = 1)

Arguments

x

a vector

shift

integer specifying the number of observations tobe shifted to the right. Negative values imply shifts to the left.

Details

A.ttributes of the original object are carried along to the new laggedone.

Value

a vector likex

Author(s)

Frank Harrell

See Also

lag

Examples

Lag(1:5,2)Lag(letters[1:4],2)Lag(factor(letters[1:4]),-2)# Find which observations are the first for a given subjectid <- c('a','a','b','b','b','c')id != Lag(id)!duplicated(id)

Merge Multiple Data Frames or Data Tables

Description

Merges an arbitrarily large series of data frames or data tables containing commonid variables. Information about number of observations and number of uniqueids in individual and final merged datasets is printed. The first data frame/table has special meaning in that all of its observations are kept whether they matchids in other data frames or not. For all other data frames, by default non-matching observations are dropped. The first data frame is also the one against which counts of uniqueids are compared. Sometimesmerge drops variable attributes such aslabels andunits. These are restored byMerge.

Usage

Merge(..., id = NULL, all = TRUE, verbose = TRUE)

Arguments

...

two or more dataframes or data tables

id

a formula containing all the identification variables such that the combination of these variables uniquely identifies subjects or records of interest. May be omitted for data tables; in that case thekey function retrieves the id variables.

all

set toFALSE to drop observations not found in second and later data frames (only applies if not usingdata.table)

verbose

set toFALSE to not print information about observations

Examples

## Not run: a <- data.frame(sid=1:3, age=c(20,30,40))b <- data.frame(sid=c(1,2,2), bp=c(120,130,140))d <- data.frame(sid=c(1,3,4), wt=c(170,180,190))all <- Merge(a, b, d, id = ~ sid)# First file should be the master file and must# contain all ids that ever occur.  ids not in the master will# not be merged from other datasets.a <- data.table(a); setkey(a, sid)# data.table also does not allow duplicates without allow.cartesian=TRUEb <- data.table(sid=1:2, bp=c(120,130)); setkey(b, sid)d <- data.table(d); setkey(d, sid)all <- Merge(a, b, d)## End(Not run)

Miscellaneous Functions

Description

This documents miscellaneous small functions in Hmisc that may be ofinterest to users.

clowess runslowess but if theiter argumentexceeds zero, sometimes wild values can result, in which caselowess is re-run withiter=0.

confbar draws multi-level confidence bars using small rectanglesthat may be of different colors.

getLatestSource fetches andsources the most recentsource code for functions in GitHub.

grType retrieves the system optiongrType, which isforced to be"base" if theplotly package is notinstalled.

prType retrieves the system optionprType, which isset to"plain" if the option is not set.print methodsthat allow for markdown/html/latex can be automatically invoked bysettingoptions(prType="html") oroptions(prType='latex').

htmlSpecialType retrieves the system optionhtmlSpecialType, which is set to"unicode" if the optionis not set.htmlSpecialType='unicode' cause html-generatingfunctions inHmisc andrms to use unicode for specialcharacters, andhtmlSpecialType='&' uses the older ampersand3-digit format.

inverseFunction generates a function to find all inverses of amonotonic or nonmonotonic function that is tabulated at vectors (x,y),typically 1000 points. If the original function is monotonic, simple linearinterpolation is used and the result is a vector, otherwise linearinterpolation is used within each interval in which the function ismonotonic and the result is a matrix with number of columns equal to thenumber of monotonic intervals. If a requested y is not within anyinterval, the extreme x that pertains to the nearest extreme y isreturned. Specifying what='sample' to the returned function will cause avector to be returned instead of a matrix, with elements taken as arandom choice of the possible inverses.

james.stein computes James-Stein shrunken estimates of cellmeans given a response variable (which may be binary) and a groupingindicator.

keepHattrib for an input variable or a data frame, creates alist object saving special Hmisc attributes such aslabel andunits that might be lost during certain operations such asrunningdata.table.restoreHattrib restores these attributes.

km.quick provides a fast way to invokesurvfitKM in thesurvival package to efficiently get Kaplan-Meier or Fleming-Harrington estimates for asingle stratum for a vector of time points (iftimes is given) or toget a vector of survival time quantiles (ifq is given). If neither is given,the whole curve is returned in a list with objectstime andsurv, andthere is an option to consider an interval as pertaining to greater than or equalto a specific time instead of the traditional greater than. If the censoring is not right censoring, the more generalsurvfit is called bykm.quick.

latexBuild takes pairs of character strings and produces asingle character string containing concatenation of all of them, plusan attribute"close" which is a character string containing theLaTeX closure that will balance LaTeX code with respect toparentheses, braces, brackets, orbegin vs.end. Whenan even-numbered element of the vector is not a left parenthesis,brace, or bracket, the element is taken as a word that was surroundedbybegin and braces, for which the correspondingend isconstructed in the returned attribute.

lm.fit.qr.bare is a fast stripped-down function for computingregression coefficients, residuals,R^2, and fitted values. Ituseslm.fit.

matxv multiplies a matrix by a vector, handling automaticaddition of intercepts if the matrix does not have a column of ones.If the first argument is not a matrix, it will be converted to one.An optional argument allows the second argument to be treated as amatrix, useful when its rows represent bootstrap reps ofcoefficients. Then ab' is computed.matxv respects the"intercepts" attribute if it is stored onb by therms package. This is used byormfits that are bootstrap-repeated bybootcov whereonly the intercept corresponding to the median is retained. Ifkint has nonzero length, it is checked for consistency with theattribute.

makeSteps is a copy of the dostep function inside thesurvival package'splot.survfit function. It expands aseries of points to include all the segments needed to plot stepfunctions. This is useful for drawing polygons to shade confidencebands for step functions.

nomiss returns a data frame (if its argument is one) with rowscorresponding toNAs removed, or it returns a matrix with rowswith any element missing removed.

outerText usesaxis() to put right-justified textstrings in the right margin. Placement depends onpar('mar')[4]

plotlyParm is a list of functions useful for specifyingparameters toplotly graphics.

plotp is a generic to handleplotp methods to makeplotly graphics.

rendHTML renders HTML in a character vector, first convertingto one character string with newline delimeters. Ifknitr iscurrently running, runs this string throughknitr::asis_outputso that the user need not includeresults='asis' in the chunkheader for R Markdown or Quarto. Ifknitr is not running, useshtmltools::browsable andhtmltools::HTML and prints theresult so that an RStudio viewer (if running inside RStudio) orseparate browser window displays the rendered HTML. The HTML code issurrounded by yaml markup to make Pandoc not fiddle with the HTML.Set the argumenthtml=FALSE to not add this, in case you arereally rendering markdown.html=FALSE also invokesrmarkdown::render to convert the character vector to HTMLbefore usinghtmltools to view, assuming the charactersrepresent RMarkdown/Quarto text other than the YAML header. Ifoptions(rawmarkup=TRUE) is in effect,rendHTML will justcat() its first argument. This is useful when rendering ishappening inside a Quarto margin, for example.

sepUnitsTrans converts character vectors containing values suchasc("3 days","3day","4month","2 years","2weeks","7") tonumeric vectors (herec(3,3,122,730,14,7)) in a flexible fashion. The user canspecify a vector of units of measurements and conversion factors. The unitswith a conversion factor of1 are taken as the target units,and if those units are present in the character strings they areignored. The target units are added to the resulting vector as the"units" attribute.

strgraphwrap is likestrwrap but is for the currentgraphics environment.

tobase64image is a function written by Dirk Eddelbuettel thatuses thebase64enc package to convert a png graphic file tobase64 encoding to include as an inline image in an html file.

trap.rule computes the area under a curve using the trapezoidalrule, assumingx is sorted.

trellis.strip.blank sets up Trellis or Lattice graphs to have aclear background on the strips for panel labels.

unPaste provides a version of the S-Plusunpaste thatworks forR and S-Plus.

whichClosePW is a very fast function using weighted multinomialsampling to determine which element of a vector is "closest" to eachelement of another vector.whichClosest quickly finds the closestelement without any randomness.

whichClosek is a slow function that finds, after jittering thelookup table, thek closest matchest to each element of theother vector, and chooses from among these one at random.

xless is a function for Linux/Unix users to invoke the systemxless command to pop up a window to display the result ofprinting an object. For MacOSxless uses the systemopen command to pop up aTextEdit window.

Usage

confbar(at, est, se, width, q = c(0.7, 0.8, 0.9, 0.95, 0.99),         col = gray(c(0, 0.25, 0.5, 0.75, 1)),        type = c("v", "h"), labels = TRUE, ticks = FALSE,        cex = 0.5, side = "l", lwd = 5, clip = c(-1e+30, 1e+30),        fun = function(x) x,        qfun = function(x) ifelse(x == 0.5, qnorm(x),                            ifelse(x < 0.5, qnorm(x/2),                            qnorm((1 +  x)/2))))getLatestSource(x=NULL, package='Hmisc', recent=NULL, avail=FALSE)grType()prType()htmlSpecialType()inverseFunction(x, y)james.stein(y, group)keepHattrib(obj)km.quick(S, times, q,        type = c("kaplan-meier", "fleming-harrington", "fh2"),        interval = c(">", ">="), method=c('constant', 'linear'), fapprox=0, n.risk=FALSE)latexBuild(..., insert, sep='')lm.fit.qr.bare(x, y, tolerance, intercept=TRUE, xpxi=FALSE, singzero=FALSE)matxv(a, b, kint=1, bmat=FALSE)nomiss(x)outerText(string, y, cex=par('cex'), ...)plotlyParmplotp(data, ...)rendHTML(x, html=TRUE)restoreHattrib(obj, attribs)sepUnitsTrans(x, conversion=c(day=1, month=365.25/12, year=365.25, week=7),              round=FALSE, digits=0)strgraphwrap(x, width = 0.9 * getOption("width"),             indent = 0, exdent = 0,             prefix = "", simplify = TRUE, units='user', cex=NULL)tobase64image(file, Rd = FALSE, alt = "image")trap.rule(x, y)trellis.strip.blank()unPaste(str, sep="/")whichClosest(x, w)whichClosePW(x, w, f=0.2)whichClosek(x, w, k)xless(x, ..., title)

Arguments

a

a numeric matrix or vector

alt,Rd

seebase64::img

at

x-coordinate for vertical confidence intervals, y-coordinatefor horizontal

attribs

an object returned bykeepHattrib

avail

set toTRUE to havegetLatestSource returna data frame of available files and latest versions instead offetching any

b

a numeric vector

cex

character expansion factor

clip

interval to truncate limits

col

vector of colors

conversion

a named numeric vector

data

an object having aplotp method

digits

number of digits used forround

est

vector of point estimates for confidence limits

f

a scaling constant

file

a file name

fun

function to transform scale

group

a categorical grouping variable

html

set toFALSE to tellrendHTML to notsurround HTML code with yaml

insert

a list of 3-element lists forlatexBuild.The first of each 3-elementlist is a character string with an environment name. The secondspecifies the order:"before" or"after", the formerindicating that when the environment is found, the third element ofthe list is inserted before or after it, according to the secondelement.

intercept

set toFALSE to not automatically add a columnof ones to thex matrix

k

get thek closest matches

kint

which element ofb to add to the result ifadoes not contain a column for intercepts

bmat

set toTRUE to considerb a matrix ofrepeated coefficients, usually resampled estimates with rowscorresponding to resamples

labels

set toFALSE to omit drawing confidencecoefficients

lwd

line widths

package

name of package forgetLatestSource, default is'Hmisc'

obj

a variable, data frame, or data table

q

vector of confidence coefficients or quantiles

qfun

quantiles on transformed scale

recent

an integer tellinggetLatestSource to get therecent most recently modified files from the package

round

set toTRUE to round converted values

S

aSurv object

se

vector of standard errors

sep

a single character string specifying the delimiter. ForlatexBuild the default is"".

side

forconfbar is"b","l","t","r" for bottom,left, top, right.

str

a character string vector

string

a character string vector

ticks

set toTRUE to draw lines between rectangles

times

a numeric vector of times

title

a character string to title a window or plot. Ignored forxless under MacOs.

tolerance

tolerance for judging singularity in matrix

type

"v" for vertical,"h" for horizontal. Forkm.quick specifies the type of survival estimator.

w

a numeric vector

width

width of confidence rectanges in user units, or seestrwrap

x

a numeric vector (matrix forlm.fit.qr.bare) or dataframe. Forxless may be any object that is sensible toprint. ForsepUnitsTrans is a character or factorvariable. ForgetLatestSource is a character string orvector of character strings containing base file names to retrievefrom CVS. Setx='all' to retrieve all source files. Forclowess,x may also be a list with x and ycomponents. ForinverseFunction,x andycontain evaluations of the function whose inverse is needed.x is typically an equally-spaced grid of 1000 points. Forstrgraphwrap is a character vector. ForrendHTMLx is a character vector.

xpxi

set toTRUE to add an element to the resultcontaining the inverse ofX'X

singzero

set toTRUE to set coefficients correspondingto singular variables to zero instead ofNA.

y

a numeric vector. ForinverseFunctiony is theevaluated function values atx.

indent,exdent,prefix

seestrwrap

simplify

seesapply

units

seepar

interval

specifies whether to deal with probabilities of exceeding a value(the default) or of exceeding or equalling the value

method,fapprox

seeapprox

n.risk

set toTRUE to include the number at risk in the result

...

arguments passed through to another function. ForlatexBuild represents pairs, with odd numbered elements beingcharacter strings containing LaTeX code or a zero-length object toignore, and even-numbered elements representing LaTeX leftparenthesis, left brace, or left bracket, or environment name.

Author(s)

Frank Harrell and Charles Dupont

Examples

trap.rule(1:100,1:100)unPaste(c('a;b or c','ab;d','qr;s'), ';')sepUnitsTrans(c('3 days','4 months','2 years','7'))set.seed(1)whichClosest(1:100, 3:5)whichClosest(1:100, rep(3,20))whichClosePW(1:100, rep(3,20))whichClosePW(1:100, rep(3,20), f=.05)whichClosePW(1:100, rep(3,20), f=1e-10)x <- seq(-1, 1, by=.01)y <- x^2h <- inverseFunction(x,y)formals(h)$turns   # vertexa <- seq(0, 1, by=.01)plot(0, 0, type='n', xlim=c(-.5,1.5))lines(a, h(a)[,1])            ## first inverselines(a, h(a)[,2], col='red') ## second inversea <- c(-.1, 1.01, 1.1, 1.2)points(a, h(a)[,1])d <- data.frame(x=1:2, y=3:4, z=5:6)d <- upData(d, labels=c(x='X', z='Z lab'), units=c(z='mm'))a <- keepHattrib(d)d <- data.frame(x=1:2, y=3:4, z=5:6)d2 <- restoreHattrib(d, a)sapply(d2, attributes)## Not run: getLatestSource(recent=5)  # source() most recent 5 revised files in HmiscgetLatestSource('cut2')    # fetch and source latest cut2.sgetLatestSource('all')     # get everythinggetLatestSource(avail=TRUE) # list available files and latest versions## End(Not run)

R2Measures

Description

Generalized R^2 Measures

Usage

R2Measures(lr, p, n, ess = NULL, padj = 1)

Arguments

lr

likelihoood ratio chi-square statistic

p

number of non-intercepts in the model that achievedlr

n

raw number of observations

ess

if a single number, is the effective sample size. If a vector of numbers is assumed to be the frequency tabulation of all distinct values of the outcome variable, from which the effective sample size is computed.

padj

set to 2 to use the classical adjusted R^2 penalty, 1 (the default) to subtractp fromlr

Details

Computes various generalized R^2 measures related to the Maddala-Cox-Snell (MCS) R^2 for regression models fitted with maximum likelihood. The original MCS R^2 is labeledR2 in the result. This measure uses the raw sample sizen and does not penalize for the number of free parameters, so it can be rewarded for overfitting. A measure adjusted for the number of fitted regression coefficientsp uses the analogy to R^2 in linear models by computing1 - exp(- lr / n) * (n-1)/(n-p-1) ifpadj=2, which is approximately1 - exp(- (lr - p) / n), the version used ifpadj=1 (the default). The latter measure is appealing because the expected value of the likelihood ratio chi-square statisticlr isp under the global null hypothesis of no predictors being associated with the response variable. Seehttps://hbiostat.org/bib/r2.html for more details.

It is well known that in logistic regression the MCS R^2 cannot achieve a value of 1.0 even with a perfect model, which prompted Nagelkerke to divide the R^2 measure by its maximum attainable value. This is not necessarily the best recalibration of R^2 throughout its range. An alternative is to use the formulas above but to replace the raw sample sizen with the effective sample size, which for data with many ties can be significantly lower than the number of observations. As used in thepopower() anddescribe() functions, in the context of a Wilcoxon test or the proportional odds model, the effective sample size isn * (1 - f) wheref is the sums of cubes of the proportion of observations at each distict value of the response variable. Whitehead derived this from an approximation to the variance of a log odds ratio in a proportional odds model. To obtain R^2 measures using the effective sample size, either provideess as a single number specifying the effective sample size, or specify a vector of frequencies of distinct Y values from which the effective sample size will be computed. In the context of survival analysis, the single number effective sample size you may wish to specify is the number of uncensored observations. This is exactly correct when estimating the hazard rate from a simple exponential distribution or when using the Cox PH/log-rank test. For failure time distributions with a very high early hazard, censored observations contain enough information that the effective sample size is greater than the number of events. See Benedetti et al, 1982.

If the effective sample size equals the raw sample size, measures involving the effective sample size are set toNA.

Value

named vector of R2 measures. The notation for results isR^2(p, n) where thep component is empty for unadjusted estimates andn is the sample size used (actual sample size for first measures, effective sample size for remaining ones). For indexes that are not adjusted, onlyn appears.

Author(s)

Frank Harrell

References

Smith TJ and McKenna CM (2013): A comparison of logistic regression pseudo R^2 indices. Multiple Linear Regression Viewpoints 39:17-26.https://www.glmj.org/archives/articles/Smith_v39n2.pdf

Benedetti JK, et al (1982): Effective sample size for tests of censored survival data. Biometrika 69:343–349.

Mittlbock M, Schemper M (1996): Explained variation for logistic regression. Stat in Med 15:1987-1997.

Date, S: R-squared, adjusted R-squared and pseudo R-squared.https://timeseriesreasoning.com/contents/r-squared-adjusted-r-squared-pseudo-r-squared/

UCLA: What are pseudo R-squareds?https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

Allison P (2013): What's the beset R-squared for logistic regression?https://statisticalhorizons.com/r2logistic/

Menard S (2000): Coefficients of determination for multiple logistic regression analysis. The Am Statistician 54:17-24.

Whitehead J (1993): Sample size calculations for ordered categorical data. Stat in Med 12:2257-2271. See errata (1994) 13:871 and letter to the editor by Julious SA, Campbell MJ (1996) 15:1065-1066 showing that for 2-category Y the Whitehead sample size formula agrees closely with the usual formula for comparing two proportions.

Examples

x <- c(rep(0, 50), rep(1, 50))y <- x# f <- lrm(y ~ x)# f   # Nagelkerke R^2=1.0# lr <- f$stats['Model L.R.']# 1 - exp(- lr / 100)  # Maddala-Cox-Snell (MCS) 0.75lr <- 138.6267  # manually so don't need rms packageR2Measures(lr, 1, 100, c(50, 50))  # 0.84 Effective n=75R2Measures(lr, 1, 100, 50)         # 0.94# MCS requires unreasonable effective sample size = minimum outcome# frequency to get close to the 1.0 that Nagelkerke R^2 achieves

Faciliate Use of save and load to Remote Directories

Description

These functions are slightly enhanced versions ofsave andload that allow a target directory to be specified usingoptions(LoadPath="pathname"). If theLoadPath option isnot set, the current working directory is used.

Usage

# options(LoadPath='mypath')Save(object, name=deparse(substitute(object)), compress=TRUE)Load(object)

Arguments

object

the name of an object, usually a data frame. It mustnot be quoted.

name

an optional name to assign to the object and file nameprefix, if the argument name is not used

compress

seesave. Default isTRUEwhich corresponds togzip.

Details

Save creates a temporary version of the object under the namegiven by the user, so thatsave will internalize this name.Then subsequentLoad orload will cause an object of theoriginal name to be created in the global environment. The name oftheR data file is assumed to be the name of the object (or the valueofname) appended with".rda".

Author(s)

Frank Harrell

See Also

save,load

Examples

## Not run: d <- data.frame(x=1:3, y=11:13)options(LoadPath='../data/rda')Save(d)   # creates ../data/rda/d.rdaLoad(d)   # reads   ../data/rda/d.rdaSave(d, 'D')   # creates object D and saves it in .../D.rda## End(Not run)

Indexes of Absolute Prediction Error for Linear Models

Description

Computes the mean and median of various absolute errors related toordinary multiple regression models. The mean and median absoluteerrors correspond to the mean square due to regression, error, andtotal. The absolute errors computed are derived from\hat{Y} - \mbox{median($\hat{Y}$)},\hat{Y} - Y, andY - \mbox{median($Y$)}. The function alsocomputes ratios that correspond toR^2 and1 - R^2 (butthese ratios do not add to 1.0); theR^2 measure is the ratio ofmean or median absolute\hat{Y} - \mbox{median($\hat{Y}$)} to the mean or median absoluteY - \mbox{median($Y$)}. The1 - R^2 or SSE/SSTmeasure is the mean or median absolute\hat{Y} - Ydivided by the mean or median absolute\hat{Y} - \mbox{median($Y$)}.

Usage

abs.error.pred(fit, lp=NULL, y=NULL)## S3 method for class 'abs.error.pred'print(x, ...)

Arguments

fit

a fit object typically fromlm orolsthat contains a y vector (i.e., you should have specifiedy=TRUE to the fitting function) unless they argumentis given toabs.error.pred. If you do not specify thelp argument,fit must containfitted.values orlinear.predictors. You must specifyfit or both oflp andy.

lp

a vector of predicted values (Y hat above) iffit is not given

y

a vector of response variable values iffit (withy=TRUE in effect) is not given

x

an object created byabs.error.pred

...

unused

Value

a list of classabs.error.pred (used byprint.abs.error.pred) containing two matrices:differences andratios.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Schemper M (2003): Stat in Med 22:2299-2308.

Tian L, Cai T, Goetghebeur E, Wei LJ (2007): Biometrika 94:297-311.

See Also

lm,ols,cor,validate.ols

Examples

set.seed(1)         # so can regenerate resultsx1 <- rnorm(100)x2 <- rnorm(100)y  <- exp(x1+x2+rnorm(100))f <- lm(log(y) ~ x1 + poly(x2,3), y=TRUE)abs.error.pred(lp=exp(fitted(f)), y=y)rm(x1,x2,y,f)

Add Marginal Observations

Description

Given a data frame and the names of variable, doubles thedata frame for each variable with a new category"All" by default, or by the value oflabel.A new variable.marginal. is added to the resulting data frame,with value"" if the observation is an original one, and withvalue equal to the names of the variable being marginalized (separatedby commas) otherwise. If there is another stratification variablebesides the one in ..., and that variable is nested inside thevariable in ..., specifynested=variable name to have the valueof that variable set folabel whenever marginal observations arecreated for .... See the state-city example below.

Usage

addMarginal(data, ..., label = "All", margloc=c('last', 'first'), nested)

Arguments

data

a data frame

...

a list of names of variables to marginalize

label

category name for added marginal observations

margloc

location for marginal category within factor variablespecifying categories. Set to"first" to override thedefault - to put a category with valuelabel as the firstcategory.

nested

a single unquoted variable name if used

Examples

d <- expand.grid(sex=c('female', 'male'), country=c('US', 'Romania'),                 reps=1:2)addMarginal(d, sex, country)# Example of nested variablesd <- data.frame(state=c('AL', 'AL', 'GA', 'GA', 'GA'),                city=c('Mobile', 'Montgomery', 'Valdosto',                       'Augusta', 'Atlanta'),                x=1:5, stringsAsFactors=TRUE)addMarginal(d, state, nested=city) # cite set to 'All' when state is

addggLayers

Description

Add Spike Histograms and Extended Box Plots toggplot

Usage

addggLayers(  g,  data,  type = c("ebp", "spike"),  ylim = layer_scales(g)$y$get_limits(),  by = "variable",  value = "value",  frac = 0.065,  mult = 1,  facet = NULL,  pos = c("bottom", "top"),  showN = TRUE)

Arguments

g

aggplot object

data

data frame/table containing raw data

type

specifies either extended box plot or spike histogram. Both are horizontal so are showing the distribution of the x-axis variable.

ylim

y-axis limits to use for scaling the height of the added plots, if you don't want to use the limits thatggplot has stored

by

the name of a variable indata used to stratify raw data

value

name of x-variable

frac

fraction of y-axis range to devote to vertical aspect of the added plot

mult

fudge factor for scaling y aspect

facet

optional faceting variable

pos

position for added plot

showN

sete toFALSE to not show sample sizes

Details

For an example seethis. Note that it was not possible to just create the layers needed to be added, as creating these particular layers in isolation resulted in aggplot error.

Value

the originalggplot object with more layers added

Author(s)

Frank Harrell

See Also

spikecomp()


Check if All Elements in Character Vector are Numeric

Description

Tests, without issuing warnings, whether all elements of a charactervector are legal numeric values, or optionally converts the vector to anumeric vector. Leading and trailing blanks inx are ignored.

Usage

all.is.numeric(x, what = c("test", "vector", "nonnum"), extras=c('.','NA'))

Arguments

x

a character vector

what

specifywhat="vector" to return a numeric vector ifit passes the test, or the original character vector otherwise, thedefault"test" to returnFALSE if there are nonon-missing non-extra values ofx or there is at leastone non-numeric value ofx, or"nonnum" to return thevector of non-extra, non-NA, non-numeric values ofx.

extras

a vector of character strings to count as numericvalues, other than"".

Value

a logical value ifwhat="test" or a vector otherwise

Author(s)

Frank Harrell

See Also

as.numeric

Examples

all.is.numeric(c('1','1.2','3'))all.is.numeric(c('1','1.2','3a'))all.is.numeric(c('1','1.2','3'),'vector')all.is.numeric(c('1','1.2','3a'),'vector')all.is.numeric(c('1','',' .'),'vector')all.is.numeric(c('1', '1.2', '3a'), 'nonnum')

Linear Extrapolation

Description

Works in conjunction with theapprox function to do linearextrapolation.approx in R does not support extrapolation atall, and it is buggy in S-Plus 6.

Usage

approxExtrap(x, y, xout, method = "linear", n = 50, rule = 2, f = 0,             ties = "ordered", na.rm = FALSE)

Arguments

x,y,xout,method,n,rule,f

seeapprox

ties

applies only to R. Seeapprox

na.rm

set toTRUE to removeNAs inx andy before proceeding

Details

Duplicates inx (and correspondingy elements) are removedbefore usingapprox.

Value

a vector the same length asxout

Author(s)

Frank Harrell

See Also

approx

Examples

approxExtrap(1:3,1:3,xout=c(0,4))

Additive Regression with Optimal Transformations on Both Sides usingCanonical Variates

Description

Expands continuous variables into restricted cubic spline bases andcategorical variables into dummy variables and fits a multivariateequation using canonical variates. This finds optimum transformationsthat maximizeR^2. Optionally, the bootstrap is used to estimatethe covariance matrix of both left- and right-hand-side transformationparameters, and to estimate the bias in theR^2 due to overfittingand compute the bootstrap optimism-correctedR^2.Cross-validation can also be used to get an unbiased estimate ofR^2 but this is not as precise as the bootstrap estimate. Thebootstrap and cross-validation may also used to get estimates of meanand median absolute error in predicted values on the originalyscale. These two estimates are perhaps the best ones for gauging theaccuracy of a flexible model, because it is difficult to compareR^2 under different y-transformations, and becauseR^2allows for an out-of-sample recalibration (i.e., it only measuresrelative errors).

Note that uncertainty about the proper transformation ofy causesan enormous amount of model uncertainty. When the transformation fory is estimated from the data a high variance in predicted valueson the originaly scale may result, especially if the truetransformation is linear. Comparing bootstrap or cross-validated meanabsolute errors with and without restricted they transform to belinear (ytype='l') may help the analyst choose the proper modelcomplexity.

Usage

areg(x, y, xtype = NULL, ytype = NULL, nk = 4,     B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL)## S3 method for class 'areg'print(x, digits=4, ...)## S3 method for class 'areg'plot(x, whichx = 1:ncol(x$x), ...)## S3 method for class 'areg'predict(object, x, type=c('lp','fitted','x'),                       what=c('all','sample'), ...)

Arguments

x

A single predictor or a matrix of predictors. Categoricalpredictors are required to be coded as integers (asfactordoes internally).Forpredict,x is a data matrix with the same integercodes that were originally used for categorical variables.

y

afactor, categorical, character, or numeric responsevariable

xtype

a vector of one-letter character codes specifying how each predictoris to be modeled, in order of columns ofx. The codes are"s" for smooth function (using restricted cubic splines),"l" for no transformation (linear), or"c" forcategorical (to cause expansion into dummy variables). Default is"s" ifnk > 0 and"l" ifnk=0.

ytype

same coding as forxtype. Default is"s"for a numeric variable with more than two unique values,"l"for a binary numeric variable, and"c" for a factor,categorical, or character variable.

nk

number of knots, 0 for linear, or 3 or more. Default is 4which will fit 3 parameters to continuous variables (one linear termand two nonlinear terms)

B

number of bootstrap resamples used to estimate covariancematrices of transformation parameters. Default is no bootstrapping.

na.rm

set toFALSE if you are sure that observationswithNAs have already been removed

tolerance

singularity tolerance. List source code forlm.fit.qr.bare for details.

crossval

set to a positive integer k to compute k-foldcross-validated R-squared (square of first canonical correlation)and mean and median absolute error of predictions on the original scale

digits

number of digits to use in formatting for printing

object

an object created byareg

whichx

integer or character vector specifying which predictorsare to have their transformations plotted (default is all). They transformation is always plotted.

type

tellspredict whether to obtain predicteduntransformedy (type='lp', the default) or predictedy on the original scale (type='fitted'), or the designmatrix for the right-hand side (type='x').

what

When they-transform is non-monotonic you mayspecifywhat='sample' topredict to obtain a randomsample ofy values on the original scale instead of a matrixof ally-inverses. SeeinverseFunction.

...

arguments passed to the plot function.

Details

areg is a competitor oface in theacepackpackage. Transformations fromace are seldom smooth enough andare often overfitted. Withareg the complexity can be controlledwith thenk parameter, and predicted values are easy to obtainbecause parametric functions are fitted.

If one side of the equation has a categorical variable with more thantwo categories and the other side has a continuous variable not assumedto act linearly, larger sample sizes are needed to reliably estimatetransformations, as it is difficult to optimally score categoricalvariables to maximizeR^2 against a simultaneously optimallytransformed continuous variable.

Value

a list of class"areg" containing many objects

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Breiman and Friedman, Journal of the American StatisticalAssociation (September, 1985).

See Also

cancor,ace,transcan

Examples

set.seed(1)ns <- c(30,300,3000)for(n in ns) {  y <- sample(1:5, n, TRUE)  x <- abs(y-3) + runif(n)  par(mfrow=c(3,4))  for(k in c(0,3:5)) {    z <- areg(x, y, ytype='c', nk=k)    plot(x, z$tx)title(paste('R2=',format(z$rsquared)))    tapply(z$ty, y, range)    a <- tapply(x,y,mean)    b <- tapply(z$ty,y,mean)    plot(a,b)abline(lsfit(a,b))    # Should get same result to within linear transformation if reverse x and y    w <- areg(y, x, xtype='c', nk=k)    plot(z$ty, w$tx)    title(paste('R2=',format(w$rsquared)))    abline(lsfit(z$ty, w$tx)) }}par(mfrow=c(2,2))# Example where one category in y differs from others but only in variance of xn <- 50y <- sample(1:5,n,TRUE)x <- rnorm(n)x[y==1] <- rnorm(sum(y==1), 0, 5)z <- areg(x,y,xtype='l',ytype='c')zplot(z)z <- areg(x,y,ytype='c')zplot(z)## Not run: # Examine overfitting when true transformations are linearpar(mfrow=c(4,3))for(n in c(200,2000)) {  x <- rnorm(n); y <- rnorm(n) + x    for(nk in c(0,3,5)) {    z <- areg(x, y, nk=nk, crossval=10, B=100)    print(z)    plot(z)    title(paste('n=',n))  }}par(mfrow=c(1,1))# Underfitting when true transformation is quadratic but overfitting# when y is allowed to be transformedset.seed(49)n <- 200x <- rnorm(n); y <- rnorm(n) + .5*x^2#areg(x, y, nk=0, crossval=10, B=100)#areg(x, y, nk=4, ytype='l', crossval=10, B=100)z <- areg(x, y, nk=4) #, crossval=10, B=100)z# Plot x vs. predicted value on original scale.  Since y-transform is# not monotonic, there are multiple y-inversesxx <- seq(-3.5,3.5,length=1000)yhat <- predict(z, xx, type='fitted')plot(x, y, xlim=c(-3.5,3.5))for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j)# Plot a random sample of possible y inversesyhats <- predict(z, xx, type='fitted', what='sample')points(xx, yhats, pch=2)## End(Not run)# True transformation of x1 is quadratic, y is linearn <- 200x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3)par(mfrow=c(2,2))plot(z)# y transformation is inverse quadratic but areg gets the same answer by# making x1 quadraticn <- 5000x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2z <- areg(cbind(x1,x2),y,nk=5)par(mfrow=c(2,2))plot(z)# Overfit 20 predictors when no true relationships existn <- 1000x <- matrix(runif(n*20),n,20)y <- rnorm(n)z <- areg(x, y, nk=5)  # add crossval=4 to expose the problem# Test predict functionn <- 50x <- rnorm(n)y <- rnorm(n) + xg <- sample(1:3, n, TRUE)z <- areg(cbind(x,g),y,xtype=c('s','c'))range(predict(z, cbind(x,g)) - z$linear.predictors)

Multiple Imputation using Additive Regression, Bootstrapping, andPredictive Mean Matching

Description

Thetranscan function creates flexible additive imputation modelsbut provides only an approximation to true multiple imputation as theimputation models are fixed before all multiple imputations aredrawn. This ignores variability caused by having to fit theimputation models.aregImpute takes all aspects of uncertainty inthe imputations into account by using the bootstrap to approximate theprocess of drawing predicted values from a full Bayesian predictivedistribution. Different bootstrap resamples are used for each of themultiple imputations, i.e., for theith imputation of a sometimesmissing variable,i=1,2,... n.impute, a flexible additivemodel is fitted on a sample with replacement from the original data andthis model is used to predict all of the original missing andnon-missing values for the target variable.

areg is used to fit the imputation models. By default, linearityis assumed for target variables (variables being imputed) andnk=3 knots are assumed for continuous predictors transformedusing restricted cubic splines. Ifnk is three or greater andtlinear is set toFALSE,aregsimultaneously finds transformations of the target variable and of all ofthe predictors, to get a good fit assuming additivity, maximizingR^2, using the same canonical correlation method astranscan. Flexible transformations may be overridden forspecific variables by specifying the identity transformation for them.When a categorical variable is being predicted, the flexibletransformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. Ifnk is a vector,areg's bootstrap andcrossval=10options will be used to help find the optimum validating value ofnk over values of that vector, at the last imputation iteration.For the imputations, the minimum value ofnk is used.

Instead of defaulting to taking random draws from fitted imputationmodels using random residuals as is done bytranscan,aregImpute by default uses predictive mean matching with optionalweighted probability sampling of donors rather than using only theclosest match. Predictive mean matching works for binary, categorical,and continuous variables without the need for iterative maximumlikelihood fitting for binary and categorical variables, and without theneed for computing residuals or for curtailing imputed values to be inthe range of actual data. Predictive mean matching is especiallyattractive when the variable being imputed is also being transformedautomatically. Constraints may be placed on variables being imputedwith predictive mean matching, e.g., a missing hospital discharge datemay be required to be imputed from a donor observation whose dischargedate is before the recipient subject's first post-discharge visit date.See Details below for more information about thealgorithm. A"regression" method is also available that issimilar to that used intranscan. This option should be usedwhen mechanistic missingness requires the use of extrapolation duringimputation.

Aprint method summarizes the results, and aplot method plotsdistributions of imputed values. Typically,fit.mult.impute willbe called afteraregImpute.

If a target variable is transformed nonlinearly (i.e., ifnk isgreater than zero andtlinear is set toFALSE) and theestimated target variable transformation is non-monotonic, imputedvalues are not unique. Whentype='regression', a random choiceof possible inverse values is made.

ThereformM function provides two ways of recreating a formula togive toaregImpute by reordering the variables in the formula.This is a modified version of a function written by Yong Hao Pua. Onecan specifynperm to obtain a list ofnperm randomlypermuted variables. The list is converted to a single ordinary formulaifnperm=1. Ifnperm is omitted, variables are sorted indescending order of the number ofNAs.reformM alsoprints a recommended number of multiple imputations to use, which is aminimum of 5 and the percent of incomplete observations.

Usage

aregImpute(formula, data, subset, n.impute=5, group=NULL,           nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'),           pmmtype=1, match=c('weighted','closest','kclosest'),           kclosest=3, fweighted=0.2,           curtail=TRUE, constraint=NULL,           boot.method=c('simple', 'approximate bayesian'),           burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75)## S3 method for class 'aregImpute'print(x, digits=3, ...)## S3 method for class 'aregImpute'plot(x, nclass=NULL, type=c('ecdf','hist'),     datadensity=c("hist", "none", "rug", "density"),     diagnostics=FALSE, maxn=10, ...)reformM(formula, data, nperm)

Arguments

formula

an S model formula. You can specify restrictions for transformationsof variables. The function automatically determines which variablesare categorical (i.e.,factor,category, or character vectors).Binary variables are automatically restricted to be linear. Forcelinear transformations of continuous variables by enclosing variablesby the identify function (I()). It is recommended thatfactor() oras.factor() do not appear in the formula butinstead variables be converted to factors as needed and stored in thedata frame. That way imputations for factor variables (done usingimpute.transcan for example) will be correct. CurrentlyreformM does not handle variables that are enclosed in functionssuch asI().

x

an object created byaregImpute. ForaregImpute, setx toTRUE to save the data matrix containing the final (numbern.impute) imputations in the result. Thisis needed if you want to later do out-of-sample imputation.Categorical variables are coded as integers in this matrix.

data

input raw data

subset

These may be also be specified. You may not specifyna.action asna.retain is always used.

n.impute

number of multiple imputations.n.impute=5 is frequentlyrecommended but 10 or more doesn't hurt.

group

a character or factor variable the same length as thenumber of observations indata and containing noNAs.Whengroup is present, causes a bootstrap sample of theobservations corresponding to non-NAs of a target variable tohave the same frequency distribution ofgroup as thethat in the non-NAs of the original sample. This can handlek-sample problems as well as lower the chance that a bootstrap samplewill have a missing cell when the original cell frequency was low.

nk

number of knots to use for continuous variables. When boththe target variable and the predictors are having optimumtransformations estimated, there is more instability than with normalregression so the complexity of the model should decrease more sharplyas the sample size decreases. Hence setnk to 0 (to forcelinearity for non-categorical variables) or 3 (minimum number of knotspossible with a linear tail-restricted cubic spline) for small samplesizes. Simulated problems as in the examples section can assist inchoosingnk. Setnk to a vector to get bootstrap-validatedand 10-fold cross-validatedR^2 and mean and median absoluteprediction errors for imputing each sometimes-missing variable, withnk ranging over the given vector. The errors are on theoriginal untransformed scale. The mean absolute error is therecommended basis for choosing the number of knots (or linearity).

tlinear

set toFALSE to allow a target variable (variablebeing imputed) to have a nonlinear left-hand-side transformation whennk is 3 or greater

type

The default is"pmm" for predictive mean matching,which is a more nonparametric approach that will work for categoricalas well as continuous predictors. Alternatively, use"regression" when all variables that are sometimes missing arecontinuous and the missingness mechanism is such that entire intervalsof population values are unobserved. See the Details section for moreinformation. Another method,type="normpmm", only workswhen variables containingNAs are continuous andtlinearisTRUE (the default), meaning that the variable being imputedis not transformed when it is on the left hand model side.normpmm assumes that the imputation regression parameterestimates are multivariately normally distributed and that theresidual variance has a scaled chi-squared distribution. For eachimputation a random draw of the estimates is taken and a random drawfrom sigma is combined with those to get a random draw from theposterior predicted value distribution. Predictive mean matching isthen done matching these predicted values from incomplete observationswith predicted values from complete potential donor observations,where the latter predictions are based on the imputation model leastsquares parameter estimates and not on random draws from the posterior.For theplot method, specifytype="hist"to draw histograms of imputed values with rug plots at the top, ortype="ecdf" (the default) to draw empirical CDFs with spikehistograms at the bottom.

pmmtype

type of matching to be used for predictive meanmatching whentype="pmm".pmmtype=2 means that predicted values for both target incomplete and complete observations come froma fit from the same bootstrap sample.pmmtype=1, the default,means that predicted values for complete observations are basedon additive regression fits on original complete observations (using lastimputations for non-target variables as with the other methds), and usingfits on a bootstrap sample to get predicted values for missing target variables.See van Buuren (2012) section 3.4.2 wherepmmtype=1 is said towork much better when the number of variables is small.pmmtype=3 means that complete observation predicted values comefrom a bootstrap sample fit whereas target incomplete observationpredicted values come from a sample with replacement from the bootstrapfit (approximate Bayesian bootstrap).

match

Defaults tomatch="weighted" to do weighted multinomialprobability sampling using the tricube function (similar to lowess)as the weights. The argument of the tricube function is the absolutedifference in transformed predicted values of all the donors and ofthe target predicted value, divided by a scaling factor.The scaling factor in the tricube function isfweighted timesthe mean absolute difference between the target predicted value andall the possible donor predicted values. Setmatch="closest"to find as the donor the observation having the closest predictedtransformed value, even if that same donor is found repeatedly. Setmatch="kclosest" to use a slower implementation that finds,after jittering the complete case predicted values, thekclosest complete cases on the target variable being imputed,then takes a random sample of one of thesekclosest cases.

kclosest

seematch

fweighted

Smoothing parameter (multiple of mean absolute difference) used whenmatch="weighted", with a default value of 0.2. Setfweighted to a number between 0.02 and 0.2 to force the donorto have a predicted value closer to the target, and setfweighted to larger values (but seldom larger than 1.0) to allowdonor values to be less tightly matched. See the examples below tolearn how to study the relationship betweenfweighted and thestandard deviation of multiple imputations within individuals.

curtail

applies iftype='regression', causing imputedvalues to be curtailed at the observed range of the target variable.Set toFALSE to allow extrapolation outside the data range.

constraint

for predictive mean matchingconstraint is anamed list specifying Rexpression()s encoding constaints onwhich donor observations are allowed to be used, based on variablesthat are not missing, i.e., based on donor observations and/orrecipient observations as long as the target variable being imputed isnot used for the recipients. The expressions must evaluate to alogical vector with noNAs and whose length is the number ofrows in the donor observations. The expressions refer to donorobservations by prefixing variable names byd$, and to a singlerecipient observation by prefixing variables names byr$.

boot.method

By default, simple boostrapping is used in which thetarget variable is predicted using a sample with replacement from theobservations with non-missing target variable. Specifyboot.method='approximate bayesian' to build the imputationmodels from a sample with replacement from a sample with replacementof the observations with non-missing targets. Preliminary simulationshave shown this results in good confidence coverage of the final modelparameters whentype='regression' is used. Not implementedwhengroup is used.

burnin

aregImpute doesburnin + n.impute iterations of theentire modeling process. The firstburnin imputations arediscarded. More burn-in iteractions may be requied when multiplevariables are missing on the same observations. When only onevariable is missing, no burn-ins are needed andburnin is setto zero if unspecified.

pr

set toFALSE to suppress printing of iteration messages

plotTrans

set toTRUE to plotace oravas transformationsfor each variable for each of the multiple imputations. This isuseful for determining whether transformations are reasonable. Iftransformations are too noisy or have long flat sections (resulting in"lumps" in the distribution of imputed values), it may be advisable toplace restrictions on the transformations (monotonicity or linearity).

tolerance

singularity criterion; list the source code in thelm.fit.qr.bare function for details

B

number of bootstrap resamples to use ifnk is a vector

digits

number of digits for printing

nclass

number of bins to use in drawing histogram

datadensity

seeEcdf

diagnostics

Specifydiagnostics=TRUE to draw plots of imputed values againstsequential imputation numbers, separately for each missingobservations and variable.

maxn

Maximum number of observations shown for diagnostics. Default ismaxn=10, which limits the number of observations plotted to at mostthe first 10.

nperm

number of random formula permutations forreformM;omit to sort variables by descending missing count.

...

other arguments that are ignored

Details

The sequence of steps used by thearegImpute algorithm is thefollowing.
(1) For each variable containing mNAs where m > 0, initialize theNAs to values from a random sample (without replacement ifa sufficient number of non-missing values exist) of size m from thenon-missing values.
(2) Forburnin+n.impute iterations do the following steps. Thefirstburnin iterations provide a burn-in, and imputations aresaved only from the lastn.impute iterations.
(3) For each variable containing anyNAs, draw a sample withreplacement from the observations in the entire dataset in which thecurrent variable being imputed is non-missing. Fit a flexibleadditive model to predict this target variable while finding theoptimum transformation of it (unless the identitytransformation is forced). Use this fitted flexible model topredict the target variable in all of the original observations.Impute each missing value of the target variable with the observedvalue whose predicted transformed value is closest to the predictedtransformed value of the missing value (ifmatch="closest" andtype="pmm"), or use a draw from a multinomial distribution with probabilities derivedfrom distance weights, ifmatch="weighted" (the default).
(4) After these imputations are computed, use these random drawimputations the next time the curent target variable is used as apredictor of other sometimes-missing variables.

Whenmatch="closest", predictive mean matching does not work wellwhen fewer than 3 variables are used to predict the target variable,because many of the multiple imputations for an observation will beidentical. In the extreme case of one right-hand-side variable andassuming that only monotonic transformations of left and right-sidevariables are allowed, every bootstrap resample will give predictedvalues of the target variable that are monotonically related topredicted values from every other bootstrap resample. The same is truefor Bayesian predicted values. This causes predictive mean matching toalways match on the same donor observation.

When the missingness mechanism for a variable is so systematic that thedistribution of observed values is truncated, predictive mean matchingdoes not work. It will only yield imputed values that are near observedvalues, so intervals in which no values are observed will not bepopulated by imputed values. For this case, the only hope is to makeregression assumptions and use extrapolation. Withtype="regression",aregImpute will use linearextrapolation to obtain a (hopefully) reasonable distribution of imputedvalues. The"regression" option causesaregImpute toimpute missing values by adding a random sample of residuals (withreplacement if there are moreNAs than measured values) on thetransformed scale of the target variable. After random residuals areadded, predicted random draws are obtained on the original untransformedscale using reverse linear interpolation on the table of original andtransformed target values (linear extrapolation when a random residualis large enough to put the random draw prediction outside the range ofobserved values). The bootstrap is used as withtype="pmm" tofactor in the uncertainty of the imputation model.

As model uncertainty is high when the transformation of a targetvariable is unknown,tlinear defaults toTRUE to limit thevariance in predicted values whennk is positive.

Value

a list of class"aregImpute" containing the following elements:

call

the function call expression

formula

the formula specified toaregImpute

match

thematch argument

fweighted

thefweighted argument

n

total number of observations in input dataset

p

number of variables

na

list of subscripts of observations for which values were originally missing

nna

named vector containing the numbers of missing values in the data

type

vector of types of transformations used for each variable("s","l","c" for smooth spline, linear, or categorical with dummyvariables)

tlinear

value oftlinear parameter

nk

number of knots used for smooth transformations

cat.levels

list containing character vectors specifying thelevels ofcategorical variables

df

degrees of freedom (number of parameters estimated) for eachvariable

n.impute

number of multiple imputations per missing value

imputed

a list containing matrices of imputed values in the same format asthose created bytranscan. Categorical variables are coded usingtheir integer codes. Variables having no missing values will haveNULL matrices in the list.

x

ifx isTRUE, the original data matrix withinteger codes for categorical variables

rsq

for the last round of imputations, a vector containing the R-squareswith which each sometimes-missing variable could be predicted from theothers byace oravas.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

van Buuren, Stef. Flexible Imputation of Missing Data. Chapman &Hall/CRC, Boca Raton FL, 2012.

Little R, An H. Robust likelihood-based analysis of multivariate datawith missing values. Statistica Sinica 14:949-968, 2004.

van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fullyconditional specifications in multivariate imputation. J Stat CompSim 72:1049-1064, 2006.

de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB.Multiple imputation to correct for partial verification biasrevisited. Stat Med 27:5880-5889, 2008.

Siddique J. Multiple imputation using an iterative hot-deck withdistance-based donor selection. Stat Med 27:83-102, 2008.

White IR, Royston P, Wood AM. Multiple imputation using chainedequations: Issues and guidance for practice. Stat Med 30:377-399,2011.

Curnow E, Carpenter JR, Heron JE, et al: Multiple imputation ofmissing data under missing at random: compatible imputation models arenot sufficient to avoid bias if they are mis-specified. J Clin EpiJune 9, 2023. DOI:10.1016/j.jclinepi.2023.06.011.

See Also

fit.mult.impute,transcan,areg,naclus,naplot,mice,dotchart3,Ecdf,completer

Examples

# Check that aregImpute can almost exactly estimate missing values when# there is a perfect nonlinear relationship between two variables# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3set.seed(3)x1 <- rnorm(200)x2 <- x1^2x3 <- runif(200)m <- 30x2[1:m] <- NAa <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')amatplot(x1[1:m]^2, a$imputed$x2)abline(a=0, b=1, lty=2)x1[1:m]^2a$imputed$x2# Multiple imputation and estimation of variances and covariances of# regression coefficient estimates accounting for imputation# Example 1: large sample size, much missing data, no overlap in# NAs across variablesx1 <- factor(sample(c('a','b','c'),1000,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)x3 <- rnorm(1000)y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)orig.x1 <- x1[1:250]orig.x2 <- x2[251:350]x1[1:250] <- NAx2[251:350] <- NAd <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)# Find value of nk that yields best validating imputation models# tlinear=FALSE means to not force the target variable to be linearf <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,                data=d, B=10) # normally B=75f# Try forcing target variable (x1, then x2) to be linear while allowing# predictors to be nonlinear (could also say tlinear=TRUE)f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)f## Not run: # Use 100 imputations to better check against individual true valuesf <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)fpar(mfrow=c(2,1))plot(f)modecat <- function(u) { tab <- table(u) as.numeric(names(tab)[tab==max(tab)][1])}table(orig.x1,apply(f$imputed$x1, 1, modecat))par(mfrow=c(1,1))plot(orig.x2, apply(f$imputed$x2, 1, mean))fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f,                        data=d)sqrt(diag(vcov(fmi)))fcc <- lm(y ~ x1 + x2 + x3)summary(fcc)   # SEs are larger than from mult. imputation## End(Not run)## Not run: # Example 2: Very discriminating imputation models,# x1 and x2 have some NAs on the same rows, smaller nset.seed(5)x1 <- factor(sample(c('a','b','c'),100,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)x3 <- rnorm(100)y  <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)orig.x1 <- x1[1:20]orig.x2 <- x2[18:23]x1[1:20] <- NAx2[18:23] <- NA#x2[21:25] <- NAd <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)n <- naclus(d)plot(n); naplot(n)  # Show patterns of NAs# 100 imputations to study them; normally use 5 or 10f  <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)par(mfrow=c(2,3))plot(f, diagnostics=TRUE, maxn=2)# Note: diagnostics=TRUE makes graphs similar to those made by:# r <- range(f$imputed$x2, orig.x2)# for(i in 1:6) {  # use 1:2 to mimic maxn=2#   plot(1:100, f$imputed$x2[i,], ylim=r,#        ylab=paste("Imputations for Obs.",i))#   abline(h=orig.x2[i],lty=2)# }table(orig.x1,apply(f$imputed$x1, 1, modecat))par(mfrow=c(1,1))plot(orig.x2, apply(f$imputed$x2, 1, mean))fmi <- fit.mult.impute(y ~ x1 + x2, lm, f,                        data=d)sqrt(diag(vcov(fmi)))fcc <- lm(y ~ x1 + x2)summary(fcc)   # SEs are larger than from mult. imputation## End(Not run)## Not run: # Study relationship between smoothing parameter for weighting function# (multiplier of mean absolute distance of transformed predicted# values, used in tricube weighting function) and standard deviation# of multiple imputations.  SDs are computed from average variances# across subjects.  match="closest" same as match="weighted" with# small value of fweighted.# This example also shows problems with predicted mean# matching almost always giving the same imputed values when there is# only one predictor (regression coefficients change over multiple# imputations but predicted values are virtually 1-1 functions of each# other)set.seed(23)x <- runif(200)y <- x + runif(200, -.05, .05)r <- resid(lsfit(x,y))rmse <- sqrt(sum(r^2)/(200-2))   # sqrt of residual MSEy[1:20] <- NAd <- data.frame(x,y)f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)# As an aside here is how to create a completed dataset for imputation# number 3 as fit.mult.impute would do automatically.  In this degenerate# case changing 3 to 1-2,4-10 will not alter the results.imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE,                           pr=FALSE, check=FALSE)sd <- sqrt(mean(apply(f$imputed$y, 1, var)))ss <- c(0, .01, .02, seq(.05, 1, length=20))sds <- ss; sds[1] <- sdfor(i in 2:length(ss)) {  f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i])  sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))}plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',     type='b')abline(v=.2,  lty=2)  # default value of fweightedabline(h=rmse, lty=2)  # root MSE of residuals from linear regression## End(Not run)## Not run: # Do a similar experiment for the Titanic datasetgetHdata(titanic3)h <- lm(age ~ sex + pclass + survived, data=titanic3)rmse <- summary(h)$sigmaset.seed(21)f <- aregImpute(~ age + sex + pclass + survived, n.impute=10,                data=titanic3, match='closest')sd <- sqrt(mean(apply(f$imputed$age, 1, var)))ss <- c(0, .01, .02, seq(.05, 1, length=20))sds <- ss; sds[1] <- sdfor(i in 2:length(ss)) {  f <- aregImpute(~ age + sex + pclass + survived, data=titanic3,                  n.impute=10, fweighted=ss[i])  sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))}plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',     type='b')abline(v=.2,   lty=2)  # default value of fweightedabline(h=rmse, lty=2)  # root MSE of residuals from linear regression## End(Not run)set.seed(2)d <- data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)),                x3=c(runif(4), rep(NA, 11), runif(35)))reformM(~ x1 + x2 + x3, data=d)reformM(~ x1 + x2 + x3, data=d, nperm=2)# Give result or one of the results as the first argument to aregImpute# Constrain imputed values for two variables# Require imputed values for x2 to be above 0.2# Assume x1 is never missing and require imputed values for# x3 to be less than the recipient's value of x1a <- aregImpute(~ x1 + x2 + x3, data=d,                constraint=list(x2 = expression(d$x2 > 0.2),                                x3 = expression(d$x3 < r$x1)))a

Bivariate Summaries Computed Separately by a Series of Predictors

Description

biVar is a generic function that accepts a formula and usualdata,subset, andna.action parameters plus aliststatinfo that specifies a function of two variables tocompute along with information about labeling results for printing andplotting. The function is called separately with each right hand sidevariable and the same left hand variable. The result is a matrix ofbivariate statistics and thestatinfo list that drives printingand plotting. The plot method draws a dot plot with x-axis values bydefault sorted in order of one of the statistics computed by the function.

spearman2 computes the square of Spearman's rho rank correlationand a generalization of it in whichx can relatenon-monotonically toy. This is done by computing the Spearmanmultiple rho-squared between(rank(x), rank(x)^2) andy.Whenx is categorical, a different kind of Spearman correlationused in the Kruskal-Wallis test is computed (andspearman2 can dothe Kruskal-Wallis test). This is done by computing the ordinarymultipleR^2 betweenk-1 dummy variables andrank(y), wherex hask categories.x canalso be a formula, in which case each predictor is correlated separatelywithy, using non-missing observations for that predictor.biVar is used to do the looping and bookkeeping. By default theplot shows the adjustedrho^2, using the same formula used forthe ordinary adjustedR^2. TheF test uses the unadjustedR2.

spearman computes Spearman's rho on non-missing values of twovariables.spearman.test is a simple version ofspearman2.default.

chiSquare is set up likespearman2 except it is intendedfor a categorical response variable. Separate Pearson chi-square testsare done for each predictor, with optional collapsing of infrequentcategories. Numeric predictors having more thang levels arecategorized intog quantile groups.chiSquare usesbiVar.

Usage

biVar(formula, statinfo, data=NULL, subset=NULL,      na.action=na.retain, exclude.imputed=TRUE, ...)## S3 method for class 'biVar'print(x, ...)## S3 method for class 'biVar'plot(x, what=info$defaultwhat,                       sort.=TRUE, main, xlab,                       vnames=c('names','labels'), ...)spearman2(x, ...)## Default S3 method:spearman2(x, y, p=1, minlev=0, na.rm=TRUE, exclude.imputed=na.rm, ...)## S3 method for class 'formula'spearman2(formula, data=NULL,          subset, na.action=na.retain, exclude.imputed=TRUE, ...)spearman(x, y)spearman.test(x, y, p=1)chiSquare(formula, data=NULL, subset=NULL, na.action=na.retain,          exclude.imputed=TRUE, ...)

Arguments

formula

a formula with a single left side variable

statinfo

seespearman2.formula orchiSquare code

data,subset,na.action

the usual options for models. Default forna.action is to retainall values, NA or not, so that NAs can be deleted in only a pairwisefashion.

exclude.imputed

set toFALSE to include imputed values (created byimpute) in the calculations.

...

other arguments that are passed to the function used tocompute the bivariate statistics or todotchart3 forplot.

na.rm

logical; delete NA values?

x

a numeric matrix with at least 5 rows and at least 2 columns (ify is absent). Forspearman2, the first argument maybe a vector of any type, including character or factor. The firstargument may also be a formula, in which case all predictors arecorrelated individually with the response variable.x may be a formula forspearman2in which casespearman2.formula is invoked. Eachpredictor in the right hand side of the formula is separately correlatedwith the response variable. Forprint orplot,xis an object produced bybiVar. Forspearman andspearman.testx is a numeric vector, as isy. ForchiSquare,x is a formula.

y

a numeric vector

p

for numeric variables, specifies the order of the Spearmanrho^2 touse. The default isp=1 to compute the ordinaryrho^2. Usep=2 to compute the quadratic rankgeneralization to allow non-monotonicity.p is ignored forcategorical predictors.

minlev

minimum relative frequency that a level of a categorical predictorshould have before it is pooled with other categories (seecombine.levels) inspearman2 andchiSquare (inwhich case it also applies to the response). The default,minlev=0 causes no pooling.

what

specifies which statistic to plot. Possibilities include thecolumn names that appear with the print method is used.

sort.

setsort.=FALSE to suppress sorting variables by thestatistic being plotted

main

main title for plot. Default title shows the name of the responsevariable.

xlab

x-axis label. Default constructed fromwhat.

vnames

set to"labels" to use variable labels in place of names forplotting. If a variable does not have a label the name is alwaysused.

Details

Uses midranks in case of ties, as described by Hollander and Wolfe.P-values for Spearman, Wilcoxon, or Kruskal-Wallis tests areapproximated by using thet orF distributions.

Value

spearman2.default (thefunction that is called for a singlex, i.e., when there is noformula) returns a vector of statistics for the variable.biVar,spearman2.formula, andchiSquare return amatrix with rows corresponding to predictors.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods.New York: Wiley.

Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.

See Also

combine.levels,varclus,dotchart3,impute,chisq.test,cut2.

Examples

x <- c(-2, -1, 0, 1, 2)y <- c(4,   1, 0, 1, 4)z <- c(1,   2, 3, 4, NA)v <- c(1,   2, 3, 4, 5)spearman2(x, y)plot(spearman2(z ~ x + y + v, p=2))f <- chiSquare(z ~ x + y + v)f

Confidence Intervals for Binomial Probabilities

Description

Produces 1-alpha confidence intervals for binomial probabilities.

Usage

binconf(x, n, alpha=0.05,        method=c("wilson","exact","asymptotic","all"),        include.x=FALSE, include.n=FALSE, return.df=FALSE)

Arguments

x

vector containing the number of "successes" for binomial variates

n

vector containing the numbers of corresponding observations

alpha

probability of a type I error, so confidence coefficient = 1-alpha

method

character string specifing which method to use. The "all" method onlyworks when x and n are length 1. The "exact" method uses the F distributionto compute exact (based on the binomial cdf) intervals; the"wilson" interval is score-test-based; and the "asymptotic" is thetext-book, asymptotic normal interval. Following Agresti andCoull, the Wilson interval is to be preferred and so is thedefault.

include.x

logical flag to indicate whetherx should be included in thereturned matrix or data frame

include.n

logical flag to indicate whethern should be included in thereturned matrix or data frame

return.df

logical flag to indicate that a data frame rather than a matrix bereturned

Value

a matrix or data.frame containing the computed intervals and,optionally,x andn.

Author(s)

Rollin Brant, Modified by Frank Harrell and
Brad Biggerstaff
Centers for Disease Control and Prevention
National Center for Infectious Diseases
Division of Vector-Borne Infectious Diseases
P.O. Box 2087, Fort Collins, CO, 80522-2087, USA
bkb5@cdc.gov

References

A. Agresti and B.A. Coull, Approximate is better than "exact" forinterval estimation of binomial proportions,American Statistician,52:119–126, 1998.

R.G. Newcombe, Logit confidence intervals and the inverse sinhtransformation,American Statistician,55:200–202, 2001.

L.D. Brown, T.T. Cai and A. DasGupta, Interval estimation fora binomial proportion (with discussion),Statistical Science,16:101–133, 2001.

Examples

binconf(0:10,10,include.x=TRUE,include.n=TRUE)binconf(46,50,method="all")

Bootstrap Kaplan-Meier Estimates

Description

Bootstraps Kaplan-Meier estimate of the probability of survival to atleast a fixed time (times variable) or the estimate of theqquantile of the survival distribution (e.g., median survival time, thedefault).

Usage

bootkm(S, q=0.5, B=500, times, pr=TRUE)

Arguments

S

aSurv object for possibly right-censored survival time

q

quantile of survival time, default is 0.5 for median

B

number of bootstrap repetitions (default=500)

times

time vector (currently only a scalar is allowed) at which to computesurvival estimates. You may specify only one ofq andtimes, and iftimes is specifiedq is ignored.

pr

set toFALSE to suppress printing the iteration number every10 iterations

Details

bootkm uses Therneau'ssurvfitKM function to efficientlycompute Kaplan-Meier estimates.

Value

a vector containingB bootstrap estimates

Side Effects

updates.Random.seed, and, ifpr=TRUE, prints progressof simulations

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Akritas MG (1986): Bootstrapping the Kaplan-Meier estimator. JASA81:1032–1038.

See Also

survfit,Surv,Survival.cph,Quantile.cph

Examples

# Compute 0.95 nonparametric confidence interval for the difference in# median survival time between females and males (two-sample problem)set.seed(1)library(survival)S <- Surv(runif(200))      # no censoringsex <- c(rep('female',100),rep('male',100))med.female <- bootkm(S[sex=='female',], B=100) # normally B=500med.male   <- bootkm(S[sex=='male',],   B=100)describe(med.female-med.male)quantile(med.female-med.male, c(.025,.975), na.rm=TRUE)# na.rm needed because some bootstrap estimates of median survival# time may be missing when a bootstrap sample did not include the# longer survival times

Power and Sample Size for Two-Sample Binomial Test

Description

Uses method of Fleiss, Tytun, and Ury (but without the continuitycorrection) to estimate the power (or the sample size to achieve a givenpower) of a two-sided test for the difference in two proportions. The twosample sizes are allowed to be unequal, but forbsamsize you must specifythe fraction of observations in group 1. For power calculations, oneprobability (p1) must be given, and either the other probability (p2),anodds.ratio, or apercent.reduction must be given. Forbpower orbsamsize, any or all of the arguments may be vectors, in which case theyreturn a vector of powers or sample sizes. All vector arguments must havethe same length.

Givenp1, p2,ballocation uses the method of Brittain and Schlesselmanto compute the optimal fraction of observations to be placed in group 1that either (1) minimize the variance of the difference in two proportions,(2) minimize the variance of the ratio of the two proportions, (3) minimize the variance of the log odds ratio, or(4) maximize the power of the 2-tailed test for differences. For (4)the total sample size must be given, or the fraction optimizingthe power is not returned. The fraction for (3) is one minus the fractionfor (1).

bpower.sim estimates power by simulations, in minimal time. By usingbpower.sim you can see that the formulas without any continuity correctionare quite accurate, and that the power of a continuity-corrected testis significantly lower. That's why no continuity corrections are implementedhere.

Usage

bpower(p1, p2, odds.ratio, percent.reduction,        n, n1, n2, alpha=0.05)bsamsize(p1, p2, fraction=.5, alpha=.05, power=.8)ballocation(p1, p2, n, alpha=.05)bpower.sim(p1, p2, odds.ratio, percent.reduction,            n, n1, n2,            alpha=0.05, nsim=10000)

Arguments

p1

population probability in the group 1

p2

probability for group 2

odds.ratio

odds ratio to detect

percent.reduction

percent reduction in risk to detect

n

total sample size over the two groups. If you omit this forballocation, thefraction which optimizes power will not bereturned.

n1

sample size in group 1

n2

sample size in group 2.bpower, ifn is given,n1 andn2 are set ton/2.

alpha

type I assertion probability

fraction

fraction of observations in group 1

power

the desired probability of detecting a difference

nsim

number of simulations of binomial responses

Details

Forbpower.sim, all arguments must be of length one.

Value

forbpower, the power estimate; forbsamsize, a vector containingthe sample sizes in the two groups; forballocation, a vector with4 fractions of observations allocated to group 1, optimizing the fourcriteria mentioned above. Forbpower.sim, a vector with threeelements is returned, corresponding to the simulated power and itslower and upper 0.95 confidence limits.

AUTHOR

Frank Harrell

Department of Biostatistics

Vanderbilt University

fh@fharrell.com

References

Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculatingsample sizes for comparing independent proportions. Biometrics 36:343–6.

Brittain E, Schlesselman JJ (1982): Optimal allocation for the comparisonof proportions. Biometrics 38:1003–9.

Gordon I, Watson R (1996): The myth of continuity-corrected sample sizeformulae. Biometrics 52:71–6.

See Also

samplesize.bin,chisq.test,binconf

Examples

bpower(.1, odds.ratio=.9, n=1000, alpha=c(.01,.05))bpower.sim(.1, odds.ratio=.9, n=1000)bsamsize(.1, .05, power=.95)ballocation(.1, .5, n=100)# Plot power vs. n for various odds ratios  (base prob.=.1)n  <- seq(10, 1000, by=10)OR <- seq(.2,.9,by=.1)plot(0, 0, xlim=range(n), ylim=c(0,1), xlab="n", ylab="Power", type="n")for(or in OR) {  lines(n, bpower(.1, odds.ratio=or, n=n))  text(350, bpower(.1, odds.ratio=or, n=350)-.02, format(or))}# Another way to plot the same curves, but letting labcurve do the# work, including labeling each curve at points of maximum separationpow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)),              n=n)names(pow) <- format(OR)labcurve(pow, pl=TRUE, xlab='n', ylab='Power')# Contour graph for various probabilities of outcome in the control# group, fixing the odds ratio at .8 ([p2/(1-p2) / p1/(1-p1)] = .8)# n is varied alsop1 <- seq(.01,.99,by=.01)n  <- seq(100,5000,by=250)pow <- outer(p1, n, function(p1,n) bpower(p1, n=n, odds.ratio=.8))# This forms a length(p1)*length(n) matrix of power estimatescontour(p1, n, pow)

Box-percentile plots

Description

Producess side-by-side box-percentile plots from several vectors or alist of vectors.

Usage

bpplot(..., name=TRUE, main="Box-Percentile Plot",        xlab="", ylab="", srtx=0, plotopts=NULL)

Arguments

...

vectors or lists containing numeric components (e.g., the output ofsplit).

name

character vector of names for the groups. Default isTRUE to put names on the x-axis. Such names are taken from the data vectors or thenames attribute of the first argument if it is a list.Setname toFALSE to suppress names.If a character vector is supplied the names in the vector areused to label the groups.

main

main title for the plot.

xlab

x axis label.

ylab

y axis label.

srtx

rotation angle for x-axis labels. Default is zero.

plotopts

a list of other parameters to send toplot

Value

There are no returned values

Side Effects

A plot is created on the current graphics device.

BACKGROUND

Box-percentile plots are similiar to boxplots, except box-percentile plotssupply more information about the univariate distributions. At any heightthe width of the irregular "box" is proportional to the percentile of thatheight, up to the 50th percentile, and above the 50th percentile the widthis proportional to 100 minus the percentile. Thus, the width at any givenheight is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box.

Author(s)

Jeffrey Banfield
umsfjban@bill.oscs.montana.edu
Modified by F. Harrell 30Jun97

References

Esty WW, Banfield J: The box-percentile plot. J StatisticalSoftware 8 No. 17, 2003.

See Also

panel.bpplot,boxplot,Ecdf,bwplot

Examples

set.seed(1)x1 <- rnorm(500)x2 <- runif(500, -2, 2)x3 <- abs(rnorm(500))-2bpplot(x1, x2, x3)g <- sample(1:2, 500, replace=TRUE)bpplot(split(x2, g), name=c('Group 1','Group 2'))rm(x1,x2,x3,g)

Statistics by Categories

Description

For any number of cross-classification variables,bystatsreturns a matrix with the sample size, number missingy, andfun(non-missing y), with the cross-classifications designatedby rows. Uses Harrell's modification of theinteractionfunction to produce cross-classifications. The defaultfun ismean, and ify is binary, the mean is labeled asFraction. There is aprint method as well as alatex method for objects created bybystats.bystats2 handles the special case in which there are 2classifcation variables, and places the first one in rows and thesecond in columns. Theprint method forbystats2 usestheprint.char.matrix function to organize statisticsfor cells into boxes.

Usage

bystats(y, ..., fun, nmiss, subset)## S3 method for class 'bystats'print(x, ...)## S3 method for class 'bystats'latex(object, title, caption, rowlabel, ...)bystats2(y, v, h, fun, nmiss, subset)## S3 method for class 'bystats2'print(x, abbreviate.dimnames=FALSE,   prefix.width=max(nchar(dimnames(x)[[1]])), ...)## S3 method for class 'bystats2'latex(object, title, caption, rowlabel, ...)

Arguments

y

a binary, logical, or continuous variable or a matrix or data frame ofsuch variables. Ify is a data frame it is converted to a matrix.Ify is a data frame or matrix, computations are done on subsets ofthe rows ofy, and you should specifyfun so as to be able to operateon the matrix. For matrixy, any column with a missing value causesthe entire row to be considered missing, and the row is not passed tofun.

...

Forbystats, one or more classifcation variables separated by commas.Forprint.bystats, options passed toprint.default such asdigits.Forlatex.bystats, andlatex.bystats2,options passed tolatex.default such asdigits.If you passcdec tolatex.default, keep in mind that the first one ortwo positions (depending onnmiss) should have zeros since thesecorrespond with frequency counts.

v

vertical variable forbystats2. Will be converted tofactor.

h

horizontal variable forbystats2. Will be converted tofactor.

fun

a function to compute on the non-missingy for a given subset.You must specifyfun= in front of the function name or definition.fun may return a single number or a vector or matrix of any length.Matrix results are rolled out into a vector, with names preserved.Wheny is a matrix, a commonfun isfunction(y) apply(y, 2, ff)whereff is the name of a function which operates on one column ofy.

nmiss

A column containing a count of missing values is included ifnmiss=TRUEor if there is at least one missing value.

subset

a vector of subscripts or logical values indicating the subset ofdata to analyze

abbreviate.dimnames

set toTRUE to abbreviatedimnames in output

prefix.width

seeprint.char.matrix

title

title to pass tolatex.default. Default is the first word ofthe character string version of the first calling argument.

caption

caption to pass tolatex.default. Default is theheadingattribute from the object produced bybystats.

rowlabel

rowlabel to pass tolatex.default. Default is thebyvarnamesattribute from the object produced bybystats. Forbystats2 thedefault is"".

x

an object created bybystats orbystats2

object

an object created bybystats orbystats2

Value

forbystats, a matrix with row names equal to the classification labels and columnnamesN, Missing, funlab, wherefunlab is determined fromfun.A row is added to the end with the summary statistics computed on all observations combined. The class of this matrix isbystats.Forbystats, returns a 3-dimensional array with the last dimensioncorresponding to statistics being computed. The class of the array isbystats2.

Side Effects

latex produces a.tex file.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

interaction,cut,cut2,latex,print.char.matrix,translate

Examples

## Not run: bystats(sex==2, county, city)bystats(death, race)bystats(death, cut2(age,g=5), race)bystats(cholesterol, cut2(age,g=4), sex, fun=median)bystats(cholesterol, sex, fun=quantile)bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x)))latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2)f <- function(y) c(Hazard=sum(y[,2])/sum(y[,1]))# f() gets the hazard estimate for right-censored data from exponential dist.bystats(cbind(d.time, death), race, sex, fun=f)bystats(cbind(pressure, cholesterol), age.decile,         fun=function(y) c(Median.pressure   =median(y[,1]),                          Median.cholesterol=median(y[,2])))y <- cbind(pressure, cholesterol)bystats(y, age.decile,         fun=function(y) apply(y, 2, median))   # same result as last onebystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75)))# The last one computes separately the 0.25 and 0.75 quantiles of 2 vars.latex(bystats2(death, race, sex, fun=table))## End(Not run)

capitalize the first letter of a string

Description

Capitalizes the first letter of each element of the string vector.

Usage

capitalize(string)

Arguments

string

String to be capitalized

Value

Returns a vector of charaters with the first letter capitalized

Author(s)

Charles Dupont

Examples

capitalize(c("Hello", "bob", "daN"))

Power of Interaction Test for Exponential Survival

Description

Uses the method of Peterson and George to compute the power of aninteraction test in a 2 x 2 setup in which all 4 distributions areexponential. This will be the same as the power of the Cox modeltest if assumptions hold. The test is 2-tailed. The duration of accrual is specified(constant accrual is assumed), as is the minimum follow-up time.The maximum follow-up time is thenaccrual + tmin. Treatmentallocation is assumed to be 1:1.

Usage

ciapower(tref, n1, n2, m1c, m2c, r1, r2, accrual, tmin,          alpha=0.05, pr=TRUE)

Arguments

tref

time at which mortalities estimated

n1

total sample size, stratum 1

n2

total sample size, stratum 2

m1c

tref-year mortality, stratum 1 control

m2c

tref-year mortality, stratum 2 control

r1

% reduction inm1c by intervention, stratum 1

r2

% reduction inm2c by intervention, stratum 2

accrual

duration of accrual period

tmin

minimum follow-up time

alpha

type I error probability

pr

set toFALSE to suppress printing of details

Value

power

Side Effects

prints

AUTHOR

Frank Harrell

Department of Biostatistics

Vanderbilt University

fh@fharrell.com

References

Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.

See Also

cpower,spower

Examples

# Find the power of a race x treatment test.  25% of patients will# be non-white and the total sample size is 14000.  # Accrual is for 1.5 years and minimum follow-up is 5y.# Reduction in 5-year mortality is 15% for whites, 0% or -5% for# non-whites.  5-year mortality for control subjects if assumed to# be 0.18 for whites, 0.23 for non-whites.n <- 14000for(nonwhite.reduction in c(0,-5)) {  cat("\n\n\n% Reduction in 5-year mortality for non-whites:",      nonwhite.reduction, "\n\n")  pow <- ciapower(5,  .75*n, .25*n,  .18, .23,  15, nonwhite.reduction,                    1.5, 5)  cat("\n\nPower:",format(pow),"\n")}

Convert between the 5 different coordinate sytems on a graphical device

Description

Takes a set of coordinates in any of the 5 coordinate systems (usr,plt, fig, dev, or tdev) and returns the same points in all 5coordinate systems.

Usage

cnvrt.coords(x, y = NULL, input = c("usr", "plt", "fig", "dev","tdev"))

Arguments

x

Vector, Matrix, or list of x coordinates (or x and ycoordinates), NA's allowed.

y

y coordinates (ifx is a vector), NA's allowed.

input

Character scalar indicating the coordinate system of theinput points.

Details

Every plot has 5 coordinate systems:

usr (User): the coordinate system of the data, this is shown by thetick marks and axis labels.

plt (Plot): Plot area, coordinates range from 0 to 1 with 0corresponding to the x and y axes and 1 corresponding to the top andright of the plot area. Margins of the plot correspond to plotcoordinates less than 0 or greater than 1.

fig (Figure): Figure area, coordinates range from 0 to 1 with 0corresponding to the bottom and left edges of the figure (includingmargins, label areas) and 1 corresponds to the top and right edges.fig and dev coordinates will be identical if there is only 1 figurearea on the device (layout, mfrow, or mfcol has not been used).

dev (Device): Device area, coordinates range from 0 to 1 with 0corresponding to the bottom and left of the device region within theouter margins and 1 is the top and right of the region withing theouter margins. If the outer margins are all set to 0 then tdev anddev should be identical.

tdev (Total Device): Total Device area, coordinates range from 0 to 1 with 0corresponding to the bottom and left edges of the device (piece ofpaper, window on screen) and 1 corresponds to the top and right edges.

Value

A list with 5 components, each component is a list with vectors namedx and y. The 5 sublists are:

usr

The coordinates of the input points in usr (User) coordinates.

plt

The coordinates of the input points in plt (Plot)coordinates.

fig

The coordinates of the input points in fig (Figure)coordinates.

dev

The coordinates of the input points in dev (Device)coordinates.

tdev

The coordinates of the input points in tdev (Total Device)coordinates.

Note

You must provide both x and y, but one of them may beNA.

This function is becoming depricated with the new functionsgrconvertX andgrconvertY in R version 2.7.0 and beyond.These new functions use the correct coordinate system names and havemore coordinate systems available, you should start using them instead.

Author(s)

Greg Snowgreg.snow@imail.org

See Also

par specifically 'usr','plt', and 'fig'. Also'xpd' for plotting outside of the plotting region and 'mfrow' and'mfcol' for multi figure plotting.subplot,grconvertX andgrconvertY in R2.7.0 and later

Examples

old.par <- par(no.readonly=TRUE)par(mfrow=c(2,2),xpd=NA)# generate some sample datatmp.x <- rnorm(25, 10, 2)tmp.y <- rnorm(25, 50, 10)tmp.z <- rnorm(25, 0, 1)plot( tmp.x, tmp.y)# draw a diagonal line across the plot areatmp1 <- cnvrt.coords( c(0,1), c(0,1), input='plt' )lines(tmp1$usr, col='blue')# draw a diagonal line accross figure regiontmp2 <- cnvrt.coords( c(0,1), c(1,0), input='fig')lines(tmp2$usr, col='red')# save coordinate of point 1 and y value near top of plot for future plotstmp.point1 <- cnvrt.coords(tmp.x[1], tmp.y[1])tmp.range1 <- cnvrt.coords(NA, 0.98, input='plt')# make a second plot and draw a line linking point 1 in each plotplot(tmp.y, tmp.z)tmp.point2 <- cnvrt.coords( tmp.point1$dev, input='dev' )arrows( tmp.y[1], tmp.z[1], tmp.point2$usr$x, tmp.point2$usr$y, col='green')# draw another plot and add rectangle showing same range in 2 plotsplot(tmp.x, tmp.z)tmp.range2 <- cnvrt.coords(NA, 0.02, input='plt')tmp.range3 <- cnvrt.coords(NA, tmp.range1$dev$y, input='dev')rect( 9, tmp.range2$usr$y, 11, tmp.range3$usr$y, border='yellow')# put a label just to the right of the plot and#  near the top of the figure region.text( cnvrt.coords(1.05, NA, input='plt')$usr$x,cnvrt.coords(NA, 0.75, input='fig')$usr$y,"Label", adj=0)par(mfrow=c(1,1))## create a subplot within another plot (see also subplot)plot(1:10, 1:10)tmp <- cnvrt.coords( c( 1, 4, 6, 9), c(6, 9, 1, 4) )par(plt = c(tmp$dev$x[1:2], tmp$dev$y[1:2]), new=TRUE)hist(rnorm(100))par(fig = c(tmp$dev$x[3:4], tmp$dev$y[3:4]), new=TRUE)hist(rnorm(100))par(old.par)

Miscellaneous ggplot2 and grid Helper Functions

Description

These functions are used onggplot2 objects or as layers whenbuilding aggplot2 object, and to facilitate use ofgridExtra.colorFacet colors the thin rectangles used to separate panels created byfacet_grid (andprobably byfacet_wrap). A better approach may be found athttps://stackoverflow.com/questions/28652284/.arrGrob is a front-end togridExtra::arrangeGrob thatallows for proper printing. Seehttps://stackoverflow.com/questions/29062766/store-output-from-gridextragrid-arrange-into-an-object/. ThearrGrobprint method callsgrid::grid.draw.

Usage

colorFacet(g, col = adjustcolor("blue", alpha.f = 0.3))arrGrob(...)## S3 method for class 'arrGrob'print(x, ...)

Arguments

g

aggplot2 object that used faceting

col

color for facet separator rectanges

...

passed toarrangeGrob

x

an object created byarrGrob

Author(s)

Sandy Muspratt

Examples

## Not run: require(ggplot2)s <- summaryP(age + sex ~ region + treatment)colorFacet(ggplot(s))   # prints directly# arrGrob is called by rms::ggplot.Predict and others## End(Not run)

combine.levels

Description

Combine Infrequent Levels of a Categorical Variable

Usage

combine.levels(  x,  minlev = 0.05,  m,  ord = is.ordered(x),  plevels = FALSE,  sep = ",")

Arguments

x

a factor, 'ordered' factor, or numeric or character variable that will be turned into a 'factor'

minlev

the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled '"OTHER"'. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When 'ord=TRUE' combinations happen only for consecutive levels.

m

alternative to 'minlev', is the minimum number of observations in a cell before it will be combined with others

ord

set to 'TRUE' to treat 'x' as if it were an ordered factor, which allows only consecutive levels to be combined

plevels

by default 'combine.levels' pools low-frequency levels into a category named 'OTHER' when 'x' is not ordered and 'ord=FALSE'. To instead name this category the concatenation of all the pooled level names, separated by a comma, set 'plevels=TRUE'.

sep

the separator for concatenating levels when 'plevels=TRUE'

Details

After turning 'x' into a 'factor' if it is not one already, combineslevels of 'x' whose frequency falls below a specified relative frequency 'minlev' or absolute count 'm'. When 'x' is not treated as ordered, all of thesmall frequency levels are combined into '"OTHER"', unless 'plevels=TRUE'.When 'ord=TRUE' or 'x' is an ordered factor, only consecutive levelsare combined. New levels are constructed by concatenating the levels with'sep' as a separator. This is useful when comparing ordinal regressionwith polytomous (multinomial) regression and there are too manycategories for polytomous regression. 'combine.levels' is also usefulwhen assumptions of ordinal models are being checked empirically bycomputing exceedance probabilities for various cutoffs of thedependent variable.

Value

a factor variable, or if 'ord=TRUE' an ordered factor variable

Author(s)

Frank Harrell

Examples

x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1))combine.levels(x, m=3)combine.levels(x, m=3, plevels=TRUE)combine.levels(x, ord=TRUE, m=3)x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1),       rep('F',1))combine.levels(x, ord=TRUE, m=3)

Combination Plot

Description

Generates a plotly attribute plot given a series of possibly overlapping binary variables

Usage

combplotp(  formula,  data = NULL,  subset,  na.action = na.retain,  vnames = c("labels", "names"),  includenone = FALSE,  showno = FALSE,  maxcomb = NULL,  minfreq = NULL,  N = NULL,  pos = function(x) 1 * (tolower(x) %in% c("true", "yes", "y", "positive", "+",    "present", "1")),  obsname = "subjects",  ptsize = 35,  width = NULL,  height = NULL,  ...)

Arguments

formula

a formula containing all the variables to be cross-tabulated, on the formula's right hand side. There is no left hand side variable. Ifformula is omitted, then all variables fromdata are analyzed.

data

input data frame. If none is specified the data are assumed to come from the parent frame.

subset

an optional subsetting expression applied todata

na.action

seelm etc.

vnames

set to"names" to use variable names to label axes instead of variable labels. When using the defaultlabels, any variable not having a label will have its name used instead.

includenone

set toTRUE to include the combination where all conditions are absent

showno

set toTRUE to show a light dot for conditions that are not part of the currently tabulated combination

maxcomb

maximum number of combinations to display

minfreq

if specified, any combination having a frequency less than this will be omitted from the display

N

set to an integer to override the global denominator, instead of using the number of rows in the data

pos

a function of vector returning a logical vector withTRUE values indicating positive

obsname

character string noun describing observations, default is"subjects"

ptsize

point size, defaults to 35

width

width ofplotly plot

height

height ofplotly plot

...

optional arguments to pass totable

Details

Similar to theUpSetR package, draws a special dot chart sometimes called an attribute plot that depicts all possible combination of the binary variables. By default a positive value, indicating that a certain condition pertains for a subject, is any of logicalTRUE, numeric 1,"yes","y","positive","+" or"present" value, and all others are considered negative. The user can override this determination by specifying her ownpos function. Case is ignored in the variable values.

The plot uses solid dots arranged in a vertical line to indicate which combination of conditions is being considered. Frequencies of all possible combinations are shown above the dot chart. Marginal frequencies of positive values for the input variables are shown to the left of the dot chart. More information for all three of these component symbols is provided in hover text.

Variables are sorted in descending order of marginal frqeuencies and likewise for combinations of variables.

Value

plotly object

Author(s)

Frank Harrell

Examples

if (requireNamespace("plotly")) {  g <- function() sample(0:1, n, prob=c(1 - p, p), replace=TRUE)  set.seed(2); n <- 100; p <- 0.5  x1 <- g(); label(x1) <- 'A long label for x1 that describes it'  x2 <- g()  x3 <- g(); label(x3) <- 'This is<br>a label for x3'  x4 <- g()  combplotp(~ x1 + x2 + x3 + x4, showno=TRUE, includenone=TRUE)  n <- 1500; p <- 0.05  pain       <- g()  anxiety    <- g()  depression <- g()  soreness   <- g()  numbness   <- g()  tiredness  <- g()  sleepiness <- g()  combplotp(~ pain + anxiety + depression + soreness + numbness +            tiredness + sleepiness, showno=TRUE)}

completer

Description

Create imputed dataset(s) usingtranscan andaregImpute objects

Usage

completer(a, nimpute, oneimpute = FALSE, mydata)

Arguments

a

An object of classtranscan oraregImpute

nimpute

A numeric vector between 1 anda$n.impute. Fortranscan object, this is set to 1. ForaregImpute object, returns a list ofnimpute datasets whenoneimpute is set toFALSE (default).

oneimpute

A logical vector. When set toTRUE, returns a single completed dataset for the imputation number specified bynimpute

mydata

A data frame in which its missing values will be imputed.

Details

Similar in function tomice::complete, this function usestranscan andaregImpute objects to impute missing dataand returns the completed dataset(s) as a dataframe or a list.It assumes thattranscan is used for single regression imputation.

Value

      A single or a list of completed dataset(s).

Author(s)

      Yong-Hao Pua, Singapore General Hospital

Examples

## Not run: mtcars$hp[1:5]    <- NAmtcars$wt[1:10]   <- NAmyrform <- ~ wt + hp + I(carb)mytranscan  <- transcan( myrform,  data = mtcars, imputed = TRUE,  pl = FALSE, pr = FALSE, trantab = TRUE, long = TRUE)myareg      <- aregImpute(myrform, data = mtcars, x=TRUE, n.impute = 5)completer(mytranscan)                    # single completed datasetcompleter(myareg, 3, oneimpute = TRUE)# single completed dataset based on the `n.impute`th set of multiple imputationcompleter(myareg, 3)# list of completed datasets based on first `nimpute` sets of multiple imputationcompleter(myareg)# list of completed datasets based on all available sets of multiple imputation# To get a stacked data frame of all completed datasets use# do.call(rbind, completer(myareg, data=mydata))# or use rbindlist in data.table## End(Not run)

Element Merging

Description

Merges an object by the names of its elements. Inserting elements invalue intox that do not exists inx andreplacing elements inx that exists invalue withvalue elements ifprotect is false.

Usage

consolidate(x, value, protect, ...)## Default S3 method:consolidate(x, value, protect=FALSE, ...)consolidate(x, protect, ...) <- value

Arguments

x

named list or vector

value

named list or vector

protect

logical; should elements inx be kept insteadof elements invalue?

...

currently does nothing; included if ever want to make generic.

Author(s)

Charles Dupont

See Also

names

Examples

x <- 1:5names(x) <- LETTERS[x]y <- 6:10names(y) <- LETTERS[y-2]x                  # c(A=1,B=2,C=3,D=4,E=5)y                  # c(D=6,E=7,F=8,G=9,H=10)consolidate(x, y)      # c(A=1,B=2,C=3,D=6,E=7,F=8,G=9,H=10)consolidate(x, y, protect=TRUE)      # c(A=1,B=2,C=3,D=4,E=5,F=8,G=9,H=10)

Metadata for a Data Frame

Description

contents is a generic method for whichcontents.data.frameis currently the only method.contents.data.frame creates anobject containing the following attributes of the variables from a data frame: names, labels (if any), units (if any), number offactor levels (if any), factor levels,class, storage mode, and number of NAs.print.contents.data.framewill print the results, with options for sorting the variables.html.contents.data.frame creates HTML code for displaying theresults. This code has hyperlinks so that if the user clicks on thenumber of levels the browser jumps to the correct part of a table offactor levels for all thefactor variables. If long labels arepresent ("longlabel" attributes on variables), these are printedat the bottom and thehtml method links to them through theregular labels. Variables having the samelevels in the sameorder have the levels factored out for brevity.

contents.list prints a directory of datasets whensasxport.get imported more than one SAS dataset.

Ifoptions(prType='html') is in effect, callingprint onan object that is the contents of a data frame will result inrendering the HTML version. If run from the console a browser windowwill open.

Usage

contents(object, ...)## S3 method for class 'data.frame'contents(object, sortlevels=FALSE, id=NULL,  range=NULL, values=NULL, ...)## S3 method for class 'contents.data.frame'print(x,    sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf,    number=FALSE, ...) ## S3 method for class 'contents.data.frame'html(object,           sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf,           levelType=c('list','table'),           number=FALSE, nshow=TRUE, ...)## S3 method for class 'list'contents(object, dslabels, ...)## S3 method for class 'contents.list'print(x,    sort=c('none','names','labels','NAs','vars'), ...)

Arguments

object

a data frame. Forhtml is an object created bycontents. Forcontents.list is a list of data frames.

sortlevels

set toTRUE to sort levels of all factorvariables into alphabetic order. This is especially useful when twovariables use the same levels but in different orders. They willstill be recognized by thehtml method as having identicallevels if sorted.

id

an optional subject ID variable name that if present inobject will cause the number of unique IDs to be printed inthe contents header

range

an optional variable name that if present inobjectwill cause its range to be printed in the contents header

values

an optional variable name that if present inobject will cause its unique values to be printed in thecontents header

x

an object created bycontents

sort

Default is to print the variables in their original order in thedata frame. Specify one of"names","labels", or"NAs" to sort the variables by,respectively, alphabetically by names, alphabetically by labels, or byincreaseing order of number of missing values. Forcontents.list,sort may also be the value"vars" to cause sorting by the number of variables in the dataset.

prlevels

set toFALSE to not print all levels offactor variables

maxlevels

maximum number of levels to print for afactor variable

number

set toTRUE to have theprint andlatex methodsnumber the variables by their order in the data frame

nshow

set toFALSE to suppress outputting number ofobservations and number ofNAs; useful when these countswould unblind information to blinded reviewers

levelType

By default, bullet lists of category levels areconstructed in html. SetlevelType='table' to put levels inhtml table format.

...

arguments passed fromhtml toformat.df,unused otherwise

dslabels

named vector of SAS dataset labels, created forexample bysasdsLabels

Value

an object of class"contents.data.frame" or"contents.list". For thehtml method is anhtmlcharacter vector object.

Author(s)

Frank Harrell
Vanderbilt University
fh@fharrell.com

See Also

describe,html,upData,extractlabs,hlab

Examples

set.seed(1)dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE),                  stringsAsFactors=TRUE)contents(dfr)dfr <- upData(dfr, labels=c(x='Label for x', y='Label for y'))attr(dfr$x, 'longlabel') <- 'A very long label for x that can continue onto multiple long lines of text'k <- contents(dfr)print(k, sort='names', prlevels=FALSE)## Not run: html(k)html(contents(dfr))            # same resultlatex(k$contents)              # latex.default just the main information## End(Not run)

Power of Cox/log-rank Two-Sample Test

Description

Assumes exponential distributions for both treatment groups.Uses the George-Desu method along withformulas of Schoenfeld that allow estimation of the expected number ofevents in the two groups. To allow for drop-ins (noncompliance to control therapy, crossover tointervention) and noncompliance of the intervention, the method ofLachin and Foulkes is used.

Usage

cpower(tref, n, mc, r, accrual, tmin, noncomp.c=0, noncomp.i=0,        alpha=0.05, nc, ni, pr=TRUE)

Arguments

tref

time at which mortalities estimated

n

total sample size (both groups combined). If allocation is unequalso that there are notn/2 observations in each group, you may specifythe sample sizes innc andni.

mc

tref-year mortality, control

r

% reduction inmc by intervention

accrual

duration of accrual period

tmin

minimum follow-up time

noncomp.c

% non-compliant in control group (drop-ins)

noncomp.i

% non-compliant in intervention group (non-adherers)

alpha

type I error probability. A 2-tailed test is assumed.

nc

number of subjects in control group

ni

number of subjects in intervention group.nc andni are specifiedexclusive ofn.

pr

set toFALSE to suppress printing of details

Details

For handling noncompliance, uses a modification of formula (5.4) ofLachin and Foulkes. Their method is based on a test for the differencein two hazard rates, whereascpower is based on testing the differencein two log hazards. It is assumed here that the same correction factorcan be approximately applied to the log hazard ratio as Lachin and Foulkes applied tothe hazard difference.

Note that Schoenfeld approximates the varianceof the log hazard ratio by4/m, wherem is the total number of events,whereas the George-Desu method uses the slightly better1/m1 + 1/m2.Power from this function will thus differ slightly from that obtained withthe SASsamsizc program.

Value

power

Side Effects

prints

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.

Lachin JM, Foulkes MA: Biometrics 42:507–519; 1986.

Schoenfeld D: Biometrics 39:499–503; 1983.

See Also

spower,ciapower,bpower

Examples

#In this example, 4 plots are drawn on one page, one plot for each#combination of noncompliance percentage.  Within a plot, the#5-year mortality % in the control group is on the x-axis, and#separate curves are drawn for several % reductions in mortality#with the intervention.  The accrual period is 1.5y, with all#patients followed at least 5y and some 6.5y.par(mfrow=c(2,2),oma=c(3,0,3,0))morts <- seq(10,25,length=50)red <- c(10,15,20,25)for(noncomp in c(0,10,15,-1)) {  if(noncomp>=0) nc.i <- nc.c <- noncomp else {nc.i <- 25; nc.c <- 15}  z <- paste("Drop-in ",nc.c,"%, Non-adherence ",nc.i,"%",sep="")  plot(0,0,xlim=range(morts),ylim=c(0,1),           xlab="5-year Mortality in Control Patients (%)",           ylab="Power",type="n")  title(z)  cat(z,"\n")  lty <- 0  for(r in red) {        lty <- lty+1        power <- morts        i <- 0        for(m in morts) {          i <- i+1          power[i] <- cpower(5, 14000, m/100, r, 1.5, 5, nc.c, nc.i, pr=FALSE)        }        lines(morts, power, lty=lty)  }  if(noncomp==0)legend(18,.55,rev(paste(red,"% reduction",sep="")),           lty=4:1,bty="n")}mtitle("Power vs Non-Adherence for Main Comparison",           ll="alpha=.05, 2-tailed, Total N=14000",cex.l=.8)## Point sample size requirement vs. mortality reduction# Root finder (uniroot()) assumes needed sample size is between# 1000 and 40000#nc.i <- 25; nc.c <- 15; mort <- .18red <- seq(10,25,by=.25)samsiz <- redi <- 0for(r in red) {  i <- i+1  samsiz[i] <- uniroot(function(x) cpower(5, x, mort, r, 1.5, 5,                                          nc.c, nc.i, pr=FALSE) - .8,                       c(1000,40000))$root}samsiz <- samsiz/1000par(mfrow=c(1,1))plot(red, samsiz, xlab='% Reduction in 5-Year Mortality', ylab='Total Sample Size (Thousands)', type='n')lines(red, samsiz, lwd=2)title('Sample Size for Power=0.80\nDrop-in 15%, Non-adherence 25%')title(sub='alpha=0.05, 2-tailed', adj=0)

Read Comma-Separated Text Data Files

Description

Read comma-separated text data files, allowing optional translationto lower case for variable names after making them valid S names.There is a facility for reading long variable labels as one of therows. If labels are not specified and a final variable name is notthe same as that in the header, the original variable name is saved asa variable label. Usesread.csv if thedata.tablepackage is not in effect, otherwise callsfread.

Usage

csv.get(file, lowernames=FALSE, datevars=NULL, datetimevars=NULL,        dateformat='%F',        fixdates=c('none','year'), comment.char="", autodate=TRUE,        allow=NULL, charfactor=FALSE,        sep=',', skip=0, vnames=NULL, labels=NULL, text=NULL, ...)

Arguments

file

the file name for import.

lowernames

set this toTRUE to change variable names tolower case.

datevars

character vector of names (afterlowernames isapplied) of variables to consider as a factor or character vectorcontaining dates in a format matchingdateformat. Thedefault is"%F" which uses the yyyy-mm-dd format.

datetimevars

character vector of names (afterlowernamesis applied) of variables to consider to be date-time variables, withdate formats as described underdatevars followed by a spacefollowed by time in hh:mm:ss format.chron is used to storesuch variables. If all times in the variableare 00:00:00 the variable will be converted to an ordinary date variable.

dateformat

forcleanup.import is the input format (seestrptime)

fixdates

for any of the variables listed indatevarsthat have adateformat thatcleanup.import understands,specifyingfixdates allows corrections of certain formattinginconsistencies before the fields are attempted to be converted todates (the default is to assume that thedateformat is followedfor all observation fordatevars). Currentlyfixdates='year' is implemented, which will cause 2-digit or4-digit years to be shifted to the alternate number of digits whendateform is the default"%F" or is"%y-%m-%d","%m/%d/%y", or"%m/%d/%Y". Two-digits years arepadded with20 on the left. Setdateformat to thedesired format, not the exceptional format.

comment.char

a character vector of length one containing asingle character or an empty string. Use '""' to turn off theinterpretation of comments altogether.

autodate

Set to true to allow function to guess at whichvariables are dates

allow

a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version1.9.

charfactor

set toTRUE to change character variables tofactors if they have fewer than n/2 unique values. Blanks and nullstrings are converted toNAs.

sep

field separator, defaults to comma

skip

number of records to skip before data start. Required ifvnames orlabels is given.

vnames

number of row containing variable names, default is one

labels

number of row containing variable labels, default is nolabels

text

a character string containing the.csv file to useinstead offile=. Passed toread.csv as thetext= argument.

...

arguments to pass toread.csv other thanskip andsep.

Details

csv.get reads comma-separated text data files, allowing optionaltranslation to lower case for variable names after making them valid Snames. Original possibly non-legal names are taken to be variablelabels iflabels is not specified. Character or factorvariables containing dates can be converted to date variables.cleanup.import is invoked to finish the job.

Value

a new data frame.

Author(s)

Frank Harrell, Vanderbilt University

See Also

sas.get,data.frame,cleanup.import,read.csv,strptime,POSIXct,Date,fread

Examples

## Not run: dat <- csv.get('myfile.csv')# Read a csv file with junk in the first row, variable names in the# second, long variable labels in the third, and junk in the 4th rowdat <- csv.get('myfile.csv', vnames=2, labels=3, skip=4)## End(Not run)

Representative Curves

Description

curveRep finds representative curves from arelatively large collection of curves. The curves usually representtime-response profiles as in serial (longitudinal or repeated) datawith possibly unequal time points and greatly varying sample sizes persubject. After excluding records containing missingx ory, records are first stratified intokn groups having similarsample sizes per curve (subject). Within these strata, curves arenext stratified according to the distribution ofx points percurve (typically measurement times per subject). Theclara clustering/partitioning function is usedto do this, clustering on one, two, or threex characteristicsdepending on the minimum sample size in the current interval of samplesize. If the interval has a minimum number of uniquevalues ofone, clustering is done on the singlex values. If the minimumnumber of uniquex values is two, clustering is done to creategroups that are similar on bothmin(x) andmax(x). Forgroups containing no fewer than three uniquex values,clustering is done on the trio of valuesmin(x),max(x),and the longest gap between any successivex. Then withinsample size andx distribution strata, clustering oftime-response profiles is based onp values ofy allevaluated at the samep equally-spacedx's within thestratum. An option allows per-curve data to be smoothed withlowess before proceeding. Outerx values aretaken as extremes ofx across all curves within the stratum.Linear interpolation within curves is used to estimatey at thegrid ofx's. For curves within the stratum that do not extendto the most extremex values in that stratum, extrapolationuses flat lines from the observed extremes in the curve unlessextrap=TRUE. Thepy values are clustered usingclara.

print andplot methods show results. By specifying anauxiliaryidcol variable toplot, other variables suchas treatment may be depicted to allow the analyst to determine forexample whether subjects on different treatments are assigned todifferent time-response profiles. To write the frequencies of avariable such as treatment in the upper left corner of each panel(instead of the grand total number of clusters in that panel), specifyfreq.

curveSmooth takes a set of curves and smooths them usinglowess. If the number of uniquex points in a curve isless thanp, the smooth is evaluated at the uniquexvalues. Otherwise it is evaluated at an equally spaced set ofx points over the observed range. If fewer than 3 uniquex values are in a curve, those points are used and smoothing is not done.

Usage

curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5,         force1 = TRUE, metric = c("euclidean", "manhattan"),         smooth=FALSE, extrap=FALSE, pr=FALSE)## S3 method for class 'curveRep'print(x, ...)## S3 method for class 'curveRep'plot(x, which=1:length(res),                        method=c('all','lattice','data'),                        m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE,                        idcol=NULL, freq=NULL, plotfreq=FALSE,                        xlim=range(x), ylim=range(y),                        xlab='x', ylab='y', colorfreq=FALSE, ...)curveSmooth(x, y, id, p=NULL, pr=TRUE)

Arguments

x

a numeric vector, typically measurement times.Forplot.curveRep is an object created bycurveRep.

y

a numeric vector of response values

id

a vector of curve (subject) identifiers, the same length asx andy

kn

number of curve sample size groups to construct.curveRep tries to divide the data into equal numbers ofcurves across sample size intervals.

kxdist

maximum number of x-distribution clusters to deriveusingclara

k

maximum number of x-y profile clusters to derive usingclara

p

number ofx points at which to interpolateyfor profile clustering. ForcurveSmooth is the number ofequally spaced points at which to evaluate the lowess smooth, and ifp is omitted the smooth is evaluated at the originalxvalues (which will allowcurveRep to still know thexdistribution

force1

By default if any curves have only one point, all curvesconsisting of one point will be placed in a separate stratum. Toprevent this separation, setforce1 = FALSE.

metric

seeclara

smooth

By default, linear interpolation is used on raw data toobtainy values to cluster to determine x-y profiles.Specifysmooth = TRUE to replace observed points withlowess before computingy points on the grid.Also, whensmooth is used, it may be desirable to useextrap=TRUE.

extrap

set toTRUE to use linear extrapolation toevaluatey points for x-y clustering. Not recommended unlesssmoothing has been or is being done.

pr

set toTRUE to print progress notes

which

an integer vector specifying which sample size intervalsto plot. Must be specified ifmethod='lattice' and must be asingle number in that case.

method

The default makes individual plots of possibly allx-distribution by sample size by cluster combinations. Fewer may beplotted by specifyingwhich. Specifymethod='lattice'to show a latticexyplot of a single sample size interval,with x distributions going across and clusters going down. To notplot but instead return a data frame for a single sample sizeinterval, specifymethod='data'

m

the number of curves in a cluster to randomly sample if thereare more thanm in a cluster. Default is to draw all curvesin a cluster. Formethod = "lattice" you can specifym = "quantiles" to use thexYplot function to showquantiles ofy as a function ofx, with the quantilesspecified by theprobs argument. This cannot be used to drawa group containingn = 1.

nx

applies ifm = "quantiles". SeexYplot.

probs

3-vector of probabilities with the central quantilefirst. Default uses quartiles.

fill

formethod = "all", by default if a sample sizex-distribution stratum did not have enough curves to stratify intok x-y profiles, empty graphs are drawn so that a matrix ofgraphs will have the next row starting with a different sample sizerange or x-distribution. See the example below.

idcol

a named vector to be used as a table lookup for colorassignments (does not apply whenm = "quantile"). The names ofthis vector are curveids and the values are color names ornumbers.

freq

a named vector to be used as a table lookup for a groupingvariable such as treatment. The names are curveids andvalues are any values useful for grouping in a frequency tabulation.

plotfreq

set toTRUE to plot the frequencies from thefreq variable as horizontal bars instead of printing them.Applies only tomethod = "lattice". By default the largest baris 0.1 times the length of a panel's x-axis. Specifyplotfreq = 0.5 for example to make the longest bar half this long.

colorfreq

set toTRUE to color the frequencies printed byplotfreq using the colors provided byidcol.

xlim,ylim,xlab,ylab

plotting parameters. Default ranges arethe ranges in the entire set of raw data given tocurveRep.

...

arguments passed to other functions.

Details

In the graph titles for the default graphic output,n refers to theminimum sample size,x refers to the sequential x-distributioncluster, andc refers to the sequential x-y profile cluster. Graphsfrommethod = "lattice" are produced byxyplot and in the panel titlesdistribution refers to the x-distribution stratum andcluster refers to the x-y profile cluster.

Value

a list of class"curveRep" with the following elements

res

a hierarchical list first split by sample size intervals,then by x distribution clusters, then containing a vector of clusternumbers withid values as a names attribute

ns

a table of frequencies of sample sizes per curve afterremovingNAs

nomit

total number of records excluded due toNAs

missfreq

a table of frequencies of number ofNAsexcluded per curve

ncuts

cut points for sample size intervals

kn

number of sample size intervals

kxdist

number of clusters on x distribution

k

number of clusters of curves within sample size anddistribution groups

p

number of points at which to evaluate each curve for clustering

x
y
id

input data after removingNAs

curveSmooth returns a list with elementsx,y,id.

Note

The references describe other methods for derivingrepresentative curves, but those methods were not used here. The lastreference which used a cluster analysis on principal componentsmotivatedcurveRep however. Thekml package does k-means clustering of longitudinal data with imputation.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Segal M. (1994): Representative curves for longitudinal data viaregression trees. J Comp Graph Stat 3:214-233.

Jones MC, Rice JA (1992): Displaying the important features of largecollections of similar curves. Am Statistician 46:140-145.

Zheng X, Simpson JA, et al (2005): Data from a study of effectivenesssuggested potential prognostic factors related to the patterns ofshoulder pain. J Clin Epi 58:823-830.

See Also

clara,dataRep

Examples

## Not run: # Simulate 200 curves with per-curve sample sizes ranging from 1 to 10# Make curves with odd-numbered IDs have an x-distribution that is random# uniform [0,1] and those with even-numbered IDs have an x-dist. that is# half as wide but still centered at 0.5.  Shift y values higher with# increasing IDsset.seed(1)N <- 200nc <- sample(1:10, N, TRUE)id <- rep(1:N, nc)x <- y <- idfor(i in 1:N) {  x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))  y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10)}w <- curveRep(x, y, id, kxdist=2, p=10)wpar(ask=TRUE, mfrow=c(4,5))plot(w)                # show everything, profiles going acrosspar(mfrow=c(2,5))plot(w,1)              # show n=1 results# Use a color assignment table, assigning low curves to green and# high to red.  Unique curve (subject) IDs are the names of the vector.cols <- c(rep('green', N/2), rep('red', N/2))names(cols) <- as.character(1:N)plot(w, 3, idcol=cols)par(ask=FALSE, mfrow=c(1,1))plot(w, 1, 'lattice')  # show n=1 resultsplot(w, 3, 'lattice')  # show n=4-5 resultsplot(w, 3, 'lattice', idcol=cols)  # same but different color mappingplot(w, 3, 'lattice', m=1)  # show a single "representative" curve# Show median, 10th, and 90th percentiles of supposedly representative curvesplot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))# Same plot but with much less grouping of x variableplot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)# Use ggplot2 for one sample size intervalz <- plot(w, 2, 'data')require(ggplot2)ggplot(z, aes(x, y, color=curve)) + geom_line() +       facet_grid(distribution ~ cluster) +       theme(legend.position='none') +       labs(caption=z$ninterval[1])# Smooth data before profiling.  This allows later plotting to plot# smoothed representative curves rather than raw curves (which# specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)d <- curveSmooth(x, y, id)w <- with(d, curveRep(x, y, id))# Example to show that curveRep can cluster profiles correctly when# there is no noise.  In the data there are four profiles - flat, flat# at a higher mean y, linearly increasing then flat, and flat at the# first height except for a sharp triangular peakset.seed(1)x <- 0:100m <- length(x)profile <- matrix(NA, nrow=m, ncol=4)profile[,1] <- rep(0, m)profile[,2] <- rep(3, m)profile[,3] <- c(0:3, rep(3, m-4))profile[,4] <- c(0,1,3,1,rep(0,m-4))col <- c('black','blue','green','red')matplot(x, profile, type='l', col=col)xeval <- seq(0, 100, length.out=5)s <- x matplot(x[s], profile[s,], type='l', col=col)id <- rep(1:100, each=m)X <- Y <- idcols <- character(100)names(cols) <- as.character(1:100)for(i in 1:100) {  s <- id==i  X[s] <- x  j <- sample(1:4,1)  Y[s] <- profile[,j]  cols[i] <- col[j]}table(cols)yl <- c(-1,4)w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4)plot(w, 1, 'lattice', idcol=cols, ylim=yl)# Found 4 clusters but two have same profilew <- curveRep(X, Y, id, kn=1, kxdist=1, k=3)plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl)# Incorrectly combined black and red because default value p=5 did# not result in different profiles at x=xevalw <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)plot(w, 1, 'lattice', idcol=cols, ylim=yl)# Found correct clusters because evaluated curves at 40 equally# spaced points and could find the sharp triangular peak in profile 4## End(Not run)

Cut a Numeric Variable into Intervals

Description

cut2 is a function likecut but left endpoints are inclusive and labels are ofthe form[lower, upper), except that last interval is[lower,upper]. If cuts are given, will by default make sure that cuts include entirerange ofx.Also, if cuts are not given, will cutx into quantile groups (g given) or groupswith a given minimum number of observations (m). Whereas cut creates acategory object,cut2 creates a factor object.m is not guaranteed but is a target.

cutGn guarantees that the grouped variable will have a minimum ofm observations in any group. This is done by an exhaustive algorithm that runs fast due to being coded in Fortran.

Usage

cut2(x, cuts, m=150, g, levels.mean=FALSE, digits, minmax=TRUE,oneval=TRUE, onlycuts=FALSE, formatfun=format, ...)cutGn(x, m, what=c('mean', 'factor', 'summary', 'cuts', 'function'), rcode=FALSE)

Arguments

x

numeric vector to classify into intervals

cuts

cut points

m

desired minimum number of observations in a group. The algorithm doesnot guarantee that all groups will have at leastm observations.

g

number of quantile groups

levels.mean

set toTRUE to make the new categorical vector have levels attribute that isthe group means ofx instead of interval endpoint labels

digits

number of significant digits to use in constructing levels. Default is 3(5 iflevels.mean=TRUE)

minmax

if cuts is specified butmin(x)<min(cuts) ormax(x)>max(cuts), augmentscuts to include min and maxx

oneval

if an interval contains only one unique value, the interval will belabeled with the formatted version of that value instead of theinterval endpoints, unlessoneval=FALSE

onlycuts

set toTRUE to only return the vector of computed cuts. Thisconsists of the interior values plus outer ranges.

formatfun

formatting function, supports formula notation (ifrlang is installed)

...

additional arguments passed toformatfun

what

specifies the kind of vector values to return fromcutGn, the default being like'levels.mean' ofcut2. Specify'summary' to return a numeric 3-column matrix that summarizes the intervals satisfying them requirement. Usewhat='cuts' to only return the vector of computed cutpoints. To create a function that will recode the variable in play using the same intervals as computed bycutGn, specifywhat='function'. This function will have awhat argument to allow the user to decide later whether to recode into interval means or into afactor variable.

rcode

set toTRUE to run thecutgn algorithm in R. This is useful for speed comparisons with the default compiled code.

Value

a factor variable with levels of the form[a,b) or formatted means(character strings) unlessonlycuts isTRUE in which casea numeric vector is returned

See Also

cut,quantile,combine.levels

Examples

set.seed(1)x <- runif(1000, 0, 100)z <- cut2(x, c(10,20,30))table(z)table(cut2(x, g=10))      # quantile groupstable(cut2(x, m=50))      # group x into intevals with at least 50 obs.table(cutGn(x, m=50, what='factor'))f <- cutGn(x, m=50, what='function')ff(c(-1, 2, 10), what='mean')f(c(-1, 2, 10), what='factor')## Not run:   x <- round(runif(200000), 3)  system.time(a <- cutGn(x, m=20))              # 0.02s  system.time(b <- cutGn(x, m=20, rcode=TRUE))  # 1.51s  identical(a, b)## End(Not run)

Tips for Creating, Modifying, and Checking Data Frames

Description

This help file contains a template for importing data to create an Rdata frame, correcting some problems resulting from the import andmaking the data frame be stored more efficiently, modifying the dataframe (including better annotating it and changing the names of someof its variables), and checking and inspecting the data frame forreasonableness of the values of its variables and to describe patternsof missing data. Various built-in functions and functions in theHmisc library are used. At the end some methods for creating dataframes “from scratch” withinR are presented.

The examples below attempt to clarify the separation of operationsthat are done on a data frame as a whole, operations that are done ona small subset of its variables without attaching the whole dataframe, and operations that are done on many variables after attachingthe data frame in search position one. It also tries to clarify thatfor analyzing several separate variables usingR commands that do notsupport adata argument, it is helpful to attach the data framein a search position later than position one.

It is often useful to create, modify, and process datasets in thefollowing order.

  1. Import external data into a data frame (if the raw data do notcontain column names, provide these during the import if possible)

  2. Make global changes to a data frame (e.g., changing variablenames)

  3. Change attributes or values of variables within a data frame

  4. Do analyses involving the whole data frame (without attaching it)
    (Data frame still in .Data)

  5. Do analyses of individual variables (after attaching the dataframe in search position two or later)

Details

The examples below use theFEV dataset fromRosner 1995. Almost any dataset would do. The jcetable dataare taken fromGalobardes, etal.

Presently, giving a variable the"units" attribute (using theHmiscunits function) only benefits theHmiscdescribe function and thermslibrary's version of thelink[rms]{Surv} function. Variableslabels defined with the Hmisclabel function are used bydescribe,summary.formula, and many ofthe plotting functions inHmisc andrms.

References

Alzola CF, Harrell FE (2006):An Introduction to S and the Hmisc and Design Libraries.Chapters 3 and 4,https://hbiostat.org/R/doc/sintro.pdf.

Galobardes, et al. (1998),J Clin Epi 51:875-881.

Rosner B (1995):Fundamentals of Biostatistics, 4th Edition.New York: Duxbury Press.

See Also

scan,read.table,cleanup.import,sas.get,data.frame,attach,detach,describe,datadensity,plot.data.frame,hist.data.frame,naclus,factor,label,units,names,expand.grid,summary.formula,summary.data.frame,casefold,edit,page,plot.data.frame,Cs,combine.levels,upData

Examples

## Not run: # First, we do steps that create or manipulate the data# frame in its entirety.  For S-Plus, these are done with# .Data in search position one (the default at the# start of the session).## -----------------------------------------------------------------------# Step 1: Create initial draft of data frame# # We usually begin by importing a dataset from# # another application.  ASCII files may be imported# using the scan and read.table functions.  SAS# datasets may be imported using the Hmisc sas.get# function (which will carry more attributes from# SAS than using File \dots  Import) from the GUI# menus.  But for most applications (especially# Excel), File \dots Import will suffice.  If using# the GUI, it is often best to provide variable# names during the import process, using the Options# tab, rather than renaming all fields later Of# course, if the data to be imported already have# field names (e.g., in Excel), let S use those# automatically.  If using S-Plus, you can use a# command to execute File \dots  Import, e.g.:import.data(FileName = "/windows/temp/fev.asc",            FileType = "ASCII", DataFrame = "FEV")# Here we name the new data frame FEV rather than# fev, because we wanted to distinguish a variable# in the data frame named fev from the data frame# name.  For S-Plus the command will look# instead like the following:FEV <- importData("/tmp/fev.asc")# -----------------------------------------------------------------------# Step 2: Clean up data frame / make it be more# efficiently stored# # Unless using sas.get to import your dataset# (sas.get already stores data efficiently), it is# usually a good idea to run the data frame through# the Hmisc cleanup.import function to change# numeric variables that are always whole numbers to# be stored as integers, the remaining numerics to# single precision, strange values from Excel to# NAs, and character variables that always contain# legal numeric values to numeric variables.# cleanup.import typically halves the size of the# data frame.  If you do not specify any parameters# to cleanup.import, the function assumes that no# numeric variable needs more than 7 significant# digits of precision, so all non-integer-valued# variables will be converted to single precision.FEV <- cleanup.import(FEV)# -----------------------------------------------------------------------# Step 3: Make global changes to the data frame# # A data frame has attributes that are "external" to# its variables.  There are the vector of its# variable names ("names" attribute), the# observation identifiers ("row.names"), and the# "class" (whose value is "data.frame").  The# "names" attribute is the one most commonly in need# of modification.  If we had wanted to change all# the variable names to lower case, we could have# specified lowernames=TRUE to the cleanup.import# invocation above, or typenames(FEV) <- casefold(names(FEV))# The upData function can also be used to change# variable names in two ways (see below).# To change names in a non-systematic way we use# other options.  Under Windows/NT the most# straigtforward approach is to change the names# interactively.  Click on the data frame in the# left panel of the Object Browser, then in the# right pane click twice (slowly) on a variable.# Use the left arrow and other keys to edit the# name.  Click outside that name field to commit the# change.  You can also rename columns while in a# Data Sheet.  To instead use programming commands# to change names, use something like:names(FEV)[6] <- 'smoke'   # assumes you know the positions!  names(FEV)[names(FEV)=='smoking'] <- 'smoke' names(FEV) <- edit(names(FEV))# The last example is useful if you are changing# many names.  But none of the interactive# approaches such as edit() are handy if you will be# re-importing the dataset after it is updated in# its original application.  This problem can be# addressed by saving the new names in a permanent# vector in .Data:new.names <- names(FEV)# Then if the data are re-imported, you can typenames(FEV) <- new.names# to rename the variables.# -----------------------------------------------------------------------# Step 4: Delete unneeded variables# # To delete some of the variables, you can# right-click on variable names in the Object# Browser's right pane, then select Delete.  You can# also set variables to have NULL values, which# causes the system to delete them.  We don't need# to delete any variables from FEV but suppose we# did need to delete some from mydframe.mydframe$x1 <- NULL mydframe$x2 <- NULLmydframe[c('age','sex')] <- NULL   # delete 2 variables mydframe[Cs(age,sex)]    <- NULL   # same thing# The last example uses the Hmisc short-cut quoting# function Cs.  See also the drop parameter to upData.# -----------------------------------------------------------------------# Step 5: Make changes to individual variables#         within the data frame# # After importing data, the resulting variables are# seldom self - documenting, so we commonly need to# change or enhance attributes of individual# variables within the data frame.# # If you are only changing a few variables, it is# efficient to change them directly without# attaching the entire data frame.FEV$sex   <- factor(FEV$sex,   0:1, c('female','male')) FEV$smoke <- factor(FEV$smoke, 0:1,                     c('non-current smoker','current smoker')) units(FEV$age)    <- 'years'units(FEV$fev)    <- 'L' label(FEV$fev)    <- 'Forced Expiratory Volume' units(FEV$height) <- 'inches'# When changing more than one or two variables it is# more convenient change the data frame using the# Hmisc upData function.FEV2 <- upData(FEV,  rename=c(smoking='smoke'),   # omit if renamed above  drop=c('var1','var2'),  levels=list(sex  =list(female=0,male=1),              smoke=list('non-current smoker'=0,                         'current smoker'=1)),  units=list(age='years', fev='L', height='inches'),  labels=list(fev='Forced Expiratory Volume'))# An alternative to levels=list(\dots) is for example# upData(FEV, sex=factor(sex,0:1,c('female','male'))).# # Note that we saved the changed data frame into a# new data frame FEV2.  If we were confident of the# correctness of our changes we could have stored# the new data frame on top of the old one, under# the original name FEV.# -----------------------------------------------------------------------# Step 6:  Check the data frame# # The Hmisc describe function is perhaps the first# function that should be used on the new data# frame.  It provides documentation of all the# variables and the frequency tabulation, counts of# NAs,  and 5 largest and smallest values are# helpful in detecting data errors.  Typing# describe(FEV) will write the results to the# current output window.  To put the results in a# new window that can persist, even upon exiting# S, we use the page function.  The describe# output can be minimized to an icon but kept ready# for guiding later steps of the analysis.page(describe(FEV2), multi=TRUE) # multi=TRUE allows that window to persist while# control is returned to other windows# The new data frame is OK.  Store it on top of the# old FEV and then use the graphical user interface# to delete FEV2 (click on it and hit the Delete# key) or type rm(FEV2) after the next statement.FEV <- FEV2# Next, we can use a variety of other functions to# check and describe all of the variables.  As we# are analyzing all or almost all of the variables,# this is best done without attaching the data# frame.  Note that plot.data.frame plots inverted# CDFs for continuous variables and dot plots# showing frequency distributions of categorical# ones.summary(FEV)# basic summary function (summary.data.frame) plot(FEV)                # plot.data.frame datadensity(FEV)         # rug plots and freq. bar charts for all var.hist.data.frame(FEV)     # for variables having > 2 values by(FEV, FEV$smoke, summary)  # use basic summary function with stratification# -----------------------------------------------------------------------# Step 7:  Do detailed analyses involving individual#          variables# # Analyses based on the formula language can use# data= so attaching the data frame may not be# required.  This saves memory.  Here we use the# Hmisc summary.formula function to compute 5# statistics on height, stratified separately by age# quartile and by sex.options(width=80) summary(height ~ age + sex, data=FEV,        fun=function(y)c(smean.sd(y),                         smedian.hilow(y,conf.int=.5)))# This computes mean height, S.D., median, outer quartilesfit <- lm(height ~ age*sex, data=FEV) summary(fit)# For this analysis we could also have attached the# data frame in search position 2.  For other# analyses, it is mandatory to attach the data frame# unless FEV$ prefixes each variable name.# Important: DO NOT USE attach(FEV, 1) or# attach(FEV, pos=1, \dots) if you are only analyzing# and not changing the variables, unless you really# need to avoid conflicts with variables in search# position 1 that have the same names as the# variables in FEV.  Attaching into search position# 1 will cause S-Plus to be more of a memory hog.attach(FEV)# Use e.g. attach(FEV[,Cs(age,sex)]) if you only# want to analyze a small subset of the variables# Use e.g. attach(FEV[FEV$sex=='male',]) to# analyze a subset of the observationssummary(height ~ age + sex,        fun=function(y)c(smean.sd(y),          smedian.hilow(y,conf.int=.5)))fit <- lm(height ~ age*sex)# Run generic summary function on height and fev, # stratified by sexby(data.frame(height,fev), sex, summary)# Cross-classify into 4 sex x smoke groupsby(FEV, list(sex,smoke), summary)# Plot 5 quantiless <- summary(fev ~ age + sex + height,              fun=function(y)quantile(y,c(.1,.25,.5,.75,.9)))plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='),      main='A Discovery', xlab='FEV')# Use the nonparametric bootstrap to compute a # 0.95 confidence interval for the population mean fevsmean.cl.boot(fev)    # in Hmisc# Use the Statistics \dots Compare Samples \dots One Sample # keys to get a normal-theory-based C.I.  Then do it # more manually.  The following method assumes that # there are no NAs in fevsd <- sqrt(var(fev))xbar <- mean(fev)xbarsdn <- length(fev)qt(.975,n-1)     # prints 0.975 critical value of t dist. with n-1 d.f.xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1)   # prints confidence limits# Fit a linear model# fit <- lm(fev ~ other variables \dots)detach()# The last command is only needed if you want to# start operating on another data frame and you want# to get FEV out of the way.# -----------------------------------------------------------------------# Creating data frames from scratch# # Data frames can be created from within S.  To# create a small data frame containing ordinary# data, you can use something likedframe <- data.frame(age=c(10,20,30),                      sex=c('male','female','male'),                     stringsAsFactors=TRUE)# You can also create a data frame using the Data# Sheet.  Create an empty data frame with the# correct variable names and types, then edit in the# data.dd <- data.frame(age=numeric(0),sex=character(0),                 stringsAsFactors=TRUE)# The sex variable will be stored as a factor, and# levels will be automatically added to it as you# define new values for sex in the Data Sheet's sex# column.# # When the data frame you need to create is defined# by systematically varying variables (e.g., all# possible combinations of values of each variable),# the expand.grid function is useful for quickly# creating the data.  Then you can add# non-systematically-varying variables to the object# created by expand.grid, using programming# statements or editing the Data Sheet.  This# process is useful for creating a data frame# representing all the values in a printed table.# In what follows we create a data frame# representing the combinations of values from an 8# x 2 x 2 x 2 (event x method x sex x what) table,# and add a non-systematic variable percent to the# data.jcetable <- expand.grid( event=c('Wheezing at any time',         'Wheezing and breathless',         'Wheezing without a cold',         'Waking with tightness in the chest',         'Waking with shortness of breath',         'Waking with an attack of cough',         'Attack of asthma',         'Use of medication'), method=c('Mail','Telephone'),  sex=c('Male','Female'), what=c('Sensitivity','Specificity'))jcetable$percent <- c(756,618,706,422,356,578,289,333,  576,421,789,273,273,212,212,212,  613,763,713,403,377,541,290,226,  613,684,632,290,387,613,258,129,  656,597,438,780,732,679,938,919,  714,600,494,877,850,703,963,987,  755,420,480,794,779,647,956,941,  766,423,500,833,833,604,955,986) / 10# In jcetable, event varies most rapidly, then# method, then sex, and what.## End(Not run)

Representativeness of Observations in a Data Set

Description

These functions are intended to be used to describe how well a givenset of new observations (e.g., new subjects) were represented in adataset used to develop a predictive model.ThedataRep function forms a data frame that contains all the uniquecombinations of variable values that existed in a given set ofvariable values. Cross–classifications of values are created usingexact values of variables, so for continuous numeric variables it isoften necessary to round them to the nearestv and to possiblycurtail the values to some lower and upper limit before rounding.Herev denotes a numeric constant specifying the matching tolerancethat will be used.dataRep also stores marginal distributionsummaries for all the variables. For numeric variables, all 101percentiles are stored, and for all variables, the frequencydistributions are also stored (frequencies are computed after anyrounding and curtailment of numeric variables). For the purposes ofrounding and curtailing, theroundN function is provided. Aprintmethod will summarize the calculations made bydataRep, and iflong=TRUE all unique combinations of values and their frequencies inthe original dataset are printed.

Thepredict method fordataRep takes a new data frame havingvariables named the same as the original ones (but whose factor levelsare not necessarily in the same order) and examines the collapsedcross-classifications created bydataRep to find how manyobservations were similar to each of the new observations after anyrounding or curtailment of limits is done.predict also does somecalculations to describe how the variable values of the newobservations "stack up" against the marginal distributions of theoriginal data. For categorical variables, the percent of observationshaving a given variable with the value of the new observation (afterrounding for variables that were throughroundN in the formula giventodataRep) is computed. For numeric variables, the percentile ofthe original distribution in which the current value falls will becomputed. For this purpose, the data are not rounded because the 101original percentiles were retained; linear interpolation is used toestimate percentiles for values between two tabulated percentiles.The lowest marginal frequency of matching values across all variablesis also computed. For example, if an age, sex combination matches 10subjects in the original dataset but the age value matches 100 ages(after rounding) and the sex value matches the sex code of 300observations, the lowest marginal frequency is 100, which is a "bestcase" upper limit for multivariable matching. I.e., matching on allvariables has to result on a lower frequency than this amount.Aprint method for the output ofpredict.dataRep prints allcalculations done bypredict by default. Calculations can beselectively suppressed.

Usage

dataRep(formula, data, subset, na.action)roundN(x, tol=1, clip=NULL)## S3 method for class 'dataRep'print(x, long=FALSE, ...)## S3 method for class 'dataRep'predict(object, newdata, ...)## S3 method for class 'predict.dataRep'print(x, prdata=TRUE, prpct=TRUE, ...)

Arguments

formula

a formula with no left-hand-side. Continuous numeric variables inneed of rounding should appear in the formula as e.g.roundN(x,5) tohave a tolerance of e.g. +/- 2.5 in matching. Factor or charactervariables as well as numeric ones not passed throughroundN arematched on exactly.

x

a numeric vector or an object created bydataRep

object

the object created bydataRep orpredict.dataRep

data,subset,na.action

standard modeling arguments. Defaultna.action isna.delete,i.e., observations in the original dataset having any variablesmissing are deleted up front.

tol

rounding constant (tolerance is actuallytol/2 as values are roundedto the nearesttol)

clip

a 2-vector specifying a lower and upper limit to curtail values ofxbefore rounding

long

set toTRUE to see all unique combinations and frequency count

newdata

a data frame containing all the variables given todataRep but notnecessarily in the same order or having factor levels in the same order

prdata

set toFALSE to suppress printingnewdata and the count of matchingobservations (plus the worst-case marginal frequency).

prpct

set toFALSE to not print percentiles and percents

...

unused

Value

dataRep returns a list of class"dataRep" containing the collapseddata frame and frequency counts along with marginal distributioninformation.predict returns an object of class"predict.dataRep"containing information determined by matching observations innewdata with the original (collapsed) data.

Side Effects

print.dataRep prints.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

See Also

round,table

Examples

set.seed(13)num.symptoms <- sample(1:4, 1000,TRUE)sex <- factor(sample(c('female','male'), 1000,TRUE))x    <- runif(1000)x[1] <- NAtable(num.symptoms, sex, .25*round(x/.25))d <- dataRep(~ num.symptoms + sex + roundN(x,.25))print(d, long=TRUE)predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'),                      x=c(.03,.5,1.5)))

Design Effect and Intra-cluster Correlation

Description

Computes the Kish design effect and corresponding intra-cluster correlationfor a single cluster-sampled variable

Usage

deff(y, cluster)

Arguments

y

variable to analyze

cluster

a variable whose unique values indicate cluster membership. Anytype of variable is allowed.

Value

a vector with named elementsn (total number of non-missingobservations),clusters (number of clusters after deletingmissing data),rho(intra-cluster correlation), anddeff(design effect).

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

bootcov,robcov

Examples

set.seed(1)blood.pressure <- rnorm(1000, 120, 15)clinic <- sample(letters, 1000, replace=TRUE)deff(blood.pressure, clinic)

Concise Statistical Description of a Vector, Matrix, Data Frame,or Formula

Description

describe is a generic method that invokesdescribe.data.frame,describe.matrix,describe.vector, ordescribe.formula.describe.vector is the basic function for handling a single variable.This function determines whether the variable is character, factor,category, binary, discrete numeric, and continuous numeric, and printsa concise statistical summary according to each. A numeric variable isdeemed discrete if it has <= 10 distinct values. In this case,quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 distinctvalues. For any variable for which the frequency table is not printed,the 5 lowest and highest values are printed. This behavior can beoverriden for long character variables with many levels using thelistunique parameter, to get a complete tabulation.

describe is especially useful fordescribing data frames created by*.get, as labels, formats,value labels, and (in the case ofsas.get) frequencies of specialmissing values are printed.

For a binary variable, the sum (number of 1's) and mean (proportion of1's) are printed. If the first argument is a formula, a model frameis created and passed to describe.data.frame. If a variableis of class"impute", a count of the number of imputed values isprinted. If a date variable has an attributepartial.date(this is set up bysas.get), counts of how many partial dates areactually present (missing month, missing day, missing both) are also presented.If a variable was created by the special-purpose functionsubsti (whichsubstitutes values of a second variable if the first variable is NA),the frequency table of substitutions is also printed.

For numeric variables,describe adds an item calledInfowhich is a relative information measure using the relative efficiency ofa proportional odds/Wilcoxon test on the variable relative to the sametest on a variable that has no ties.Info is related to howcontinuous the variable is, and ties are less harmful the more untiedvalues there are. The formula forInfo is one minus the sum ofthe cubes of relative frequencies of values divided by one minus thesquare of the reciprocal of the sample size. The lowest informationcomes from a variable having only one distinct value following by ahighly skewed binary variable.Info is reported totwo decimal places.

A latex method exists for converting thedescribe object to aLaTeX file. For numeric variables having more than 20 distinct values,describe saves in its returned object the frequencies of 100evenly spaced bins running from minimum observed value to the maximum.When there are less than or equal to 20 distinct values, the originalvalues are maintained.latex andhtml insert a spike histogram displaying thesefrequency counts in the tabular material using the LaTeX pictureenvironment. For example output seehttps://hbiostat.org/doc/rms/book/chapter7edition1.pdf.Note that the latex method assumes you have the following stylesinstalled in your latex installation: setspace and relsize.

Thehtml method mimics the LaTeX output. This is useful in thecontext of Quarto/Rmarkdown html and html notebook output.Ifoptions(prType='html') is in effect, callingprint onan object that is the result of runningdescribe on a data framewill result in rendering the HTML version. If run from the console abrowser window will open. Whenwhich is specified toprint, whether or notprType='html' is in effect, agt package html table will be produced containing only the types of variables requested. Whenwhich='both' a list withelement namesContinuous andCategorical is produced,making it convenient for the user to print as desired, or to pass thelist directed to theqreportmaketabs function when using Quarto.

Theplot method is fordescribe objects run on dataframes. It produces spike histograms for a graphic ofcontinuous variables and a dot chart for categorical variables, showingcategory proportions. The graphic format isggplot2 if the userhas not setoptions(grType='plotly') or has set thegrTypeoption to something other than'plotly'. Otherwiseplotlygraphics that are interactive are produced, and these can be placed intoan Rmarkdown html notebook. The user must install theplotlypackage for this to work. When the use hovers the mouse over a bin fora raw data value, the actual value will pop-up (formatted usingdigits). When the user hovers over the minimum data value, mostof the information calculated bydescribe will pop up. For eachvariable, the number of missing values is used to assign the color tothe histogram or dot chart, and a legend is drawn. Color is not used ifthere are no missing values in any variable. For categorical variables,hovering over the leftmost point for a variable displays details, andfor all points proportions, numerators, and denominators are displayedin the popup. If both continuous and categorical variables are presentandwhich='both' is specified, theplot method returns anunclassedlist containing two objects, named'Categorical'and'Continuous', in that order.

Sample weights may be specified to any of the functions, resultingin weighted means, quantiles, and frequency tables.

Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)pp. 557, the term "unique" has been replaced with "distinct" in theoutput (but not in parameter names).

Whenweights are not used, the pseudomedian and Gini's mean difference are computed fornumeric variables. The pseudomedian is labeledpMedian and is the median of all possible pairwise averages. It is a robust and efficient measure of location that equals the mean and median for symmetric distributions. It is also called the Hodges-Lehmann one-sample estimator. Gini's mean difference is a robust measure of dispersion that is themean absolute difference between any pairs of observations. In simpleoutput Gini's difference is labeledGmd.

formatdescribeSingle is a service function forlatex,html, andprint methods for single variables that is notintended to be called by the user.

Usage

## S3 method for class 'vector'describe(x, descript, exclude.missing=TRUE, digits=4,         listunique=0, listnchar=12,         weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE,         rmhtml=FALSE, trans=NULL, lumptails=0.01, ...)## S3 method for class 'matrix'describe(x, descript, exclude.missing=TRUE, digits=4, ...)## S3 method for class 'data.frame'describe(x, descript, exclude.missing=TRUE,    digits=4, trans=NULL, ...)## S3 method for class 'formula'describe(x, descript, data, subset, na.action,    digits=4, weights, ...)## S3 method for class 'describe'print(x, which = c('both', 'categorical', 'continuous'), ...)## S3 method for class 'describe'latex(object, title=NULL,      file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'),      append=FALSE, size='small', tabular=TRUE, greek=TRUE,      spacing=0.7, lspace=c(0,0), ...)## S3 method for class 'describe.single'latex(object, title=NULL, vname,      file, append=FALSE, size='small', tabular=TRUE, greek=TRUE,      lspace=c(0,0), ...)## S3 method for class 'describe'html(object, size=85, tabular=TRUE,      greek=TRUE, scroll=FALSE, rows=25, cols=100, ...)## S3 method for class 'describe.single'html(object, size=85,      tabular=TRUE, greek=TRUE, ...)formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'),           lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0),           size=85, ...)## S3 method for class 'describe'plot(x, which=c('both', 'continuous', 'categorical'),                          what=NULL,                          sort=c('ascending', 'descending', 'none'),                          n.unique=10, digits=5, bvspace=2, ...)

Arguments

x

a data frame, matrix, vector, or formula. For a data frame, thedescribe.data.framefunction is automatically invoked. For a matrix,describe.matrix iscalled. For a formula, describe.data.frame(model.frame(x))is invoked. The formula may or may not have a response variable. Forprint,latex,html, orformatdescribeSingle,x is an object created bydescribe.

descript

optional title to print for x. The default is the name of the argumentor the "label" attributes of individual variables. When the first argumentis a formula,descript defaults to a character representation ofthe formula.

exclude.missing

set toTRUE to print the names of variables that contain only missing values.This list appears at the bottom of the printout, and no space is takenup for such variables in the main listing.

digits

number of significant digits to print. Forplot.describe isthe number of significant digits to put in hover text forplotly when showing raw variable values.

listunique

For a character variable that is not anmChoice variable, thathas its longest string length greater thanlistnchar, and thathas no more thanlistunique distinct values, all values arelisted in alphabetic order. Any value having more than one occurrencehas the frequency of occurrence included. Specifylistunique equal to some value at least as large as the numberof observations to ensure that all character variables will have alltheir values listed. For purposes of tabulating character strings,multiple white spaces of any kind are translated to a single space,leading and trailing white space are ignored, and case is ignored.

listnchar

seelistunique

weights

a numeric vector of frequencies or sample weights. Each observationwill be treated as if it were sampledweights times.

minlength

value passed to summary.mChoice

shortmChoice

set toFALSE to have summary ofmChoice variables use actual levels everywhere, instead ofabbreviating to integers and printing of all original labels at thetop

rmhtml

set toTRUE to strip html from variable labels

trans

fordescribe.vector is a list specifying how totransformx for constructing the frequency distribution used inspike histograms. The first element of the list is a character stringdescribing the transformation, the second is the transformationfunction, and the third argument is the inverse of this function thatis used in labeling points on the original scale,e.g.trans=list('log', log, exp). Fordescribe.data.frametrans is a list of such lists, withthe name of each list being name of the variable to which thetransformation applies. Seehttps://hbiostat.org/rmsc/impred.html#data for an example.

lumptails

specifies the quantile to use (its complement is alsoused) for grouping observations in the tails so that outliers haveless chance of distorting the variable's range for sparkline spikehistograms. The default is 0.01, i.e., observations below the 0.01quantile are grouped together in the leftmost bin, and observationsabove the 0.99 quantile are grouped to form the last bin.

normwt

The default,normwt=FALSE results in the use ofweights asweights in computing various statistics. In this case the sample sizeis assumed to be equal to the sum ofweights. Specifynormwt=TRUE to divideweights by a constant so thatweights sum to the number ofobservations (length of vectors specified todescribe). In thiscase the number of observations is taken to be the actual number ofrecords given todescribe.

object

a result ofdescribe

title

unused

data

a data frame, data table, or list

subset

a subsetting expression

na.action

These are used if a formula is specified.na.action defaults tona.retain which does not delete anyNAs from the data frame.Usena.action=na.omit orna.delete to drop any observation withanyNA before processing.

...

arguments passed todescribe.default which are passed to callstoformat for numeric variables. For example if using RPOSIXct orDate date/time formats, specifyingdescribe(d,format='%d%b%y') will print date/time variables as"01Jan2000". This is useful for omitting the timecomponent. See the help file forformat.POSIXct orformat.Date for moreinformation. Forplot methods, ... is ignored.Forhtml andlatex methods, ... is used to passoptional arguments toformatdescribeSingle, especially thecondense argument. For theprint method whenwhich= is given, possiblearguments to use for tabulating continuous variable output aresparkwidth (the width of the spike histogram sparkline in pixels,defaulting to 200),qcondense (set toFALSE to devoteseparate columns to all quantiles),extremes (set toTRUE to print the 5 lowest and highest values in the table ofcontinuous variables). For categorical variable output, the argumentfreq can be used to specify how frequency tables are rendered:'chart' (the default; an interactive sparkline frequency bar chart) orfreq='table' for small tables.sort is another argumentpassed tohtml_describe_cat. For sparkline frequency chartsthe default is to sort non-numeric categories in descending order offrequency. Setcode=FALSE to use the original data order. Thew argument also applies to categorical variable output.

file

name of output file (should have a suffix of .tex). Default name isformed from the first word of thedescript element of thedescribe object, prefixed by"describe". Setfile="" to send LaTeX code to standard output instead of a file.

append

set toTRUE to havelatex append text to an existing filenamedfile

size

LaTeX text size ("small", the default, or"normalsize","tiny","scriptsize", etc.) for thedescribe outputin LaTeX. For html is the percent of the prevailing font size to use forthe output.

tabular

set toFALSE to use verbatim rather than tabular (or htmltable) environment for the summary statistics output. By default,tabular is used if the output is not too wide.

greek

By default, thelatex andhtml methodswill change names of greek letters that appear in variablelabels to appropriate LaTeX symbols in math mode, or html symbols, unlessgreek=FALSE.

spacing

By default, thelatex method fordescribe runon a matrix or data frame uses thesetspace LaTeX package with aline spacing of 0.7 so as to no waste space. Specifyspacing=0to suppress the use of thesetspace'sspacing environment,or specify another positive value to use this environment with adifferent spacing.

lspace

extra vertical scape, in character size units (i.e., "ex"as appended to the space). When using certain font sizes, there istoo much space left around LaTeX verbatim environments. Thistwo-vector specifies space to remove (i.e., the values are negated informing thevspace command) before (first element) and after(second element oflspace) verbatims

scroll

set toTRUE to create an html scrollable box forthe html output

rows,cols

the number of rows or columns to allocate for thescrollable box

vname

unused argument inlatex.describe.single

which

specifies whether to plot numeric continuous orbinary/categorical variables, or both. When"both" a list withtwo elements is created. Each element is aggplot2 orplotly object. If there are no variables of a given type, asingleggplot2 orplotly object is returned, ready toprint. Forprint.describe may be"categorical" or"continuous", causing agt table to be created with thecategorical or continuous variabledescribe results.

what

character or numeric vector specifying which variables toplot; default is to plot all

sort

specifies how and whether variables are sorted in order ofthe proportion of positives whenwhich="categorical". Specifysort="none" to leave variables in the order they appear in theoriginal data.

n.unique

the minimum number of distinct values a numeric variablemust have beforeplot.describe uses it in a continuous variableplot

bvspace

the between-variable spacing for categorical variables.Defaults to 2, meaning twice the amount of vertical space as what isused for between-category spacing within a variable

condense

specifies whether to condense the output with regard tothe 5 lowest and highest values ("extremes") and the frequency table

lang

specifies the markup language

verb

set to 1 if a verbatim environment is already in effect for LaTeX

Details

Ifoptions(na.detail.response=TRUE)has been set andna.action is"na.delete" or"na.keep", summary statistics onthe response variable are printed separately for missing and non-missingvalues of each predictor. The default summary function returnsthe number of non-missing response values and the mean of the lastcolumn of the response values, with anames attribute ofc("N","Mean"). When the response is aSurv object and the mean is used, this willresult in the crude proportion of events being used to summarizethe response. The actual summary function can be designated throughoptions(na.fun.response = "function name").

If you are modifying LaTexparskip or certain other parameters,you may need to shrink the area aroundtabular andverbatim environments produced bylatex.describe. You cando this using for example\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt\partopsep=0pt}\preto{\@tabular}{\parskip=2pt\parsep=0pt}\makeatother in the LaTeX preamble.

Multiple choice (mChoice) variables'describe output renders well in html but not when included in aQuarto document.

Value

a list containing elementsdescript,counts,values. The list is of classdescribe. If the inputobject was a matrix or a data frame, the list is a list of lists, one list for each variableanalyzed.latex returns a standardlatex object. For numericvariables having at least 20 distinct values, an additional componentintervalFreq. This component is a list with two elements,range(containing two values) andcount, a vector of 100 integer frequencycounts.print withwhich= returns a 'gt' table object.The user can modify the table by piping formatting changes, columnremovals, and other operations, before final rendering.

Author(s)

Frank Harrell
Vanderbilt University
fh@fharrell.com

See Also

spikecomp,sas.get,quantile,GiniMd,pMedian,table,summary,model.frame.default,naprint,lapply,tapply,Surv,na.delete,na.keep,na.detail.response,latex

Examples

set.seed(1)describe(runif(200),dig=2)    #single variable, continuous                              #get quantiles .05,.10,\dotsdfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))describe(dfr)## Not run: options(grType='plotly')d <- describe(mydata)p <- plot(d)   # create plots for both types of variablesp[[1]]; p[[2]] # or p$Categorical; p$Continuousplotly::subplot(p[[1]], p[[2]], nrows=2)  # plot both in oneplot(d, which='categorical')    # categorical onesd <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)describe(d)      #describe entire data frameattach(d, 1)describe(relig)  #Has special missing values .D .F .M .R .T                 #attr(relig,"label") is "Religious preference"#relig : Religious preference  Format:relig#    n missing  D  F M R T distinct # 4038     263 45 33 7 2 1        8##0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) # Method for describing part of a data frame: describe(death.time ~ age*sex + rcs(blood.pressure)) describe(~ age+sex) describe(~ age+sex, weights=freqs)  # weighted analysis fit <- lrm(y ~ age*sex + log(height)) describe(formula(fit)) describe(y ~ age*sex, na.action=na.delete)   # report on number deleted for each variable options(na.detail.response=TRUE)  # keep missings separately for each x, report on dist of y by x=NA describe(y ~ age*sex) options(na.fun.response="quantile") describe(y ~ age*sex)   # same but use quantiles of y by x=NA d <- describe(my.data.frame) d$age                   # print description for just age d[c('age','sex')]       # print description for two variables d[sort(names(d))]       # print in alphabetic order by var. names d2 <- d[20:30]          # keep variables 20-30 page(d2)                # pop-up window for these variables# Test date/time formats and suppression of times when they don't vary library(chron) d <- data.frame(a=chron((1:20)+.1),                 b=chron((1:20)+(1:20)/100),                 d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,                               hour=rep(11,20),min=rep(17,20),sec=rep(11,20)),                 f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20,                               hour=1:20,min=1:20,sec=1:20),                 g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20)) describe(d)# Make a function to run describe, latex.describe, and use the kdvi# previewer in Linux to view the result and easily make a pdf file ldesc <- function(data) {  options(xdvicmd='kdvi')  d <- describe(data, desc=deparse(substitute(data)))  dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11) } ldesc(d)## End(Not run)

Discrete Vector tools

Description

discrete creates a discrete vector which is distinct from acontinuous vector, or a factor/ordered vector.The other function are tools for manipulating descrete vectors.

Usage

as.discrete(x, ...)## Default S3 method:as.discrete(x, ...)discrete(x, levels = sort(unique.default(x), na.last = TRUE), exclude = NA)## S3 replacement method for class 'discrete'x[...] <- value## S3 method for class 'discrete'x[..., drop = FALSE]## S3 method for class 'discrete'x[[i]]is.discrete(x)## S3 replacement method for class 'discrete'is.na(x) <- value## S3 replacement method for class 'discrete'length(x) <- value

Arguments

x

a vector

drop

Should unused levels be dropped.

exclude

logical: shouldNA be excluded.

i

indexing vector

levels

charater: list of individual level values

value

index of elements to set toNA

...

arguments to be passed to other functions

Details

as.discrete converts a vector into a discrete vector.

discrete creates a discrete vector from provided values.

is.discrete tests to see if the vector is a discrete vector.

Value

as.discrete,discrete returns a vector ofdiscrete type.

is.discrete returan logicalTRUE if the vector is ofclass discrete other wise it returnsFALSE.

Author(s)

Charles Dupont

See Also

[[,[,factor

Examples

a <- discrete(1:25)ais.discrete(a)b <- as.discrete(2:4)b

Enhanced Dot Chart

Description

dotchart2 is an enhanced version of thedotchart function with several new options.

Usage

dotchart2(data, labels, groups=NULL, gdata=NA, horizontal=TRUE, pch=16,          xlab='', ylab='', xlim=NULL, auxdata, auxgdata=NULL, auxtitle,          lty=1, lines=TRUE, dotsize = .8,          cex = par("cex"), cex.labels = cex,          cex.group.labels = cex.labels*1.25, sort.=TRUE,       add=FALSE, dotfont=par('font'), groupfont=2,       reset.par=add, xaxis=TRUE, width.factor=1.1,          lcolor='gray', leavepar=FALSE,          axisat=NULL, axislabels=NULL, ...)

Arguments

data

a numeric vector whose values are shown on the x-axis

labels

a vector of labels for each point, corresponding tox. If omitted,names(data) are used, and if there arenonames, integers prefixed by"#" are used.

groups

an optional categorical variable indicating howdata values are grouped

gdata

data values for groups, typically summaries such as groupmedians

horizontal

set toFALSE to make the chart verticalinstead of the default

pch

default character number or value for plotting dots in dot charts.The default is 16.

xlab

x-axis title

ylab

y-axis title

xlim

x-axis limits. Applies only tohorizontal=TRUE.

auxdata

a vector of auxiliary data given todotchart2, of the same lengthas the first (data) argument. If present, thisvector of values will be printed outside the right margin of the dotchart. Usuallyauxdata represents cell sizes.

auxgdata

similar toauxdata but corresponding to thegdataargument. These usually represent overall sample sizes for eachgroup of lines.

auxtitle

ifauxdata is given,auxtitle specifies a columnheading for the extra printed data in the chart, e.g.,"N"

lty

line type for horizontal lines. Default is 1 for R, 2 for S-Plus

lines

set toFALSE to suppress drawing of referencelines

dotsize

cex value for drawing dots. Default is 0.8. Note that the originaldotchart function used a default of 1.2.

cex

seepar

cex.labels

cex parameter that applies only to the line labels for thedot chartcex parameter for major grouping labels fordotchart2. Defaults tocex.

cex.group.labels

value ofcex corresponding togdata

sort.

set toFALSE to keepdotchart2 from sorting the inputdata, i.e., it will assume that the data are already properlyarranged. This is especially useful when you are usinggdataandgroups and you want to control theorder that groups appear on the chart (from top to bottom).

add

set toTRUE to add to an existing plot

dotfont

font number of plotting dots. Default is one. Use-1 touse "outline" fonts. For example,pch=183, dotfont=-1plots an open circle for UNIX on postscript.pch=1 makesan open octagon under Windows.

groupfont

font number to use in drawinggroup labels fordotchart2.Default is2 for boldface.

reset.par

set toFALSE to causedotchart2 to not reset theparparameters when finished. This is useful whenadd=TRUE is about tobe used in another call. The default is to reset theparparameters ifadd=TRUE and not ifadd=FALSE, i.e., theprogram assumes that only one set of points will be added to anexisting set. If you fail to usereset.par=TRUE for the first of a series of plots, the next call toplot withadd=TRUE will result in distorted x-axis scaling.

xaxis

set toFALSE to suppress drawing x-axis

width.factor

When the calculated left margin turns out to be faulty, specify afactor by which to multiple the left margin aswidth.factor to getthe appropriate space for labels on horizonal charts.

lcolor

color for horizontal reference lines. Default is"gray" for R,par("col") for S-Plus.

leavepar

set toTRUE to leavepar() unchanged.This assumes the user has allocated sufficient left and rightmargins for a horizontal dot chart.

axisat

a vector of tick mark locations to pass toaxis.Useful if transforming the data axis

axislabels

a vector of strings specifying axis tick marklabels. Useful if transforming the data axis

...

arguments passed toplot.default

Side Effects

dotchart will leavepar altered ifreset.par=FALSE.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

dotchart

Examples

set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))dotchart2(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y')dotchart2(y2, g, groups=maj, pch=17, add=TRUE)## Compare with dotchart function (no superpositioning or auxdata allowed):## dotchart(y1, g, groups=maj, xlab='Y')## To plot using a transformed scale add for example## axisat=sqrt(pretty(y)), axislabels=pretty(y)

Enhanced Version of dotchart Function

Description

These are adaptations of the R dotchart function that sorts categoriestop to bottom, addsauxdata andauxtitle arguments to putextra information in the right margin, and fordotchart3 addsargumentscex.labels,cex.group.labels, andgroupfont. By default, group headings are in a larger, boldfont.dotchart3 also cuts a bit of white space from the top andbottom of the chart. The most significant change, however, is in howx is interpreted. Columns ofx no longer provide analternate way to define groups. Instead, they define superpositionedvalues. This is useful for showing three quartiles, for example. Goingalong with this change, fordotchart3pch can now be avector specifying symbols to use going across columns ofx.x was changed in this way because to put multiple points on aline (e.g., quartiles) and keeping track ofpar() parameters whendotchart2 was called withadd=TRUE was cumbersome.dotchart3 changes the margins to account for horizontal labels.

dotchartp is a version ofdotchart3 for making the chartwith theplotly package.

summaryD creates aggregate data usingsummarize andcallsdotchart3 with suitable arguments to summarize data bymajor and minor categories. Ifoptions(grType='plotly') is ineffect and theplotly package is installed,summaryD usesdotchartp instead ofdotchart3.

summaryDp is a streamlinedsummaryD-like function thatuses thedotchartpl function to render aplotly graphic.It is used to compute summary statistics stratified separately by aseries of variables.

Usage

dotchart3(x, labels = NULL, groups = NULL, gdata = NULL,          cex = par("cex"), pch = 21, gpch = pch, bg = par("bg"),          color = par("fg"), gcolor = par("fg"), lcolor = "gray",          xlim = range(c(x, gdata), na.rm=TRUE), main = NULL, xlab = NULL,          ylab = NULL, auxdata = NULL, auxtitle = NULL, auxgdata=NULL,          axisat=NULL, axislabels=NULL,          cex.labels = cex, cex.group.labels = cex.labels * 1.25,          cex.auxdata=cex, groupfont = 2,          auxwhere=NULL, height=NULL, width=NULL, ...)dotchartp(x, labels = NULL, groups = NULL, gdata = NULL,            xlim = range(c(x, gdata), na.rm=TRUE), main=NULL,            xlab = NULL, ylab = '', auxdata=NULL, auxtitle=NULL,            auxgdata=NULL, auxwhere=c('right', 'hover'),            symbol='circle', col=colorspace::rainbow_hcl,            legendgroup=NULL,            axisat=NULL, axislabels=NULL, sort=TRUE, digits=4, dec=NULL,            height=NULL, width=700, layoutattr=FALSE, showlegend=TRUE, ...) summaryD(formula, data=NULL, fun=mean, funm=fun,         groupsummary=TRUE, auxvar=NULL, auxtitle='',         auxwhere=c('hover', 'right'),         vals=length(auxvar) > 0, fmtvals=format,         symbol=if(use.plotly) 'circle' else 21,         col=if(use.plotly) colorspace::rainbow_hcl else 1:10,         legendgroup=NULL,         cex.auxdata=.7, xlab=v[1], ylab=NULL,         gridevery=NULL, gridcol=gray(.95), sort=TRUE, ...)summaryDp(formula,          fun=function(x) c(Mean=mean(x, na.rm=TRUE),                            N=sum(! is.na(x))),          overall=TRUE, xlim=NULL, xlab=NULL,          data=NULL, subset=NULL, na.action=na.retain,          ncharsmax=c(50, 30),          digits=4, ...)

Arguments

x

a numeric vector or matrix

labels

labels for categories corresponding to rows ofx. If not specified these are taken from row names ofx.

groups,gdata,cex,pch,gpch,bg,color,gcolor,lcolor,xlim,main,xlab,ylab

seedotchart

auxdata

a vector of information to be put in the right margin,in the same order asx. May be numeric, character, or avector of expressions containingplotmath markup. Fordotchartp,auxdata may be a matrix to go alongwith the numeric x-axis variable, to result in point-specifichover text.

auxtitle

a column heading forauxdata

auxgdata

similar toauxdata but corresponding to thegdata argument. These usually represent overall sample sizesfor each group of lines.

axisat

a vector of tick mark locations to pass toaxis.Useful if transforming the data axis

axislabels

a vector of strings specifying axis tick marklabels. Useful if transforming the data axis

digits

number of significant digits for formatting numeric datain hover text fordotchartp andsummaryDp

dec

fordotchartp only, overridesdigits tospecify the argument toround() for rounding values forhover labels

cex.labels

cex for labels

cex.group.labels

cex for group labels

cex.auxdata

cex forauxdata

groupfont

font number for group headings

auxwhere

forsummaryD anddotchartp specifieswhetherauxdata andauxgdata are to be placed on the far right of thechart, or should appear as pop-up tooltips when hovering themouse over the ordinaryx data points on the chart.Ignored fordotchart3.

...

other arguments passed to some of the graphics functions,or todotchart3 ordotchartp fromsummaryD.Theauxwhere='hover' option is a useful argument to passfromsummaryD todotchartp. Also used to passother arguments todotchartpl fromsummaryDp.

layoutattr

set toTRUE to putplotly::layoutinformation in a list as an attributelayout of thereturnedplotly object instead of running theplotly object through thelayout function. Thisis useful if runningdotchartp multiple times to laterput together usingplotly::subplot and only then runningthe result throughplotly::layout.

showlegend

set toFALSE to suppress theplotlylegend withdotchartp

formula

a formula with one variable on the left hand side (thevariable to compute summary statistics on), and one or twovariables on the right hand side. If there are two variables,the first is taken as the major grouping variable. If the lefthand side variable is a matrix it has to be a legal R variablename, not an expression, andfun needs to be able toprocess a matrix. ForsummaryDp there may be more thantwo right-hand-side variables.

data

a data frame or list used to find the variables informula. If omitted, the parent environment is used.

fun

a summarization function creating a single number from avector. Default is the mean. ForsummaryDp,funproduces a named vector of summary statistics, with the defaultcomputing theMean andN (number of non-missing values).

funm

applies if there are two right hand variables andgroupsummary=TRUE and the marginal summaries over justthe firstx variable need to be computed differentlythan the summaries that are cross-classified by bothvariables.funm defaults tofun and shouldhave the same structure asfun.

groupsummary

By default, when there are two right-handvariables,summarize(..., fun) is called a second timewithout the use of the second variable, to obtain marginalsummaries for the major grouping variable and display theresults as a dot (and optionally in the right margin). Setgroupsummary=FALSE to suppress this information.

auxvar

whenfun returns more than one statistic and theuser names the elements in the returned vector, you can specifyauxvar as a single character string naming one of them.This will cause the named element to be written in the rightmargin, and that element to be deleted when plotting the statistics.

vals

set toTRUE to show data values (dotlocations) in the right margin. Defaults toTRUE ifauxvar is specified.

fmtvals

an optional function to format values before puttingthem in the right margin. Default is theformatfunction.

symbol

a scalar or vector ofpch values for ordinarygraphics or a character vector or scalar ofplotlysymbols. These correspond to columns ofx or elementsproduced byfun.

col

a function or vector of colors to assign to multiple pointsplotted in one line. If a function it will be evaluated with anargument equal to the number of groups/columns.

legendgroup

seeplotly documentation; corresponds tocolumn names/fun output forplotly graphs only

gridevery

specify a positive number to draw very faint verticalgrid lines everygrideveryx-axis units; fornon-plotly charts

gridcol

color for grid lines; default is very faint gray scale

sort

specifysort=FALSE to plot data in the originalorder, from top to bottom on the dot chart. Fordotchartp, setsort to'descending' tosort in descending order of the first column ofx, or'ascending' to do the reverse. These do not make senseifgroups is present.

height,width

height and width in pixels fordotchartp ifnot usingplotly defaults. Ignored fordotchart3. If set to"auto" the height iscomputed usingHmisc::plotlyHeightDotchart.

overall

set toFALSE to suppress plotting ofunstratified estimates

subset

an observation subsetting expression

na.action

anNA action function

ncharsmax

a 2-vector specifying the number of characters afterwhich an html new line character should be placed, respectively forthe x-axis label and the stratification variable levels

Value

the function returns invisibly

Author(s)

Frank Harrell

See Also

dotchart,dotchart2,summarize,rlegend

Examples

set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))dotchart3(cbind(y1,y2), g, groups=maj, auxdata=n, auxtitle='n',          xlab='Y', pch=c(1,17))## Compare with dotchart function (no superpositioning or auxdata allowed):## dotchart(y1, g, groups=maj, xlab='Y')## Not run: dotchartp(cbind(y1, y2), g, groups=maj, auxdata=n, auxtitle='n',          xlab='Y', gdata=cbind(c(0,.1), c(.23,.44)), auxgdata=c(-1,-2),          symbol=c('circle', 'line-ns-open'))summaryDp(sbp ~ region + sex + race + cut2(age, g=5), data=mydata)## End(Not run)## Put options(grType='plotly') to have the following use dotchartp## (rlegend will not apply)## Add argument auxwhere='hover' to summaryD or dotchartp to put## aux info in hover text instead of right marginsummaryD(y1 ~ maj + g, xlab='Mean')summaryD(y1 ~ maj + g, groupsummary=FALSE)summaryD(y1 ~ g, fmtvals=function(x) sprintf('%4.2f', x))Y <- cbind(y1, y2)   # summaryD cannot handle cbind(...) ~ ...summaryD(Y  ~ maj + g, fun=function(y) y[1,], symbol=c(1,17))rlegend(.1, 26, c('y1','y2'), pch=c(1,17))summaryD(y1 ~ maj, fun=function(y) c(Mean=mean(y), n=length(y)),         auxvar='n', auxtitle='N')

Enhanced Version of dotchart Function for plotly

Description

This function produces aplotly interactive graphic and acceptsa different format of data input than the otherdotchartfunctions. It was written to handle a hierarchical data structureincluding strata that further subdivide the main classes. Strata,indicated by themult variable, are shown on the samehorizontal line, and if the variablebig isFALSE willappear slightly below the main line, using smaller symbols, and havingsome transparency. This is intended to handle output such as thatfrom thesummaryP function when there is a superpositioningvariablegroup and a stratification variablemult,especially when the data have been run through theaddMarginalfunction to createmult categories labelled"All" forwhich the user will specifybig=TRUE to indicate non-stratifiedestimates (stratified only ongroup) to emphasize.

When viewing graphics that usedmult andbig, the usercan click on the legends for the small points forgroups tovanish the finely stratified estimates.

Whengroup is used bymult andbig are not, andwhen thegroup variable has exactly two distinct values, youcan specifyrefgroup to get the difference between twoproportions in addition to the individual proportions. The individualproportions are plotted, but confidence intervals for the differenceare shown in hover text and half-width confidence intervals for thedifference, centered at the midpoint of the proportions, are shown.These have the property of intersecting the two proportions if andonly if there is no significant difference at the1 - conf.intlevel.

Specifyfun=exp andifun=log if estimates and confidencelimits are on the log scale. Make sure that zeros were prevented inthe original calculations. For exponential hazard rates this can beaccomplished by replacing event counts of 0 with 0.5.

Usage

dotchartpl(x, major=NULL, minor=NULL, group=NULL, mult=NULL,           big=NULL, htext=NULL, num=NULL, denom=NULL,           numlabel='', denomlabel='',           fun=function(x) x, ifun=function(x) x, op='-',           lower=NULL, upper=NULL,           refgroup=NULL, sortdiff=TRUE, conf.int=0.95,           minkeep=NULL, xlim=NULL, xlab='Proportion',           tracename=NULL, limitstracename='Limits',           nonbigtracename='Stratified Estimates',           dec=3, width=800, height=NULL,           col=colorspace::rainbow_hcl)

Arguments

x

a numeric vector used for values on thex-axis

major

major vertical category, e.g., variable labels

minor

minor vertical category, e.g. category levels withinvariables

group

superpositioning variable such as treatment

mult

strata names for further subdivisions withoutgroups

big

omit if all levels ofmult are equally important orifmult is omitted. Otherwise denotes major (larger points,right on horizontal lines) vs. minor (smaller, transparent pointsslightly below the line).

htext

additional hover text per point

num

ifx represents proportions, optionally specifiesnumerators to be used in fractions added to hover text. Whennum is given,x is automatically added to hover text,rounded to 3 digits after the decimal point.

denom

likenum but for denominators

numlabel

character string to put to the right of the numeratorin hover text

denomlabel

character string to put to the right of thedenominator in hover text

fun

a transformation to make when printing estimates. Forexample, one may specifyfun=exp to anti-log estimates andconfidence limites that were computed on a log basis

ifun

inverse transformation offun

op

set to for example'/' whenfun=exp andeffects are computed as ratios instead of differences. This is usedin hover text.

lower

lower limits for optional error bars

upper

upper limits for optional error bars

refgroup

ifgroup is specified and there are exactly twogroups, specify the character string for the reference group incomputing difference in proportions. For example ifrefgroup='A' and thegroup levels are'A','B',you will get B - A.

sortdiff

minor categories are sorted by descendingvalues of the difference in proportions whenrefgroup is used,unless you specifysortdiff=FALSE

conf.int

confidence level for computing confidence intervalsfor the difference in two proportions. Specifyconf.int=FALSEto suppress confidence intervals.

minkeep

ifrefgroup andminkeep are both given,observations that are at or aboveminkeep for at least one ofthe groups are retained. The defaults to to keep all observations.

xlim

x-axis limits

xlab

x-axis label

tracename

plotly trace name ifgroup is not used

limitstracename

plotly trace name forlower andupper ifgroup is not used

nonbigtracename

plotly trace name used for non-bigelements, which usually represent stratified versions of the "big"observations

col

a function or vector of colors to assign togroup.If a function it will be evaluated with an argument equal to thenumber of distinct groups.

dec

number of places to the right of the decimal place forformatting numeric quantities in hover text

width

width of plot in pixels

height

height of plot in pixels; computed from number of strataby default

Value

aplotly object. An attributelevelsRemoved isadded ifminkeep is used and any categories were omitted fromthe plot as a result. This is a character vector with categoriesremoved. Ifmajor is present, the strings are of the formmajor:minor

Author(s)

Frank Harrell

See Also

dotchartp

Examples

## Not run: set.seed(1)d <- expand.grid(major=c('Alabama', 'Alaska', 'Arkansas'),                 minor=c('East', 'West'),                 group=c('Female', 'Male'),                 city=0:2)n <- nrow(d)d$num   <- round(100*runif(n))d$denom <- d$num + round(100*runif(n))d$x     <- d$num / d$denomd$lower <- d$x - runif(n)d$upper <- d$x + runif(n)with(d, dotchartpl(x, major, minor, group, city, lower=lower, upper=upper,            big=city==0, num=num, denom=denom, xlab='x'))# Show half-width confidence intervals for Female - Male differences# after subsetting the data to have only one record per# state/region/groupd <- subset(d, city == 0)with(d, dotchartpl(x, major, minor, group, num=num, denom=denom,            lower=lower, upper=upper, refgroup='Male'))n <- 500set.seed(1)d <- data.frame(  race         = sample(c('Asian', 'Black/AA', 'White'), n, TRUE),  sex          = sample(c('Female', 'Male'), n, TRUE),  treat        = sample(c('A', 'B'), n, TRUE),  smoking      = sample(c('Smoker', 'Non-smoker'), n, TRUE),  hypertension = sample(c('Hypertensive', 'Non-Hypertensive'), n, TRUE),  region       = sample(c('North America','Europe','South America',                          'Europe', 'Asia', 'Central America'), n, TRUE))d <- upData(d, labels=c(race='Race', sex='Sex'))dm <- addMarginal(d, region)s <- summaryP(race + sex + smoking + hypertension ~                region + treat,  data=dm)s$region <- ifelse(s$region == 'All', 'All Regions', as.character(s$region))with(s,  dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region,            big=region == 'All Regions', num=freq, denom=denom))s2 <- s[- attr(s, 'rows.to.exclude1'), ]with(s2,      dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region,                big=region == 'All Regions', num=freq, denom=denom))# Note these plots can be created by plot.summaryP when options(grType='plotly')# Plot hazard rates and ratios with confidence limits, on log scaled <- data.frame(tx=c('a', 'a', 'b', 'b'),                event=c('MI', 'stroke', 'MI', 'stroke'),                count=c(10, 5, 5, 2),                exposure=c(1000, 1000, 900, 900))# There were no zero event counts in this dataset.  In general we# want to handle that, hence the 0.5 belowd <- upData(d, hazard = pmax(0.5, count) / exposure,               selog  = sqrt(1. / pmax(0.5, count)),               lower  = log(hazard) - 1.96 * selog,               upper  = log(hazard) + 1.96 * selog)with(d,     dotchartpl(log(hazard), minor=event, group=tx, num=count, denom=exposure,                lower=lower, upper=upper,                fun=exp, ifun=log, op='/',                numlabel='events', denomlabel='years',                refgroup='a', xlab='Events Per Person-Year'))## End(Not run)

Dual Standard Deviations

Description

Computes one standard deviation for the lower half of the distribution of a numeric vector and another SD for the upper half. By default the center of the distribution for purposes of splitting into "halves" is the mean. The user may override this withcenter. When splitting into halves, observations equal to thecenter value are included in both subsets.

Usage

dualSD(x, na.rm = FALSE, nmin = 10, center = xbar)

Arguments

x

a numeric vector

na.rm

set toTRUE to find anyNA values and remove them before computing SDs.

nmin

the minimum number of non-NA obesrvations that must be present for two SDs to be computed. If the mumber of non-missing values falls belownmin, the regular SD is duplicated in the result.

center

center point for making the two subsets. The sample mean is used to compute the two SDs no matter what is specified forcenter.

Details

The purpose of dual SDs is to describe variability for asymmetric distributions. Symmetric distributions are also handled, though slightly less efficiently than a single SD does.

Value

a 2-vector of SDs with namesbottom andtop

Author(s)

Frank Harrell

See Also

pMedian()

Examples

set.seed(1)x <- rnorm(20000)sd(x)dualSD(x)y <- exp(x)s1 <- sd(y)s2 <- dualSD(y)s1s2quantile(y, c(0.025, 0.975))mean(y) + 1.96 * c(-1, 1) * s1mean(y) + 1.96 * c(- s2['bottom'], s2['top'])c(mean=mean(y), pseudomedian=pMedian(y), median=median(y))

ebpcomp

Description

Computation of Coordinates of Extended Box Plots Elements

Usage

ebpcomp(x, qref = c(0.5, 0.25, 0.75), probs = c(0.05, 0.125, 0.25, 0.375))

Arguments

x

a numeric variable

qref

quantiles for major corners

probs

quantiles for minor corners

Details

For an extended box plots computes all the elements needed for plotting it. This is typically used when adding to aggplot2 plot.

Value

list with elementssegments,lines,points,points2

Author(s)

Frank Harrell

Examples

ebpcomp(1:1000)

ecdfSteps

Description

Compute Coordinates of an Empirical Distribution Function

Usage

ecdfSteps(x, extend)

Arguments

x

numeric vector, possibly withNAs that are ignored

extend

a 2-vector do extend the range of x (low, high). Setextend=FALSE to not extendx, or leave it missing to extend it 1/20th of the observed range on other side.

Details

For a numeric vector uses the R built-inecdf function to computecoordinates of the ECDF, with extension slightly below and above therange ofx by default. This is useful forggplot2 where the ECDF may need to be transformed. The returned object is suitable for creating stratified statistics usingdata.table and other methods.

Value

a list with componentsx andy

Author(s)

Frank Harrell

See Also

stats::ecdf()

Examples

ecdfSteps(0:10)## Not run: # Use data.table for obtaining ECDFs by country and regionw <- d[, ecdfSteps(z, extend=c(1,11)), by=.(country, region)]  # d is a DT# Use ggplot2 to make one graph with multiple regions' ECDFs# and use faceting for countriesggplot(w, aes(x, y, color=region)) + geom_step() +       facet_wrap(~ country)## End(Not run)

Multicolumn Formating

Description

Expands the width either supercolumns or the subcolumns so that thethe sum of the supercolumn widths is the same as the sum of thesubcolumn widths.

Usage

equalBins(widths, subwidths)

Arguments

widths

widths of the supercolumns.

subwidths

list of widths of the subcolumns for each supercolumn.

Details

This determins the correct subwidths of each of various columns in a tablefor printing. The correct width of the multicolumns is deterimed bysumming the widths of it subcolumns.

Value

widths of the the columns for a table.

Author(s)

Charles Dupont

See Also

nchar,stringDims

Examples

mcols <- c("Group 1", "Group 2")mwidth <- nchar(mcols, type="width")spancols <- c(3,3)ccols <- c("a", "deer", "ad", "cat", "help", "bob")cwidth <- nchar(ccols, type="width")subwidths <- partition.vector(cwidth, spancols)equalBins(mwidth, subwidths)

Plot Error Bars

Description

Add vertical error bars to an existing plot or makes a newplot with error bars.

Usage

errbar(x, y, yplus, yminus, cap=0.015, main = NULL,       sub=NULL, xlab=as.character(substitute(x)),       ylab=if(is.factor(x) || is.character(x)) ""           else as.character(substitute(y)),       add=FALSE, lty=1, type='p', ylim=NULL,       lwd=1, pch=16, errbar.col, Type=rep(1, length(y)),        ...)

Arguments

x

vector of numeric x-axis values (for vertical error bars) or a factor orcharacter variable (for horizontal error bars,x representing thegroup labels)

y

vector of y-axis values.

yplus

vector of y-axis values: the tops of the error bars.

yminus

vector of y-axis values: the bottoms of the error bars.

cap

the width of the little lines at the tops and bottoms of the error barsin units of the width of the plot. Defaults to0.015.

main

a main title for the plot, passed toplot, see alsotitle.

sub

a sub title for the plot, passed toplot

xlab

optional x-axis labels ifadd=FALSE.

ylab

optional y-axis labels ifadd=FALSE. Defaults to blank for horizontal charts.

add

set toTRUE to add bars to an existing plot (available only for verticalerror bars)

lty

type of line for error bars

type

type of point. Usetype="b" to connect dots.

ylim

y-axis limits. Default is to use range ofy,yminus, andyplus. Forhorizonal charts,ylim is really thex-axis range, excludingdifferences.

lwd

line width for line segments (not main line)

pch

character to use as the point.

errbar.col

color to use for drawing error bars.

Type

used for horizontal bars only. Is an integer vector with values1if corresponding values represent simple estimates,2 if theyrepresent differences.

...

other parameters passed to all graphics functions.

Details

errbar adds vertical error bars to an existing plot or makes a newplot with error bars. It can also make a horizontal error bar plotthat shows error bars for group differences as well as bars forgroups. For the latter type of plot, the lower x-axis scalecorresponds to group estimates and the upper scale corresponds todifferences. The spacings of the two scales are identical but thescale for differences has its origin shifted so that zero may beincluded. If at least one of the confidence intervals includes zero,a vertical dotted reference line at zero is drawn.

Author(s)

Charles Geyer, University of Chicago. Modified by Frank Harrell,Vanderbilt University, to handle missing data, to add the parametersadd andlty, and to implement horizontal charts with differences.

Examples

set.seed(1)x <- 1:10y <- x + rnorm(10)delta <- runif(10)errbar( x, y, y + delta, y - delta )# Show bootstrap nonparametric CLs for 3 group means and for# pairwise differences on same graphgroup <- sample(c('a','b','d'), 200, TRUE)y     <- runif(200) + .25*(group=='b') + .5*(group=='d')cla <- smean.cl.boot(y[group=='a'],B=100,reps=TRUE)  # usually B=1000a   <- attr(cla,'reps')clb <- smean.cl.boot(y[group=='b'],B=100,reps=TRUE)b   <- attr(clb,'reps')cld <- smean.cl.boot(y[group=='d'],B=100,reps=TRUE)d   <- attr(cld,'reps')a.b <- quantile(a-b,c(.025,.975))a.d <- quantile(a-d,c(.025,.975))b.d <- quantile(b-d,c(.025,.975))errbar(c('a','b','d','a - b','a - d','b - d'),       c(cla[1],clb[1],cld[1],cla[1]-clb[1],cla[1]-cld[1],clb[1]-cld[1]),       c(cla[3],clb[3],cld[3],a.b[2],a.d[2],b.d[2]),       c(cla[2],clb[2],cld[2],a.b[1],a.d[1],b.d[1]),       Type=c(1,1,1,2,2,2), xlab='', ylab='')

Escapes any characters that would have special meaning in a reqular expression.

Description

Escapes any characters that would have special meaning in a reqular expression.

Usage

escapeRegex(string)escapeBS(string)

Arguments

string

string being operated on.

Details

escapeRegex will escape any characters that would havespecial meaning in a reqular expression. For any stringgrep(regexpEscape(string), string) will always be true.

escapeBS will escape any backslash ‘⁠\⁠’ in a string.

Value

The value of the string with any characters that would havespecial meaning in a reqular expression escaped.

Author(s)

Charles Dupont
Department of Biostatistics
Vanderbilt University

See Also

grep

Examples

string <- "this\\(system) {is} [full]."escapeRegex(string)escapeBS(string)

estSeqMarkovOrd

Description

Simulate Comparisons For Use in Sequential Markov Longitudinal Clinical Trial Simulations

Usage

estSeqMarkovOrd(  y,  times,  initial,  absorb = NULL,  intercepts,  parameter,  looks,  g,  formula,  ppo = NULL,  yprevfactor = TRUE,  groupContrast = NULL,  cscov = FALSE,  timecriterion = NULL,  coxzph = FALSE,  sstat = NULL,  rdsample = NULL,  maxest = NULL,  maxvest = NULL,  nsim = 1,  progress = FALSE,  pfile = "")

Arguments

y

vector of possible y values in order (numeric, character, factor)

times

vector of measurement times

initial

a vector of probabilities summing to 1.0 that specifies the frequency distribution of initial values to be sampled from. The vector must have names that correspond to values ofy representing non-absorbing states.

absorb

vector of absorbing states, a subset ofy. The default is no absorbing states. Observations are truncated when an absorbing state is simulated. May be numeric, character, or factor.

intercepts

vector of intercepts in the proportional odds model. There must be one fewer of these than the length ofy.

parameter

vector of true parameter (effects; group differences) values. These are group 2:1 log odds ratios in the transition model, conditioning on the previousy.

looks

integer vector of ID numbers at which maximum likelihood estimates and their estimated variances are computed. For a single look specify a scalar value forloops equal to the number of subjects in the sample.

g

a user-specified function of three or more arguments which in order areyprev - the value ofy at the previous time, the current timet, thegap between the previous time and the current time, an optional (usually named) covariate vectorX, and optional arguments such as a regression coefficient value to simulate from. The function needs to allowyprev to be a vector andyprev must not include any absorbing states. Theg function returns the linear predictor for the proportional odds model aside fromintercepts. The returned value must be a matrix with row names taken fromyprev. If the model is a proportional odds model, the returned value must be one column. If it is a partial proportional odds model, the value must have one column for each distinct value of the response variable Y after the first one, with the levels of Y used as optional column names. So columns correspond tointercepts. The different columns are used fory-specific contributions to the linear predictor (aside fromintercepts) for a partial or constrained partial proportional odds model. Parameters for partial proportional odds effects may be included in the ... arguments.

formula

a formula object given to thelrm() function using variables with these name:y,time,yprev, andgroup (factor variable having values '1' and '2'). Theyprev variable is converted to a factor before fitting the model unlessyprevfactor=FALSE.

ppo

a formula specifying the part offormula for which proportional odds is not to be assumed, i.e., that specifies a partial proportional odds model. Specifyingppo triggers the use ofVGAM::vglm() instead ofrms::lrm and will make the simulations run slower.

yprevfactor

seeformula

groupContrast

omit this argument ifgroup has only one regression coefficient informula. Otherwise ifppo is omitted, providegroupContrast as a list of two lists that are passed torms::contrast.rms() to compute the contrast of interest and its standard error. The first list corresponds to group 1, the second to group 2, to get a 2:1 contrast. Ifppo is given and the group effect is not just a simple regression coefficient, specify asgroupContrast a function of avglm fit that computes the contrast of interest and its standard error and returns a list with elements namedContrast andSE. For the latter type you can optionally have formal argumentsn1,n2, andparameter that are passed togroupContrast to compute the standard error of the group contrast, wheren1 andn2 respectively are the sample sizes for the two groups andparameter is the true group effect parameter value.

cscov

applies ifppo is not used. Set toTRUE to use the cluster sandwich covariance estimator of the variance of the group comparison.

timecriterion

a function of a time-ordered vector of simulated ordinal responsesy that returns a vectorFALSE orTRUE values denoting whether the currenty level met the condition of interest. For exampleestSeqMarkovOrd will compute the first time at whichy >= 5 if you specifytimecriterion=function(y) y >= 5. This function is only called at the last data look for each simulated study. To have more control, instead oftimecriterion returning a logical vector have it return a numeric 2-vector containing, in order, the event/censoring time and the 1/0 event/censoring indicator.

coxzph

set toTRUE iftimecriterion is specified and you want to compute a statistic for testing proportional hazards at the last look of each simulated data

sstat

set to a function of the time vector and the corresponding vector of ordinal responses for a single group if you want to compute a Wilcoxon test on a derived quantity such as the number of days in a given state.

rdsample

an optional function to do response-dependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state:times (ascending measurement times for one subject),y (vector of ordinal outcomes at these times for one subject. The function returnsNULL if no observations are to be dropped, returns the vector of new times to sample.

maxest

maximum acceptable absolute value of the contrast estimate, ignored ifNULL. Any values exceedingmaxest will result in the estimate being set toNA.

maxvest

likemaxest but for the estimated variance of the contrast estimate

nsim

number of simulations (default is 1)

progress

set toTRUE to send current iteration number topfile every 10 iterations. Each iteration will really involve multiple simulations, ifparameter has length greater than 1.

pfile

file to which to write progress information. Defaults to'' which is the console. Ignored ifprogress=FALSE.

Details

Simulates sequential clinical trials of longitudinal ordinal outcomes using a first-order Markov model. Looks are done sequentially after subject ID numbers given in the vectorlooks with the earliest possible look being after subject 2. At each look, a subject's repeated records are either all used or all ignored depending on the sequent ID number. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a functiong that has extra arguments specifying the true effect ofparameter the treatmentgroup expecting treatments to be coded 1 and 2.parameter is usually on the scale of a regression coefficient, e.g., a log odds ratio. Fitting is done using therms::lrm() function, unless non-proportional odds is allowed in which caseVGAM::vglm() is used. Iftimecriterion is specified, the function also, for the last data look only, computes the first time at which the criterion is satisfied for the subject or use the event time and event/censoring indicator computed bytimecriterion. The Cox/logrank chi-square statistic for comparing groups on the derived time variable is saved. Ifcoxzph=TRUE, thesurvival package correlation coefficientrho from the scaled partial residuals is also saved so that the user can later determine to what extent the Markov model resulted in the proportional hazards assumption being violated when analyzing on the time scale.vglm is accelerated by saving the first successful fit for the largest sample size and using its coefficients as starting value for furthervglm fits for any sample size for the same setting ofparameter.

Value

a data frame with number of rows equal to the product ofnsim, the length oflooks, and the length ofparameter, with variablessim,parameter,look,est (log odds ratio for group), andvest (the variance of the latter). Iftimecriterion is specified the data frame also containsloghr (Cox log hazard ratio for group),lrchisq (chi-square from Cox test for group), and ifcoxph=TRUE,phchisq, the chi-square for testing proportional hazards. The attributeetimefreq is also present iftimecriterion is present, and it probvides the frequency distribution of derived event times by group and censoring/event indicator. Ifsstat is given, the attributesstat is also present, and it contains an array with dimensions corresponding to simulations, parameter values within simulations,id, and a two-column subarray with columnsgroup andy, the latter being the summary measure computed by thesstat function. The returned data frame also has attributelrmcoef which are the last-look logistic regression coefficient estimates over thensim simulations and the parameter settings, and an attributefailures which is a data frame containing the variablesreason andfrequency cataloging the reasons for unsuccessful model fits.

Author(s)

Frank Harrell

See Also

gbayesSeqSim(),simMarkovOrd(),https://hbiostat.org/R/Hmisc/markov/


estSeqSim

Description

Simulate Comparisons For Use in Sequential Clinical Trial Simulations

Usage

estSeqSim(parameter, looks, gendat, fitter, nsim = 1, progress = FALSE)

Arguments

parameter

vector of true parameter (effects; group differences) values

looks

integer vector of observation numbers at which posterior probabilities are computed

gendat

a function of three arguments: true parameter value (scalar), sample size for first group, sample size for second group

fitter

a function of two arguments: 0/1 group indicator vector and the dependent variable vector

nsim

number of simulations (default is 1)

progress

set toTRUE to send current iteration number to the console

Details

Simulates sequential clinical trials. Looks are done sequentially at observation numbers given in the vectorlooks with the earliest possible look being at observation 2. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a functiongendat that given a true effect ofparameter and the two sample sizes (for treatment groups 1 and 2) returns a list with vectorsy1 andy2 containing simulated data. The user also provides a functionfitter with argumentsx (group indicator 0/1) andy (response variable) that returns a 2-vector containing the effect estimate and its variance.parameter is usually on the scale of a regression coefficient, e.g., a log odds ratio.

Value

a data frame with number of rows equal to the product ofnsim, the length oflooks, and the length ofparameter.

Author(s)

Frank Harrell

See Also

gbayesSeqSim(),simMarkovOrd(),estSeqMarkovOrd()

Examples

if (requireNamespace("rms", quietly = TRUE)) {  # Run 100 simulations, 5 looks, 2 true parameter values  # Total simulation time: 2s  lfit <- function(x, y) {  f <- rms::lrm.fit(x, y)    k <- length(coef(f))    c(coef(f)[k], vcov(f)[k, k])  }  gdat <- function(beta, n1, n2) {    # Cell probabilities for a 7-category ordinal outcome for the control group    p <- c(2, 1, 2, 7, 8, 38, 42) / 100    # Compute cell probabilities for the treated group    p2 <- pomodm(p=p, odds.ratio=exp(beta))    y1 <- sample(1 : 7, n1, p,  replace=TRUE)    y2 <- sample(1 : 7, n2, p2, replace=TRUE)    list(y1=y1, y2=y2)  }  set.seed(1)  est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200),                    gendat=gdat,                    fitter=lfit, nsim=100)  head(est)}

Flexible Event Chart for Time-to-Event Data

Description

Creates an event chart on the current graphics device. Also, allows userto plot legend on plot area or on separate page.Contains features useful for plotting data with time-to-event outcomesWhich arise in a variety of studiesincluding randomized clinical trials and non-randomized cohort studies.This function can use as input a matrix or a data frame, although greaterutility and ease of use will be seen with a data frame.

Usage

event.chart(data, subset.r = 1:dim(data)[1], subset.c = 1:dim(data)[2],           sort.by = NA, sort.ascending = TRUE,           sort.na.last = TRUE, sort.after.subset = TRUE,           y.var = NA, y.var.type = "n",           y.jitter = FALSE, y.jitter.factor = 1,           y.renum = FALSE, NA.rm = FALSE, x.reference = NA,           now = max(data[, subset.c], na.rm = TRUE),           now.line = FALSE, now.line.lty = 2,           now.line.lwd = 1, now.line.col = 1, pty = "m",           date.orig = c(1, 1, 1960), titl = "Event Chart",           y.idlabels = NA, y.axis = "auto",           y.axis.custom.at = NA, y.axis.custom.labels = NA,           y.julian = FALSE, y.lim.extend = c(0, 0),           y.lab = ifelse(is.na(y.idlabels), "", as.character(y.idlabels)),           x.axis.all = TRUE, x.axis = "auto",           x.axis.custom.at = NA, x.axis.custom.labels = NA,           x.julian = FALSE, x.lim.extend = c(0, 0), x.scale = 1,           x.lab = ifelse(x.julian, "Follow-up Time", "Study Date"),           line.by = NA, line.lty = 1, line.lwd = 1, line.col = 1,           line.add = NA, line.add.lty = NA,           line.add.lwd = NA, line.add.col = NA,           point.pch = 1:length(subset.c),           point.cex = rep(0.6, length(subset.c)),           point.col = rep(1, length(subset.c)),           point.cex.mult = 1., point.cex.mult.var = NA,           extra.points.no.mult = rep(NA, length(subset.c)),           legend.plot = FALSE, legend.location = "o", legend.titl = titl,           legend.titl.cex = 3, legend.titl.line = 1,           legend.point.at = list(x = c(5, 95), y = c(95, 30)),           legend.point.pch = point.pch,           legend.point.text = ifelse(rep(is.data.frame(data), length(subset.c)),                                      names(data[, subset.c]),                                      subset.c),           legend.cex = 2.5, legend.bty = "n",           legend.line.at = list(x = c(5, 95), y = c(20, 5)),           legend.line.text = names(table(as.character(data[, line.by]),                                          exclude = c("", "NA"))),           legend.line.lwd = line.lwd, legend.loc.num = 1,           ...)

Arguments

data

a matrix or data frame with rows corresponding to subjects andcolumns corresponding to variables. Note that for a data frame ormatrix containing multiple time-to-eventdata (e.g., time to recurrence, time to death, and time tolast follow-up), one column is required for each specific event.

subset.r

subset of rows of original matrix or data frame to place in event chart.Logical arguments may be used here (e.g.,treatment.arm == 'a', ifthe data frame, data, has been attached to the search directory;otherwise,data$treatment.arm == "a").

subset.c

subset of columns of original matrix or data frame to place in event chart;if working with a data frame, a vector of data frame variable names may beused for subsetting purposes (e.g.,c('randdate', 'event1').

sort.by

column(s) or data frame variable name(s) with which to sort the chart's output.The default isNA, thereby resulting in a chart sorted by original row number.

sort.ascending

logical flag (which takes effect only if the argumentsort.by is utilized).IfTRUE (default), sorting is done in ascending order; ifFALSE, descending order.

sort.na.last

logical flag (which takes effect only if the argumentsort.by is utilized).IfTRUE (default),NA values are considered as last values in ordering.

sort.after.subset

logical flag (which takes effect only if the argument sort.by is utilized).IfFALSE, sorting data (viasort.by specified variablesor columns) will be performed prior to row subsetting (viasubset.r);ifTRUE (default), row subsetting of original data will be done before sorting.

y.var

variable name or column number of original matrix or data frame withwhich to scale y-axis. Default isNA, which will result in equally spaced lines on y-axis(based on original data or sorted data if requested by sort.by).Otherwise, location of lines on y-axis will be dictated by specified variableor column. Examples of specified variables may be date of an eventor a physiological covariate. Any observation which hasa missing value for the y.var variable will not appear on the graph.

y.var.type

type of variable specified iny.var (which will only take effect ifargumenty.var is utilized). If"d", specifed variable is a date (eithernumeric julian date or an S-Plus dates object); if"n", specifed variableis numeric (e.g., systolic blood pressure level) although not a julian date.

y.jitter

logical flag (which takes effect only if the argumenty.var is utilized).Due to potential ties iny.var variable,y.jitter (whenTRUE) will jitterthe data to allow discrimination between observations at the possible costof producing slightly inaccurate dates or covariate values; ifFALSE (thedefault), no jittering will be performed. They.jitter algorithmassumes a uniform distribution of observations across the range ofy.var.The algorithm is as follows:

size.jitter <- ( diff(range(y.var)) / (2 * (length(y.var) - 1)) ) * y.jitter.factor

The default ofy.jitter.factor is 1. The entire product is then used as anargument intorunif:y.var <- y.var + runif(length(y.var), -size.jitter, size.jitter)

y.jitter.factor

an argument used with they.jitter function to scale the range of added noise.Default is 1.

y.renum

logical flag. IfTRUE, subset observations are listed on y-axis from1 tolength(subset.r); ifFALSE (default), subset observations are listedon y-axis in original form. As an example, ifsubset.r = 301:340 andy.renum ==TRUE, y-axis will be shown as 1 through 40. However, ify.renum ==FALSE, y-axis will be shown as 301 through 340. The above examplesassume the following argument,NA.rm, is set toFALSE.

NA.rm

logical flag. IfTRUE, subset observations which haveNA for each variable specified in subset.c will not have anentry on the y-axis. Also, if the following argument,x.reference, is specified, observations with missingx.reference values will also not have an entry on the y-axis.IfFALSE (default), user can identify those observationswhich do haveNA for every variable specified insubset.c (or, ifx.reference is specified, alsothose observations which are missing only thex.reference value); this caneasily be done by examining the resulting y-axis andrecognizing the observations without any plotting symbols.

x.reference

column of original matrix or data frame with which to reference the x-axis.That is, if specified, all columns specified insubset.c will be substractedbyx.reference. An example may be to see the timing of events before andafter treatment or to see time-to-event after entry into study.The event times will be aligned using thex.reference argumentas the reference point.

now

the “now” date which will be used for top of y-axiswhen creating the Goldman eventchart (see reference below).Default ismax(data[, subset.c], na.rm =TRUE).

now.line

logical flag. A feature utilized by the Goldman Eventchart.Whenx.reference is specified as the start of follow-up andy.var = x.reference, then the Goldman chart can be created.This argument, ifTRUE, will cause the plot region to be square, and willdraw a line with a slope of -1 from the top of the y-axis to the rightend of the x-axis. Essentially, it denotes end of current follow-up periodfor looking at the time-to-event data. Default isFALSE.

now.line.lty

line type ofnow.line.

now.line.lwd

line width ofnow.line.

now.line.col

color ofnow.line.

pty

graph option,pty='m' is the default; usepty='s' for the square lookingGoldman's event chart.

date.orig

date of origin to consider if dates are in julian, SAS , or S-Plus datesobject format; default is January 1, 1960 (which is the default originused by both S-Plus and SAS). Utilized when eithery.julian = FALSE orx.julian = FALSE.

titl

title for event chart. Default is 'Event Chart'.

y.idlabels

column or data frame variable name used for y-axis labels. For example,ifc('pt.no') is specified, patient ID (stored inpt.no)will be seen on y-axis labelsinstead of sequence specified bysubset.r. This argument takes precedenceover bothy.axis = 'auto' andy.axis = 'custom' (see below).NOTE: Program will issue warning if this argument isspecified and ifis.na(y.var) == FALSE;y.idlabels will not beused in this situation. Also, attempting to plot too many patientson a single event chart will cause undesirable plotting ofy.idlabels.

y.axis

character string specifying whether program will control labellingof y-axis (with argument"auto"), or if user will control labelling(with argument"custom"). If"custom" is chosen, user must specifylocation and text of labels usingy.axis.custom.at andy.axis.custom.labels arguments, respectively, listed below.This argument will not be utilized ify.idlabels is specified.

y.axis.custom.at

user-specified vector of y-axis label locations.Must be used wheny.axis = "custom"; will not be used otherwise.

y.axis.custom.labels

user-specified vector of y-axis labels.Must be used wheny.axis = "custom"; will not be used otherwise.

y.julian

logical flag (which will only be considered ify.axis == "auto" and(!is.na(y.var) & y.var.type== "d"). IfFALSE (default), will convert juliannumeric dates or S-Plus dates objects into “mm/dd/yy” formatfor the y-axis labels. IfTRUE, dates will be printed injulian (numeric) format.

y.lim.extend

two-dimensional vector representing the number of units that the userwants to increaseylim on bottom and top of y-axis, respectively.Defaultc(0,0). This argument will not take effect if the Goldman chartis utilized.

y.lab

single label to be used for entire y-axis. Default will be the variable nameor column number ofy.idlabels (if non-missing) and blank otherwise.

x.axis.all

logical flag. IfTRUE (default), lower and upper limits of x-axis will bebased on all observations (rows) in matrix or data frame. IfFALSE, lower andupper limits will be based only on those observations specified bysubset.r(either before or after sorting depending on specification ofsort.by andvalue ofsort.after.subset).

x.axis

character string specifying whether program will control labellingof x-axis (with argument"auto"), or if user will control labelling(with argument"custom"). If"custom" is chosen, user must specifylocation and text of labels usingx.axis.custom.at andx.axis.custom.labels arguments, respectively, listed below.

x.axis.custom.at

user-specified vector of x-axis label locations.Must be used whenx.axis == "custom"; will not be used otherwise.

x.axis.custom.labels

user-specified vector of x-axis labels.Must be used whenx.axis == "custom"; will not be used otherwise.

x.julian

logical flag (which will only be considered ifx.axis == "auto").IfFALSE (default), will convert julian dates or S-plus dates objectsinto “mm/dd/yy” format for the x-axis labels. IfTRUE, dates will beprinted in julian (numeric) format. NOTE: This argument should remainTRUE ifx.reference is specified.

x.lim.extend

two-dimensional vector representing the number of time units (usually in days)that the user wants to increasexlim on left-hand side and right-handside of x-axis, respectively. Default isc(0,0). This argument will nottake effect if the Goldman chart is utilized.

x.scale

a factor whose reciprocal is multiplied to original units of thex-axis. For example, if the original data frame is in units of days,x.scale = 365 will result in units of years (notwithstanding leap years).Default is 1.

x.lab

single label to be used for entire x-axis. Default will be “On Study Date”ifx.julian = FALSE and “Time on Study” ifx.julian = TRUE.

line.by

column or data frame variable name for plotting unique lines by uniquevalues of vector (e.g., specifyc('arm') to plot unique lines bytreatment arm). Can take at most one column or variable name.Default isNA which produces identical lines for each patient.

line.lty

vector of line types corresponding to ascending order ofline.by values.Ifline.by is specified, the vector should be the length ofthe number of unique values ofline.by.Ifline.by isNA, onlyline.lty[1] will be used.The default is 1.

line.lwd

vector of line widths corresponding to ascending order ofline.by values.Ifline.by is specified, the vector should be the length ofthe number of unique values ofline.by.Ifline.by isNA, onlyline.lwd[1] will be used.The default is 1.

line.col

vector of line colors corresponding to ascending order ofline.by values.Ifline.by is specified, the vector should be the length ofthe number of unique values ofline.by.Ifline.by isNA, onlyline.col[1] will be used.The default is 1.

line.add

a 2xk matrix with k=number of pairs of additional line segments to add.For example, if it is of interest to draw additional line segmentsconnecting events one and two, two and three, and four and five,(possibly with different colors), an appropriateline.add argument would bematrix(c('first.event','second.event','second.event','third.event', 'fourth.event','fifth.event'), 2, 3). One line segmentwould be drawn betweenfirst.event andsecond.event,a second line segment would be drawn betweensecond.event andthird.event,and a third line segment would be drawn betweenfourth.event andfifth.event.Different line types, widths and colors can be specified (in argumentslisted just below).

The convention use ofsubset.c andline.add must match (i.e., column namemust be used for both or column number must be used for both).

Ifline.add != NA, length ofline.add.lty,line.add.lwd, andline.add.colmust be the same as number of pairs of additional line segments to add.

NOTE: The drawing of the original default linemay be suppressed (withline.col = 0),andline.add can be used to do all the line plotting for the event chart.

line.add.lty

a kx1 vector corresponding to the columns ofline.add; specifies the linetypes for the k line segments.

line.add.lwd

a kx1 vector corresponding to the columns ofline.add; specifies the linewidths for the k line segments.

line.add.col

a kx1 vector corresponding to the columns ofline.add; specifies the linecolors for the k line segments.

point.pch

vector ofpch values for points representing each event. If similarevents are listed in multiple columns (e.g., regular visits ora recurrent event), repeatedpch values may be listed in thevector (e.g.,c(2,4,rep(183,3))).Iflength(point.pch) < length(subset.c),point.pch will be repeated untillengths are equal; a warning message will verify this condition.

point.cex

vector of size of points representing each event.Iflength(point.cex) < length(subset.c),point.cex will be repeated untillengths are equal; a warning message will verify this condition.

point.col

vector of colors of points representing each event.Iflength(point.col) < length(subset.c),point.col will be repeated untillengths are equal; a warning message will verify this condition.

point.cex.mult

a single number (may be non-integer), which is the base multiplier for the value ofthecex of the plotted points, when interest lies in a variable size allowed for certain points, as a function ofthe quantity of the variable(s) in the dataset specified in thepoint.cex.mult.var argument;multiplied by originalpoint.cex value and then the value of interest (for an individual)from thepoint.cex.mult.var argument; used only when non-NA arguments are provided topoint.cex.mult.var;default is 1. .

point.cex.mult.var

vector of variables to be used in determining what point.cex.mult is multiplied byfor determining size of plotted points from (possibly a subset of)subset.c variables, when interest lies in a variable size allowed for certain points, as a function ofthe level of some variable(s) in the dataset;default isNA.

extra.points.no.mult

vector of variables in the dataset to ignore for purposes of usingpoint.cex.mult; for example, for some variables there may be interest inallowing a variable size allowed for the plotting of the points, whereasother variables (e.g., dropout time), there may be no interest in such manipulation;the vector should be the same size as the number of variables specified insubset.c,withNA entries where variable point size is of interest and the variable name (or location insubset.c) specified when the variablepoint size is not of interest; in this latter case, the associated argument inpoint.cex is instead used as the pointcex;used only when non-NA arguments are provided topoint.cex.mult.var;default isNA

legend.plot

logical flag; ifTRUE, a legend will be plotted. Location of legend willbe based on specification of legend.location along with values of otherarguments listed below. Default isFALSE (i.e., no legend plotting).

legend.location

will be used only iflegend.plot = TRUE.If"o" (default), a one-page legend will precede the output of the chart.The user will need to hitenter in order for the event chart to be displayed.This feature is possible due to thedev.ask option.If"i", an internal legend will be placed in the plot regionbased onlegend.point.at. If"l", a legend will be placed in the plot regionusing the locator option. Legend will map points to events (via columnnames, by default) and, ifline.by is specified, lines to groups (based onlevels ofline.by).

legend.titl

title for the legend; default is title to be used for main plot.Only used whenlegend.location = "o".

legend.titl.cex

size of text for legend title. Only used whenlegend.location = "o".

legend.titl.line

line location of legend title dictated bymtext function withouter = FALSE option;default is 1.0. Only used whenlegend.location = "o".

legend.point.at

location of upper left and lower right corners of legend area tobe utilized for describing events via points and text.

legend.point.pch

vector ofpch values for points representing each event in the legend.Default ispoint.pch.

legend.point.text

text to be used for describing events; the default is setup for a data frame,as it will print the names of the columns specified bysubset.c.

legend.cex

size of text for points and event descriptions. Default is 2.5 which is setupforlegend.location = "o". A much smallercex is recommended (possibly 0.75)for use withlegend.location = "i" orlegend.location = "l".

legend.bty

option to put a box around the legend(s); default is to have no box(legend.bty = "n"). Optionlegend.bty = "o" will produce a legend box.

legend.line.at

ifline.by was specified (withlegend.location = "o" orlegend.location = "i"),this argument will dictate the location of the upper left and lower rightcorners of legend area to be utilized for describing the differentline.by values (e.g.,treatment.arm). The default is setup forlegend.location = "o".

legend.line.text

text to be used for describingline.by values; the default are the namesof the unique non-missingline.by values as produced from the table function.

legend.line.lwd

vector of line widths corresponding toline.by values.

legend.loc.num

number used for locator argument whenlegend.locator = "l". If 1 (default),user is to locate only the top left corner of the legend box. If 2, useris to locate both the top left corner and the lower right corner. This willbe done twice whenline.by is specified (once for points and once for lines).

...

additional par arguments for use in main plot.

Details

if you want to put, say, two eventcharts side-by-side, in a plotregion, you should not set uppar(mfrow=c(1,2)) before running thefirst plot. Instead, you should add the argumentmfg=c(1,1,1,2)to the first plot call followed by the argumentmfg=c(1,2,1,2)to the second plot call.

if dates in original data frame are in a specialized form(eg., mm/dd/yy) of mode CHARACTER, the user must convert those columns tobecome class dates or julian numeric mode (seeDate for more information).For example, in a data frame calledtestdata, with specializeddates in columns 4 thru 10, the following code could be used:as.numeric(dates(testdata[,4:10])). This will convert the columnsto numeric julian dates based on the function's default originof January 1, 1960. If original dates are in class dates or julian form,no extra work is necessary.

In the survival analysis, the data typically come in twocolumns: one column containing survival time and the othercontaining censoring indicator or event code. Theevent.convert function converts this type of data intomultiple columns of event times, one column of each eventtype, suitable for theevent.chart function.

Side Effects

an event chart is created on the current graphics device.If legend.plot =TRUE and legend.location = 'o',a one-page legend will precede the event chart. Please note that parparameters on completion of function will be reset to par parametersexisting prior to start of function.

Author(s)

J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org,khess@mdanderson.org

Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca

References

Lee J.J., Hess, K.R., Dubin, J.A. (2000). Extensions and applicationsof event charts.The American Statistician,54:1, 63–70.

Dubin, J.A., Lee, J.J., Hess, K.R. (1997).The Utility of Event Charts.Proceedings of the Biometrics Section, AmericanStatistical Association.

Dubin, J.A., Muller H-G, Wang J-L (2001).Event history graphs for censored survival data.Statistics in Medicine,20: 2951–2964.

Goldman, A.I. (1992).EVENTCHARTS: Visualizing Survival and Other Timed-Events Data.The American Statistician,46:1, 13–18.

See Also

event.history,Date

Examples

# The sample data set is an augmented CDC AIDS dataset (ASCII)# which is used in the examples in the help file.  This dataset is # described in Kalbfleisch and Lawless (JASA, 1989).# Here, we have included only children 4 years old and younger.# We have also added a new field, dethdate, which# represents a fictitious death date for each patient.  There was# no recording of death date on the original dataset.  In addition, we have# added a fictitious viral load reading (copies/ml) for each patient at time of AIDS diagnosis,# noting viral load was also not part of the original dataset.#   # All dates are julian with julian=0 being # January 1, 1960, and julian=14000 being 14000 days beyond# January 1, 1960 (i.e., May 1, 1998).cdcaids <- data.frame(age=c(4,2,1,1,2,2,2,4,2,1,1,3,2,1,3,2,1,2,4,2,2,1,4,2,4,1,4,2,1,1,3,3,1,3),infedate=c(7274,7727,7949,8037,7765,8096,8186,7520,8522,8609,8524,8213,8455,8739,8034,8646,8886,8549,8068,8682,8612,9007,8461,8888,8096,9192,9107,9001,9344,9155,8800,8519,9282,8673),diagdate=c(8100,8158,8251,8343,8463,8489,8554,8644,8713,8733,8854,8855,8863,8983,9035,9037,9132,9164,9186,9221,9224,9252,9274,9404,9405,9433,9434,9470,9470,9472,9489,9500,9585,9649),diffdate=c(826,431,302,306,698,393,368,1124,191,124,330,642,408,244,1001,391,246,615,1118,539,612,245,813,516,1309,241,327,469,126,317,689,981,303,976),dethdate=c(8434,8304,NA,8414,8715,NA,8667,9142,8731,8750,8963,9120,9005,9028,9445,9180,9189,9406,9711,9453,9465,9289,9640,9608,10010,9488,9523,9633,9667,9547,9755,NA,9686,10084),censdate=c(NA,NA,8321,NA,NA,8519,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,10095,NA,NA),viralload=c(13000,36000,70000,90000,21000,110000,75000,12000,125000,110000,13000,39000,79000,135000,14000,42000,123000,20000,12000,18000,16000,140000,16000,58000,11000,120000,85000,31000,24000,115000,17000,13100,72000,13500))cdcaids <- upData(cdcaids, labels=c(age     ='Age, y', infedate='Date of blood transfusion',          diagdate='Date of AIDS diagnosis',          diffdate='Incubation period (days from HIV to AIDS)',          dethdate='Fictitious date of death',          censdate='Fictitious censoring date',  viralload='Fictitious viral load'))# Note that the style options listed with these# examples are best suited for output to a postscript file (i.e., using# the postscript function with horizontal=TRUE) as opposed to a graphical# window (e.g., motif).# To produce simple calendar event chart (with internal legend):# postscript('example1.ps', horizontal=TRUE) event.chart(cdcaids,  subset.c=c('infedate','diagdate','dethdate','censdate'),  x.lab = 'observation dates',  y.lab='patients (sorted by AIDS diagnosis date)',  titl='AIDS data calendar event chart 1',  point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),  legend.plot=TRUE, legend.location='i', legend.cex=1.0,  legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),  legend.point.at = list(c(7210, 8100), c(35, 27)), legend.bty='o')# To produce simple interval event chart (with internal legend):# postscript('example2.ps', horizontal=TRUE) event.chart(cdcaids,  subset.c=c('infedate','diagdate','dethdate','censdate'),  x.lab = 'time since transfusion (in days)',  y.lab='patients (sorted by AIDS diagnosis date)',  titl='AIDS data interval event chart 1',  point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),  legend.plot=TRUE, legend.location='i', legend.cex=1.0,  legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),  x.reference='infedate', x.julian=TRUE,  legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1)))# To produce simple interval event chart (with internal legend),# but now with flexible diagdate symbol size based on viral load variable:# postscript('example2a.ps', horizontal=TRUE) event.chart(cdcaids,  subset.c=c('infedate','diagdate','dethdate','censdate'),  x.lab = 'time since transfusion (in days)',  y.lab='patients (sorted by AIDS diagnosis date)',  titl='AIDS data interval event chart 1a, with viral load at diagdate represented',  point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),  point.cex.mult = 0.00002, point.cex.mult.var = 'viralload', extra.points.no.mult = c(1,NA,1,1),   legend.plot=TRUE, legend.location='i', legend.cex=1.0,  legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),  x.reference='infedate', x.julian=TRUE,  legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1)))# To produce more complicated interval chart which is# referenced by infection date, and sorted by age and incubation period:# postscript('example3.ps', horizontal=TRUE) event.chart(cdcaids,  subset.c=c('infedate','diagdate','dethdate','censdate'),  x.lab = 'time since diagnosis of AIDS (in days)',  y.lab='patients (sorted by age and incubation length)',  titl='AIDS data interval event chart 2 (sorted by age, incubation)',  point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8),  legend.plot=TRUE, legend.location='i',legend.cex=1.0,  legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),  x.reference='diagdate', x.julian=TRUE, sort.by=c('age','diffdate'),  line.by='age', line.lty=c(1,3,2,4), line.lwd=rep(1,4), line.col=rep(1,4),  legend.bty='o', legend.point.at = list(c(-1350, -800), c(7, -1)),  legend.line.at = list(c(-1350, -800), c(16, 8)),  legend.line.text=c('age = 1', '       = 2', '       = 3', '       = 4'))# To produce the Goldman chart:# postscript('example4.ps', horizontal=TRUE) event.chart(cdcaids,  subset.c=c('infedate','diagdate','dethdate','censdate'),  x.lab = 'time since transfusion (in days)', y.lab='dates of observation',  titl='AIDS data Goldman event chart 1',  y.var = c('infedate'), y.var.type='d', now.line=TRUE, y.jitter=FALSE,  point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), mgp = c(3.1,1.6,0),  legend.plot=TRUE, legend.location='i',legend.cex=1.0,  legend.point.text=c('transfusion','AIDS diagnosis','death','censored'),  x.reference='infedate', x.julian=TRUE,  legend.bty='o', legend.point.at = list(c(1500, 2800), c(9300, 10000)))# To convert coded time-to-event data, then, draw an event chart:surv.time <- c(5,6,3,1,2)cens.ind   <- c(1,0,1,1,0)surv.data  <- cbind(surv.time,cens.ind)event.data <- event.convert(surv.data)event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)

Event Conversion for Time-to-Event Data

Description

Convert a two-column data matrix with event time and event code intomultiple column event time with one event in each column

Usage

event.convert(data2, event.time = 1, event.code = 2)

Arguments

data2

a matrix or dataframe with at least 2 columns; by default, the firstcolumn contains the event time and the second column contains the kevent codes (e.g. 1=dead, 0=censord)

event.time

the column number in data contains the event time

event.code

the column number in data contains the event code

Details

In the survival analysis, the data typically come in twocolumns: one column containing survival time and the othercontaining censoring indicator or event code. Theevent.convert function converts this type of data intomultiple columns of event times, one column of each eventtype, suitable for theevent.chart function.

Author(s)

J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org,khess@mdanderson.org

Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca

See Also

event.history,Date,event.chart

Examples

# To convert coded time-to-event data, then, draw an event chart:surv.time <- c(5,6,3,1,2)cens.ind   <- c(1,0,1,1,0)surv.data  <- cbind(surv.time,cens.ind)event.data <- event.convert(surv.data)event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)

Produces event.history graph for survival data

Description

Produces an event history graph for right-censored survival data,including time-dependent covariate status, as described inDubin, Muller, and Wang (2001). Effectively,a Kaplan-Meier curve is produced with supplementary informationregarding individual survival information, censoring information, andstatus over time of an individual time-dependent covariate or time-dependent covariate function for both uncensored and censored individuals.

Usage

event.history(data, survtime.col, surv.col,              surv.ind = c(1, 0), subset.rows = NULL,              covtime.cols = NULL, cov.cols = NULL,              num.colors = 1, cut.cov = NULL, colors = 1,              cens.density = 10, mult.end.cens = 1.05,              cens.mark.right =FALSE, cens.mark = "-",              cens.mark.ahead = 0.5, cens.mark.cutoff = -1e-08,              cens.mark.cex = 1,              x.lab = "time under observation",              y.lab = "estimated survival probability",              title = "event history graph", ...)

Arguments

data

A matrix or data frame with rows corresponding to units(often individuals) and columns corresponding to survival time,event/censoring indicator. Also, multiple columns may be devoted totime-dependent covariate level and time change.

survtime.col

Column (in data) representing minimum of time-to-event or right-censoring time for individual.

surv.col

Column (in data) representing event indicator for an individual.Though, traditionally, such an indicator will be 1 for an event and0 for a censored observation, this indicator can be represented by any two numbers, made explicit by the surv.ind argument.

surv.ind

Two-element vector representing, respectively, the number for an event, as listed insurv.col, followed by the number for a censoredobservation. Default is traditional survival data represention, i.e.,c(1,0).

subset.rows

Subset of rows of original matrix or data frame (data) to place in event history graph.Logical arguments may be used here (e.g.,treatment.arm == "a", ifthe data frame, data, has been attached to the search directory;

covtime.cols

Column(s) (in data) representing the time when change of time-dependent covariate (or time-dependent covariate function) occurs. There should be a unique non-NA entry in the column for each such change (along with correspondingcov.cols column entry representing the value of the covariate or function at that change time). Default isNULL, meaning no time-dependent covariate information will be presented in the graph.

cov.cols

Column(s) (in data) representing the level of the time-dependent covariate (or time-dependent covariate function). There should be a unique non-NA column entry representing each change in the level (along with a corresponding covtime.cols column entry representing the time of the change). Default isNULL, meaningno time-dependent covariate information will be presented inthe graph.

num.colors

Colors are utilized for the time-dependent covariate level for anindividual. This argument provides the number of unique covariatelevels which will be displayed by mapping the number of colors (vianum.colors) to the number of desired covariate levels. This will divide the covariate span into roughly equally-sized intervals, via the S-Plus cut function.Default is one color, meaning no time-dependent informationwill be presented in the graph. Note that this argument willbe ignored/superceded if a non-NULL argument is provided for thecut.cov parameter.

cut.cov

This argument allows the user to explicitly state how to define the intervals for the time-dependent covariate, such thatdifferent colors will be allocated to the user-defined covariate levels.For example, for plotting five colors, six ordered points within the span of the data's covariate levels should be provided.Default isNULL, meaning that thenum.colors argument valuewill dictate the number of breakpoints, with the covariate spandefined into roughly equally-sized intervals via the S-Plus cutfunction. However, ifis.null(cut.cov) == FALSE, then this argument supercedes any entry for thenum.colors argument.

colors

This is a vector argument defining the actual colors used for the time-dependent covariate levels in the plot, with theindex of this vector corresponding to the ordered levelsof the covariate. The number of colors (i.e., the lengthof the colors vector) should correspond to the value provided to thenum.colors argument or the number of ordered points - 1 as defined in thecut.cov argument(withcut.cov supercedingnum.colors ifis.null(cut.cov) == FALSE). The function, as currently written, allows for as much as twenty distinct colors. This argument effectively feedsinto the col argument for the S-Plus polygon function. Default iscolors = 1. See the col argument for the both the S-Plus par function and polygon function for more information.

cens.density

This will provide the shading density at the end of the individual bars for those who are censored. For more informationon shading density, see the density argument in the S-Pluspolygon function. Default iscens.density=10.

mult.end.cens

This is a multiplier that extends the length of the longest surviving individual bar (or bars, if a tie exists) if right-censored, presuming that no event times eventually follow thisfinal censored time. Default extends the length 5 percent beyond the length of the observed right-censored survival time.

cens.mark.right

A logical argument that states whether an explicit mark should be placed to the right of the individual right-censored survival bars. This argument is most useful forlarge sample sizes, where it may be hard to detect the special shading via cens.density, particularly for the short-term survivors.

cens.mark

Character argument which describes the censored mark that should beused ifcens.mark.right = TRUE. Default is"-".

cens.mark.ahead

A numeric argument, which specifies the absolute distanceto be placed between the individual right-censoredsurvival bars and the mark as defined in the above cens.markargument. Default is 0.5 (that is, a half of day, ifsurvival time is measured in days), but may very well needadjusting depending on the maximum survival timeobserved in the dataset.

cens.mark.cutoff

A negative number very close to 0 (by defaultcens.mark.cutoff = -1e-8) to ensure that the censoring marks get plotted correctly. Seeevent.historycode in order to see its usage. This argument typically will notneed adjustment.

cens.mark.cex

Numeric argument defining the size of the mark defined in thecens.mark argument above. See more information by viewing thecex argument for the S-Pluspar function.Default iscens.mark.cex = 1.0.

x.lab

Single label to be used for entire x-axis. Default is"time under observation".

y.lab

Single label to be used for entire y-axis. Default is"estimated survival probability".

title

Title for the event history graph. Default is"event history graph".

...

This allows arguments to the plot function call within theevent.history function. So, for example, the axes representations can be manipulatedwith appropriate arguments, or particular areas of theevent.history graph can be “zoomed”. See the details section for more comments about zooming.

Details

In order to focus on a particular area of the event history graph,zooming can be performed. This is best done by specifying appropriatexlim andylim arguments at the end of theevent.history function call, taking advantage of the... argument link to the plot function.An example of zooming can be seenin Plate 4 of the paper referenced below.

Please read the reference below to understand how theindividual covariate and survival information is provided in the plot,how ties are handled, how right-censoring is handled, etc.

WARNING

This function has been tested thoroughly, but only within a restricted version and environment, i.e., only within S-Plus 2000, Version 3, and within S-Plus 6.0,version 2, both on a Windows 2000 machine. Hence, we cannot currently vouchfor the function's effectiveness in other versions of S-Plus (e.g., S-Plus 3.4) nor in other operating environments (e.g., Windows 95, Linux or Unix).The function has also been verified to work on R under Linux.

Note

The authors have found better control of the use of color by producing the graphs via the postscript plotting devicein S-Plus. In fact, the provided examples utilize the postscript function.However, your past experiences may be different, and you may prefer to control color directly (to the graphsheetin Windows environment, for example). The event.historyfunction will work with either approach.

Author(s)

Joel Dubin
jdubin@uwaterloo.ca

References

Dubin, J.A., Muller, H.-G., and Wang, J.-L. (2001).Event history graphs for censored survival data.Statistics in Medicine,20, 2951-2964.

See Also

plot,polygon,event.chart,par

Examples

# Code to produce event history graphs for SIM paper## before generating plots, some pre-processing needs to be performed,#  in order to get dataset in proper form for event.history function;#  need to create one line per subject and sort by time under observation, #  with those experiencing event coming before those tied with censoring time;require('survival')data(heart)# creation of event.history version of heart dataset (call heart.one):heart.one <- matrix(nrow=length(unique(heart$id)), ncol=8)for(i in 1:length(unique(heart$id))) {  if(length(heart$id[heart$id==i]) == 1)   heart.one[i,] <- as.numeric(unlist(heart[heart$id==i, ]))  else if(length(heart$id[heart$id==i]) == 2)   heart.one[i,] <- as.numeric(unlist(heart[heart$id==i,][2,])) }heart.one[,3][heart.one[,3] == 0] <- 2 ## converting censored events to 2, from 0if(is.factor(heart$transplant)) heart.one[,7] <- heart.one[,7] - 1 ## getting back to correct transplantation codingheart.one <- as.data.frame(heart.one[order(unlist(heart.one[,2]), unlist(heart.one[,3])),])names(heart.one) <- names(heart)# back to usual censoring indicator:heart.one[,3][heart.one[,3] == 2] <- 0 # note: transplant says 0 (for no transplants) or 1 (for one transplant)#        and event = 1 is death, while event = 0 is censored# plot single Kaplan-Meier curve from heart data, first creating survival objectheart.surv <- survfit(Surv(stop, event) ~ 1, data=heart.one, conf.int = FALSE)# figure 3: traditional Kaplan-Meier curve# postscript('ehgfig3.ps', horiz=TRUE)# omi <- par(omi=c(0,1.25,0.5,1.25)) plot(heart.surv, ylab='estimated survival probability',      xlab='observation time (in days)') title('Figure 3: Kaplan-Meier curve for Stanford data', cex=0.8)# dev.off()## now, draw event history graph for Stanford heart data; use as Figure 4# postscript('ehgfig4.ps', horiz=TRUE, colors = seq(0, 1, len=20))# par(omi=c(0,1.25,0.5,1.25)) event.history(heart.one, survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10),x.lab = 'time under observation (in days)',title='Figure 4: Event history graph for\nStanford data',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 30.0, cens.mark.cex = 0.85)# dev.off()# now, draw age-stratified event history graph for Stanford heart data; #  use as Figure 5# two plots, stratified by age status# postscript('c:\temp\ehgfig5.ps', horiz=TRUE, colors = seq(0, 1, len=20))# par(omi=c(0,1.25,0.5,1.25)) par(mfrow=c(1,2)) event.history(data=heart.one, subset.rows = (heart.one[,4] < 0),survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10),  x.lab = 'time under observation\n(in days)',title = 'Figure 5a:\nStanford data\n(age < 48)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85,xlim=c(0,1900)) event.history(data=heart.one, subset.rows = (heart.one[,4] >= 0),survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10),x.lab = 'time under observation\n(in days)',title = 'Figure 5b:\nStanford data\n(age >= 48)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85,xlim=c(0,1900))# dev.off()# par(omi=omi)# we will not show liver cirrhosis data manipulation, as it was #  a bit detailed; however, here is the #  event.history code to produce Figure 7 / Plate 1# Figure 7 / Plate 1 : prothrombin ehg with color## Not run: second.arg <- 1### second.arg is for shadingthird.arg <- c(rep(1,18),0,1)### third.arg is for intensity# postscript('c:\temp\ehgfig7.ps', horiz=TRUE, # colors = cbind(seq(0, 1, len = 20), second.arg, third.arg)) # par(omi=c(0,1.25,0.5,1.25), col=19) event.history(cirrhos2.eh, subset.rows = NULL,               survtime.col=cirrhos2.eh$time, surv.col=cirrhos2.eh$event,covtime.cols = as.matrix(cirrhos2.eh[, ((2:18)*2)]),cov.cols = as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),cut.cov =  as.numeric(quantile(as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),c(0,.2,.4,.6,.8,1), na.rm=TRUE) + c(-1,0,0,0,0,1)), colors=c(20,4,8,11,14),x.lab = 'time under observation (in days)',title='Figure 7: Event history graph for liver cirrhosis data (color)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 100.0, cens.mark.cex = 0.85)# dev.off()## End(Not run)

extractlabs

Description

Extract Labels and Units From Multiple Datasets

Usage

extractlabs(..., print = TRUE)

Arguments

...

one ore more data frames or data tables

print

set toFALSE to not print details about variables with conflicting attributes

Details

For one or more data frames/tables extracts all labels and units and comb ines them over dataset, dropping any variables not having either labels or units defined. The resulting data table is returned and is used by thehlab function if the user stores the result in an objectnamedLabelsUnits. The result isNULL if no variable in any dataset has a non-blanklabel orunits. Variables found in more than one dataset with duplicatelabel andunits are consolidated. A warning message is issued when duplicate variables have conflicting labels or units, and by default, details are printed. No attempt is made to resolve these conflicts.

Value

a data table

Author(s)

Frank Harrell

See Also

label(),contents(),units(),hlab()

Examples

d <- data.frame(x=1:10, y=(1:10)/10)d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)d2 <- dunits(d2$x) <- 'cm'LabelsUnits <- extractlabs(d, d2)LabelsUnits

fImport

Description

General File Import Usingrio

Usage

fImport(  file,  format,  lowernames = c("not mixed", "no", "yes"),  und. = FALSE,  ...)

Arguments

file

name of file to import, or full URL.rio determines the file type from the file suffix unless you override this withformat

format

format of file to import, usually not needed. Seerio::import() for details

lowernames

defaults to changing variable names to all lower case unless the name as mixed upper and lower case, which results in keeping the original characters in the name. Setlowernames='no' to leave variable names as they were created in the original file export, or setlowernames='yes' to set all names to lower case whether they have mixed case or not. For all options, a check is made to see if the name conversions would result in any duplicate names. If so, the original names are retained and a warning message issued.

und.

set toTRUE to change all underscores in names to periods

...

more arguments to pass torio::import()

Details

This is a front-end for therio package'simport function.fImport includes options for setting variable names to lower case and to change underscores in names to periods. Variables on the imported data frame that havelabels are converted to Hmisc packagelabelled class so that subsetting the data frame will preserve the labels.

Value

a data frame created byrio, unless ario option is given to use another format

Author(s)

Frank Harrell

See Also

upData, especially themoveUnits option

Examples

## Not run: # Get a Stata datasetd <- fImport('http://www.principlesofeconometrics.com/stata/alcohol.dta')contents(d)## End(Not run)

Find Close Matches

Description

Compares each row inx against all the rows iny, finding rows iny with all columns within a tolerance of the values a given row ofx. The default tolerancetol is zero, i.e., an exact match is required on all columns.For qualifying matches, a distance measure is computed. This isthe sum of squares of differences betweenx andy after scalingthe columns. The default scaling values aretol, and for columnswithtol=1 the scale values are set to 1.0 (since they are ignoredanyway). Matches (up tomaxmatch of them) are stored and listed in order of increasing distance.
Thesummary method prints a frequency distribution of thenumber of matches per observation inx, the median of the minimumdistances for all matches perx, as a function of the number of matches,and the frequency of selection of duplicate observations as those havingthe smallest distance. Theprint method prints the entirematchesanddistance components of the result fromfind.matches.
matchCases finds all controls that match cases on a single variablex within a tolerance oftol. This is intended for prospectivecohort studies that use matching for confounder adjustment (eventhough regression models usually work better).

Usage

find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10)## S3 method for class 'find.matches'summary(object, ...)## S3 method for class 'find.matches'print(x, digits, ...)matchCases(xcase,    ycase,    idcase=names(ycase),           xcontrol, ycontrol, idcontrol=names(ycontrol),           tol=NULL,           maxobs=max(length(ycase),length(ycontrol))*10,           maxmatch=20, which=c('closest','random'))

Arguments

x

a numeric matrix or the result offind.matches

y

a numeric matrix with same number of columns asx

xcase

numeric vector to match on for cases

xcontrol

numeric vector to match on for controls, not necessarilythe same length asxcase

ycase

a vector or matrix

ycontrol

ycase andycontrol are vectors or matrices, not necessarily having the same number of rows,specifying a variable to carry along from cases and matchingcontrols. If you instead want to carry along rows from a data frame,letycase andycontrol be non-overlapping integer subscripts ofthe donor data frame.

tol

a vector of tolerances with number of elements the same as the numberof columns ofy, forfind.matches. FormatchCasesis a scalar tolerance.

scale

a vector of scaling constants with number of elements the same as thenumber of columns ofy.

maxmatch

maximum number of matches to allow. FormatchCases,maximum number of controls to match with a case (default is 20). If more thanmaxmatch matching controls are available, a random sample withoutreplacement ofmaxmatch controls is used (ifwhich="random").

object

an object created byfind.matches

digits

number of digits to use in printing distances

idcase

vector the same length asxcase

idcontrol

idcase andidcontrol are vectors the same length asxcase andxcontrol respectively, specifying the id of cases and controls. Defaults are integersspecifying original element positions within each of cases andcontrols.

maxobs

maximum number of cases and all matching controls combined (maximumdimension of data frame resulting frommatchControls). Default isten times the maximum of the number of cases and number of controls.maxobs is used to allocate space for the resulting data frame.

which

set to"closest" (the default) to match cases with up tomaxmatchcontrols that most closely match onx. Setwhich="random" to userandomly chosen controls. In either case, only those controls withintol onx are allowed to be used.

...

unused

Value

find.matches returns a list of classfind.matches with elementsmatches anddistance. Both elements are matrices with the number of rows equal to the numberof rows inx, and withk columns, wherek is the maximum number ofmatches (<= maxmatch) that occurred. The elements ofmatchesare row identifiers ofy that match, with zeros if fewer thanmaxmatch matches are found (blanks ify had row names).matchCases returns a data frame with variablesidcase (id of casecurrently being matched),type (factor variable with levels"case"and"control"),id (id of case if case row, or id of matchingcase), andy.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Ming K, Rosenbaum PR (2001): A note on optimal matching with variablecontrols using the assignment algorithm. J Comp Graph Stat10:455–463.

Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with avariable number of controls vs. a fixed number of controls for a cohortstudy: trade-offs. J Clin Epidemiology 56:230-237.Note: These papers were not used for the functions here butprobably should have been.

See Also

scale,apply

Examples

y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5))x <- rbind(c(.09,.21), c(.29,.39))yxw <- find.matches(x, y, maxmatch=5, tol=c(.05,.05))set.seed(111)       # so can replicate resultsx <- matrix(runif(500), ncol=2)y <- matrix(runif(2000), ncol=2)w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03))w$matches[1:5,]w$distance[1:5,]# Find first x with 3 or more y-matchesnum.match <- apply(w$matches, 1, function(x)sum(x > 0))j <- ((1:length(num.match))[num.match > 2])[1]x[j,]y[w$matches[j,],]summary(w)# For many applications would do something like this:# attach(df1)# x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects# attach(df2)# y <- cbind(age, sex)# mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age# Demonstrate matchCasesxcase     <- c(1,3,5,12)xcontrol  <- 1:6idcase    <- c('A','B','C','D')idcontrol <- c('a','b','c','d','e','f')ycase     <- c(11,33,55,122)ycontrol  <- c(11,22,33,44,55,66)matchCases(xcase, ycase, idcase,           xcontrol, ycontrol, idcontrol, tol=1)# If y is a binary response variable, the following code# will produce a Mantel-Haenszel summary odds ratio that # utilizes the matching.# Standard variance formula will not work here because# a control will match more than one case# WARNING: The M-H procedure exemplified here is suspect # because of the small strata and widely varying number# of controls per case.x    <- c(1, 2, 3, 3, 3, 6, 7, 12,  1, 1:7)y    <- c(0, 0, 0, 1, 0, 1, 1,  1,  1, 0, 0, 0, 0, 1, 1, 1)case <- c(rep(TRUE, 8), rep(FALSE, 8))id   <- 1:length(x)m <- matchCases(x[case],  y[case],  id[case],                x[!case], y[!case], id[!case], tol=1)iscase <- m$type=='case'# Note: the first tapply on insures that event indicators are# sorted by case id.  The second actually does something.event.case    <- tapply(m$y[iscase],  m$idcase[iscase],  sum)event.control <- tapply(m$y[!iscase], m$idcase[!iscase], sum)n.control     <- tapply(!iscase,      m$idcase,          sum)n             <- tapply(m$y,          m$idcase,          length)or <- sum(event.case * (n.control - event.control) / n) /      sum(event.control * (1 - event.case) / n)or# Bootstrap this estimator by sampling with replacement from# subjects.  Assumes id is unique when combine cases+controls# (id was constructed this way above).  The following algorithms# puts all sampled controls back with the cases to whom they were# originally matched.ids <- unique(m$id)idgroups <- split(1:nrow(m), m$id)B   <- 50   # in practice use many moreors <- numeric(B)# Function to order w by ids, leaving unassigned elements zeroalign <- function(ids, w) {  z <- structure(rep(0, length(ids)), names=ids)  z[names(w)] <- w  z}for(i in 1:B) {  j <- sample(ids, replace=TRUE)  obs <- unlist(idgroups[j])  u <- m[obs,]  iscase <- u$type=='case'  n.case <- align(ids, tapply(u$type, u$idcase,                               function(v)sum(v=='case')))  n.control <- align(ids, tapply(u$type, u$idcase,                                 function(v)sum(v=='control')))  event.case <- align(ids, tapply(u$y[iscase],  u$idcase[iscase],  sum))  event.control <- align(ids, tapply(u$y[!iscase], u$idcase[!iscase], sum))  n <- n.case + n.control  # Remove sets having 0 cases or 0 controls in resample  s             <- n.case > 0 & n.control > 0  denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s])  or <- if(denom==0) NA else    sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom  ors[i] <- or}describe(ors)

First Word in a String or Expression

Description

first.word finds the first word in an expression. A word is defined byunlisting the elements of the expression found by the S parser and thenaccepting any elements whose first character is either a letter or period.The principal intended use is for the automatic generation of temporaryfile names where it is important to exclude special characters fromthe file name. For Microsoft Windows, periods in names are deleted andonly up to the first 8 characters of the word is returned.

Usage

first.word(x, i=1, expr=substitute(x))

Arguments

x

any scalar character string

i

word number, default value = 1. Used when the second orith word iswanted. Currently only thei=1 case is implemented.

expr

any S object of modeexpression.

Value

a character string

Author(s)

Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com

Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu

Examples

first.word(expr=expression(y ~ x + log(w)))

Format a Data Frame or Matrix for LaTeX or HTML

Description

format.df does appropriate rounding and decimal alignment, and outputsa character matrix containing the formatted data. Ifx is adata.frame, then do each component separately.Ifx is a matrix, but not a data.frame, make it a data.framewith individual components for the columns.If a componentx$x is a matrix, then do all columns the same.

Usage

format.df(x, digits, dec=NULL, rdec=NULL, cdec=NULL,          numeric.dollar=!dcolumn, na.blank=FALSE, na.dot=FALSE,          blank.dot=FALSE, col.just=NULL, cdot=FALSE,          dcolumn=FALSE, matrix.sep=' ', scientific=c(-4,4),          math.row.names=FALSE, already.math.row.names=FALSE,          math.col.names=FALSE, already.math.col.names=FALSE,          double.slash=FALSE, format.Date="%m/%d/%Y",          format.POSIXt="%m/%d/%Y %H:%M:%OS", ...)

Arguments

x

a matrix (usually numeric) or data frame

digits

causes all values in the table to be formatted todigits significantdigits.dec is usually preferred.

dec

Ifdec is a scalar, all elements of the matrix will be roundedtodec decimal places to the right of the decimal.dec can also be a matrix whose elements correspond tox, for customized rounding of each element.A matrixdec must have number of columns equal to number of columnsof inputx.A scalardec is expanded to a vectorcdec with number ofitems equal to number of columns of inputx.

rdec

a vector specifying the number of decimal places to the right for each row (cdec is more commonly used thanrdec)A vectorrdec must have number of items equal to number of rows of inputx.rdec is expanded to matrixdec.

cdec

a vector specifying the number of decimal places for each column.The vector must have number of items equal to number of columns or componentsof input x.

cdot

Set toTRUE to use centered dots rather than ordinary periods in numbers.The output uses a syntax appropriate forlatex.

na.blank

Set toTRUE to use blanks rather thanNA for missing values.This usually looks better inlatex.

dcolumn

Set toTRUE to use David Carlisle's dcolumn style fordecimal alignment inlatex.Default isFALSE. You will probably want tousedcolumn if you userdec, as a column may then contain varyingnumber of places to the right of the decimal.dcolumn can line upall such numbers on the decimal point, with integer values rightjustified at the decimal point location of numbers that actuallycontain decimal places. When you usedcolumn = TRUE,numeric.dollar is set by default toFALSE. When youusedcolumn = TRUE, theobject attribute"style" set to ‘⁠dcolumn⁠’ as thelatexusepackage must reference[dcolumn].The three files ‘dcolumn.sty’, ‘newarray.sty’, and‘array.sty’ will need to be in a directory in yourTEXINPUTS path.When you usedcolumn=TRUE,numeric.dollar should be set toFALSE.

numeric.dollar

logical, default!dcolumn. Set toTRUE to place dollarsigns around numeric values whendcolumn = FALSE. This assures thatlatex will use minus signs rather than hyphens to indicatenegative numbers. Set toFALSE whendcolumn = TRUE, asdcolumn.sty automatically uses minus signs.

math.row.names

logical, set true to place dollar signs around the row names.

already.math.row.names

set toTRUE to prevent any mathmode changes to row names

math.col.names

logical, set true to place dollar signs around the column names.

already.math.col.names

set toTRUE to prevent any mathmode changes to column names

na.dot

Set toTRUE to use periods rather thanNA for missingnumeric values. This works with theSAS convention that periods indicate missing values.

blank.dot

Set toTRUE to use periods rather than blanks for missing character values.This works with theSAS convention that periods indicate missing values.

col.just

Input vectorcol.just must have number of columns equal tonumber of columns of the output matrix. WhenNULL, thedefault, thecol.just attribute of the result is set to‘⁠l⁠’ for character columns and to ‘⁠r⁠’ for numericcolumns. The user can override the default by an argument vectorwhose length is equal to the number of columns of the result matrix.Whenformat.df is called bylatex.default, thecol.just is used as thecols argument to thetabular environment and the letters ‘⁠l⁠’, ‘⁠r⁠’,and ‘⁠c⁠’ are valid values. Whenformat.df is called bySAS, thecol.just is used to determine whether a‘⁠\$⁠’ is needed on the ‘⁠input⁠’ line of the ‘sysin’ file,and the letters ‘⁠l⁠’ and ‘⁠r⁠’ are valid values. You canpass specifications other thanl,r,c incol.just,e.g.,"p{3in}" to get paragraph-formatted columns fromlatex().

matrix.sep

Whenx is a data frame containing a matrix, so that new column namesare constructed from the name of the matrix object and the names ofthe individual columns of the matrix,matrix.sep specifies thecharacter to use to separate object names from individual columnnames.

scientific

specifies ranges of exponents (or a logical vector) specifying valuesnot to convert to scientific notation. Seeformat.default for details.

double.slash

should escaping backslashes be themselves escaped.

format.Date

String used to format objects of the Date class.

format.POSIXt

String used to format objects of the POSIXt class.

...

other arguments are accepted and passed toformat.default.ForlatexVerbatim these arguments are passed to theprint function.

Value

a character matrix with character images of properly roundedx.Matrix components of inputx are now just sets of columns ofcharacter matrix.Object attribute"col.just" repeats the value of the argumentcol.just when provided,otherwise, it includes the recommended justification for columns of output.See the discussion of the argumentcol.just.The default justification is ‘⁠l⁠’ for characters and factors,‘⁠r⁠’ for numeric.Whendcolumn==TRUE, numerics will have ‘⁠.⁠’ as the justification character.

Author(s)

Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com

Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu

See Also

latex

Examples

## Not run: x <- data.frame(a=1:2, b=3:4)x$m <- 10000*matrix(5:8,nrow=2)names(x)dim(x)xformat.df(x, big.mark=",")dim(format.df(x))## End(Not run)

Format P Values

Description

format.pval is intended for formatting p-values.

Usage

format.pval(x, pv=x, digits = max(1, .Options$digits - 2),            eps = .Machine$double.eps, na.form = "NA", ...)

Arguments

pv

a numeric vector.

x

argument for method compliance.

digits

how many significant digits are to be used.

eps

a numerical tolerance: see Details.

na.form

character representation ofNAs.

...

arguments passed toformat in theformat.pvalfunction body.

Details

format.pval is mainly an auxiliary function forprint.summary.lm etc., and does separate formatting forfixed, floating point and very small values; those less thaneps are formatted as “‘⁠< [eps]⁠’” (where“‘⁠[eps]⁠’” stands forformat(eps, digits)).

Value

A character vector.

Note

This is the baseformat.pval function with theablitiy to pass thensmall argument toformat

Examples

format.pval(c(runif(5), pi^-100, NA))format.pval(c(0.1, 0.0001, 1e-27))format.pval(c(0.1, 1e-27), nsmall=3)

Gaussian Bayesian Posterior and Predictive Distributions

Description

gbayes derives the (Gaussian) posterior and optionally the predictivedistribution when both the prior and the likelihood are Gaussian, andwhen the statistic of interest comes from a 2-sample problem.This function is especially useful in obtaining the expected power ofa statistical test, averaging over the distribution of the populationeffect parameter (e.g., log hazard ratio) that is obtained usingpilot data.gbayes is also useful for summarizing studies forwhich the statistic of interest is approximately Gaussian withknown variance. An example is given for comparing two proportionsusing the angular transformation, for which the variance isindependent of unknown parameters except for very extreme probabilities.Aplot method is also given. This plots the prior, posterior, andpredictive distributions on a single graph using a nice default forthe x-axis limits and using thelabcurve function for automaticlabeling of the curves.

gbayes2 uses the method of Spiegelhalter and Freedman (1986) to compute theprobability of correctly concluding that a new treatment is superiorto a control. By this we mean that a 1-alpha normaltheory-based confidence interval for the new minus old treatmenteffect lies wholly to the right ofdelta.w, wheredelta.w is theminimally worthwhile treatment effect (which can be zero to beconsistent with ordinary null hypothesis testing, a method not alwaysmaking sense). This kind of power function is averaged over a priordistribution for the unknown treatment effect. This procedure isapplicable to the situation where a prior distribution is not to beused in constructing the test statistic or confidence interval, but isonly used for specifying the distribution ofdelta, the parameter ofinterest.

Even thoughgbayes2assumes that the test statistic has a normal distribution with knownvariance (which is strongly a function of the sample size in the twotreatment groups), the prior distribution function can be completelygeneral. Instead of using a step-function for the prior distributionas Spiegelhalter and Freedman used in their appendix,gbayes2 usesthe built-inintegrate function for numerical integration.gbayes2 also allows the variance of the test statistic to be generalas long as it is evaluated by the user. The conditional power given theparameter of interestdelta is1 - pnorm((delta.w - delta)/sd + z), where zis the normal critical value corresponding to 1 -alpha/2.

gbayesMixPredNoData derives the predictive distribution of astatistic that is Gaussian givendelta when no data have yet beenobserved and when the prior is a mixture of two Gaussians.

gbayesMixPost derives the posterior density, cdf, or posteriormean ofdelta given the statisticx, when the prior fordelta is a mixture of twoGaussians and whenx is Gaussian givendelta.

gbayesMixPowerNP computes the power for a test fordelta >delta.wfor the case where (1) a Gaussian prior or mixture of two Gaussian priorsis used as the prior distribution, (2) this prior is used in formingthe statistical test or credible interval, (3) no prior is used forthe distribution ofdelta for computing power but instead a fixedsingledelta is given (as in traditional frequentist hypothesistests), and (4) the test statistic has a Gaussian likelihood withknown variance (and mean equal to the specifieddelta).gbayesMixPowerNP is handy where you want to use an earlier study intesting for treatment effects in a new study, but you want to mix withthis prior a non-informative prior. The mixing probabilitymix canbe thought of as the "applicability" of the previous study. As withgbayes2, power here means the probability that the new study willyield a left credible interval that is to the right ofdelta.w.gbayes1PowerNP is a special case ofgbayesMixPowerNP when theprior is a single Gaussian.

Usage

gbayes(mean.prior, var.prior, m1, m2, stat, var.stat,        n1, n2, cut.prior, cut.prob.prior=0.025)## S3 method for class 'gbayes'plot(x, xlim, ylim, name.stat='z', ...)gbayes2(sd, prior, delta.w=0, alpha=0.05, upper=Inf, prior.aux)gbayesMixPredNoData(mix=NA, d0=NA, v0=NA, d1=NA, v1=NA,                    what=c('density','cdf'))gbayesMixPost(x=NA, v=NA, mix=1, d0=NA, v0=NA, d1=NA, v1=NA,              what=c('density','cdf','postmean'))gbayesMixPowerNP(pcdf, delta, v, delta.w=0, mix, interval,                 nsim=0, alpha=0.05)gbayes1PowerNP(d0, v0, delta, v, delta.w=0, alpha=0.05)

Arguments

mean.prior

mean of the prior distribution

cut.prior,cut.prob.prior,var.prior

variance of the prior. Use a large number such as 10000 to effectivelyuse a flat (noninformative) prior. Sometimes it is useful to computethe variance so that the prior probability thatstat is greater thansome impressive valueu is onlyalpha. The correctvar.prior to use is then((u-mean.prior)/qnorm(1-alpha))^2.You can specifycut.prior=u andcut.prob.prior=alpha (whose default is 0.025)in place ofvar.prior to havegbayes compute the prior variance in thismanner.

m1

sample size in group 1

m2

sample size in group 2

stat

statistic comparing groups 1 and 2, e.g., log hazard ratio, differencein means, difference in angular transformations of proportions

var.stat

variance ofstat, assumed to be known.var.stat should eitherbe a constant (allowed ifn1 is not specified), or a function oftwo arguments which specify the sample sizes in groups 1 and 2. Calculations will be approximate when the variance is estimated from the data.

x

an object returned bygbayes or the value of the statistic whichis an estimator of delta, the parameter of interest

sd

the standard deviation of the treatment effect

prior

a function of possibly a vector of unknown treatment effects,returning the prior density at those values

pcdf

a function computing the posterior CDF of the treatment effectdelta, such as a function created bygbayesMixPost withwhat="cdf".

delta

a true unknown single treatment effect to detect

v

the variance of the statisticx, e.g.,s^2 * (1/n1 + 1/n2).Neitherx norv need to be defined togbayesMixPost, as they can be defined at run time to the functioncreated bygbayesMixPost.

n1

number of future observations in group 1, for obtaining a predictivedistribution

n2

number of future observations in group 2

xlim

vector of 2 x-axis limits. Default is the mean of the posterior plus orminus 6 standard deviations of the posterior.

ylim

vector of 2 y-axis limits. Default is the range over combined prior and posterior densities.

name.stat

label for x-axis. Default is"z".

...

optional arguments passed tolabcurve fromplot.gbayes

delta.w

the minimum worthwhile treatment difference to detech. The default iszero for a plain uninteristing null hypothesis.

alpha

type I error, or more accurately one minus the confidence level for atwo-sided confidence limit for the treatment effect

upper

upper limit of integration over the prior distribution multiplied bythe normal likelihood for the treatment effect statistic. Default isinfinity.

prior.aux

argument to pass toprior fromintegrate throughgbayes2.Inside ofpower the argument must be namedprior.aux if itexists. You can pass multiple parameters by passingprior.aux as alist and pulling off elements of the list insideprior. This setupwas used because of difficulties in passing... arguments throughintegrate for some situations.

mix

mixing probability or weight for the Gaussian prior having meand0and variancev0.mix must be between 0 and 1, inclusive.

d0

mean of the first Gaussian distribution (only Gaussian forgbayes1PowerNP and is a required argument)

v0

variance of the first Gaussian (only Gaussian forgbayes1PowerNP and is a required argument)

d1

mean of the second Gaussian (ifmix < 1)

v1

variance of the second Gaussian (ifmix < 1). Any of these last 5arguments can be omitted togbayesMixPredNoData as they can beprovided at run time to the function created bygbayesMixPredNoData.

what

specifies whether the predictive density or the CDF is to becomputed. Default is"density".

interval

a 2-vector containing the lower and upper limit for possible values ofthe test statisticx that would result in a left credible intervalexceedingdelta.w with probability 1-alpha/2

nsim

defaults to zero, causinggbayesMixPowerNP to solve numerically for thecritical value ofx, then to compute the power accordingly. Specifya nonzero number such as 20000 fornsim to instead have the functionestimate power by simulation. In this case 0.95 confidence limits onthe estimated power are also computed. This approach is sometimesnecessary ifuniroot can't solve the equation for the critical value.

Value

gbayes returns a list of class"gbayes" containing the followingnames elements:mean.prior,var.prior,mean.post,var.post, andifn1 is specified,mean.pred andvar.pred. Note thatmean.pred is identical tomean.post.gbayes2 returns a singlenumber which is the probability of correctly rejecting the nullhypothesis in favor of the new treatment.gbayesMixPredNoDatareturns a function that can be used to evaluate the predictive densityor cumulative distribution.gbayesMixPost returns a function thatcan be used to evaluate the posterior density or cdf.gbayesMixPowerNPreturns a vector containing two values ifnsim = 0. The first value is thecritical value for the test statistic that will make the left credibleinterval >delta.w, and the second value is the power. Ifnsim > 0,it returns the power estimate and confidence limits for it ifnsim >0. The examples show how to use these functions.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Spiegelhalter DJ, Freedman LS, Parmar MKB (1994): Bayesian approaches torandomized trials. JRSS A 157:357–416. Results forgbayes are derived fromEquations 1, 2, 3, and 6.

Spiegelhalter DJ, Freedman LS (1986): A predictive approach toselecting the size of a clinical trial, based on subjective clinicalopinion. Stat in Med 5:1–13.

Joseph, Lawrence and Belisle, Patrick (1997): Bayesian sample sizedetermination for normal means and differences between normal means.The Statistician 46:209–226.

Grouin, JM, Coste M, Bunouf P, Lecoutre B (2007): Bayesian sample sizedetermination in non-sequential clinical trials: Statistical aspects andsome regulatory considerations. Stat in Med 26:4914–4924.

See Also

gbayesSeqSim

Examples

# Compare 2 proportions using the var stabilizing transformation# arcsin(sqrt((x+3/8)/(n+3/4))) (Anscombe), which has variance # 1/[4(n+.5)]m1 <- 100;     m2 <- 150deaths1 <- 10; deaths2 <- 30f <- function(events,n) asin(sqrt((events+3/8)/(n+3/4)))stat <- f(deaths1,m1) - f(deaths2,m2)var.stat <- function(m1, m2) 1/4/(m1+.5) + 1/4/(m2+.5)cat("Test statistic:",format(stat),"  s.d.:",    format(sqrt(var.stat(m1,m2))), "\n")#Use unbiased prior with variance 1000 (almost flat)b <- gbayes(0, 1000, m1, m2, stat, var.stat, 2*m1, 2*m2)print(b)plot(b)#To get posterior Prob[parameter > w] use # 1-pnorm(w, b$mean.post, sqrt(b$var.post))#If g(effect, n1, n2) is the power function to#detect an effect of 'effect' with samples size for groups 1 and 2#of n1,n2, estimate the expected power by getting 1000 random#draws from the posterior distribution, computing power for#each value of the population effect, and averaging the 1000 powers#This code assumes that g will accept vector-valued 'effect'#For the 2-sample proportion problem just addressed, 'effect'#could be taken approximately as the change in the arcsin of#the square root of the probability of the eventg <- function(effect, n1, n2, alpha=.05) {  sd <- sqrt(var.stat(n1,n2))  z <- qnorm(1 - alpha/2)  effect <- abs(effect)  1 - pnorm(z - effect/sd) + pnorm(-z - effect/sd)}effects <- rnorm(1000, b$mean.post, sqrt(b$var.post))powers <- g(effects, 500, 500)hist(powers, nclass=35, xlab='Power')describe(powers)# gbayes2 examples# First consider a study with a binary response where the# sample size is n1=500 in the new treatment arm and n2=300# in the control arm.  The parameter of interest is the # treated:control log odds ratio, which has variance# 1/[n1 p1 (1-p1)] + 1/[n2 p2 (1-p2)].  This is not# really constant so we average the variance over plausible# values of the probabilities of response p1 and p2.  We# think that these are between .4 and .6 and we take a # further short cutv <- function(n1, n2, p1, p2) 1/(n1*p1*(1-p1)) + 1/(n2*p2*(1-p2))n1 <- 500; n2 <- 300ps <- seq(.4, .6, length=100)vguess <- quantile(v(n1, n2, ps, ps), .75)vguess#        75% # 0.02183459# The minimally interesting treatment effect is an odds ratio# of 1.1.  The prior distribution on the log odds ratio is# a 50:50 mixture of a vague Gaussian (mean 0, sd 100) and# an informative prior from a previous study (mean 1, sd 1)prior <- function(delta)   0.5*dnorm(delta, 0, 100)+0.5*dnorm(delta, 1, 1)deltas <- seq(-5, 5, length=150)plot(deltas, prior(deltas), type='l')# Now compute the power, averaged over this priorgbayes2(sqrt(vguess), prior, log(1.1))# [1] 0.6133338# See how much power is lost by ignoring the previous# study completelygbayes2(sqrt(vguess), function(delta)dnorm(delta, 0, 100), log(1.1))# [1] 0.4984588# What happens to the power if we really don't believe the treatment# is very effective?  Let's use a prior distribution for the log# odds ratio that is uniform between log(1.2) and log(1.3).# Also check the power against a true null hypothesisprior2 <- function(delta) dunif(delta, log(1.2), log(1.3))gbayes2(sqrt(vguess), prior2, log(1.1))# [1] 0.1385113gbayes2(sqrt(vguess), prior2, 0)# [1] 0.3264065# Compare this with the power of a two-sample binomial test to# detect an odds ratio of 1.25bpower(.5, odds.ratio=1.25, n1=500, n2=300)#     Power # 0.3307486# For the original prior, consider a new study with equal# sample sizes n in the two arms.  Solve for n to get a# power of 0.9.  For the variance of the log odds ratio# assume a common p in the center of a range of suspected# probabilities of response, 0.3.  For this example we# use a zero null value and the uniform prior abovev   <- function(n) 2/(n*.3*.7)pow <- function(n) gbayes2(sqrt(v(n)), prior2)uniroot(function(n) pow(n)-0.9, c(50,10000))$root# [1] 2119.675# Check this valuepow(2119.675)# [1] 0.9# Get the posterior density when there is a mixture of two priors,# with mixing probability 0.5.  The first prior is almost# non-informative (normal with mean 0 and variance 10000) and the# second has mean 2 and variance 0.3.  The test statistic has a value# of 3 with variance 0.4.f <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3)args(f)# Plot this densitydelta <- seq(-2, 6, length=150)plot(delta, f(delta), type='l')# Add to the plot the posterior density that used only# the almost non-informative priorlines(delta, f(delta, mix=1), lty=2)# The same but for an observed statistic of zerolines(delta, f(delta, mix=1, x=0), lty=3)# Derive the CDF instead of the densityg <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3,                   what='cdf')# Had mix=0 or 1, gbayes1PowerNP could have been used instead# of gbayesMixPowerNP below# Compute the power to detect an effect of delta=1 if the variance# of the test statistic is 0.2gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12))# Do the same thing by simulationgbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12), nsim=20000)# Compute by what factor the sample size needs to be larger# (the variance needs to be smaller) so that the power is 0.9ratios <- seq(1, 4, length=50)pow <- single(50)for(i in 1:50)   pow[i] <- gbayesMixPowerNP(g, 1, 0.2/ratios[i], interval=c(-10,12))[2]# Solve for ratio using reverse linear interpolationapprox(pow, ratios, xout=0.9)$y# Check this by computing powergbayesMixPowerNP(g, 1, 0.2/2.1, interval=c(-10,12))# So the study will have to be 2.1 times as large as earlier thought

gbayesSeqSim

Description

Simulate Bayesian Sequential Treatment Comparisons Using a Gaussian Model

Usage

gbayesSeqSim(est, asserts)

Arguments

est

data frame created byestSeqSim()

asserts

list of lists. The first element of each list is the user-specified name for each assertion/prior combination, e.g.,"efficacy". The other elements are, in order, a character string equal to "<", ">", or "in", a parameter valuecutoff (for "<" and ">") or a 2-vector specifying an interval for "in", and either a prior distribution mean and standard deviation namedmu andsigma respectively, or a parameter value ("cutprior") and tail area"tailprob". If the latter is used,mu is assumed to be zero andsigma is solved for such that P(parameter > 'cutprior') = P(parameter < - 'cutprior') =tailprob.

Details

Simulate a sequential trial under a Gaussian model for parameter estimates, and Gaussian priors using simulated estimates and variances returned byestSeqSim. For each row of the data frameest and for each prior/assertion combination, computes the posterior probability of the assertion.

Value

a data frame with number of rows equal to that ofest with a number of new columns equal to the number of assertions added. The new columns are namedp1,p2,p3, ... (posterior probabilities),mean1,mean2, ... (posterior means), andsd1,sd2, ... (posterior standard deviations). The returned data frame also has an attributeasserts added which is the originalasserts augmented with any derivedmu andsigma and converted to a data frame, and another attributealabels which is a named vector used to mapp1,p2, ... to the user-provided labels inasserts.

Author(s)

Frank Harrell

See Also

gbayes(),estSeqSim(),simMarkovOrd(),estSeqMarkovOrd()

Examples

## Not run: # Simulate Bayesian operating characteristics for an unadjusted# proportional odds comparison (Wilcoxon test)# For 100 simulations, 5 looks, 2 true parameter values, and# 2 assertion/prior combinations, compute the posterior probability# Use a low-level logistic regression call to speed up simuluations# Use data.table to compute various summary measures# Total simulation time: 2slfit <- function(x, y) {f <- rms::lrm.fit(x, y)  k <- length(coef(f))  c(coef(f)[k], vcov(f)[k, k])}gdat <- function(beta, n1, n2) {  # Cell probabilities for a 7-category ordinal outcome for the control group  p <- c(2, 1, 2, 7, 8, 38, 42) / 100  # Compute cell probabilities for the treated group  p2 <- pomodm(p=p, odds.ratio=exp(beta))  y1 <- sample(1 : 7, n1, p,  replace=TRUE)  y2 <- sample(1 : 7, n2, p2, replace=TRUE)  list(y1=y1, y2=y2)}# Assertion 1: log(OR) < 0 under prior with prior mean 0.1 and sigma 1 on log OR scale# Assertion 2: OR between 0.9 and 1/0.9 with prior mean 0 and sigma computed so that# P(OR > 2) = 0.05asserts <- list(list('Efficacy', '<', 0, mu=0.1, sigma=1),                list('Similarity', 'in', log(c(0.9, 1/0.9)),                     cutprior=log(2), tailprob=0.05))set.seed(1)est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200),                   gendat=gdat,                   fitter=lfit, nsim=100)z <- gbayesSeqSim(est, asserts)head(z)attr(z, 'asserts')# Compute the proportion of simulations that hit targets (different target posterior# probabilities for efficacy vs. similarity)# For the efficacy assessment compute the first look at which the target# was hit (set to infinity if never hit)require(data.table)z <- data.table(z)u <- z[, .(first=min(p1 > 0.95)), by=.(parameter, sim)]# Compute the proportion of simulations that ever hit the target and# that hit it by the 100th subjectu[, .(ever=mean(first < Inf)),  by=.(parameter)]u[, .(by75=mean(first <= 100)), by=.(parameter)]## End(Not run)

Step function confidence intervals for ggplot2

Description

Produces a step function confidence interval for survival curves. This function is taken fromtheutile.visuals package by Eric Finnesgard. That package is not used because of itsstrong dependencies.

Usage

geom_stepconfint(  mapping = NULL,  data = NULL,  stat = "identity",  position = "identity",  na.rm = FALSE,  ...)

Arguments

mapping

Aesthetic mappings with aes() function. Like geom_ribbon(), you must providecolumns for x, ymin (lower limit), ymax (upper limit).

data

The data to be displayed in this layer. Can inherit from ggplot parent.

stat

The statistical transformation to use on the data for this layer, as a string.Defaults to 'identity'.

position

Position adjustment, either as a string, or the result of a call to aposition adjustment function.

na.rm

If FALSE, the default, missing values are removed with a warning. If TRUE,missing values are silently removed.

...

Optional. Any other ggplot geom_ribbon() arguments.

Note

Originally adapted from the survminer package <https://github.com/kassambara/survminer>.

Author(s)

Eric Finnesgard

Examples

require(survival)require(ggplot2)f <- survfit(Surv(time, status) ~ trt, data = diabetic)d <- with(f, data.frame(time, surv, lower, upper, trt=rep(names(f$strata), f$strata)))ggplot(d, aes(x = time, y=surv)) +  geom_step(aes(color = trt)) +  geom_stepconfint(aes(ymin = lower, ymax = upper, fill = trt), alpha = 0.3) +  coord_cartesian(c(0, 50)) +  scale_x_continuous(expand = c(0.02,0)) +  labs(x = 'Time', y = 'Freedom From Event') +  scale_color_manual(    values = c('#d83641', '#1A45A7'),    name = 'Treatment',    labels = c('None', 'Laser'),    aesthetics = c('colour', 'fill'))

Download and Install Datasets forHmisc,rms, and StatisticalModeling

Description

This function downloads and makes ready to use datasets from the mainweb site for theHmisc andrms libraries. ForR, thedatasets were stored in compressedsave format andgetHdata makes them available by runningloadafter download. For S-Plus, the datasets were stored indata.dump format and are made available by runningdata.restore after import. The dataset is run through thecleanup.import function. CallinggetHdata with nofile argument provides a character vector of names of availabledatasets that are currently on the web site. ForR,R's defaultbrowser can optionally be launched to view⁠html⁠ files that werealready prepared using theHmisc commandhtml(contents()) or to view ‘.txt’ or ‘.html’ datadescription files when available.

Ifoptions(localHfiles=TRUE) the scripts are read from local directory~/web/data/repo instead of from the web server.

Usage

getHdata(file, what = c("data", "contents", "description", "all"),         where="https://hbiostat.org/data/repo")

Arguments

file

an unquoted name of a dataset on the web site, e.g. ‘⁠prostate⁠’.Omitfile to obtain a list of available datasets.

what

specifywhat="contents" to browse the contents (metadata) forthe dataset rather than fetching the data themselves. Specifywhat="description" to browse a data description file ifavailable. Specifywhat="all" to retrieve the data and seethe metadata and description.

where

URL containing the data and metadata files

Value

getHdata() without afile argument returns a charactervector of dataset base names. When a dataset is downloaded, the dataframe is placed in search position one and is not returned as value ofgetHdata.

Author(s)

Frank Harrell

See Also

download.file,cleanup.import,data.restore,load

Examples

## Not run: getHdata()          # download list of available datasetsgetHdata(prostate)  # downloads, load( ) or data.restore( )                    # runs cleanup.import for S-Plus 6getHdata(valung, "contents")   # open browser (options(browser="whatever"))                    # after downloading valung.html                    # (result of html(contents()))getHdata(support, "all")  # download and open one browser windowdatadensity(support)attach(support)     # make individual variables availablegetHdata(plasma,  "all")  # download and open two browser windows                          # (description file is available for plasma)## End(Not run)

Interact with github rscripts Project

Description

The github rscripts project athttps://github.com/harrelfe/rscripts contains R scripts that areprimarily analysis templates for teaching with RStudio. This functionallows the user to print an organized list of available scripts, todownload a script andsource() it into the current session (thedefault), todownload a script and load it into an RStudio script editor window, tolist scripts whose major category contains a given string (ignoringcase), or to list all major and minor categories. Ifoptions(localHfiles=TRUE) the scripts are read from local directory~/R/rscripts instead of from github.

Usage

getRs(file=NULL, guser='harrelfe', grepo='rscripts', gdir='raw/master',      dir=NULL, browse=c('local', 'browser'), cats=FALSE,      put=c('source', 'rstudio'))

Arguments

file

a character string containing a script file name.Omitfile to obtain a list of available scripts with majorand minor categories.

guser

GitHub user name, default is'harrelfe'

grepo

Github repository name, default is'rscripts'

gdir

Github directory under which to find retrievable files

dir

directory undergrepo in which to find files

browse

When showing the rscripts contents directory, thedefault is to list in tabular form in the console. Specifybrowse='browser' to open the online contents in a webbrowser.

cats

Leave at the default (FALSE) to list whole contentsor download a script. Specifycats=TRUE to list major andminor categories available. Specify a character string to listall scripts whose major category contains the string (ignoringcase).

put

Leave at the default ('source') tosource() the file. This is useful when the file just defines a function you want to use in the session. Use loadput='rstudio' to load the file into the RStudio script editor window using therstudioapinavigateToFile function. If RStudio is not running,file.edit() is used instead.

Value

a data frame or list, depending on arguments

Author(s)

Frank Harrell and Cole Beck

See Also

download.file

Examples

## Not run: getRs()             # list available scriptsscripts <- getRs()  # likewise, but store in an object that can easily                    # be viewed on demand in RStudiogetRs('introda.r')  # download introda.r and put in script editorgetRs(cats=TRUE)    # list available major and minor categoriescategories <- getRs(cats=TRUE)# likewise but store results in a list for later viewinggetRs(cats='reg')   # list all scripts in a major category containing 'reg'getRs('importREDCap.r')   # source() to define a function# source() a new version of the Hmisc package's cut2 function:getRs('cut2.s', grepo='Hmisc', dir='R')## End(Not run)

Open a Zip File From a URL

Description

Allows downloading and reading of a zip file containing one file

Usage

getZip(url, password=NULL)

Arguments

url

either a path to a local file or a valid URL.

password

required to decode password-protected zip files

Details

Allows downloading and reading of zip file containing one file.The file may be password protected. If a password is needed then one will be requested unless given.

Note: to make password-protected zip file z.zip, do zip -e z myfile

Value

Returns a file O/I pipe.

Author(s)

Frank E. Harrell

See Also

pipe

Examples

## Not run: read.csv(getZip('http://test.com/z.zip'))## End(Not run)

getabd

Description

Data from The Analysis of Biological Data by Shitlock and Schluter

Usage

getabd(name = "", lowernames = FALSE, allow = "_")

Arguments

name

name of dataset to fetch. Omit to get a data table listing all available datasets.

lowernames

set toTRUE to change variable names to lower case

allow

set toNULL to convert underscores in variable names to periods

Details

Fetches csv files for exercises in the book

Value

data frame with attributeslabel andurl

Author(s)

Frank Harrell


Frequency Scatterplot

Description

Usesggplot2 to plot a scatterplot or dot-like chart for the casewhere there is a very large number of overlapping values. This worksfor continuous and categoricalx andy. For continuousvariables it serves the same purpose as hexagonal binning. Counts foroverlapping points are grouped into quantile groups and level oftransparency and rainbow colors are used to provide count information.

Instead, you can specifystick=TRUE not use color but to encodecell frequencies with the height of a black line y-centered at the middle of the bins.Relative frequencies are not transformed, and the maximum cellfrequency is shown in a caption. Every point with at least afrequency of one is depicted with a full-height light gray verticalline, scaled to the above overall maximum frequency. In this way torelative frequency is to proportion of these light gray lines that areblack, and one can see points whose frequencies are too low to see theblack lines.

The result can also be passed toggplotly. Actual cellfrequencies are added to the hover text in that case using thelabelggplot2 aesthetic.

Usage

ggfreqScatter(x, y, by=NULL, bins=50, g=10, cuts=NULL,              xtrans = function(x) x,              ytrans = function(y) y,              xbreaks = pretty(x, 10),              ybreaks = pretty(y, 10),              xminor  = NULL, yminor = NULL,              xlab = as.character(substitute(x)),              ylab = as.character(substitute(y)),              fcolors = viridisLite::viridis(10), nsize=FALSE,              stick=FALSE, html=FALSE, prfreq=FALSE, ...)

Arguments

x

x-variable

y

y-variable

by

an optional vector used to make separate plots for eachdistinct value usingfacet_wrap()

bins

for continuousx ory is the number of bins tocreate by rounding. Ignored for categorical variables. If a2-vector, the first element corresponds tox and the second toy.

g

number of quantile groups to make for frequency counts. Useg=0 to use frequencies continuously for color coding. This is recommended only when usingplotly.

cuts

instead of usingg, specifycuts to providethe vector of cuts for categorizing frequencies for assignment to colors

xtrans,ytrans

functions specifying transformations to be madebefore binning and plotting

xbreaks,ybreaks

vectors of values to label on axis, on originalscale

xminor,yminor

values at which to put minor tick marks, onoriginal scale

xlab,ylab

axis labels. If not specified and variable has alabel, thatu label will be used.

fcolors

colors argument to pass toscale_color_gradientn to color code frequencies. Usefcolors=gray.colors(10, 0.75, 0) to show grayscale, for example. Another good choice isfcolors=hcl.colors(10, 'Blue-Red').

nsize

set toTRUE to not vary color or transparency butinstead to size the symbols in relation to the number of points. Bestwith bothx andy are discrete.ggplot2size is taken as the fourth root of the frequency. If thereare 15 or unique frequencies all the unique frequencies are used,otherwiseg quantile groups of frequencies are used.

stick

set toTRUE to not use colors but instead usevarying-height black vertical lines to depict cell frequencies.

html

set toTRUE to use html in axis labels instead ofplotmath

prfreq

set toTRUE to print the frequency distributions ofthe binned coordinate frequencies

...

arguments to pass togeom_point such asshapeandsize

Value

aggplot object

Author(s)

Frank Harrell

See Also

cut2

Examples

require(ggplot2)set.seed(1)x <- rnorm(1000)y <- rnorm(1000)count <- sample(1:100, 1000, TRUE)x <- rep(x, count)y <- rep(y, count)# color=alpha=NULL below makes loess smooth over all pointsg <- ggfreqScatter(x, y) +   # might add g=0 if using plotly      geom_smooth(aes(color=NULL, alpha=NULL), se=FALSE) +      ggtitle("Using Deciles of Frequency Counts, 2500 Bins")g# plotly::ggplotly(g, tooltip='label')  # use plotly, hover text = freq. only# Plotly makes it somewhat interactive, with hover text tooltips# Instead use varying-height sticks to depict frequenciesggfreqScatter(x, y, stick=TRUE) + labs(subtitle='Relative height of black lines to gray linesis proportional to cell frequency.Note that points with even tiny frequency are visable(gray line with no visible black line).')# Try with x categoricalx1 <- sample(c('cat', 'dog', 'giraffe'), length(x), TRUE)ggfreqScatter(x1, y)# Try with y categoricaly1 <- sample(LETTERS[1:10], length(x), TRUE)ggfreqScatter(x, y1)# Both categorical, larger point symbols, box instead of circleggfreqScatter(x1, y1, shape=15, size=7)# Vary box size insteadggfreqScatter(x1, y1, nsize=TRUE, shape=15)

ggplotlyr

Description

Renderplotly Graphic from aggplot2 Object

Usage

ggplotlyr(ggobject, tooltip = "label", remove = "txt: ", ...)

Arguments

ggobject

an object produced byggplot

tooltip

attribute specified toggplot to hold hover text

remove

extraneous text to remove from hover text. Default is set to assumetooltip='label' and assumed the user specifiedaes(..., label=txt). If you instead specifiedaes(..., label=myvar) useremove='myvar: '.

...

other arguments passed toggplotly

Details

Usesplotly::ggplotly() to render aplotly graphic with a specified tooltip attribute, removing extraneous text thatggplotly puts in hover text whentooltip='label'

Value

aplotly object

Author(s)

Frank Harrell


hashCheck

Description

Check for Changes in List of Objects

Usage

hashCheck(..., file, .print. = TRUE, .names. = NULL)

Arguments

...

a list of objects including data frames, vectors, functions, and all other types of R objects that represent dependencies of a certain calculation

file

name of file in which results are stored

.print.

set toFALSE to suppress printing information messages about what has changed

.names.

vector of names of original arguments if not callinghashCheck directly

Details

Given an RDS file name and a list of objects, does the following:

Setoptions(debughash=TRUE) to trace results in⁠/tmp/debughash.txt⁠

Value

alist with elementsresult (the computations),hash (the new hash), andchanged which details what changed to make computations need to be run

Author(s)

Frank Harrell


Harrell-Davis Distribution-Free Quantile Estimator

Description

Computes the Harrell-Davis (1982) quantile estimator and jacknifestandard errors of quantiles. The quantile estimator is a weightedlinear combination or order statistics in which the order statisticsused in traditional nonparametric quantile estimators are given thegreatest weight. In small samples the H-D estimator is more efficientthan traditional ones, and the two methods are asymptoticallyequivalent. The H-D estimator is the limit of a bootstrap average asthe number of bootstrap resamples becomes infinitely large.

Usage

hdquantile(x, probs = seq(0, 1, 0.25),           se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)

Arguments

x

a numeric vector

probs

vector of quantiles to compute

se

set toTRUE to also compute standard errors

na.rm

set toTRUE to removeNAs fromxbefore computing quantiles

names

set toFALSE to prevent names attributions frombeing added to quantiles and standard errors

weights

set toTRUE to return a"weights"attribution with the matrix of weights used in the H-D estimatorcorresponding to order statistics, with columns corresponding toquantiles.

Details

A Fortran routine is used to compute the jackknife leave-out-onequantile estimates. Standard errors are not computed for quantiles 0 or1 (NAs are returned).

Value

A vector of quantiles. Ifse=TRUE this vector will have anattributese added to it, containing the standard errors. Ifweights=TRUE, also has a"weights" attribute which is a matrix.

Author(s)

Frank Harrell

References

Harrell FE, Davis CE (1982): A new distribution-free quantileestimator. Biometrika 69:635-640.

Hutson AD, Ernst MD (2000): The exact bootstrap mean and variance ofan L-estimator. J Roy Statist Soc B 62:89-94.

See Also

quantile

Examples

set.seed(1)x <- runif(100)hdquantile(x, (1:3)/4, se=TRUE)## Not run: # Compare jackknife standard errors with those from the bootstraplibrary(boot)boot(x, function(x,i) hdquantile(x[i], probs=(1:3)/4), R=400)## End(Not run)

Moving and Hiding Table of Contents

Description

Moving and hiding table of contents for Rmd HTML documents

Usage

hidingTOC(  buttonLabel = "Contents",  levels = 3,  tocSide = c("right", "left"),  buttonSide = c("right", "left"),  posCollapse = c("margin", "top", "bottom"),  hidden = FALSE)

Arguments

buttonLabel

the text on the button that hides and unhides thetable of contents. Defaults toContents.

levels

the max depth of the table of contents that it is desired tohave control over the display of. (defaults to 3)

tocSide

which side of the page should the table of contents be placedon. Can be either'right' or'left'. Defaults to'right'

buttonSide

which side of the page should the button that hides the TOCbe placed on. Can be either'right' or'left'. Defaults to'right'

posCollapse

if'margin' then display the depth select buttonsvertically along the side of the page choosen bybuttonSide. If'top' then display the depth select buttons horizontally under thebutton that hides the TOC. Defaults to'margin'.'bottom' iscurrently unimplemented.

hidden

Logical should the table of contents be hidden at page loadDefaults toFALSE

Details

hidingTOC creates a table of contents in a Rmd document thatcan be hidden at the press of a button. It also generate buttons that allowthe hiding or unhiding of the diffrent level depths of the table of contents.

Value

a HTML formated text string to be inserted into an markdown document

Author(s)

Thomas Dupont

Examples

## Not run: hidingTOC()## End(Not run)

Histograms for Variables in a Data Frame

Description

This functions tries to compute the maximum number of histograms thatwill fit on one page, then it draws a matrix of histograms. If thereare more qualifying variables than will fit on a page, the functionwaits for a mouse click before drawing the next page.

Usage

## S3 method for class 'data.frame'hist(x, n.unique = 3, nclass = "compute",                na.big = FALSE, rugs = FALSE, freq=TRUE, mtitl = FALSE, ...)

Arguments

x

a data frame

n.unique

minimum number of unique values a variable must havebefore a histogram is drawn

nclass

number of bins. Default ismax(2,trunc(min(n/10,25*log(n,10))/2)), where n is the number ofnon-missing values for a variable.

na.big

set toTRUE to draw the number of missing valueson the top of the histogram in addition to in a subtitle. In thesubtitle, n is the number of non-missing values and m is the numberof missing values

rugs

set toTRUE to add rug plots at the top of eachhistogram

freq

seehist. Default is to show frequencies.

mtitl

set to a character string to set aside extra outside topmargin and to use the string for an overall title

...

arguments passed toscat1d

Value

the number of pages drawn

Author(s)

Frank E Harrell Jr

See Also

hist,scat1d

Examples

d <- data.frame(a=runif(200), b=rnorm(200),                w=factor(sample(c('green','red','blue'), 200, TRUE)))hist.data.frame(d)   # in R, just say hist(d)

Back to Back Histograms

Description

Takes two vectors or a list withx andy components, and produces back to back histograms of the two datasets.

Usage

histbackback(x, y, brks=NULL, xlab=NULL, axes=TRUE, probability=FALSE,             xlim=NULL, ylab='', ...)

Arguments

x,y

either two vectors or a list given asx with two components. If thecomponents have names, they will be used to label the axis(modification FEH).

brks

vector of the desired breakpoints for the histograms.

xlab

a vector of two character strings naming the two datasets.

axes

logical flag stating whether or not to label the axes.

probability

logical flag: ifTRUE, then the x-axis corresponds to the units for adensity. IfFALSE, then the units are counts.

xlim

x-axis limits. First value must be negative, as the left histogram isplaced at negative x-values. Second value must be positive, for theright histogram. To make the limits symmetric, use e.g.ylim=c(-20,20).

ylab

label for y-axis. Default is no label.

...

additional graphics parameters may be given.

Value

a list is returned invisibly with the following components:

left

the counts for the dataset plotted on the left.

right

the counts for the dataset plotted on the right.

breaks

the breakpoints used.

Side Effects

a plot is produced on the current graphics device.

Author(s)

Pat Burns
Salomon Smith Barney
London
pburns@dorado.sbi.com

See Also

hist,histogram

Examples

options(digits=3)set.seed(1)histbackback(rnorm(20), rnorm(30))fool <- list(x=rnorm(40), y=rnorm(40))histbackback(fool)age <- rnorm(1000,50,10)sex <- sample(c('female','male'),1000,TRUE)histbackback(split(age, sex))agef <- age[sex=='female']; agem <- age[sex=='male']histbackback(list(Female=agef,Male=agem), probability=TRUE, xlim=c(-.06,.06))

Use plotly to Draw Stratified Spike Histogram and Box Plot Statistics

Description

Usesplotly to draw horizontal spike histograms stratified bygroup, plus the mean (solid dot) and vertical bars for thesequantiles: 0.05 (red, short), 0.25 (blue, medium), 0.50 (black, long),0.75 (blue, medium), and 0.95 (red, short). The robust dispersion measureGini's mean difference and the SD may optionally be added. These areshown as horizontal lines starting at the minimum value ofxhaving a length equal to the mean difference or SD. Even when Gini'sand SD are computed, they are not drawn unless the user clicks on theirlegend entry.

Spike histograms have the advantage of effectively showing the raw data for bothsmall and huge datasets, and unlike box plots allow multi-modality to beeasily seen.

histboxpM plots multiple histograms stacked vertically, forvariables in a data frame having a commongroup variable (if any)and combined usingplotly::subplot.

dhistboxp is likehistboxp but noplotly graphicsare actually drawn. Instead, a data frame suitable for use withplotlyM is returned. Fordhistboxp an additional level ofstratificationstrata is implemented.group causes adifferent result here to produce back-to-back histograms (in the case oftwo groups) for each level ofstrata.

Usage

histboxp(p = plotly::plot_ly(height=height), x, group = NULL,         xlab=NULL, gmd=TRUE, sd=FALSE, bins = 100, wmax=190, mult=7,         connect=TRUE, showlegend=TRUE)dhistboxp(x, group = NULL, strata=NULL, xlab=NULL,           gmd=FALSE, sd=FALSE, bins = 100, nmin=5, ff1=1, ff2=1)histboxpM(p=plotly::plot_ly(height=height, width=width), x, group=NULL,          gmd=TRUE, sd=FALSE, width=NULL, nrows=NULL, ncols=NULL, ...)

Arguments

p

plotly graphics object if already begun

x

a numeric vector, or forhistboxpM a numeric vector ora data frame of numeric vectors, hopefully withlabel andunits attributes

group

a discrete grouping variable. If omitted, defaults to avector of ones

strata

a discrete numeric stratification variable. Values arealso used to space out different spike histograms. Defaultsto a vector of ones.

xlab

x-axis label, defaults to labelled version include unitsof measurement if any

gmd

set toFALSE to not compute Gini's mean difference

sd

set toTRUE to compute the SD

width

width in pixels

nrows

number of rows for layout of multiple plots

ncols

number of columns for layout of multiple plots. At mostone ofnrows,ncols should be specified.

bins

number of equal-width bins to use for spike histogram. Ifthe number of distinct values ofx is less thanbins,the actual values ofx are used.

nmin

minimum number of non-missing observations for agroup-stratum combination before the spike histogram andquantiles are drawn

ff1,ff2

fudge factors for position and bar length for spike histograms

wmax,mult

tweaks for margin to allocate

connect

set toFALSE to suppress lines connectingquantiles

showlegend

used if producing multiple plots to be combined withsubplot; set toFALSE for all but one plot

...

other arguments forhistboxpM that are passed tohistboxp

Value

aplotly object. Fordhistboxp a data frame asexpected byplotlyM

Author(s)

Frank Harrell

See Also

histSpike,plot.describe,scat1d

Examples

## Not run: dist <- c(rep(1, 500), rep(2, 250), rep(3, 600))Distribution <- factor(dist, 1 : 3, c('Unimodal', 'Bimodal', 'Trimodal'))x <- c(rnorm(500, 6, 1),       rnorm(200, 3, .7), rnorm(50, 7, .4),       rnorm(200, 2, .7), rnorm(300, 5.5, .4), rnorm(100, 8, .4))histboxp(x=x, group=Distribution, sd=TRUE)X <- data.frame(x, x2=runif(length(x)))histboxpM(x=X, group=Distribution, ncols=2)  # separate plots## End(Not run)

hlab

Description

Easy Extraction of Labels/Units Expressions for Plotting

Usage

hlab(x, name = NULL, html = FALSE, plotmath = TRUE)

Arguments

x

a single variable name, unquoted

name

a single character string providing an alternate way to namex that is useful whenhlab is called from another function such ashlabs

html

set toTRUE to return HTML strings instead ofplotmath expressions

plotmath

set toFALSE to use plain text instead of plotmath

Details

Given a single unquoted variable, first looks to see if a non-NULLLabelsUnits object exists (produced byextractlabs()). WhenLabelsUnits does not exist or isNULL, looks up the attributes in the current dataset, which defaults tod or may be specified byoptions(current_ds='name of the data frame/table'). Finally the existence of a variable of the given name in the global environment is checked. When a variable is not found in any of these three sources or has a blanklabel andunits, anexpression() with the variable name alone is returned. Ifhtml=TRUE, HTML strings are constructed instead, suitable forplotly graphics.

The result is useful forxlab andylab in base plotting functions or inggplot2, along with being useful forlabs inggplot2. See example.

Value

an expression created bylabelPlotmath withplotmath=TRUE

Author(s)

Frank Harrell

See Also

label(),units(),contents(),hlabs(),extractlabs(),plotmath

Examples

d <- data.frame(x=1:10, y=(1:10)/10)d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)hlab(x)hlab(x, html=TRUE)hlab(z)require(ggplot2)ggplot(d, aes(x, y)) + geom_point() + labs(x=hlab(x), y=hlab(y))# Can use xlab(hlab(x)) + ylab(hlab(y)) also# Store names, labels, units for all variables in d in objectLabelsUnits <- extractlabs(d)# Remove d; labels/units still foundrm(d)hlab(x)# Remove LabelsUnits and use a current dataset named# d2 instead of the default drm(LabelsUnits)options(current_ds='d2')

hlabs

Description

Front-end to ggplot2 labs Function

Usage

hlabs(x, y, html = FALSE)

Arguments

x

a single variable name, unquoted

y

a single variable name, unquoted

html

set toTRUE to render in html (forplotly), otherwise the result isplotmath expressions

Details

Runsx,y, or both throughhlab() and passes the constructed labels to theggplot2::labs function to specify x- and y-axis labels specially formatted for units of measurement

Value

result ofggplot2::labs()

Author(s)

Frank Harrell

Examples

# Name the current dataset d, or specify a name with# options(curr_ds='...') or run `extractlabs`, then# ggplot(d, aes(x,y)) + geom_point() + hlabs(x,y)# to specify only the x-axis label use hlabs(x), or to# specify only the y-axis label use hlabs(y=...)

Matrix of Hoeffding's D Statistics

Description

Computes a matrix of Hoeffding's (1948)D statistics for allpossible pairs of columns of a matrix.D is a measure of thedistance betweenF(x,y) andG(x)H(y), whereF(x,y)is the joint CDF ofX andY, andG andH aremarginal CDFs. Missing values are deleted in pairs rather than deletingall rows ofx having any missing variables. TheDstatistic is robust against a wide variety of alternatives toindependence, such as non-monotonic relationships. The larger the valueofD, the more dependent areX andY (for manytypes of dependencies).D used here is 30 times Hoeffding'soriginalD, and ranges from -0.5 to 1.0 if there are no ties inthe data.print.hoeffd prints the information derived byhoeffd. The higher the value ofD, the more dependent arex andy.hoeffd also computes the mean and maximumabsolute values of the difference between the joint empirical CDF andthe product of the marginal empirical CDFs.

Usage

hoeffd(x, y)## S3 method for class 'hoeffd'print(x, ...)

Arguments

x

a numeric matrix with at least 5 rows and at least 2 columns (ify is absent), or an object created byhoeffd

y

a numeric vector or matrix which will be concatenated tox

...

ignored

Details

Uses midranks in case of ties, as described by Hollander and Wolfe.P-values are approximated by linear interpolation on the tablein Hollander and Wolfe, which uses the asymptotically equivalentBlum-Kiefer-Rosenblatt statistic. ForP<.0001 or>0.5,P values arecomputed using a well-fitting linear regression function inlog P vs.the test statistic.Ranks (but not bivariate ranks) are computed using efficientalgorithms (see reference 3).

Value

a list with elementsD, thematrix of D statistics,n thematrix of number of observations used in analyzing each pair of variables,andP, the asymptotic P-values.Pairs with fewer than 5 non-missing values have the D statistic set to NA.The diagonals ofn are the number of non-NAs for the single variablecorresponding to that row and column.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat19:546–57.

Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods,pp. 228–235, 423. New York: Wiley.

Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.

See Also

rcorr,varclus

Examples

x <- c(-2, -1, 0, 1, 2)y <- c(4,   1, 0, 1, 4)z <- c(1,   2, 3, 4, NA)q <- c(1,   2, 3, 4, 5)hoeffd(cbind(x,y,z,q))# Hoeffding's test can detect even one-to-many dependencyset.seed(1)x <- seq(-10,10,length=200)y <- x*sign(runif(200,-1,1))plot(x,y)hoeffd(x,y)

Convert an S object to HTML

Description

html is a generic function, for which only two methods are currentlyimplemented,html.latex and a rudimentaryhtml.data.frame. The former uses theHeVeA LaTeX to HTML translator by Maranget to create an HTML file from a LaTeX file likethe one produced bylatex.html.default just runshtml.data.frame.htmlVerbatim prints all of its arguments to the console in anhtml verbatim environment, using a specified percent of the prevailingcharacter size. This is useful for R Markdown withknitr.

Most of the html-producing functions in the Hmisc and rms packagesreturn a character vector passed throughhtmltools::HTML so thatkintr will correctly format the result without the need for theuser puttingresults='asis' in the chunk header.

Usage

html(object, ...)## S3 method for class 'latex'html(object, file, where=c('cwd', 'tmp'),  method=c('hevea', 'htlatex'),  rmarkdown=FALSE, cleanup=TRUE, ...)## S3 method for class 'data.frame'html(object,  file=paste(first.word(deparse(substitute(object))),'html',sep='.'), header,     caption=NULL, rownames=FALSE, align='r', align.header='c',     bold.header=TRUE, col.header='Black',     border=2, width=NULL, size=100, translate=FALSE,     append=FALSE, link=NULL, linkCol=1,     linkType=c('href','name'), disableq=FALSE, ...) ## Default S3 method:html(object,     file=paste(first.word(deparse(substitute(object))),'html',sep='.'),     append=FALSE, link=NULL, linkCol=1, linkType=c('href','name'), ...)htmlVerbatim(..., size=75, width=85, scroll=FALSE, rows=10, cols=100,             propts=NULL, omit1b=FALSE)

Arguments

object

a data frame or an object created bylatex.For the generichtml is any object for which anhtmlmethod exists.

file

name of the file to create. The default filename isobject.html whereobject is the first word inthe name of the argument forobject. Forhtml.latexspecifyfile='' orfile=character(0) to print html code tothe console, as when usingknitr. For thedata.framemethod,file may be set toFALSE which causes a charactervector enclosed inhtmltools::HTML to be returned instead ofwriting to the console.

where

forhtml. Default is to put output files in currentworking directory. Specifywhere='tmp' to put in a systemtemporary directory area.

method

default is to use system commandhevea to convertfrom LaTeX to html. Specifymethod='htlatex' to use systemcommandhtlatex, assuming the system packageTeX4ht is installed.

rmarkdown

set toTRUE if using RMarkdown (usually underknitr and RStudio). This causes html to be packaged forRMarkdown and output to go into the console stream.file isignored whenrmarkdown=TRUE.

cleanup

if usingmethod='htlatex' set toFALSE ifwhere='cwd' toprevent deletion of auxiliary files created byhtlatex that are not needed when using the finalhtmldocument (only the.css file is needed in addition to.html). If usingmethod='hevea',cleanup=TRUEcauses deletion of the generated.haux file.

header

vector of column names. Defaults to names inobject. Set toNULL to suppress column names.

caption

a character string to be used as a caption before thetable

rownames

set toFALSE to ignore row names even if they arepresent

align

alignment for table columns (all are assumed to have thesame if is a scalar). Specify"c", "r", "l" for center, right, or leftalignment.

align.header

same coding as foralign but pertains toheader

bold.header

set toFALSE to not bold face column headers

col.header

color for column headers

border

set to 0 to not include table cell borders, 1 to includeonly outer borders, or 2 (the default) to put borders around cells too

translate

set toTRUE to run header and table cell textthrough thehtmlTranslate function

width

optional table width forhtml.data.frame. For fullpage width usewidth="100%", for use inoptions() forprinting objects.

size

a number between 0 and 100 representing the percent of theprevailing character size to be used byhtmlVerbatim and thedata frame method.

append

set toTRUE to append to an existing file

link

character vector specifying hyperlink names to attach toselected elements of the matrix or data frame. No hyperlinks are usediflink is omitted or for elements oflink that are"". To allow multiple links per link,link may also bea character matrix shaped asobject in which caselinkCol is ignored.

linkCol

column number ofobject to which hyperlinks areattached. Defaults to first column.

linkType

defaults to"href"

disableq

set toTRUE to add code to the html table tagthat makes Quarto not use its usual table style

...

ignored except forhtmlVerbatim - is a list ofobjects toprint()

scroll

set toTRUE to put the html in a scrollabletextarea

rows,cols

the number of rows and columns to devote to the visablepart of the scrollable box

propts

options, besidesquote=FALSE to pass to theprint method, forhtmlVerbatim

omit1b

forhtmlVerbatim ifTRUE causes an initialand a final line of output that is all blank to be deleted

Author(s)

Frank E. Harrell, Jr.
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com

References

Maranget, Luc. HeVeA: a LaTeX to HTML translater.URL: http://para.inria.fr/~maranget/hevea/

See Also

latex

Examples

## Not run: x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','e')))w <- latex(x)h <- html(w) # run HeVeA to convert .tex to .htmlh <- html(x) # convert x directly to htmlw <- html(x, link=c('','B'))   # hyperlink first row first col to B# Assuming system package tex4ht is installed, easily convert advanced# LaTeX tables to htmlgetHdata(pbc)s <- summaryM(bili + albumin + stage + protime + sex + age + spiders ~ drug,              data=pbc, test=TRUE)w <- latex(s, npct='slash', file='s.tex')z <- html(w)browseURL(z$file)d <- describe(pbc)w <- latex(d, file='d.tex')z <- html(w)browseURL(z$file)## End(Not run)

htmltabc

Description

Simple HTML Table of Verbatim Output

Usage

htmltabv(..., cols = 2, propts = list(quote = FALSE))

Arguments

...

objects toprint(). The arguments must be named with the labels you want to print before the verbatimprint().

cols

number of columns in the html table

propts

an option list of arguments to pass to theprint() methods; default is to not quote character strings

Details

Usescapture.output to capture as character strings the results ofrunningprint() on each element of.... If an element of... haslength of 1 and is a blank string, nothing is printed for that cellother than its name (not in verbatim).

Value

character string of html

Author(s)

Frank Harrell


Generic Functions and Methods for Imputation

Description

These functions do simple andtranscan imputation and print, summarize, and subscriptvariables that have NAs filled-in with imputed values. The simpleimputation method involves filling in NAs with constants,with a specified single-valued function of the non-NAs, or froma sample (with replacement) from the non-NA values (this is usefulin multiple imputation).More complex imputations can be donewith thetranscan function, which also works with the generic methodsshown here, i.e.,impute can take atranscan object and use theimputed values created bytranscan (withimputed=TRUE) to fill-in NAs.Theprint method places * after variable values that were imputed.Thesummary method summarizes all imputed values and then usesthe nextsummary method available for the variable.The subscript method preserves attributes of the variable and subsetsthe list of imputed values corresponding with how the variable wassubsetted. Theis.imputed function is for checking if observationsare imputed.

Usage

impute(x, ...)## Default S3 method:impute(x, fun=median, ...)## S3 method for class 'impute'print(x, ...)## S3 method for class 'impute'summary(object, ...)is.imputed(x)

Arguments

x

a vector or an object created bytranscan, or a vector needingbasic unconditional imputation. If there are noNAs andxis a vector, it is returned unchanged.

fun

the name of a function to use in computing the (single) imputed value from the non-NAs. The default ismedian.If instead of specifying a function asfun, a single value or vector(numeric, or character ifobject is a factor) is specified,those values are used for insertion.fun can also be the characterstring"random" to draw random values for imputation, with the randomvalues not forced to be the same if there are multiple NAs.For a vector of constants, the vector must be of length one(indicating the same value replaces all NAs) or must be as long asthe number of NAs, in which case the values correspond to consecutive NAsto replace. For a factorobject, constants for imputation may includecharacter values not in the current levels ofobject. In thatcase new levels are added.Ifobject is of class"factor",fun is ignored and themost frequent category is used for imputation.

object

an object of class"impute"

...

ignored

Value

a vector with class"impute" placed in front of existing classes.Foris.imputed, a vector of logical values is returned (allTRUE ifobject is not of classimpute).

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

transcan,impute.transcan,describe,na.include,sample

Examples

age <- c(1,2,NA,4)age.i <- impute(age)# Could have used impute(age,2.5), impute(age,mean), impute(age,"random")age.isummary(age.i)is.imputed(age.i)

intMarkovOrd

Description

Compute Parameters for Proportional Odds Markov Model

Usage

intMarkovOrd(  y,  times,  initial,  absorb = NULL,  intercepts,  extra = NULL,  g,  target,  t,  ftarget = NULL,  onlycrit = FALSE,  constraints = NULL,  printsop = FALSE,  ...)

Arguments

y

vector of possible y values in order (numeric, character, factor)

times

vector of measurement times

initial

initial value ofy (baseline state; numeric, character, or factor matchingy). If length 1 this value is used for all subjects, otherwise it is a vector of lengthn.

absorb

vector of absorbing states, a subset ofy (numeric, character, or factor matchingy). The default is no absorbing states. Observations are truncated when an absorbing state is simulated.

intercepts

vector of initial guesses for the intercepts

extra

an optional vector of intial guesses for other parameters passed tog such as regression coefficients for previous states and for general time trends. Name the elements ofextra for more informative output.

g

a user-specified function of three or more arguments which in order areyprev - the value ofy at the previous time, the current timet, thegap between the previous time and the current time, an optional (usually named) covariate vectorX, and optional arguments such as a regression coefficient value to simulate from. The function needs to allowyprev to be a vector andyprev must not include any absorbing states. Theg function returns the linear predictor for the proportional odds model aside fromintercepts. The returned value must be a matrix with row names taken fromyprev. If the model is a proportional odds model, the returned value must be one column. If it is a partial proportional odds model, the value must have one column for each distinct value of the response variable Y after the first one, with the levels of Y used as optional column names. So columns correspond tointercepts. The different columns are used fory-specific contributions to the linear predictor (aside fromintercepts) for a partial or constrained partial proportional odds model. Parameters for partial proportional odds effects may be included in the ... arguments.

target

vector of target state occupancy probabilities at timet. Ifextra is specified,target must be a matrix where row names are character versions oft and columns represent occupancy probabilities corresponding to values ofy at the time given in the row.

t

target times. Can have more than one element only ifextra is given.

ftarget

an optional function defining constraints that relate to transition probabilities. The function returns a penalty which is a sum of absolute differences in probabilities from target probabilities over possibly multiple targets. Theftarget function must have two arguments:intercepts andextra.

onlycrit

set toTRUE to only return the achieved objective criterion and not print anything

constraints

a function of two arguments: the vector of current intercept values and the vector ofextra parameters, returningTRUE if that vector meets the constrains andFALSE otherwise

printsop

set toTRUE to print solved-for state occupancy probabilities for groups 1 and 2 and log odds ratios corresponding to them

...

optional arguments to pass tostats::nlm(). If this is specified, the arguments thatintMarkovOrd normally sends tonlm are not used.

Details

Given a vectorintercepts of initial guesses at the intercepts in a Markov proportional odds model, and a vectorextra if there are other parameters, solves for theintercepts andextra vectors that yields a set of occupancy probabilities at timet that equal, as closely as possible, a vector of target values.

Value

list containing two vectors namedintercepts andextra unlessoncrit=TRUE in which case the best achieved sum of absolute errors is returned

Author(s)

Frank Harrell

See Also

https://hbiostat.org/R/Hmisc/markov/


knitr Setup and plotly Service Function

Description

knitrSet sets up knitr to use better default parameters for base graphics,better code formatting, and to allow several arguments to be passedfrom code chunk headers, such asbty,mfrow,ps,bot (extra bottom margin for base graphics),top (extratop margin),left (extra left margin),rt (extra rightmargin),lwd,mgp,las,tcl,axes,xpd,h (usuallyfig.height in knitr),w (usuallyfig.width in knitr),wo(out.width in knitr),ho (out.height in knitr),cap (characterstring containing figure caption),scap (character stringcontaining short figure caption for table of figures). Thecapfile argument facilities auto-generating a table of figuresfor certain Rmarkdown report themes. This is done by the addition ofa hook function that appends data to thecapfile file each timea chunk runs that has a long or short caption in the chunk header.

plotlySave saves a plotly graphic with namefoo.pngwherefoo is the name of the current chunk. You must have afreeplotly account fromplot.ly to use this function,and you must have runSys.setenv(plotly_username="your_plotly_username") andSys.setenv(plotly_api_key="your_api_key"). The API key can befound in one's profile settings.

Usage

knitrSet(basename=NULL, w=if(! bd) 4, h=if(! bd) 3, wo=NULL, ho=NULL,         fig.path=if(length(basename)) basename else '',         fig.align=if(! bd) 'center', fig.show='hold',         fig.pos=if(! bd) 'htbp',         fig.lp    = if(! bd) paste('fig', basename, sep=':'),         dev=switch(lang, latex='pdf', markdown='png',                    blogdown=NULL, quarto=NULL),         tidy=FALSE, error=FALSE,         messages=c('messages.txt', 'console'),         width=61, decinline=5, size=NULL, cache=FALSE,         echo=TRUE, results='markup', capfile=NULL,         lang=c('latex','markdown','blogdown','quarto'))plotlySave(x, ...)

Arguments

basename

base name to be added in front of graphics filenames.basename is followed by a minus sign.

w,h

default figure width and height in inches

wo,ho

default figure rendering width and height, in integerpixels or percent as a character string, e.g.'40%'

fig.path

path for figures. To put figures in a subdirectoryspecify e.g.fig.path='folder/'. Ignored for blogdown.

fig.align,fig.show,fig.pos,fig.lp,tidy,cache,echo,results,error,size

see knitr documentation

dev

graphics device, with default figured fromlang

messages

By default warning and other messages such as thosefrom loading packages are sent to file'messages.txt' in thecurrent working directory. You can specifymessages='console' to send them directly to the console.

width

text output width for R code and output

decinline

number of digits to the right of the decimal point toround numeric values appearing inside Sexpr

capfile

the name of a file in the current working directorythat is used to accumulate chunk labels, figure cross-referencetags, and figure short captions (long captions if no short captionis defined) for the purpose of usingmarkupSpecs$markdown$tof() to insert a table of figures in areport. The file as appended to, which is useful ifcache=TRUE is used since this will keep some chunks fromrunning. Thetof function will remove earlier duplicatedfigure tags if this is the case. If notcacheing, the usershould initialize the file to empty at the top of the script.

lang

Default is'latex' to use LaTeX. Set to'markdown' when using R Markdown or'blogdown' or'quarto'. For'blogdown' and'quarto',par andknitrgraphics-related hooks are not called as this would preventwriting graphics files in the correct directoryfor the blog system.

x

aplotly graphics object or a named list of suchobjects. The resultingpng file will go in the file pathgiven by theknitrfig.path value, and have a basename equal to the currentknitr chunk name. Ifx is alist, a minus sign followed by the chunk name are inserted before.png.

...

additional arguments passed toplotly::plotly_IMAGE

Author(s)

Frank Harrell

See Also

knit

Examples

## Not run: # Typical call (without # comment symbols):# <<echo=FALSE>>=# require(Hmisc)# knitrSet()# @knitrSet()    # use all defaults and don't use a graphics file prefixknitrSet('modeling')   # use modeling- prefix for a major section or chapterknitrSet(cache=TRUE, echo=FALSE)  # global default to cache and not print codeknitrSet(w=5,h=3.75)   # override default figure width, height# ```{r chunkname}# p <- plotly::plot_ly(...)# plotlySave(p)   # creates fig.path/chunkname.png## End(Not run)

Label Curves, Make Keys, and Interactively Draw Points and Curves

Description

labcurve optionally draws a set of curves then labels the curves.A variety of methods for drawing labels are implemented, ranging frompositioning using the mouse to automatic labeling to automatic placementof key symbols with manual placement of key legends to automaticplacement of legends. For automatic positioning of labels or keys, acurve is labeled at a point that is maximally separated from all of theother curves. Gaps occurring when curves do not start or end at thesame x-coordinates are given preference for positioning labels. Iflabels are offset from the curves (the default behaviour), if theclosest curve to curve i is above curve i, curve i is labeled below itsline. If the closest curve is below curve i, curve i is labeled aboveits line. These directions are reversed if the resulting labels wouldappear outside the plot region.

Both ordinary lines and step functions are handled, and there is anoption to draw the labels at the same angle as the curve within alocal window.

Unless the mouse is used to position labels or plotting symbols areplaced along the curves to distinguish them, curves are examined at 100(by default) equally spaced points over the range of x-coordinates inthe current plot area. Linear interpolation is used to gety-coordinates to line up (step function or constant interpolation isused for step functions). There is an option to instead examine allcurves at the set of unique x-coordinates found by unioning thex-coordinates of all the curves. This option is especially useful whenplotting step functions. By settingadj="auto" you can havelabcurve try to optimally left- or right-justify labels dependingon the slope of the curves at the points at which labels would becentered (plus a vertical offset). This is especially useful whenlabels must be placed on steep curve sections.

You can use theon top method to write (short) curve namesdirectly on the curves (centered on the y-coordinate). This isespecially useful when there are many curves whose full labels would runinto each other. You can plot letters or numbers on the curves, forexample (using thekeys option), and havelabcurve use thekey function to provide long labels for these short ones (see theend of the example). There is another option for connecting labels tocurves using arrows. Whenkeys is a vector of integers, it istaken to represent plotting symbols (pchs), and these symbols areplotted at equally-spaced x-coordinates on each curve (by default, using5 points per curve). The points are offset in the x-direction betweencurves so as to minimize the chance of collisions.

To add a legend defining line types, colors, or line widths with nosymbols, specifykeys="lines", e.g.,labcurve(curves,keys="lines", lty=1:2).

putKey provides a different way to usekey() by allowingthe user to specify vectors for labels, line types, plotting characters,etc. Elements that do not apply (e.g.,pch for lines(type="l")) may beNA. When a series of points isrepresented by both a symbol and a line, the corresponding elements ofbothpch andlty,col., orlwd will benon-missing.

putKeyEmpty, given vectors of all the x-y coordinates that have beenplotted, useslargest.empty to find the largest empty rectangle largeenough to hold the key, and draws the key usingputKey.

drawPlot is a simple mouse-driven function for drawing series oflines, step functions, polynomials, Bezier curves, and points, andautomatically labeling the point groups usinglabcurve orputKeyEmpty. WhendrawPlot is invoked it createstemporary functionsPoints,Curve, andAbline.The user calls these functions insidethe call todrawPlot to define groups of points in the order theyare defined with the mouse.Abline is used to callablineand not actually great a group of points. For some curve types, thecurve generated to represent the corresponding series of points is drawnafter all points are entered for that series, and this curve may bedifferent than the simple curve obtained by connecting points at themouse clicks. For example, to draw a general smooth Bezier curve theuser need only click on a few points, and she must overshoot the finalcurve coordinates to define the curve. The originally entered pointsare not erased once the curve is drawn. The same goes for stepfunctions and polynomials. If youplot() the object returned bydrawPlot, however, only final curves will be shown. The lastexamples show how to usedrawPlot.

Thelargest.empty function finds the largest rectangle that is largeenough to hold a rectangle of a given height and width, such that therectangle does not contain any of a given set of points. This isused bylabcurve andputKeyEmpty to position keys at the mostempty part of an existing plot. The default method was created by HansBorchers.

Usage

labcurve(curves, labels=names(curves),         method=NULL, keys=NULL, keyloc=c("auto","none"),         type="l", step.type=c("left", "right"),          xmethod=if(any(type=="s")) "unique" else "grid",          offset=NULL, xlim=NULL,         tilt=FALSE, window=NULL, npts=100, cex=NULL,          adj="auto", angle.adj.auto=30,         lty=pr$lty, lwd=pr$lwd, col.=pr$col, transparent=TRUE,         arrow.factor=1, point.inc=NULL, opts=NULL, key.opts=NULL,         empty.method=c('area','maxdim'), numbins=25,          pl=!missing(add), add=FALSE,          ylim=NULL, xlab="", ylab="",         whichLabel=1:length(curves),         grid=FALSE, xrestrict=NULL, ...)putKey(z, labels, type, pch, lty, lwd,       cex=par('cex'), col=rep(par('col'),nc),       transparent=TRUE, plot=TRUE, key.opts=NULL, grid=FALSE)putKeyEmpty(x, y, labels, type=NULL,            pch=NULL, lty=NULL, lwd=NULL,            cex=par('cex'), col=rep(par('col'),nc),            transparent=TRUE, plot=TRUE, key.opts=NULL,            empty.method=c('area','maxdim'),             numbins=25,             xlim=pr$usr[1:2], ylim=pr$usr[3:4], grid=FALSE)drawPlot(..., xlim=c(0,1), ylim=c(0,1), xlab='', ylab='',         ticks=c('none','x','y','xy'),         key=FALSE, opts=NULL)# Points(label=' ', type=c('p','r'),#        n, pch=pch.to.use[1], cex=par('cex'), col=par('col'),#        rug = c('none','x','y','xy'), ymean)# Curve(label=' ',#       type=c('bezier','polygon','linear','pol','loess','step','gauss'),#       n=NULL, lty=1, lwd=par('lwd'), col=par('col'), degree=2,#      evaluation=100, ask=FALSE)# Abline(\dots)## S3 method for class 'drawPlot'plot(x, xlab, ylab, ticks,     key=x$key, keyloc=x$keyloc, ...)largest.empty(x, y, width=0, height=0,               numbins=25, method=c('exhaustive','rexhaustive','area','maxdim'),              xlim=pr$usr[1:2], ylim=pr$usr[3:4],              pl=FALSE, grid=FALSE)

Arguments

curves

a list of lists, each of which have at least two components: a vector ofx values and a vector of correspondingy values.curves ismandatory except whenmethod="mouse" or"locator", in which caselabels is mandatory. Each list incurves may optionally haveany of the parameterstype,lty,lwd, orcolfor that curve, as defined below (see one of the last examples).

z

a two-element list specifying the coordinate of the center of the key,e.g.locator(1) to use the mouse for positioning

labels

Forlabcurve, a vector of character strings used to label curves (which may contain newline characters to stack labels vertically). Thedefault labels are taken from the names of thecurves list.Settinglabels=FALSE will suppress drawing any labels (forlabcurve only). ForputKey andputKeyEmpty is a vector of character stringsspecifying group labels

x

see below

y

forputKeyEmpty andlargest.empty,x andyare same-length vectors specifying points that have been plotted.x can also be an object created bydrawPlot.

...

FordrawPlot is a series of invocations ofPoints andCurve (see example). Any number of point groups can be definedin this way. ForAbline these may be any arguments toabline. Forlabcurve, other parameters to pass totext.

width

see below

height

forlargest.empty, specifies the minimum allowable width inx units and the minimum allowable height iny units

method

"offset" (the default) offsets labels at largest gaps betweencurves, and draws labels beside curves."on top" draws labels on top of the curves (especiallygood when using keys)."arrow" draws arrows connecting labels to the curves."mouse" or"locator" positions labels according to mouse clicks.Ifkeys is specified and is an integer vector or is"lines",method defaults to"on top". Ifkeys is character,method defaults to"offset". Setmethod="none" tosuppress all curve labeling and key drawing, which is useful whenpl=TRUE and you only needlabcurve to draw the curves and therest of the basic graph.

Forlargest.empty specifies the method a rectangle that does notcollide with any of the (x,y) points. The defaultmethod,'exhaustive', uses a Fortran translation of an R functionand algorithm developed by Hans Borchers. The same result, more slowly,may be obtained by using pure R code by specifyingmethod='rexhaustive'. The original algorithms using binning (andthe only methods supported for S-Plus) arestill available. For all methods, screening of candidate rectangleshaving at least a given width inx-units ofwidth orhaving at least a given height iny-units ofheight is possible.Usemethod="area" to use the binning method to find the rectanglehaving the largest area, ormethod="maxdim" to use the binningmethod to return with last rectangle searched that had boththe largest width and largest height over all previous rectangles.

keys

This causes keys (symbols or short text) to be drawn on or besidecurves, and ifkeyloc is not equal to"none", a legend to beautomatically drawn. The legend links keys with full curve labelsand optionally with colors and line types.Setkeys to a vector of character strings, or avector of integers specifying plotting character (pch values -seepoints). For the latter case, the default behavior is toplot the symbols periodically, at equally spaced x-coordinates.

keyloc

Whenkeys is specified,keyloc specifies how the legendis to be positioned for drawing using thekey function intrellis. The default is"auto", for which thelargest.empty function to used to find the most empty part of theplot. If no empty rectangle large enough to hold the key is found, nokey will be drawn. Specifykeyloc="none" to suppress drawing alegend, or setkeyloc to a 2-element list containing the x and ycoordinates for the center of the legend. For example, usekeyloc=locator(1) to click the mouse at the center.keyloc specifies the coordinates of the center of thekey to be drawn withplot.drawPlot whenkey=TRUE.

type

forlabcurve, a scalar or vector of character strings specifying themethod that the points in the curves were connected."l" meansordinary connections between points and"s" means step functions.ForputKey andputKeyEmpty is a vector of plotting types,"l"for regular line,"p" for point,"b" for both point and line, and"n" for none. ForPoints is either"p" (the default) forregular points, or"r" for rugplot (one-dimensional scatter diagramto be drawn using thescat1d function). ForCurve,type is"bezier" (the default) for drawing a smooth Bezier curves (which canrepresent a non-1-to-1 function such as a circle),"polygon" fororginary line segments,"linear" for a straight line defined by twoendpoints,"pol" for adegree-degree polynomial to be fitted tothe mouse-clicked points,"step" for a left-step-function,"gauss"to plot a Gaussian density fitted to 3 clicked points,"loess" touse thelowess function to smooth the clicked points, or a functionto draw a user-specified function, evaluated atevaluation pointsspanning the whole x-axis. For the density the user must click in theleft tail, at the highest value (at the mean), and in the right tail,with the two tail values being approximately equidistant from themean. The density is scaled to fit in the highest value regardless ofits area.

step.type

type of step functions used (default is"left")

xmethod

method for generating the unique set of x-coordinates to examine (see above). Default is"grid" fortype="l" or"unique" fortype="s".

offset

distance in y-units between the center of the label and the line beinglabeled. Default is 0.75 times the height of an "m" that would bedrawn in a label. For R grid/lattice you must specify offset usingthegridunit function, e.g.,offset=unit(2,"native") oroffset=unit(.25,"cm") ("native" means data units)

xlim

limits for searching for label positions, and is also used to set upplots whenpl=TRUE andadd=FALSE. Default is total x-axisrange for current plot (par("usr")[1:2]). Forlargest.empty,xlim limits the search for largestrectanges, but it has the same default as above. Forpl=TRUE,add=FALSE you may want to extendxlim somewhat toallow large keys to fit, when usingkeyloc="auto". FordrawPlot default isc(0,1). When usinglargest.empty withggplot2,xlim andylimare mandatory.

tilt

set toTRUE to tilt labels to follow the curves, formethod="offset"whenkeys is not given.

window

width of a window, in x-units, to use in determining the local slopefor tilting labels. Default is 0.5 times number of characters in thelabel times the x-width of an "m" in the current character size and font.

npts

number of points to use ifxmethod="grid"

cex

character size to pass totext andkey. Default is currentpar("cex"). ForputKey,putKeyEmpty, andPoints is the size of theplotting symbol.

adj

Default is"auto" which haslabcurve figure justificationautomatically whenmethod="offset". This will cause centering to be used when the local angleof the curve is less thanangle.adj.auto in absolute value, leftjustification if the angle is larger and either the label is under acurve of positive slope or over a curve of negative slope, and rightjustification otherwise. For step functions, left justification is usedwhen the label is above the curve and right justifcation otherwise.Setadj=.5 to center labels at computed coordinates. Set to 0 forleft-justification, 1 for right. Setadj to a vector to vary adjustmentsover the curves.

angle.adj.auto

seeadj. Does not apply to step functions.

lty

vector of line types which were used to draw the curves.This is only used when keys are drawn. If all of theline types, line widths, and line colors are the same, lines are not drawn in the key.

lwd

vector of line widths which were used to draw the curves.This is only used when keys are drawn. Seelty also.

col.

vector of integer color numbers

col

vector of integer color numbers for use in curve labels, symbols,lines, and legends. Default ispar("col") for all curves.Seelty also.

transparent

Default isTRUE to makekey draw transparent legends, i.e., tosuppress drawing a solid rectangle background for the legend.Set toFALSE otherwise.

arrow.factor

factor by which to multiply default arrow lengths

point.inc

Whenkeys is a vector of integers,point.inc specifies the x-incrementbetween the point symbols that are overlaid periodically on the curves. By default,point.inc is equalto the range for the x-axis divided by 5.

opts

an optional list which can be used to specify any of the optionstolabcurve, with the usual element name abbreviations allowed.This is useful whenlabcurve is being called from anotherfunction. Example:opts=list(method="arrow", cex=.8, np=200).FordrawPlot a list oflabcurve options to pass aslabcurve(..., opts=).

key.opts

a list of extra arguments you wish to pass tokey(), e.g.,key.opts=list(background=1, between=3). The argument names must be spelled out in full.

empty.method

see below

numbins

These two arguments are passed to thelargest.empty function'smethod andnumbins arguments (see below).Forlargest.empty specifies the number of bins in which todiscretize both thex andy directions for searching forrectangles. Default is 25.

pl

set toTRUE (or specifyadd) to cause the curves incurves to bedrawn, under the control oftype,lty,lwd,col parameters definedeither in thecurves lists or in the separate arguments given tolabcurve or throughopts.Forlargest.empty, setpl=TRUE to show the rectangle the function found by drawing it with a solid color. May not be used underggplot2.

add

By default, when curves are actually drawn bylabcurve a new plot isstarted. To add to an existing plot, setadd=TRUE.

ylim

When a plot has already been started,ylim defaults topar("usr")[3:4].Whenpl=TRUE,ylim andxlim are determined from the ranges of the data.Specifyylim yourself to take control of the plot construction. In some cases it is advisable tomakeylim larger than usual to allow for automatically-positioned keys.Forlargest.empty,ylim specifies the limits on the y-axis to limitthe search for rectangle. Hereylim defaults to the same as above, i.e., the rangeof the y-axis of an open plot frompar. FordrawPlot the defaultisc(0,1).

xlab

see below

ylab

x-axis and y-axis labels whenpl=TRUE andadd=FALSE or fordrawPlot.Defaults to"" unless the first curve has names for its first twoelements, in which case the names of these elements are taken asxlab andylab.

whichLabel

integer vector corresponding tocurves specifying which curvesare to be labelled or have a legend

grid

set toTRUE if the Rgrid package was used to draw thecurrent plot. This preventslabcurve from usingpar("usr") etc. If using Rgrid you can pass coordinatesand lengths having arbitrary units, as documented in theunitfunction. This is especially useful foroffset.

xrestrict

When havinglabcurve label curves where they are mostseparated, you can restrict the search for this separation point to arange of the x-axis, specified as a 2-vectorxrestrict. Thisis useful when one part of the curve is very steep. Even thoughsteep regions may have maximum separation, the labels will collidewhen curves are steep.

pch

vector of plotting characters forputKey andputKeyEmpty. Can beany value includingNA when only a line is used to indentify thegroup. Is a single plotting character forPoints, with the defaultbeing the next unused value from among 1, 2, 3, 4, 16, 17, 5, 6, 15,18, 19.

plot

set toFALSE to keepputKey orputKeyEmpty from actually drawing thekey. Instead, the size of the key will be return byputKey, or thecoordinates of the key byputKeyEmpty.

ticks

tellsdrawPlot which axes to draw tick marks and tick labels.Default is"none".

key

fordrawPlot andplot.drawPlot. Default isFALSE so thatlabcurveis used to label points or curves. Set toTRUE to useputKeyEmpty.

Details

The internal functionsPoints,Curve,Abline haveunique arguments as follows.

label:

forPoints andCurve is a singlecharacter string to label that group of points

n:

number of points to accept from the mouse. Defaultis to input points until a right mouse click.

rug:

forPoints. Default is"none" tonot show the marginal x or y distributions as rug plots, for thepoints entered. Other possibilities are used to executescat1d to show the marginal distribution of x, y, or bothas rug plots.

ymean:

forPoints, subtracts a constant fromeach y-coordinate entered to make the overall meanymean

degree:

degree of polynomial to fit to points byCurve

evaluation:

number of points at which to evaluateBezier curves, polynomials, and other functions inCurve

ask:

setask=TRUE to give the user theopportunity to try again at specifying points for Bezier curves,step functions, and polynomials

Thelabcurve function used some code from the functionplot.multicurve writtenby Rod Tjoelker of The Boeing Company (tjoelker@espresso.rt.cs.boeing.com).

If there is only one curve, a label is placed at the middle x-value,and no fancy features such asangle or positive/negative offsets areused.

key is called once (with the argumentplot=FALSE) to find the keydimensions. Then an empty rectangle with at least these dimensions issearched for usinglargest.empty. Thenkey is called again to drawthe key there, using the argumentcorner=c(.5,.5) so that the centerof the rectangle can be specified tokey.

If you want to plot the data, an easier way to uselabcurve isthroughxYplot as shown in some of its examples.

Value

labcurve returns an invisible list with componentsx, y, offset, adj, cex, col, and iftilt=TRUE,angle.offset is the amount to add toy to draw a label.offset is negative if the label is drawn below the line.adj is a vector containing the values 0, .5, 1.

largest.empty returns a list with elementsx andyspecifying the coordinates of the center of the rectangle which wasfound, and elementrect containing the 4x andycoordinates of the corners of the found empty rectangle. Thearea of the rectangle is also returned.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

approx,text,legend,scat1d,xYplot,abline

Examples

n <- 2:8m <-  length(n)type <- c('l','l','l','l','s','l','l')# s=step function l=ordinary line (polygon)curves <- vector('list', m)plot(0,1,xlim=c(0,1),ylim=c(-2.5,4),type='n')set.seed(39)for(i in 1:m) {  x <- sort(runif(n[i]))  y <- rnorm(n[i])  lines(x, y, lty=i, type=type[i], col=i)  curves[[i]] <- list(x=x,y=y)}labels <- paste('Label for',letters[1:m])labcurve(curves, labels, tilt=TRUE, type=type, col=1:m)# Put only single letters on curves at points of # maximum space, and use key() to define the letters,# with automatic positioning of the key in the most empty# part of the plot# Have labcurve do the plotting, leaving extra space for keynames(curves) <- labelslabcurve(curves, keys=letters[1:m], type=type, col=1:m,         pl=TRUE, ylim=c(-2.5,4))# Put plotting symbols at equally-spaced points,# with a key for the symbols, ignoring line typeslabcurve(curves, keys=1:m, lty=1, type=type, col=1:m,         pl=TRUE, ylim=c(-2.5,4))# Plot and label two curves, with line parameters specified with dataset.seed(191)ages.f <- sort(rnorm(50,20,7))ages.m <- sort(rnorm(40,19,7))height.f <- pmin(ages.f,21)*.2+60height.m <- pmin(ages.m,21)*.16+63labcurve(list(Female=list(ages.f,height.f,col=2),              Male  =list(ages.m,height.m,col=3,lty='dashed')),         xlab='Age', ylab='Height', pl=TRUE)# add ,keys=c('f','m') to label curves with single letters# For S-Plus use lty=2# Plot power for testing two proportions vs. n for various odds ratios, # using 0.1 as the probability of the event in the control group.  # A separate curve is plotted for each odds ratio, and the curves are# labeled at points of maximum separationn  <- seq(10, 1000, by=10)OR <- seq(.2,.9,by=.1)pow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)),              n=n)names(pow) <- format(OR)labcurve(pow, pl=TRUE, xlab='n', ylab='Power')# Plot some random data and find the largest empty rectangle# that is at least .1 wide and .1 tallx <- runif(50)y <- runif(50)plot(x, y)z <- largest.empty(x, y, .1, .1)zpoints(z,pch=3)  # mark center of rectangle, orpolygon(z$rect, col='blue')  # to draw the rectangle, or#key(z$x, z$y, \dots stuff for legend)# Use the mouse to draw a series of points using one symbol, and# two smooth curves or straight lines (if two points are clicked), # none of these being labeled# d <- drawPlot(Points(), Curve(), Curve())# plot(d)## Not run: # Use the mouse to draw a Gaussian density, two series of points# using 2 symbols, one Bezier curve, a step function, and raw data# along the x-axis as a 1-d scatter plot (rug plot).  Draw a key.# The density function is fit to 3 mouse clicks# Abline draws a dotted horizontal reference lined <- drawPlot(Curve('Normal',type='gauss'),              Points('female'), Points('male'),               Curve('smooth',ask=TRUE,lty=2), Curve('step',type='s',lty=3),               Points(type='r'), Abline(h=.5, lty=2),              xlab='X', ylab='y', xlim=c(0,100), key=TRUE)plot(d, ylab='Y')plot(d, key=FALSE)  # label groups using labcurve## End(Not run)

Label Attribute of an Object

Description

label(x) retrieves thelabel attribute ofx.label(x) <- "a label" stores the label attribute, and also putsthe classlabelled as the first class ofx (for S-Plusthis class is not used and methods for handling this class arenot defined so the"label" and"units" attributes are lostupon subsetting). The reason for having this class is so that thesubscripting method forlabelled,[.labelled, can preservethelabel attribute in S. Also, theprintmethod forlabelled objects prefaces the print with the object'slabel (andunits if there). If the variable is also givena"units" attribute using theunits function, subsettingthe variable (using[.labelled) will also retain the"units" attribute.

label can optionally append a"units" attribute to thestring, and it can optionally return a string or expression (forR'splotmath facility) suitable for plotting.labelPlotmathis a function that also has this function, when the input arguments arethe'label' and'units' rather than a vector having thoseattributes. Whenplotmath mode is used to construct labels, the'label' or'units' may contain math expressions but theyare typed verbatim if they contain percent signs, blanks, orunderscores.labelPlotmath can optionally create theexpression as a character string, which is useful in buildingggplot commands.

ForSurv objects,label first looks to see if there isan overall"label" attribute for the object, then it looks forsaved attributes thatSurv put in the"inputAttributes"object, looking first at theevent variable, thentime2,and finallytime. You can restrict the looking by specifyingtype.

labelLatex constructs suitable LaTeX labels a variable or from thelabel andunits arguments, optionally right-justifyingunits ifhfill=TRUE. This is useful when making tableswhen the variable in question is not a column heading. Ifxis specified,label andunits values are extracted fromits attributes instead of from the other arguments.

Label (actuallyLabel.data.frame) is a function which generatesS source code that makes the labels in all the variables in a dataframe easy to edit.

llist is likelist except that it preserves the names orlabels of the component variables in the variableslabelattribute. This can be useful when looping over variables or usingsapply orlapply. By usingllist instead oflist one can annotate the output with the current variable's nameor label.llist also defines anames attribute for thelist and pulls thenames from the arguments' expressions fornon-named arguments.

prList prints a list with element names (without the dollarsign as in default list printing) and if an element of the list is anunclassed list with a name, all of those elements are printed, withtitles of the form "primary list name : inner list name". This isespecially useful for Rmarkdown html notebooks when a user-writtenfunction creates multiple html and graphical outputs to all be printedin a code chunk. Optionally the names can be printed after theobject, and thehtmlfig option provides more capabilities whenmaking html reports.prList does not work for regular htmldocuments.

putHfig is similar toprList but for a single graphicalobject that is rendered with aprint method, making it easy tospecify long captions, and short captions for the table of contents inHTML documents.Table of contents entries are generated with the short caption, whichis taken as the long caption if there is none. One can optionally notmake a table of contents entry. If argumenttable=TRUE tablecaptions will be produced instead. Usingexpcoll,markupSpecshtml functionexpcoll will be used tomake tables expand upon clicking an arrow rather than always appear.

putHcap is likeputHfig except that itassumes that users render the graphics or table outside of theputHcap call. This allows things to work in ordinary htmldocuments.putHcap does not handle collapsed text.

plotmathTranslate is a simple function that translates certaincharacter strings to character strings that can be used as part ofRplotmath expressions. If the input string has a space or percentinside, the string is surrounded by a call toplotmath'spaste function.

as.data.frame.labelled is a utility function that is called by[.data.frame. It is just a copy ofas.data.frame.vector.data.frame.labelled is another utility function, that adds aclass"labelled" to every variable in a data frame that has a"label" attribute but not a"labelled" class.

relevel.labelled is a method for preservinglabels with therelevel function.

reLabelled is used to add a'labelled' class back tovariables in data frame that have a 'label' attribute but no 'labelled'class. Useful for changingcleanup.import()'d S-Plus dataframes back to general form forR and old versions of S-Plus.

Usage

label(x, default=NULL, ...)## Default S3 method:label(x, default=NULL, units=plot, plot=FALSE,      grid=FALSE, html=FALSE, ...)## S3 method for class 'Surv'label(x, default=NULL, units=plot, plot=FALSE,      grid=FALSE, html=FALSE, type=c('any', 'time', 'event'), ...)## S3 method for class 'data.frame'label(x, default=NULL, self=FALSE, ...)label(x, ...) <- value## Default S3 replacement method:label(x, ...) <- value## S3 replacement method for class 'data.frame'label(x, self=TRUE, ...) <- valuelabelPlotmath(label, units=NULL, plotmath=TRUE, html=FALSE, grid=FALSE,              chexpr=FALSE)labelLatex(x=NULL, label='', units='', size='smaller[2]',           hfill=FALSE, bold=FALSE, default='', double=FALSE)## S3 method for class 'labelled'print(x, ...)   ## or x - calls print.labelledLabel(object, ...)## S3 method for class 'data.frame'Label(object, file='', append=FALSE, ...)llist(..., labels=TRUE)prList(x, lcap=NULL, htmlfig=0, after=FALSE)putHfig(x, ..., scap=NULL, extra=NULL, subsub=TRUE, hr=TRUE,        table=FALSE, file='', append=FALSE, expcoll=NULL)putHcap(..., scap=NULL, extra=NULL, subsub=TRUE, hr=TRUE,        table=FALSE, file='', append=FALSE)plotmathTranslate(x)data.frame.labelled(object)## S3 method for class 'labelled'relevel(x, ...)reLabelled(object)combineLabels(...)

Arguments

x

any object (forplotmathTranslate is a character string). Forrelevel is afactor variable. ForprList is anamed list. ForputHfig is a graphical object for which aprint method will render the graphic (e.g., aggplot2orplotly object).

self

lgoical, where to interact with the object or its components

units

set toTRUE to append the'units' attribute (if present)to the returned label. The'units' are surroundedby brackets. ForlabelPlotmath andlabelLatex is acharacter string containing the units of measurement. Whenplot isTRUE,units defaults toTRUE.

plot

set toTRUE to return a label suitable forR'splotmathfacility (returns an expression instead of a character string) if R isin effect. Ifunits is alsoTRUE, and if both'label' and'units' attributes are present, the'units' will appear after the label but in smaller type andwill not be surrounded by brackets.

default

ifx does not have a'label' attribute anddefault (a character string) is specified, the label will betaken asdefault. ForlabelLatex thedefaultis the name of the first argument if it is a variable and not a label.

grid

CurrentlyR'slattice andgrid functions do not supportplotmath expressions forxlab andylabarguments. When usinglattice functions inR, set theargumentgrid toTRUE so thatlabelPlotmath canreturn an ordinary character string instead of an expression.

html

set toTRUE to use HTML formatting instead ofplotmath expressions for constructing labels with units

type

forSurv objects specifies the type of element forwhich to restrict the search for a label

label

a character string containing a variable's label

plotmath

set toTRUE to havelabelMathplot return an expressionfor plotting usingR'splotmath facility. IfR is not ineffect, an ordinary character string is returned.

chexpr

set toTRUE to havelabelPlotmath return acharacter string of the form"expression(...)"

size

LaTeX size forunits. Default is two sizes smallerthanlabel, which assumes that the LaTeXrelsizepackage is in use.

hfill

set toTRUE to right-justifyunits in thefield. This is useful when multiple labels are being put into rowsin a LaTeXtabular environment, and will cause a problem ifthe label is used in an environment wherehfill is notappropriate.

bold

set toTRUE to havelabelLatex put thelabel in bold face.

double

set toTRUE to represent backslash in LaTeX asfour backslashes in place of two. This is needed if, for example,you need to convert the result usingas.formula

value

the label of the object, or "".

object

a data frame

...

a list of variables or expressions to be formed into alist.Ignored forprint.labelled. Forrelevel is thelevel (a single character string) to become the new reference(first) category. ForputHfig andputHcap representsone or more character strings that are pasted together, separated bya blank.

file

the name of a file to which to write S source code. Default is"", meaning standard output. ForputHcap, setfile toFALSE to return a character vector instead ofwriting tofile.

append

set toTRUE to append code generated byLabel to filefile. Also used forputHfig, putHcap.

labels

set toFALSE to makellist ignore the variables'label attribute and use the variables' names.

lcap

an optional vector of character strings corresponding toelements inx forprList. These contain long captionsthat do not appear in the table of contents but which are printedright after the short caption in the body, in the same font.

htmlfig

forprList set to1 to use HTML markup byrunning the object names throughmarkupSpecs$html$cap forfigure captions. Sethtmlfig=2 to also preface the figurecaption with"### " so that it will appear in the table ofcontents.

after

set toTRUE to haveprList put names afterthe printed object instead of before

scap

a character string specifying the short (or possibly only)caption.

extra

an optional vector of character strings. When presentthe long caption will be put in the first column of an HTML tableand the elements ofextra in subsequent columns. This allowsextra information to appear in the long caption in a way that isright-justified to the right of the flowing caption text.

subsub

set toFALSE to suppress"### " from beingplaced in front of the short caption. Set it to different characterstring to use that instead. Set it to"" to ignore shortcaptions entirely. For example to use second-level headings for thetable of contents specifysubsub="## ".

hr

applies if a caption is present. SpecifyFALSE tonot put a horizontal line before the caption and figure.

table

set toTRUE to produce table captions instead offigure captions

expcoll

character string to be visible, with a clickable arrowfollowing to allow initial hiding of a table and its captions.Cannot be used withtable=FALSE.

Value

label returns the label attribute of x, if any; otherwise, "".label is usedmost often for the individual variables in data frames. The functionsas.get copies labels over from SAS if they exist.

See Also

sas.get,describe,extractlabs,hlab

Examples

age <- c(21,65,43)y   <- 1:3label(age) <- "Age in Years"plot(age, y, xlab=label(age))data <- data.frame(age=age, y=y)label(data)label(data, self=TRUE) <- "A data frame"label(data, self=TRUE)x1 <- 1:10x2 <- 10:1label(x2) <- 'Label for x2'units(x2) <- 'mmHg'x2x2[1:5]dframe <- data.frame(x1, x2)Label(dframe)labelLatex(x2, hfill=TRUE, bold=TRUE)labelLatex(label='Velocity', units='m/s')##In these examples of llist, note that labels are printed after##variable names, because of print.labelleda <- 1:3b <- 4:6label(b) <- 'B Label'llist(a,b)llist(a,b,d=0)llist(a,b,0)w <- llist(a, b>5, d=101:103)sapply(w, function(x){  hist(as.numeric(x), xlab=label(x))  # locator(1)   ## wait for mouse click})# Or: for(u in w) {hist(u); title(label(u))}

latestFile

Description

Find File With Latest Modification Time

Usage

latestFile(pattern, path = ".", verbose = TRUE)

Arguments

pattern

a regular expression; seebase::list.files()

path

full path, defaulting to current working directory

verbose

set toFALSE to not report on total number of matching files

Details

Subject to matching onpattern finds the last modified file, and ifverbose isTRUE reports on how many total files matchedpattern.

Value

the name of the last modified file

Author(s)

Frank Harrell

See Also

base::list.files()


Convert an S object to LaTeX, and Related Utilities

Description

latex converts its argument to a ‘.tex’ file appropriatefor inclusion in a LaTeX2e document.latex is a genericfunction that calls one oflatex.default,latex.function,latex.list.

latex.defaultdoes appropriate rounding and decimal alignment and produces afile containing a LaTeX tabular environment to print the matrix or data.framex as a table.

latex.function prepares an S function for printing by issuingsedcommands that are similar to those in theS.to.latex procedure in thes.to.latex package (Chambersand Hastie, 1993).latex.function can also produceverbatim output or output that works with theSweavelLaTeX style.

latex.list callslatex recursively for each element in the argument.

latexTranslate translates particular items in characterstrings to LaTeX format, e.g., makes ‘⁠a^2 = a\$^2\$⁠’ for superscript withinvariable labels. LaTeX names of greek letters (e.g.,"alpha")will have backslashes added ifgreek==TRUE. Math mode isinserted as needed.latexTranslate assumes that input text always has matches,e.g.[) [] (] (), and that surrounding by ‘⁠\$\$⁠’ is OK.

htmlTranslate is similar tolatexTranslate but for htmltranslation. It doesn't need math mode and assumes dollar signs arejust that.

latexSN converts a vector floating point numbers to characterstrings using LaTeX exponents. Dollar signs to enter math mode are notadded. Similarly,htmlSN converts to scientific notation in html.

latexVerbatim on an object executes the object'sprint method,capturing the output for a file inside a LaTeX verbatim environment.

dvi uses the systemlatex command to compile LaTeX code producedbylatex, including any needed styles.dviwill put a ‘⁠\documentclass{report}⁠’ and ‘⁠\end{document}⁠’ wrapperaround a file produced bylatex. By default, the ‘⁠geometry⁠’ LaTeX package isused to omit all margins and to set the paper size to a default of5.5in wide by 7in tall. The result ofdvi is a .dvi file. To bothformat and screen display a non-default size, use for exampleprint(dvi(latex(x), width=3, height=4),width=3,height=4). Note thatyou can use something like ‘⁠xdvi -geometry 460x650 -margins 2.25infile⁠’ without changing LaTeX defaults to emulate this.

dvips will use the systemdvips command to print the .dvi file tothe default system printer, or create a postscript file iffileis specified.

dvigv uses the systemdvips command to convert the input objectto a .dvi file, and uses the systemdvips command to convert it topostscript. Then the postscript file is displayed using Ghostview(assumed to be the system commandgv).

There areshow methods for displaying typeset LaTeXon the screen using the systemxdvicommand. If youshow a LaTeX file created bylatex without running it throughdvi usingshow.dvi(object), theshow method will run it throughdvi automatically.Theseshow methods are not S Version 4 methods so you have to use full names suchasshow.dvi andshow.latex. Use theprint methods formore automatic display of typesetting, e.g. typinglatex(x) willinvoke xdvi to view the typeset document.

Usage

latex(object, ...)## Default S3 method:latex(object,    title=first.word(deparse(substitute(object))),    file=paste(title, ".tex", sep=""),    append=FALSE, label=title,    rowlabel=title, rowlabel.just="l",    cgroup=NULL, n.cgroup=NULL,    rgroup=NULL, n.rgroup=NULL,    cgroupTexCmd="bfseries",    rgroupTexCmd="bfseries",    rownamesTexCmd=NULL,    colnamesTexCmd=NULL,    cellTexCmds=NULL,    rowname, cgroup.just=rep("c",length(n.cgroup)),    colheads=NULL,    extracolheads=NULL, extracolsize='scriptsize',    dcolumn=FALSE, numeric.dollar=!dcolumn, cdot=FALSE,    longtable=FALSE, draft.longtable=TRUE, ctable=FALSE, booktabs=FALSE,    table.env=TRUE, here=FALSE, lines.page=40,    caption=NULL, caption.lot=NULL, caption.loc=c('top','bottom'),    star=FALSE,    double.slash=FALSE,    vbar=FALSE, collabel.just=rep("c",nc), na.blank=TRUE,    insert.bottom=NULL, insert.bottom.width=NULL,    insert.top=NULL,    first.hline.double=!(booktabs | ctable),    where='!tbp', size=NULL,    center=c('center','centering','centerline','none'),    landscape=FALSE,    multicol=TRUE,    math.row.names=FALSE, already.math.row.names=FALSE,    math.col.names=FALSE, already.math.col.names=FALSE,    hyperref=NULL, continued='continued',    ...) # x is a matrix or data.frame## S3 method for class 'function'latex(object,title=first.word(deparse(substitute(object))),file=paste(title, ".tex", sep=""),append=FALSE,assignment=TRUE,  type=c('example','verbatim','Sinput'),    width.cutoff=70, size='', ...)## S3 method for class 'list'latex(           object,           title=first.word(deparse(substitute(object))),           file=paste(title, ".tex", sep=""),           append=FALSE,           label,           caption,           caption.lot,           caption.loc=c('top','bottom'),           ...)## S3 method for class 'latex'print(x, ...)latexTranslate(object, inn=NULL, out=NULL, pb=FALSE, greek=FALSE, na='',               ...)htmlTranslate(object, inn=NULL, out=NULL, greek=FALSE, na='',              code=htmlSpecialType(), ...)latexSN(x)htmlSN(x, pretty=TRUE, ...)latexVerbatim(x, title=first.word(deparse(substitute(x))),    file=paste(title, ".tex", sep=""),    append=FALSE, size=NULL, hspace=NULL,    width=.Options$width, length=.Options$length, ...)dvi(object, ...)## S3 method for class 'latex'dvi(object, prlog=FALSE, nomargins=TRUE, width=5.5, height=7, ...)## S3 method for class 'dvi'print(x, ...)dvips(object, ...)## S3 method for class 'latex'dvips(object, ...)## S3 method for class 'dvi'dvips(object, file, ...)## S3 method for class 'latex'show(object)  # or show.dvi(object) or just objectdvigv(object, ...)## S3 method for class 'latex'dvigv(object, ...)       # or gvdvi(dvi(object))## S3 method for class 'dvi'dvigv(object, ...)

Arguments

object

Forlatex, any S object. Fordvi ordvigv, an objectcreated bylatex. ForlatexTranslate is a vector ofcharacter strings to translate. AnyNAs are set to blankstrings before conversion.

x

any object to beprinted verbatim forlatexVerbatim. ForlatexSN orhtmlSN,x is a numeric vector.

title

name of file to create without the ‘⁠.tex⁠’ extension. If thisoption is not set, value/string ofx (see above) is printedin the top left corner of the table. Settitle='' tosuppress this output.

file

name of the file to create. The default file name is ‘x.tex’ wherex is the first word in the name of the argument forx.Setfile="" to have the generated LaTeX code just printed tostandard output. This is especially useful when running under Sweave inR using its ‘⁠results=tex⁠’ tag, to save having to manage manysmall external files. Whenfile="",latex keeps track ofLaTeX styles that are called for by creating or modifying an objectlatexStyles (in.GlobalTemp in R or in frame 0 inS-Plus).latexStyles is a vector containing the base names ofall the unique LaTeX styles called for so far in the current session.See the end of the examples section for a way to use this object to goodeffect. Fordvips,file is the name of an outputpostscript file.

append

defaults toFALSE. Set toTRUE to append output to an existing file.

label

a text string representing a symbolic label for the table for referencingin the LaTeX ‘⁠\label⁠’ and ‘⁠\ref⁠’ commands.label is only used ifcaption is given.

rowlabel

Ifx has row dimnames,rowlabel is a character string containing thecolumn heading for the row dimnames. The default is the name of theargument forx.

rowlabel.just

Ifx has row dimnames, specifies the justification for printing them.Possible values are"l","r","c". The heading (rowlabel) itselfis left justified ifrowlabel.just="l", otherwise it is centered.

cgroup

a vector of character strings defining major column headings. The default isto have none.

n.cgroup

a vector containing the number of columns for which each element incgroup is a heading. For example, specifycgroup=c("Major 1","Major 2"),n.cgroup=c(3,3) if"Major 1" is to span columns 1-3 and"Major 2" isto span columns 4-6.rowlabel does not count in the column numbers.You can omitn.cgroup if all groups have the same number of columns.

rgroup

a vector of character strings containing headings for row groups.n.rgroup must be present whenrgroup is given. The firstn.rgroup[1]rows are sectioned off andrgroup[1] is used as a bold heading forthem. The usual row dimnames (which must be present ifrgroup is) are indented. The nextn.rgroup[2] rows are treated likewise, etc.

n.rgroup

integer vector giving the number of rows in each grouping. Ifrgroupis not specified,n.rgroup is just used to divide off blocks ofrows by horizontal lines. Ifrgroup is given butn.rgroup is omitted,n.rgroup will default so that each row group contains the same numberof rows.

cgroupTexCmd

A character string specifying a LaTeX command to beused to format column group labels. The default,"bfseries", setsthe current font to ‘bold’. It is possible to supply a vector ofstrings so that each column group label is formatted differently.Please note that the first item of the vector is used to format thetitle (even if a title is not used). Currently the user needs to handlethese issue. Multiple effects can be achieved by creating customLaTeX commands; for example,"\providecommand{\redscshape}{\color{red}\scshape}" creates aLaTeX command called ‘⁠\redscshape⁠’ that formats the text in redsmall-caps.

rgroupTexCmd

A character string specifying a LaTeX command to beused to format row group labels. The default,"bfseries", sets thecurrent font to ‘bold’. A vector of strings can be supplied toformat each row group label differently. Normal recycling appliesif the vector is shorter thann.rgroups. See alsocgroupTexCmd above regarding multiple effects.

rownamesTexCmd

A character string specifying a LaTeXcommand to be used to format rownames. The default,NULL, applies nocommand. A vector of different commands can also be supplied.See alsocgroupTexCmd above regarding multiple effects.

colnamesTexCmd

A character string specifying a LaTeX command to beused to format column labels. The default,NULL, applies no command.It is possible to supply a vector of strings to format each columnlabel differently. If column groups are not used, the first item inthe vector will be used to format the title. Please note that ifcolumn groups are used the first item ofcgroupTexCmd and notcolnamesTexCmd is used to format the title. The user needs to allow forthese issues when supplying a vector of commands. See alsocgroupTexCmd above regarding multiple effects.

cellTexCmds

A matrix of character strings which are LaTeXcommands to beused to format each element, or cell, of the object. The matrixmust have the sameNROW() andNCOL() as the object. The default,NULL, applies no formats. Empty strings also apply no formats, andone way to start might be to create a matrix of empty strings withmatrix(rep("", NROW(x) * NCOL(x)), nrow=NROW(x)) and thenselectively change appropriate elements of the matrix. Note thatyou might need to setnumeric.dollar=FALSE (to disable mathmode) for some effects to work. See alsocgroupTexCmd aboveregarding multiple effects.

na.blank

Set toTRUE to use blanks rather thanNA for missing values.This usually looks better inlatex.

insert.bottom

an optional character string to typeset at the bottom of the table.For"ctable" style tables, this is placed in an unmarked footnote.

insert.bottom.width

character string; a tex width controlling the width of theinsert.bottom text. Currently only does something with usinglongtable=TRUE.

insert.top

a character string to insert as a heading rightbefore beginningtabular environment. Useful for multiplesub-tables.

first.hline.double

set toFALSE to use single horizontal rules for styles other than"bookmark" or"ctable"

rowname

rownames fortabular environment. Default is rownames of matrix ordata.frame. Specifyrowname=NULL to suppress the use of row names.

cgroup.just

justification for labels for column groups. Defaults to"c".

colheads

a character vector of column headings if you don't wantto usedimnames(object)[[2]]. Specifycolheads=FALSE tosuppress column headings.

extracolheads

an optional vector of extra column headings that will appear under themain headings (e.g., sample sizes). This character vector does notneed to include an empty space for anyrowname in effect, asthis will be added automatically. You can also form subheadings bysplitting character strings defining the column headings using theusual backslashn newline character.

extracolsize

size forextracolheads or for any second lines in column names;default is"scriptsize"

dcolumn

seeformat.df

numeric.dollar

logical, default!dcolumn. Set toTRUE to place dollarsigns around numeric values whendcolumn=FALSE. This assures thatlatex will use minus signs rather than hyphens to indicatenegative numbers. Set toFALSE whendcolumn=TRUE, asdcolumn.sty automatically uses minus signs.

math.row.names

logical, set true to place dollar signs around the row names.

already.math.row.names

set toTRUE to prevent any mathmode changes to row names

math.col.names

logical, set true to place dollar signs around the column names.

already.math.col.names

set toTRUE to prevent any mathmode changes to column names

hyperref

iftable.env=TRUE is a character string used togenerate a LaTeXhyperref enclosure

continued

a character string used to indicate pages after thefirst when making a long table

cdot

seeformat.df

longtable

Set toTRUE to use David Carlisle's LaTeXlongtable style, allowinglong tables to be split over multiple pages with headers repeated oneach page.The"style" element is set to"longtable". Thelatex⁠\usepackage⁠’must reference ‘⁠[longtable]⁠’.The file ‘longtable.sty’ willneed to be in a directory in yourTEXINPUTS path.

draft.longtable

I forgot what this does.

ctable

set toTRUE to use Wybo Dekker's ‘⁠ctable⁠’ style fromCTAN. Even though for historical reasons it is not thedefault, it is generally the preferred method. Thicker but notdoubled ‘⁠\hline⁠’s are used to start a table whenctable isin effect.

booktabs

setbooktabs=TRUE to use the ‘⁠booktabs⁠’ style of horizontalrules for better tables. In this case, double ‘⁠\hline⁠’s are notused to start a table.

table.env

Settable.env=FALSE to suppress enclosing the table in a LaTeX‘⁠table⁠’ environment.table.env only applies whenlongtable=FALSE. You may not specify acaption iftable.env=FALSE.

here

Set toTRUE if you are usingtable.env=TRUE withlongtable=FALSE and youhave installed David Carlisle's ‘here.sty’ LaTeX style. This will causethe LaTeX ‘⁠table⁠’ environment to be set up with option ‘⁠H⁠’ to guaranteethat the table will appear exactly where you think it will in the text.The"style" element is set to"here". Thelatex⁠\usepackage⁠’must reference ‘⁠[here]⁠’. The file ‘here.sty’ willneed to be in a directory in yourTEXINPUTS path. ‘⁠here⁠’ islargely obsolete with LaTeX2e.

lines.page

Applies iflongtable=TRUE. No more thanlines.page lines in the bodyof a table will be placed on a single page. Page breaks will onlyoccur atrgroup boundaries.

caption

a text string to use as a caption to print at the top of the firstpage of the table. Default is no caption.

caption.lot

a text string representing a short caption to be used in the “List of Tables”.By default, LaTeX will usecaption. If you get inexplicable ‘⁠latex⁠’ errors,you may need to supplycaption.lot to make the errors go away.

caption.loc

set to"bottom" to position a caption belowthe table instead of the default of"top".

star

apply the star option for ctables to allow a table to spread overtwo columns when in twocolumn mode.

double.slash

set toTRUE to output ‘⁠"\"⁠’ as ‘⁠"\\"⁠’ in LaTeX commands. Useful when youare reading the output file back into an S vector for later output.

vbar

logical. Whenvbar==TRUE, columns in the tabular environment are separated withvertical bar characters. Whenvbar==FALSE, columns are separated with whitespace. The default,vbar==FALSE, produces tables consistent with the style sheetfor the Journal of the American Statistical Association.

collabel.just

justification for column labels.

assignment

logical. WhenTRUE, the default, the name of the functionand the assignment arrow are printed to the file.

where

specifies placement of floats if a table environment is used. Defaultis"!tbp". To allow tables to appear in the middle of a page oftext you might specifywhere="!htbp" tolatex.default.

size

size of table text if a size change is needed (default is no change).For example you might specifysize="small" to use LaTeX font size“small”. Forlatex.function is a character stringthat will be appended to"Sinput" such as"small".

center

default is"center" to enclose the table in a ‘⁠center⁠’environment. Usecenter="centering" or"centerline"to instead use LaTeX ‘⁠centering⁠’ orcenterline directives, orcenter="none" to use no centering.centerline can be useful when objects besides atabular are enclosed in a singletable environment.This option was implemented by Markus J�nttimarkus.jantti@iki.fi of Abo Akademi University.

landscape

set toTRUE to enclose the table in a ‘⁠landscape⁠’environment. Whenctable isTRUE, will use therotate argument toctable.

type

The default uses the Salltt environment forlatex.function,Settype="verbatim" to instead use the LaTeX ‘⁠verbatim⁠’environment. Usetype="Sinput" if usingSweave,especially if you have customized theSinput environment, forexample using theSweavel style which uses thelistings LaTeX package.

width.cutoff

width of function text output in columns; seedeparse

...

other arguments are accepted and ignored except thatlatexpasses arguments toformat.df (e.g.,col.just and otherformatting options likedec,rdec, andcdec). ForlatexVerbatim these arguments are passed to theprintfunction. Ignored forlatexTranslate andhtmlTranslate. ForhtmlSN, these arguments are passedtoprettyNum orformat.

inn,out

specify additional input and translated strings over the usualdefaults

pb

Ifpb=TRUE,latexTranslate also translates ‘⁠[()]⁠’to math mode using ‘⁠\left, \right⁠’.

greek

set toTRUE to havelatexTranslate put namesfor greek letters in math mode and add backslashes. ForhtmlTranslate, translates greek letters to corresponding htmlcharacters, ignoring "modes".

na

single character string to translateNA values to forlatexTranslate andhtmlTranslate

code

set to'unicode' to use HTML unicode charactersor'&' to use the ampersand pound number format

pretty

set toFALSE to havehtmlSN useformat instead ofprettyNum

hspace

horizontal space, e.g., extra left margin for verbatim text. Defaultis none. Use e.g.hspace="10ex" to add 10 extra spaces to the leftof the text.

length

for S-Plus only; is the length of the output page forprinting and capturing verbatim text

width,height

are theoptions( ) to have in effect only for whenprint isexecuted. Defaults are currentoptions. Fordvi these specifythe paper width and height in inches ifnomargins=TRUE, withdefaults of 5.5 and 7, respectively.

prlog

set toTRUE to havedvi print, to the S-Plus session, the LaTeX .logfile.

multicol

set toFALSE to not use ‘⁠\multicolumn⁠’ in headerof table

nomargins

set toFALSE to use default LaTeX margins when making the .dvi file

Details

latex.default optionally outputs a LaTeX comment containing the callingstatement. To output this comment, runoptions(omitlatexcom=FALSE) before running. The default behavior or suppressing the comment is helpfulwhen running RMarkdown to produce pdf output using LaTeX, as this usespandoc which is fooled into try to escape the percentcomment symbol.

If running under Windows and using MikTeX,latex andyapmust be in your system path, andyap is used to browse‘.dvi’ files created bylatex. You should install the‘geometry.sty’ and ‘ctable.sty’ styles in MikTeX to make optimum useoflatex().

On Mac OS X, you may have to append the ‘/usr/texbin’ directory to thesystem path. Thanks to Kevin Thorpe(kevin.thorpe@utoronto.ca) one way to set up Mac OS X isto install ‘⁠X11⁠’ and ‘⁠X11SDK⁠’ if not already installed,start ‘⁠X11⁠’ within the R GUI, and issue the commandSys.setenv( PATH=paste(Sys.getenv("PATH"),"/usr/texbin",sep=":") ). To avoid any complications of using ‘⁠X11⁠’ under MacOS, userscan install the ‘⁠TeXShop⁠’ package, which will associate‘.dvi’ files with a viewer that displays a ‘pdf’ version ofthe file after a hidden conversion from ‘dvi’ to ‘pdf’.

System options can be used to specify external commands to be used.Defaults are given byoptions(xdvicmd='xdvi') oroptions(xdvicmd='yap'),options(dvipscmd='dvips'),options(latexcmd='latex'). For MacOS specifyoptions(xdvicmd='MacdviX') or if TeXShop is installed,options(xdvicmd='open').

To use ‘⁠pdflatex⁠’ rather than ‘⁠latex⁠’, setoptions(latexcmd='pdflatex'),options(dviExtension='pdf'), and setoptions('xdvicmd') to your chosen PDF previewer.

If running S-Plus and your directory for temporary files is not‘/tmp’ (Unix/Linux) or ‘\windows\temp’ (Windows), add yourowntempdir function such astempdir <- function() "/yourmaindirectory/yoursubdirectory"

To prevent the latex file from being displayed store the result oflatex in an object, e.g.w <- latex(object, file='foo.tex').

Value

latex anddvi return alist of classlatex ordvi containing character stringelementsfile andstyle.file contains the name of thegenerated file, andstyle is a vector (possibly empty) of styles tobe included using the LaTeX2e ‘⁠\usepackage⁠’ command.

latexTranslate returns a vector of character strings

Side Effects

creates various system files and runs various Linux/UNIX systemcommands which are assumed to be in the system path.

Author(s)

Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com

Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu

David R. Whiting,
School of Clinical Medical Sciences (Diabetes),
University of Newcastle upon Tyne, UK.
david.whiting@ncl.ac.uk

See Also

html,format.df,texi2dvi

Examples

x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that')))## Not run: latex(x)   # creates x.tex in working directory# The result of the above command is an object of class "latex"# which here is automatically printed by the latex print method.# The latex print method prepends and appends latex headers and# calls the latex program in the PATH.  If the latex program is# not in the PATH, you will get error messages from the operating# system.w <- latex(x, file='/tmp/my.tex')# Does not call the latex program as the print method was not invokedprint.default(w)# Shows the contents of the w variable without attempting to latex it.d <- dvi(w)  # compile LaTeX document, make .dvi             # latex assumed to be in pathd            # or show(d) : run xdvi (assumed in path) to displayw            # or show(w) : run dvi then xdvidvips(d)     # run dvips to print documentdvips(w)     # run dvi then dvipslibrary(tools)texi2dvi('/tmp/my.tex')   # compile and produce pdf file in working dir.## End(Not run)latex(x, file="")   # just write out LaTeX code to screen## Not run: # Use paragraph formatting to wrap text to 3 in. wide in a columnd <- data.frame(x=1:2,                y=c(paste("a",                    paste(rep("very",30),collapse=" "),"long string"),                "a short string"))latex(d, file="", col.just=c("l", "p{3in}"), table.env=FALSE)## End(Not run)## Not run: # After running latex( ) multiple times with different special styles in# effect, make a file that will call for the needed LaTeX packages when# latex is run (especially when using Sweave with R)if(exists(latexStyles))  cat(paste('\usepackage{',latexStyles,'}',sep=''),      file='stylesused.tex', sep='\n')# Then in the latex job have something like:# \documentclass{article}# \input{stylesused}# \begin{document}# ...## End(Not run)

Check whether the options for latex functions have been specified.

Description

Check whether the options for latex functions have been specified.If any of
options()[c("latexcmd","dviExtension","xdvicmd")]areNULL, an error message is displayed.

Usage

latexCheckOptions(...)

Arguments

...

Any arguments are ignored.

Value

If anyNULL options are detected, the invisible text of theerror message. If all three options have non-NULL values, NULL.

Author(s)

Richard M. Heiberger <rmh@temple.edu>

See Also

latex


Enhanced Dot Chart for LaTeX Picture Environment with epic

Description

latexDotchart is a translation of thedotchart3 functionfor producing a vector of character strings containing LaTeX pictureenvironment markup that mimicsdotchart3 output. The LaTeXepic andcolor packages are required. Theadd andhorizontal=FALSE options are not available forlatexDotchart, however.

Usage

latexDotchart(data, labels, groups=NULL, gdata=NA,   xlab='', auxdata, auxgdata=NULL, auxtitle,  w=4, h=4, margin,        lines=TRUE, dotsize = .075, size='small', size.labels='small',  size.group.labels='normalsize', ttlabels=FALSE, sort.=TRUE,  xaxis=TRUE, lcolor='gray', ...)

Arguments

data

a numeric vector whose values are shown on the x-axis

labels

a vector of labels for each point, corresponding tox. If omitted,names(data) are used, and if there arenonames, integers prefixed by"#" are used.

groups

an optional categorical variable indicating howdata values are grouped

gdata

data values for groups, typically summaries such as groupmedians

xlab

x-axis title

auxdata

a vector of auxiliary data, of the same lengthas the first (data) argument. If present, thisvector of values will be printed outside the right margin of the dotchart. Usuallyauxdata represents cell sizes.

auxgdata

similar toauxdata but corresponding to thegdataargument. These usually represent overall sample sizes for eachgroup of lines.

auxtitle

ifauxdata is given,auxtitle specifies a columnheading for the extra printed data in the chart, e.g.,"N"

w

width of picture in inches

h

height of picture in inches

margin

a 4-vector representing, in inches, the margin to theleft of the x-axis, below the y-axis, to the right of the x-axis,and above the y-axis. By default these are computed making educatedcases about how to accommodateauxdata etc.

lines

set toFALSE to suppress drawing of referencelines

dotsize

diameter of filled circles, in inches, for drawing dots

size

size of text in picture. This and the next two argumentsare LaTeX font commands without the opening backslash, e.g.,'normalsize','small','large',smaller[2].

size.labels

size of labels

size.group.labels

size of labels corresponding togroups

ttlabels

set toTRUE to use typewriter monospaced fontfor labels

sort.

set toFALSE to keeplatexDotchart from sorting the inputdata, i.e., it will assume that the data are already properlyarranged. This is especially useful when you are usinggdataandgroups and you want to control theorder that groups appear on the chart (from top to bottom).

xaxis

set toFALSE to suppress drawing x-axis

lcolor

color for horizontal reference lines. Default is"gray"

...

ignored

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

dotchart3

Examples

## Not run: z <- latexDotchart(c(.1,.2), c('a','bbAAb'), xlab='This Label',                   auxdata=c(.1,.2), auxtitle='Zcriteria')f <- '/tmp/t.tex'cat('\documentclass{article}\n\usepackage{epic,color}\n\begin{document}\n', file=f)cat(z, sep='\n', file=f, append=TRUE)cat('\end{document}\n', file=f, append=TRUE)set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))z <- latexDotchart(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y',                   size.group.labels='large', ttlabels=TRUE)f <- '/tmp/t2.tex'cat('\documentclass{article}\n\usepackage{epic,color}\n\begin{document}\n\framebox{', file=f)cat(z, sep='\n', file=f, append=TRUE)cat('}\end{document}\n', file=f, append=TRUE)## End(Not run)

Convert a Data Frame or Matrix to a LaTeX Tabular

Description

latexTabular creates a character vector representing a matrix ordata frame in a simple ‘⁠tabular⁠’ environment.

Usage

latexTabular(x, headings=colnames(x),             align =paste(rep('c',ncol(x)),collapse=''),             halign=paste(rep('c',ncol(x)),collapse=''),             helvetica=TRUE, translate=TRUE, hline=0, center=FALSE, ...)

Arguments

x

a matrix or data frame, or a vector that is automaticallyconverted to a matrix

headings

a vector of character strings specifying columnheadings for ‘⁠latexTabular⁠’, defaulting tox'scolnames. To make multi-line headers use the newline characterinside elements ofheadings.

align

a character strings specifying columnalignments for ‘⁠latexTabular⁠’, defaulting topaste(rep('c',ncol(x)),collapse='') to center. You mayspecifyalign='c|c' and other LaTeX tabular formatting.

halign

a character strings specifying alignment forcolumn headings, defaulting to centered.

helvetica

set toFALSE to use default LaTeX font in‘⁠latexTabular⁠’ instead of helvetica.

translate

set toFALSE if column headings and tableentries are already inLaTeX format, otherwiselatexTabular will run them throughlatexTranslate

hline

set to 1 to puthline after heading, 2 to also puthlines before and after heading and at table end

center

set toTRUE to enclose the tabular in a LaTeXcenter environment

...

if present,x is run throughformat.df withthose extra arguments

Value

a character string containing LaTeX markup

Author(s)

Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com

See Also

latex.default,format.df

Examples

x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that')))latexTabular(x)   # a character string with LaTeX markup

Create LaTeX Thermometers and Colored Needles

Description

latexTherm creates a LaTeX picture environment for drawing aseries of thermometers whose heights depict the values of a variabley assumed to bescaled from 0 to 1. This is useful for showing fractions of sampleanalyzed in any table or plot, intended for a legend. For example, fourthermometers might be used to depict the fraction of enrolled patientsincluded in the current analysis, the fraction randomized, the fractionof patients randomized to treatment A being analyzed, and the fractionrandomized to B being analyzed. The picture is placedinside a LaTeX macro definition for macro variable namedname, tobe invoked by the user later in the LaTeX file usingnamepreceeded by a backslash.

Ify has an attribute"table", it is assumed to contain acharacter string with LaTeX code. This code is used as a tooltip popupfor PDF using the LaTeXocgtools package or using styletooltips. Typically the code will contain atabularenvironment. The user must define a LaTeX macrotooltipn thattakes two arguments (original object and pop-up object) that doesthe pop-up.

latexNeedle is similar tolatexTherm except that verticalneedles are produced and each may have its own color. A grayscale boxis placed around the needles and provides the 0-1y-axisreference. Horizontal grayscale grid lines may be drawn.

pngNeedle is similar tolatexNeedle but is for generatingsmall png graphics. The full graphics file name is returned invisibly.

Usage

latexTherm(y, name, w = 0.075, h = 0.15, spacefactor = 1/2, extra = 0.07,           file = "", append = TRUE)latexNeedle(y, x=NULL, col='black', href=0.5, name, w=.05, h=.15,            extra=0, file = "", append=TRUE)pngNeedle(y, x=NULL, col='black', href=0.5, lwd=3.5, w=6, h=18,          file=tempfile(fileext='.png'))

Arguments

y

a vector of 0-1 scaled values. Boxes and their frames areomitted forNA elements

x

a vector corresponding toy giving x-coordinates.Scaled accordingly, or defaults to equally-spaced values.

name

name of LaTeX macro variable to be defined

w

width of a single box (thermometer) in inches. ForlatexNeedle andpngNeedle is the spacing betweenneedles, the latter being in pixels.

h

height of a single box in inches. ForlatexNeedle andpngNeedle is the height of the frame, the latter in pixels.

spacefactor

fraction ofw added for extra space betweenboxes forlatexTherm

extra

extra space in inches to set aside to the right of andabove the series of boxes or frame

file

name of file to which to write LaTeX code. Default is theconsole. Also used as base file name for png graphic. Default forthat is fromtempfile.

append

set toFALSE to write overfile

col

a vector of colors corresponding to positions iny.col is repeated if too short.

href

values ofy (0-1) for which horizontal grayscalereference lines are drawn forlatexNeedle andpngNeedle. Set toNULL to not draw any reference lines

lwd

line width of needles forpngNeedle

Author(s)

Frank Harrell

Examples

## Not run: # The following is in the Hmisc tests directory# For a knitr example see latexTherm.Rnw in that directoryct <- function(...) cat(..., sep='')ct('\documentclass{report}\begin{document}\n')latexTherm(c(1, 1, 1, 1), name='lta')latexTherm(c(.5, .7, .4, .2), name='ltb')latexTherm(c(.5, NA, .75, 0), w=.3, h=1, name='ltc', extra=0)latexTherm(c(.5, NA, .75, 0), w=.3, h=1, name='ltcc')latexTherm(c(0, 0, 0, 0), name='ltd')ct('This is a the first:\lta and the second:\ltb\\ and the thirdwithout extra:\ltc END\\\nThird with extra:\ltcc END\\ \vspace{2in}\\ All data = zero, frame only:\ltd\\\end{document}\n')w <- pngNeedle(c(.2, .5, .7))cat(tobase64image(w))  # can insert this directly into an html file## End(Not run)

Legend Creation Functions

Description

Wrapers to plot defined legend ploting functions

Usage

Key(...)Key2(...)sKey(...)

Arguments

...

arguments to pass to wrapped functions


Pretty-print the Structure of a Data Object

Description

This is a function to pretty-print the structure of any data object(usually a list). It is similar to the R functionstr.

Usage

list.tree(struct, depth=-1, numbers=FALSE, maxlen=22, maxcomp=12,           attr.print=TRUE, front="", fill=". ", name.of, size=TRUE)

Arguments

struct

The object to be displayed

depth

Maximum depth of recursion (of lists within lists ...) to be printed; negativevalue means no limit on depth.

numbers

If TRUE, use numbers in leader instead of dots torepresent position in structure.

maxlen

Approximate maximum length (in characters) allowed on each line to give thefirst few values of a vector. maxlen=0 suppresses printing any values.

maxcomp

Maximum number of components of any list that will be described.

attr.print

Logical flag, determining whether a description of attributes will be printed.

front

Front material of a line, for internal use.

fill

Fill character used for each level of indentation.

name.of

Name of object, for internal use (deparsed version of struct by default).

size

Logical flag, should the size of the object in bytes be printed?

A description of the structure of struct will be printed in outlineform, with indentationfor each level of recursion, showing the internal storage mode, length,class(es) if any, attributes, and first few elements of each data vector.By default each level of list recursion is indicated by a "." and attributes by "A".

Author(s)

Alan Zaslavsky,zaslavsk@hcp.med.harvard.edu

See Also

str

Examples

X <- list(a=ordered(c(1:30,30:1)),b=c("Rick","John","Allan"),          c=diag(300),e=cbind(p=1008:1019,q=4))list.tree(X)# In R you can say str(X)

Apply a Function to Rows of a Matrix or Vector

Description

mApply is liketapply except that the first argument canbe a matrix or a vector, and the output is cleaned up ifsimplify=TRUE.It uses code adapted from Tony Plate (tplate@blackmesacapital.com) tooperate on grouped submatrices.

AsmApply can be much faster than usingby, it is oftenworth the trouble of converting a data frame to a numeric matrix forprocessing bymApply.asNumericMatrix will do this, andmatrix2dataFrame will convert a numeric matrix back into a dataframe.

Usage

mApply(X, INDEX, FUN, ..., simplify=TRUE, keepmatrix=FALSE)

Arguments

X

a vector or matrix capable of being operated on by thefunction specified as theFUN argument

INDEX

list of factors, each of same number of rows as 'X' has.

FUN

the function to be applied. In the case of functions like'+', '

...

optional arguments to 'FUN'.

simplify

set to 'FALSE' to suppress simplification of the result in toan array, matrix, etc.

keepmatrix

set toTRUE to keep result as a matrix even ifsimplify isTRUE, in the case of only one stratum

Value

FormApply, the returned value is a vector, matrix, or list.IfFUN returns more than one number, the result is an array ifsimplify=TRUE and is a list otherwise. If a matrix is returned,its rows correspond to unique combinations ofINDEX. IfINDEX is a list with more than one vector,FUN returnsmore than one number, andsimplify=FALSE, the returned value is alist that is an array with the first dimension corresponding to the lastvector inINDEX, the second dimension corresponding to the nextto last vector inINDEX, etc., and the elements of the list-arraycorrespond to the values computed byFUN. In this situation thereturned value is a regular array ifsimplify=TRUE. The orderof dimensions is as previously but the additional (last) dimensioncorresponds to values computed byFUN.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

asNumericMatrix,matrix2dataFrame,tapply,sapply,lapply,mapply,by.

Examples

require(datasets, TRUE)a <- mApply(iris[,-5], iris$Species, mean)

Methods for Storing and Analyzing Multiple Choice Variables

Description

mChoice is a function that is useful for grouping variables that representindividual choices on a multiple choice question. These choices aretypically factor or character values but may be of any type. Levelsof component factor variables need not be the same; all unique levels(or unique character values) are collected over all of the multiplevariables. Then a new character vector is formed with integer choicenumbers separated by semicolons. Optimally, a database system wouldhave exported the semicolon-separated character strings with alevels attribute containing strings defining value labelscorresponding to the integer choice numbers.mChoice is afunction for creating a multiple-choice variable after the fact.mChoice variables are explicitly handed by thedescribeandsummary.formula functions.NAs or blanks in inputvariables are ignored.

format.mChoice will convert the multiple choice representationto text form by substitutinglevels for integer codes.as.double.mChoice converts themChoice object to abinary numeric matrix, one column per used level (or all levels ofdrop=FALSE. This is called bythe user by invokingas.numeric. There is aprint method and asummary method, and aprintmethod for thesummary.mChoice object. Thesummarymethod computes frequencies of all two-way choice combinations, thefrequencies of the top 5 combinations, information about which otherchoices are present when each given choice is present, and thefrequency distribution of the number of choices per observation. Thissummary output is used in thedescribe function. Theprint method returns an html character string ifoptions(prType='html') is in effect ifrender=FALSE orrenders the html otherwise. This is used byprint.describe andis most effective whenshort=TRUE is specified tosummary.

in.mChoice creates a logical vector the same length asxwhose elements areTRUE when the observation inxcontains at least one of the codes or value labels in the secondargument.

match.mChoice creates an integer vector of the indexes of allelements intable which contain any of the speicified levels

nmChoice returns an integer vector of the number of choicesthat were made

is.mChoice returnsTRUE is the argument is a multiplechoice variable.

Usage

mChoice(..., label='',        sort.levels=c('original','alphabetic'),         add.none=FALSE, drop=TRUE, ignoreNA=TRUE)## S3 method for class 'mChoice'format(x, minlength=NULL, sep=";", ...)## S3 method for class 'mChoice'as.double(x, drop=FALSE, ...)## S3 method for class 'mChoice'print(x, quote=FALSE, max.levels=NULL,       width=getOption("width"), ...)## S3 method for class 'mChoice'as.character(x, ...)## S3 method for class 'mChoice'summary(object, ncombos=5, minlength=NULL,  drop=TRUE, short=FALSE, ...)## S3 method for class 'summary.mChoice'print(x, prlabel=TRUE, render=TRUE, ...)## S3 method for class 'mChoice'x[..., drop=FALSE]match.mChoice(x, table, nomatch=NA, incomparables=FALSE)inmChoice(x, values, condition=c('any', 'all'))inmChoicelike(x, values, condition=c('any', 'all'),              ignore.case=FALSE, fixed=FALSE)nmChoice(object)is.mChoice(x)## S3 method for class 'mChoice'Summary(..., na.rm)

Arguments

na.rm

Logical: removeNA's from data

table

a vector (mChoice) of values to be matched against.

nomatch

value to return if a value forx does not exist intable.

incomparables

logical whether incomparable values should be compaired.

...

a series of vectors

label

a character stringlabel attribute to attach to the matrix createdbymChoice

sort.levels

setsort.levels="alphabetic" to sort the columns of the matrixcreated bymChoice alphabetically by category rather than by theoriginal order of levels in component factor variables (if there wereany input variables that were factors)

add.none

Setadd.none toTRUE to make a new category'none' if it doesn't already exist and if there is anobservations with no choices selected.

drop

setdrop=FALSE to keep unused factor levels as columns of the matrixproduced bymChoice

ignoreNA

set toFALSE to keep anyNAs present indata as a real level. Prior to Hmisc 4.7-2FALSE was thedefault.

x

an object of class"mchoice" such as that created bymChoice. Foris.mChoice is any object.

object

an object of class"mchoice" such as that created bymChoice

ncombos

maximum number of combos.

width

With of a line of text to be formated

quote

quote the output

max.levels

max levels to be displayed

minlength

By default no abbreviation of levels is done informat andsummary. Specify a positive integer to useabbreviation in those functions. Seeabbreviate.

short

set toTRUE to havesummary.mChoice useinteger choice numbers in its tables, and to print the choice leveldefinitions at the top

sep

character to use to separate levels when formatting

prlabel

set toFALSE to keepprint.summary.mChoice from printing the variable label andnumber of unique values. Ignore for html output.

render

applies ofoptions(prType='html') is ineffect. Set toFALSE to return the html text instead ofrendering the html.

values

a scalar or vector. Ifvalues is integer, it isthe choice codes, and if it is a character vector, it is assumed tobe value labels. ForinmChoicelikevalues must becharacter strings which are pieces of choice labels.

condition

set to'all' forinmChoice to requirethat all choices invalues be present instead of the default ofany of them present.

ignore.case

set toTRUE to haveinmChoicelikeignore case in the data when matching onvalues

fixed

seegrep

Value

mChoice returns a character vector of class"mChoice"plus attributes"levels" and"label".summary.mChoice returns an object of class"summary.mChoice".inmChoice andinmChoicelikereturn a logical vector.format.mChoice returns a character vector, andas.double.mChoice returns a binary numeric matrix.nmChoice returns an integer vector.print.summary.mChoice returns an html character string ifoptions(prType='html') is in effect.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

label,combplotp

Examples

options(digits=3)set.seed(3)n <- 20sex <- factor(sample(c("m","f"), n, rep=TRUE))age <- rnorm(n, 50, 5)treatment <- factor(sample(c("Drug","Placebo"), n, rep=TRUE))# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail',          'Muscle Ache','Depressed')symptom1 <- sample(symp, n, TRUE)symptom2 <- sample(symp, n, TRUE)symptom3 <- sample(symp, n, TRUE)cbind(symptom1, symptom2, symptom3)[1:5,]Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')Symptomsprint(Symptoms, long=TRUE)format(Symptoms[1:5])inmChoice(Symptoms,'Headache')inmChoicelike(Symptoms, 'head', ignore.case=TRUE)levels(Symptoms)inmChoice(Symptoms, 3)# Find all subjects with either of two symptomsinmChoice(Symptoms, c('Headache','Hangnail'))# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selections# Find all subjects with both symptomsinmChoice(Symptoms, c('Headache', 'Hangnail'), condition='all')meanage <- N <- numeric(5)for(j in 1:5) { meanage[j] <- mean(age[inmChoice(Symptoms,j)]) N[j] <- sum(inmChoice(Symptoms,j))}names(meanage) <- names(N) <- levels(Symptoms)meanageN# Manually compute mean age for 2 symptomsmean(age[symptom1=='Headache' | symptom2=='Headache' | symptom3=='Headache'])mean(age[symptom1=='Hangnail' | symptom2=='Hangnail' | symptom3=='Hangnail'])summary(Symptoms)#Frequency table sex*treatment, sex*Symptomssummary(sex ~ treatment + Symptoms, fun=table)# Check:ma <- inmChoice(Symptoms, 'Muscle Ache')table(sex[ma])# could also do:# summary(sex ~ treatment + mChoice(symptom1,symptom2,symptom3), fun=table)#Compute mean age, separately by 3 variablessummary(age ~ sex + treatment + Symptoms)summary(age ~ sex + treatment + Symptoms, method="cross")f <- summary(treatment ~ age + sex + Symptoms, method="reverse", test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)

creates a string that is a repeat of a substring

Description

Takes a character and creates a string that is the character repeatedlen times.

Usage

makeNstr(char, len)

Arguments

char

character to be repeated

len

number of times to repeatchar.

Value

A string that ischar repeatedlen times.

Author(s)

Charles Dupont

See Also

paste,rep

Examples

makeNstr(" ", 5)

Read Tables in a Microsoft Access Database

Description

Assuming themdbtools package has been installed on yoursystem and is in the system path,mdb.get importsone or more tables in a Microsoft Access database. Date-timevariables are converted to dates orchron package date-timevariables. Thecsv.get function is used to importautomatically exported csv files. Iftables is unspecified all tables in the database are retrieved. If more thanone table is imported, the result is a list of data frames.

Usage

mdb.get(file, tables=NULL, lowernames=FALSE, allow=NULL,        dateformat='%m/%d/%y', mdbexportArgs='-b strip', ...)

Arguments

file

the file name containing the Access database

tables

character vector specifying the names of tables toimport. Default is to import all tables. Specifytables=TRUE to return the list of available tables.

lowernames

set this toTRUE to change variable names tolower case

allow

a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version1.9.

dateformat

seecleanup.import. Default is theusual Access format used in the U.S.

mdbexportArgs

command line arguments to issue to mdb-export.Set to'' to omit'-b strip'.

...

arguments to pass tocsv.get

Details

Uses themdbtools package executablesmdb-tables,mdb-schema, andmdb-export (with by default option-b strip to drop any binary output). In Debian/Ubuntu Linux runapt get install mdbtools.cleanup.import is invoked bycsv.get to transformvariables and store them as efficiently as possible.

Value

a new data frame or a list of data frames

Author(s)

Frank Harrell, Vanderbilt University

See Also

data.frame,cleanup.import,csv.get,Date,chron

Examples

## Not run: # Read all tables in the Microsoft Access database Nwind.mdbd <- mdb.get('Nwind.mdb')contents(d)for(z in d) print(contents(z))# Just print the names of tables in the databasemdb.get('Nwind.mdb', tables=TRUE)# Import one tableOrders <- mdb.get('Nwind.mdb', tables='Orders')## End(Not run)

meltData

Description

Melt a Dataset To Examine All Xs vs Y

Usage

meltData(  formula,  data,  tall = c("right", "left"),  vnames = c("labels", "names"),  sepunits = FALSE,  ...)

Arguments

formula

a formula

data

data frame or table

tall

see above

vnames

set tonames to always use variable names instead of labels for X

sepunits

set toTRUE to create a separate variableUnits to hold units of measurement. The variable is not created if no original variables have a non-blankunits attribute.

...

passed tolabel()

Details

Uses a formula with one or more left hand side variables (Y) and one or more right hand side variables (X). Usesdata.table::melt() to meltdata so that each X is played against the same Y iftall='right' (the default) or each Y is played against the same X combination iftall='left'. The resulting data table has variables Y with their original names (iftall='right') or variables X with their original names (iftall='left'),variable, andvalue. By defaultvariable is taken aslabel()s of thetall variables.

Value

data table

Author(s)

Frank Harrell

See Also

label()

Examples

d <- data.frame(y1=(1:10)/10, y2=(1:10)/100, x1=1:10, x2=101:110)label(d$x1) <- 'X1'units(d$x1) <- 'mmHg'm=meltData(y1 + y2 ~ x1 + x2, data=d, units=TRUE) # consider also html=TRUEprint(m)m=meltData(y1 + y2 ~ x1 + x2, data=d, tall='left')print(m)

Draw Axes With Side-Specific mgp Parameters

Description

mgp.axis is a version ofaxis that uses the appropriateside-specificmgp parameter (seepar) to accountfor different space requirements for axis labels vertical vs. horizontaltick marks.mgp.axis also fixes a bug inaxis(2,...)that causes it to assumelas=1.

mgp.axis.labels is used so that different spacing between tickmarks and axis tick mark labels may be specified for x- and y-axes. Usemgp.axis.labels('default') to set defaults. Users can set valuesmanually usingmgp.axis.labels(x,y) wherex andyare 2nd value ofpar('mgp') to use. Usemgp.axis.labels(type=w) to retrieve values, wherew='x','y','x and y','xy', to get 3mgp values(first 3 types) or 2mgp.axis.labels.

Usage

mgp.axis(side, at = NULL, ...,         mgp = mgp.axis.labels(type = if (side == 1 | side == 3) "x"                               else "y"),         axistitle = NULL, cex.axis=par('cex.axis'), cex.lab=par('cex.lab'))mgp.axis.labels(value,type=c('xy','x','y','x and y'))

Arguments

side,at

seepar

...

arguments passed through toaxis

mgp,cex.axis,cex.lab

seepar

axistitle

if specified will causeaxistitle to be drawnon the appropriate axis as a title

value

vector of values to which to set system optionmgp.axis.labels

type

see above

Value

mgp.axis.labels returns the value ofmgp (only thesecond element ofmgp iftype="xy" or a list withelementsx andy iftype="x or y", each listelement being a 3-vector) for the appropriate axis ifvalue is not specified, otherwise itreturns nothing but the system optionmgp.axis.labels is set.

mgp.axis returns nothing.

Side Effects

mgp.axis.labels stores the value in thesystem optionmgp.axis.labels

Author(s)

Frank Harrell

See Also

par

Examples

## Not run: mgp.axis.labels(type='x')  # get default value for x-axismgp.axis.labels(type='y')  # get value for y-axismgp.axis.labels(type='xy') # get 2nd element of both mgpsmgp.axis.labels(type='x and y')  # get a list with 2 elementsmgp.axis.labels(c(3,.5,0), type='x')  # setoptions('mgp.axis.labels')            # retrieveplot(..., axes=FALSE)mgp.axis(1, "X Label")mgp.axis(2, "Y Label")## End(Not run)

Miscellaneous Functions for Epidemiology

Description

Themhgr function computes the Cochran-Mantel-Haenszel stratifiedrisk ratio and its confidence limits using the Greenland-Robins varianceestimator.

Thelrcum function takes the results of a series of 2x2 tablesrepresenting the relationship between test positivity and diagnosis andcomputes positive and negative likelihood ratios (with all theirdeficiencies) and the variance oftheir logarithms. Cumulative likelihood ratios and their confidenceintervals (assuming independence of tests) are computed, assuming astring of all positive tests or a string of all negative tests. Themethod of Simel et al as described in Altman et al is used.

Usage

mhgr(y, group, strata, conf.int = 0.95)## S3 method for class 'mhgr'print(x, ...)lrcum(a, b, c, d, conf.int = 0.95)## S3 method for class 'lrcum'print(x, dec=3, ...)

Arguments

y

a binary response variable

group

a variable with two unique values specifying comparison groups

strata

the stratification variable

conf.int

confidence level

x

an object created bymhgr orlrcum

a

frequency of true positive tests

b

frequency of false positive tests

c

frequency of false negative tests

d

frequency of true negative tests

dec

number of places to the right of the decimal to print forlrcum

...

addtitional arguments to be passed to other print functions

Details

Uses equations 4 and 13 from Greenland and Robins.

Value

a list of class"mhgr" or of class"lrcum".

Author(s)

Frank E Harrell Jrfh@fharrell.com

References

Greenland S, Robins JM (1985): Estimation of a common effect parameterfrom sparse follow-up data. Biometrics 41:55-68.

Altman DG, Machin D, Bryant TN, Gardner MJ, Eds. (2000): Statistics withConfidence, 2nd Ed. Bristol: BMJ Books, 105-110.

Simel DL, Samsa GP, Matchar DB (1991): Likelihood ratios withconfidence: sample size estimation for diagnostic test studies. JClin Epi 44:763-770.

See Also

logrank

Examples

# Greate Migraine dataset used in Example 28.6 in the SAS PROC FREQ guided <- expand.grid(response=c('Better','Same'),                 treatment=c('Active','Placebo'),                 sex=c('female','male'))d$count <- c(16, 11, 5, 20, 12, 16, 7, 19)d# Expand data frame to represent raw datar <- rep(1:8, d$count)d <- d[r,]with(d, mhgr(response=='Better', treatment, sex))# Discrete survival time example, to get Cox-Mantel relative risk and CL# From Stokes ME, Davis CS, Koch GG, Categorical Data Analysis Using the# SAS System, 2nd Edition, Sectino 17.3, p. 596-599## Input data in Table 17.5d <- expand.grid(treatment=c('A','P'), center=1:3)d$healed2w    <- c(15,15,17,12, 7, 3)d$healed4w    <- c(17,17,17,13,17,17)d$notHealed4w <- c( 2, 7,10,15,16,18)d# Reformat to the way most people would collect raw datad1 <- d[rep(1:6, d$healed2w),]d1$time <- '2'd1$y <- 1d2 <- d[rep(1:6, d$healed4w),]d2$time <- '4'd2$y <- 1d3 <- d[rep(1:6, d$notHealed4w),]d3$time <- '4'd3$y <- 0d <- rbind(d1, d2, d3)d$healed2w <- d$healed4w <- d$notHealed4w <- NULLd# Finally, duplicate appropriate observations to create 2 and 4-week# risk sets.  Healed and not healed at 4w need to be in the 2-week# risk set as not healedd2w      <- subset(d, time=='4')d2w$time <- '2'd2w$y    <- 0d24      <- rbind(d, d2w)with(d24, table(y, treatment, time, center))# Matches Table 17.6with(d24, mhgr(y, treatment, interaction(center, time, sep=';')))# Get cumulative likelihood ratios and their 0.95 confidence intervals# based on the following two tables##          Disease       Disease#          +     -       +     -# Test +   39    3       20    5# Test -   21   17       22   15lrcum(c(39,20), c(3,5), c(21,22), c(17,15))

Minor Tick Marks

Description

Adds minor tick marks to an existing plot. All minor tick marks thatwill fit on the axes will be drawn.

Usage

minor.tick(nx=2, ny=2, tick.ratio=0.5, x.args = list(), y.args = list())

Arguments

nx

number of intervals in which to divide the area between major tick marks onthe X-axis. Set to 1 to suppress minor tick marks.

ny

same asnx but for the Y-axis.

tick.ratio

ratio of lengths of minor tick marks to major tick marks. The lengthof major tick marks is retrieved frompar("tck").

x.args

additionl arguments (e.g.post,lwd) used byaxis() function when rendering the X-axis.

y.args

same asx.args but for Y-axis.

Side Effects

plots

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Earl Bellinger
Max Planck Institute
earlbellinger@gmail.com
Viktor Horvath
Brandeis University
vhorvath@brandeis.edu

See Also

axis

Examples

# Plot with default settingsplot(runif(20), runif(20))minor.tick()# Plot with arguments passed to axis()plot(c(0,1), c(0,1), type = 'n', axes = FALSE, ann = FALSE)# setting up a plot without axes and annotationpoints(runif(20), runif(20))                       # plotting dataaxis(1, pos = 0.5, lwd = 2)                        # showing X-axis at Y = 0.5 with formattingaxis(2, col = 2)                                   # formatted Y-axisminor.tick( nx = 4, ny = 4, tick.ratio = 0.3,            x.args = list(pos = 0.5, lwd = 2),     # X-minor tick format argumnets            y.args = list(col = 2))                # Y-minor tick format arguments

movStats

Description

Moving Estimates Using Overlapping Windows

Usage

movStats(  formula,  stat = NULL,  discrete = FALSE,  space = c("n", "x"),  eps = if (space == "n") 15,  varyeps = FALSE,  nignore = 10,  xinc = NULL,  xlim = NULL,  times = NULL,  tunits = "year",  msmooth = c("smoothed", "raw", "both"),  tsmooth = c("supsmu", "lowess"),  bass = 8,  span = 1/4,  maxdim = 6,  penalty = NULL,  trans = function(x) x,  itrans = function(x) x,  loess = FALSE,  ols = FALSE,  qreg = FALSE,  lrm = FALSE,  orm = FALSE,  hare = FALSE,  ordsurv = FALSE,  lrm_args = NULL,  family = "logistic",  k = 5,  tau = (1:3)/4,  melt = FALSE,  data = environment(formula),  pr = c("none", "kable", "plain", "margin"))

Arguments

formula

a formula with the analysis variable on the left and the x-variable on the right, following by optional stratification variables

stat

function of one argument that returns a named list of computed values. Defaults to computing mean and quartiles + N except when y is binary in which case it computes moving proportions. If y has two columns the default statistics are Kaplan-Meier estimates of cumulative incidence at a vector oftimes.

discrete

set toTRUE if x-axis variable is discrete and no intervals should be created for windows

space

defines whether intervals used fixed width or fixed sample size

eps

tolerance for window (half width of window). Forspace='x' is in data units, otherwise is the sample size for half the window, not counting the middle target point.

varyeps

applies tospace='n' and causes a smallereps to be used in strata with fewer than “ observations so as to arrive at three x points

nignore

see description, default is to excludenignore=10 points on the left and right tails from estimation and plotting

xinc

increment in x to evaluate stats, default is xlim range/100 forspace='x'. Forspace='n'xinc defaults to m observations, where m = max(n/200, 1).

xlim

2-vector of limits to evaluate ifspace='x' (default isnignore smallest tonignore largest)

times

vector of times for evaluating one minus Kaplan-Meier estimates

tunits

time units whentimes is given

msmooth

set to'smoothed' or'both' to computelowess-smooth moving estimates.msmooth='both' will display both.'raw' will display only the moving statistics.msmooth='smoothed' (the default) will display only he smoothed moving estimates.

tsmooth

defaults to the super-smoother'supsmu' for after-moving smoothing. Usetsmooth='lowess' to instead uselowess.

bass

thesupsmubass parameter used to smooth the moving statistics iftsmooth='supsmu'. The default of 8 represents quite heavy smoothing.

span

thelowessspan used to smooth the moving statistics

maxdim

passed tohare, default is 6

penalty

passed tohare, default is to use BIC. Specify 2 to use AIC.

trans

transformation to apply to x

itrans

inverse transformation

loess

set to TRUE to also compute loess estimates

ols

set to TRUE to include rcspline estimate of mean using ols

qreg

set to TRUE to include quantile regression estimates w rcspline

lrm

set to TRUE to include logistic regression estimates w rcspline

orm

set to TRUE to include ordinal logistic regression estimates w rcspline (mean + quantiles intau)

hare

set to TRUE to include hazard regression estimtes of incidence attimes, using thepolspline package

ordsurv

set to TRUE to include ordinal regression estimates of incidence attimes, using therms packageadapt_orm andsurvest.orm functions

lrm_args

alist of optional arguments to pass tolrm whenlrm=TRUE, e.g.,list(maxit=20)

family

link function for ordinal regression (seerms::orm)

k

number of knots to use for ols, lrm, qreg restricted cubic splines. Linearity is forced for binaryy when the minimum of the number of events and number of non-events is below 10 for a by-group. Forordsurv=TRUE is the maximum number of knots tried and is passed as argumentmaxk to thermsadapt_orm function.

tau

quantile numbers to estimate with quantile regression

melt

set to TRUE to melt data table and derive Type and Statistic

data

data.table or data.frame, default is calling frame

pr

defaults to no printing of window information. Usepr='plain' to print in the ordinary way,⁠pr='kable⁠ to convert the object toknitr::kable and print, orpr='margin' to convert tokable and place in theQuarto right margin. For the latter tworesults='asis' must be in the chunk header.

Details

Function to compute moving averages and other statistics as a functionof a continuous variable, possibly stratified by other variables.Estimates are made by creating overlapping moving windows andcomputing the statistics defined in the stat function for each window.The default method,space='n' creates varying-width intervals each having a sample size of2*eps +1, and the smooth estimates are made everyxinc observations. Outer intervals are not symmetric in sample size (but the mean x in those intervals will reflect that) unlesseps=nignore, as outer intervals are centered at observationsnignore andn - nignore + 1 where the default fornignore is 10. The mean x-variable within each windows is taken to represent that window. Iftrans anditrans are given, x means are computed on thetrans(x) scale and thenitrans'd. Forspace='x', by default estimates are made on to thenignore smallest to thenignore largestobserved values of the x variable to avoid extrapolation and tohelp getting the moving statistics off on an adequate start forthe left tail. Also by default the moving estimates are smoothed usingsupsmu.Whenmelt=TRUE you can feed the result intoggplot like this:⁠ggplot(w, aes(x=age, y=crea, col=Type)) + geom_line() +⁠facet_wrap(~ Statistic)

Seehere for several examples.

Value

a data table, with attributeinfon which is a data frame with rows corresponding to strata and columnsN,Wmean,Wmin,Wmax ifstat computedN. These summarize the number of observations used in the windows. Ifvaryeps=TRUE there is an additional columneps with the computed per-stratumeps. Whenspace='n' andxinc is not given, the computedxinc also appears as a column. An additional attributeinfo is akable object ready for printing to describe the window characteristics.

Author(s)

Frank Harrell


Margin Titles

Description

Writes overall titles and subtitles after a multiple image plot is drawn.Ifpar()$oma==c(0,0,0,0),title is used instead ofmtext, to drawtitles or subtitles that are inside the plotting region for a single plot.

Usage

mtitle(main, ll, lc,         lr=format(Sys.time(),'%d%b%y'),       cex.m=1.75, cex.l=.5, ...)

Arguments

main

main title to be centered over entire figure, default is none

ll

subtitle for lower left of figure, default is none

lc

subtitle for lower center of figure, default is none

lr

subtitle for lower right of figure, default is today's date in format23Jan91 for UNIX or R (Thu May 30 09:08:13 1996 format for Windows). Set to"" to suppress lower right title.

cex.m

character size for main, default is 1.75

cex.l

character size for subtitles

...

other arguments passed tomtext

Value

nothing

Side Effects

plots

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

See Also

par,mtext,title,unix,pstamp

Examples

#Set up for 1 plot on figure, give a main title,#use date for lrplot(runif(20),runif(20))mtitle("Main Title")#Set up for 2 x 2 matrix of plots with a lower left subtitle and overall titlepar(mfrow=c(2,2), oma=c(3,0,3,0))plot(runif(20),runif(20))plot(rnorm(20),rnorm(20))plot(exp(rnorm(20)),exp(rnorm(20)))mtitle("Main Title",ll="n=20")

Plot Multiple Lines

Description

Plots multiple lines based on a vectorx and a matrixy,draws thin vertical lines connecting limits represented by columns ofy beyond the first. It is assumed that either (1) the secondand third columns ofy represent lower and upper confidencelimits, or that (2) there is an even number of columns beyond thefirst and these represent ascending quantiles that are symmetricallyarranged around 0.5. Ifoptions(grType='plotly') is in effect,usesplotly graphics instead ofgrid or base graphics.Forplotly you may want to set the list of possible colors,etc. usingpobj=plot_ly(colors=...).lwd,lty,lwd.vertare ignored underplotly.

Usage

multLines(x, y, pos = c('left', 'right'), col='gray',          lwd=1, lty=1, lwd.vert = .85, lty.vert = 1,          alpha = 0.4, grid = FALSE,          pobj=plotly::plot_ly(), xlim, name=colnames(y)[1], legendgroup=name,          showlegend=TRUE, ...)

Arguments

x

a numeric vector

y

a numeric matrix with number of rows equal to the number ofx elements

pos

whenpos='left' the vertical lines are drawn, rightto left, to the left of the point(x, y[,1). Otherwise linesare drawn left to right to the right of the point.

col

a color used to connect(x, y[,1]) pairs. The samecolor but with transparency given by thealpha argument isused to draw the vertical lines

lwd

line width for main lines

lty

line types for main lines

lwd.vert

line width for vertical lines

lty.vert

line type for vertical lines

alpha

transparency

grid

set toTRUE when usinggrid/lattice

pobj

an already startedplotly object to add to

xlim

global x-axis limits (required if usingplotly)

name

trace name if usingplotly

legendgroup

legend group name if usingplotly

showlegend

whether or not to show traces in legend, if usingplotly

...

passed toadd_lines oradd_segments ifusingplotly

Author(s)

Frank Harrell

Examples

if (requireNamespace("plotly")) {  x <- 1:4  y <- cbind(x, x-3, x-2, x-1, x+1, x+2, x+3)  plot(NA, NA, xlim=c(1,4), ylim=c(-2, 7))  multLines(x, y, col='blue')  multLines(x, y, col='red', pos='right')}

nCoincident

Description

Number of Coincident Points

Usage

nCoincident(x, y, bins = 400)

Arguments

x

numeric vector

y

numeric vector

bins

number of bins in both directions

Details

Computes the number of x,y pairs that are likely to be obscured in a regular scatterplot, in the sense of overlapping pairs after binning intobins xbins squares wherebins defaults to 400.NAs are removed first.

Value

integer count

Author(s)

Frank Harrell

Examples

nCoincident(c(1:5, 4:5), c(1:5, 4:5)/10)

Row-wise Deletion na.action

Description

Does row-wise deletion asna.omit, but adds frequency of missing valuesfor each predictorto the"na.action" attribute of the returned model frame.Optionally stores further details ifoptions(na.detail.response=TRUE).

Usage

na.delete(frame)

Arguments

frame

a model frame

Value

a model frame with rows deleted and the"na.action" attribute added.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

na.omit,na.keep,na.detail.response,model.frame.default,naresid,naprint

Examples

# options(na.action="na.delete")# ols(y ~ x)

Detailed Response Variable Information

Description

This function is called by certainna.action functions ifoptions(na.detail.response=TRUE) is set. By default, this functionreturns a matrix of counts of non-NAs and the mean of the response variablecomputed separately by whether or not each predictor is NA. The defaultaction uses the last column of aSurv object, in effect computing theproportion of events. Other summary functions may be specified byusingoptions(na.fun.response="name of function").

Usage

na.detail.response(mf)

Arguments

mf

a model frame

Value

a matrix, with rows representing the different statistics that arecomputed for the response, and columns representing the differentsubsets for each predictor (NA and non-NA value subsets).

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

na.omit,na.delete,model.frame.default,naresid,naprint,describe

Examples

# sex# [1] m f f m f f m m m m m m m m f f f m f m# age# [1] NA 41 23 30 44 22 NA 32 37 34 38 36 36 50 40 43 34 22 42 30# y# [1] 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0# options(na.detail.response=TRUE, na.action="na.delete", digits=3)# lrm(y ~ age*sex)## Logistic Regression Model# # lrm(formula = y ~ age * sex)### Frequencies of Responses#   0 1 #  10 8## Frequencies of Missing Values Due to Each Variable#  y age sex #  0   2   0### Statistics on Response by Missing/Non-Missing Status of Predictors##     age=NA age!=NA sex!=NA Any NA  No NA #   N    2.0  18.000   20.00    2.0 18.000# Mean    0.5   0.444    0.45    0.5  0.444## \dots\dots# options(na.action="na.keep")# describe(y ~ age*sex)# Statistics on Response by Missing/Non-Missing Status of Predictors##      age=NA age!=NA sex!=NA Any NA  No NA #    N    2.0  18.000   20.00    2.0 18.000# Mean    0.5   0.444    0.45    0.5  0.444## \dots# options(na.fun.response="table")  #built-in function table()# describe(y ~ age*sex)## Statistics on Response by Missing/Non-Missing Status of Predictors##   age=NA age!=NA sex!=NA Any NA No NA # 0      1      10      11      1    10# 1      1       8       9      1     8## \dots

Do-nothing na.action

Description

Does not delete rows containing NAs, but does add details concerningthe distribution of the response variable ifoptions(na.detail.response=TRUE).Thisna.action is primarily for use withdescribe.formula.

Usage

na.keep(mf)

Arguments

mf

a model frame

Value

the same model frame with the"na.action" attribute

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

na.omit,na.delete,model.frame.default,na.detail.response,naresid,naprint,describe

Examples

options(na.action="na.keep", na.detail.response=TRUE)x1 <- runif(20)x2 <- runif(20)x2[1:4] <- NAy <- rnorm(20)describe(y ~ x1*x2)

Compute Number of Observations for Left Hand Side of Formula

Description

After removing any artificial observations added byaddMarginal, computes the number ofnon-missing observations for all left-hand-side variables informula. Ifformula contains a termid(variable)variable is assumed to be a subject ID variable, and only uniquesubject IDs are counted. If group is given and its value is the name ofa variable in the right-hand-side of the model, an additional objectnobsg is returned that is a matrix with as many columns as thereare left-hand variables, and as many rows as there are levels to thegroup variable. This matrix has the further breakdown of uniquenon-missing observations bygroup. The concatenation of all IDvariables, is returned in alist elementid.

Usage

nobsY(formula, group=NULL, data = NULL, subset = NULL,      na.action = na.retain, matrixna=c('all', 'any'))

Arguments

formula

a formula object

group

character string containing optional name of astratification variable for computing sample sizes

data

a data frame

subset

an optional subsetting criterion

na.action

an optionalNA-handling function

matrixna

set to"all" if an observation is to beconsideredNA if all the columns of the variable areNA, otherwise usematrixna="any" to consider the rowmissing if any of the columns are missing

Value

an integer, with an attribute"formula" containing theoriginal formula but with anid variable (if present) removed

Examples

d <- expand.grid(sex=c('female', 'male', NA),                 country=c('US', 'Romania'),                 reps=1:2)d$subject.id <- c(0, 0, 3:12)dm <- addMarginal(d, sex, country)dim(dm)nobsY(sex + country ~ 1, data=d)nobsY(sex + country ~ id(subject.id), data=d)nobsY(sex + country ~ id(subject.id) + reps, group='reps', data=d)nobsY(sex ~ 1, data=d)nobsY(sex ~ 1, data=dm)nobsY(sex ~ id(subject.id), data=dm)

Creates a string of arbitry length

Description

Creates a vector of strings which consists of the string segment given ineach element of thestring vector repeatedtimes.

Usage

nstr(string, times)

Arguments

string

character: vector of string segments to berepeated. Will be recycled if argumenttimes is longer.

times

integer: vector of number of times to repeat thecorisponding segment. Will be recycled if argumentstring islonger.

Value

returns a character vector the same length as the longest of the two arguments.

Note

Will throw a warning if the length of the longer argment is not a evenmultiple of the shorter argument.

Author(s)

Charles Dupont

See Also

paste,rep

Examples

nstr(c("a"), c(0,3,4))nstr(c("a", "b", "c"), c(1,2,3))nstr(c("a", "b", "c"), 4)

Extract number of intercepts

Description

Extract the number of intercepts from a model

Usage

num.intercepts(fit, type=c('fit', 'var', 'coef'))

Arguments

fit

a model fit object

type

the default is to return the formal number of intercepts used when fittingthe model. Settype='var' to return the actual number ofintercepts stored in thevar object, ortype='coef' toreturn the actual number in the fitted coefficients. The former will beless than the number fitted fororm fits, and thelatter fororm fits passed throughfit.mult.impute. If thevar object is not present, the number of intercepts is determined from theab element of theinfo.matrix object if it is present.

Value

num.intercepts returns an integer with the number of interceptsin the model.

See Also

orm,fit.mult.impute


Minimally Group an Ordinal Variable So Bootstrap Samples Will Contain All Distinct Values

Description

When bootstrapping models for ordinal Y when Y is fairly continuous, it is frequently the case that one or more bootstrap samples will not include one or more of the distinct original Y values. When fitting an ordinal model (including a Cox PH model), this means that an intercept cannot be estimated, and the parameter vectors will not align over bootstrap samples. To prevent this from happening, some grouping of Y may be necessary. TheordGroupBoot function usescutGn() to group Y so that the minimum number in any group is guaranteed to not exceed a certain integerm.ordGroupBoot tries a range ofm and stops at the lowestm such that either allB tested bootstrap samples contain all the original distinct values of Y (ifB>0), or that the probability that a given sample of sizen with replacement will contain all the distinct original values exceedsaprob (B=0). This probability is computed approximately using an approximation to the probability of complete sample coverage from thecoupon collector's problem and is quite accurate for our purposes.

Usage

ordGroupBoot(  y,  B = 0,  m = 7:min(15, floor(n/3)),  what = c("mean", "factor", "m"),  aprob = 0.9999,  pr = TRUE)

Arguments

y

a numeric vector

B

number of bootstrap samples to test, or zero to use a coverage probability approximation

m

range of minimum group sizes to test; the default range is usually adequate

what

specifies that either the meany in each group should be returned, afactor version of this with interval endpoints in the levels, or the computed value ofm should be returned

aprob

minimum coverage probability sought

pr

set toFALSE to not print the computed value of the minimumm satisfying the needed condition

Value

a numeric vector corresponding toy but grouped, containing eithr the mean ofy in each group or a factor variable representing groupedy, either with the minimumm that satisfied the required sample covrage

Author(s)

Frank Harrell

See Also

cutGn()

Examples

set.seed(1)x <- c(1:6, NA, 7:22)ordGroupBoot(x, m=5:10)ordGroupBoot(x, m=5:10, B=5000, what='factor')

pMedian

Description

Pseudomedian

Usage

pMedian(  x,  na.rm = FALSE,  conf.int = 0,  B = 1000,  type = c("percentile", "bca"))

Arguments

x

a numeric vector

na.rm

set toTRUE to excludeNAs before computing the pseudomedian

conf.int

confidence level, defaulting to 0 so that no confidence limits are computed. Set to a number between 0 and 1 to compute bootstrap confidence limits

B

number of bootstrap samples ifconf.int > 0

type

type of bootstrap interval, defaulting to'percentile' for n >= 150 or'bca' for n < 150

Details

Uses fast Fortran code to compute the pseudomedian of a numeric vector. The pseudomedian is the median of all possible midpoints of two observations. The pseudomedian is also called the Hodges-Lehmann one-sample estimator. The Fortran code is was originally from JF Monahan, and was converted to C++ in theDescTools package. It has been converted to Fortran 2018 here. Bootstrap confidence intervals are optionally computed.

If n > 250,000 a random sample of 250,000 values ofx is used to limit execution time. For n > 1,000 only the percentile bootstrap confidence interval is computed.

Bootstrapping uses the Fortran subroutine directly, for efficiency.

Value

a scalar numeric value ifconf.int = 0, or a 3-vector otherwise, with named elements⁠estimate, lower, upper⁠ and attributetype. If the number of non-missing values is less than 5,NA is returned for both lower and upper limits.

See Also

https://dl.acm.org/toc/toms/1984/10/3/,https://www4.stat.ncsu.edu/~monahan/jul10/,https://www.fharrell.com/post/aci/

Examples

x <- c(1:4, 10000)pMedian(x)pMedian(x, conf.int=0.95)# Compare with brute force calculation and with wilcox.testw <- outer(x, x, '+')median(w[lower.tri(w, diag=TRUE)]) / 2wilcox.test(x, conf.int=TRUE)

pairUpDiff

Description

Pair-up and Compute Differences

Usage

pairUpDiff(  x,  major = NULL,  minor = NULL,  group,  refgroup,  lower = NULL,  upper = NULL,  minkeep = NULL,  sortdiff = TRUE,  conf.int = 0.95)

Arguments

x

a numeric vector

major

an optional factor or character vector

minor

an optional factor or character vector

group

a required factor or character vector with two levels

refgroup

a character string specifying which level ofgroup is to be subtracted

lower

an optional numeric vector giving the lowerconf.int confidence limit forx

upper

similar tolower but for the upper limit

minkeep

the minimum value ofx required to keep the observation. An observation is kept if eithergroup hasx exceeding or equallingminkeep. Default is to keep all observations.

sortdiff

set toFALSE to avoid sorting observations by descending between-group differences

conf.int

confidence level; must have been the value used to computelower andupper if they are provided

Details

This function sets up for plotting half-width confidence intervals for differences, sorting by descending order of differences within major categories, especially for dot charts as produced bydotchartpl(). Given a numeric vectorx and a grouping (superpositioning) vectorgroup with exactly two levels, computes differences in possibly transformedx between levels ofgroup for the two observations that are equal onmajor andminor. Iflower andupper are specified, usingconf.int and approximate normality on the transformed scale to backsolve for the standard errors of estimates, and uses approximate normality to get confidence intervals on differences by taking the square root of the sum of squares of the two standard errors. Coordinates for plotting half-width confidence intervals are also computed. These intervals may be plotted on the same scale asx, having the property that they overlap the twox values if and only if there is no "significant" difference at theconf.int level.

Value

a list of two objects both sorted by descending values of differences inx. TheX object is a data frame that contains the original variables sorted by descending differences acrossgroup and in addition a variablesubscripts denoting the subscripts of original observations with possible re-sorting and dropping depending onsortdiff andminkeep. TheD data frame contains sorted differences (diff),major,minor,sd of difference,lower andupper confidence limits for the difference,mid, the midpoint of the twox values involved in the difference,lowermid, the midpoint minus 1/2 the width of the confidence interval, anduppermid, the midpoint plus 1/2 the width of the confidence interval. Another element returned isdropped which is a vector ofmajor /minor combinations dropped due tominkeep.

Author(s)

Frank Harrell

Examples

x <- c(1, 4, 7, 2, 5, 3, 6)pairUpDiff(x, c(rep('A', 4), rep('B', 3)),  c('u','u','v','v','z','z','q'),  c('a','b','a','b','a','b','a'), 'a', x-.1, x+.1)

Box-Percentile Panel Function for Trellis

Description

For all their good points, box plots have a high ink/information ratioin that they mainly display 3 quartiles. Many practitioners havefound that the "outer values" are difficult to explain tonon-statisticians and many feel that the notion of "outliers" is toodependent on (false) expectations that data distributions should be Gaussian.

panel.bpplot is apanel function for use withtrellis, especially forbwplot. It draws box plots(without the whiskers) with any number of user-specified "corners"(corresponding to different quantiles), but it also draws box-percentileplots similar to those drawn by Jeffrey Banfield's(umsfjban@bill.oscs.montana.edu)bpplot function. To quote from Banfield, "box-percentile plots supply moreinformation about the univariate distributions. At any height thewidth of the irregular 'box' is proportional to the percentile of thatheight, up to the 50th percentile, and above the 50th percentile thewidth is proportional to 100 minus the percentile. Thus, the width atany given height is proportional to the percent of observations thatare more extreme in that direction. As in boxplots, the median, 25thand 75th percentiles are marked with line segments across the box."

panel.bpplot can also be used with base graphics to add extendedbox plots to an existing plot, by specifyingnogrid=TRUE, height=....

panel.bpplot is a generalization ofbpplot andpanel.bwplot in that it works withtrellis (making the plots horizontal so thatcategory labels are more visable), it allows the user to specify thequantiles to connect and those for which to draw reference lines, and it displays means (by default using dots).

bpplt draws horizontal box-percentile plot much like those drawnbypanel.bpplot but taking as the starting point a matrixcontaining quantiles summarizing the data.bpplt is primarilyintended to be used internally byplot.summary.formula.reverse orplot.summaryM but when used with no arguments has a general purpose: to draw anannotated example box-percentile plot with the default quantiles usedand with the mean drawn with a solid dot. This schematic plot isrendered nicely in postscript with an image height of 3.5 inches.

bppltp is likebpplt but forplotly graphics, andit does not draw an annotated extended box plot example.

bpplotM uses thelatticebwplot function to depictmultiple numeric continuous variables with varying scales in a singlelattice graph, after reshaping the dataset into a tall and thinformat.

Usage

panel.bpplot(x, y, box.ratio=1, means=TRUE, qref=c(.5,.25,.75),             probs=c(.05,.125,.25,.375), nout=0,             nloc=c('right lower', 'right', 'left', 'none'), cex.n=.7,             datadensity=FALSE, scat1d.opts=NULL,             violin=FALSE, violin.opts=NULL,             font=box.dot$font, pch=box.dot$pch,              cex.means =box.dot$cex,  col=box.dot$col,             nogrid=NULL, height=NULL, ...)# E.g. bwplot(formula, panel=panel.bpplot, panel.bpplot.parameters)bpplt(stats, xlim, xlab='', box.ratio = 1, means=TRUE,      qref=c(.5,.25,.75), qomit=c(.025,.975),      pch=16, cex.labels=par('cex'), cex.points=if(prototype)1 else 0.5,      grid=FALSE)bppltp(p=plotly::plot_ly(),       stats, xlim, xlab='', box.ratio = 1, means=TRUE,       qref=c(.5,.25,.75), qomit=c(.025,.975),       teststat=NULL, showlegend=TRUE)bpplotM(formula=NULL, groups=NULL, data=NULL, subset=NULL, na.action=NULL,        qlim=0.01, xlim=NULL,        nloc=c('right lower','right','left','none'),        vnames=c('labels', 'names'), cex.n=.7, cex.strip=1,        outerlabels=TRUE, ...)

Arguments

x

continuous variable whose distribution is to be examined

y

grouping variable

box.ratio

seepanel.bwplot

means

set toFALSE to suppress drawing a character at the mean value

qref

vector of quantiles for which to draw reference lines. These do notneed to be included inprobs.

probs

vector of quantiles to display in the box plot. These should all beless than 0.5; the mirror-image quantiles are added automatically. Bydefault,probs is set toc(.05,.125,.25,.375) so that intervalscontain 0.9, 0.75, 0.5, and 0.25 of the data.To draw all 99 percentiles, i.e., to draw a box-percentile plot,setprobs=seq(.01,.49,by=.01).To make a more traditional box plot, useprobs=.25.

nout

tells the function to usescat1d to draw tick marks showing thenout smallest andnout largest values ifnout >= 1, or toshow all values less than thenout quantile or greater than the1-nout quantile if0 < nout <= 0.5. Ifnout is a whole number,only the firstn/2 observations are shown on either side of themedian, wheren is the total number of observations.

nloc

location to plot number of non-NAobservations next to each box. Specifynloc='none' tosuppress. Forpanel.bpplot, the defaultnloc is'none' ifnogrid=TRUE.

cex.n

character size fornloc

datadensity

set toTRUE to invokescat1d to draw a data density(one-dimensional scatter diagram or rug plot) inside each box plot.

scat1d.opts

a list containing named arguments (without abbreviations) to pass toscat1d whendatadensity=TRUE ornout > 0

violin

set toTRUE to invokepanel.violin inaddition to drawing box-percentile plots

violin.opts

a list of options to pass topanel.violin

cex.means

character size for dots representing means

font,pch,col

seepanel.bwplot

nogrid

set toTRUE to use in base graphics

height

ifnogrid=TRUE, specifies the height of the box inusery units

...

arguments passed topoints orpanel.bpplot orbwplot

stats,xlim,xlab,qomit,cex.labels,cex.points,grid

undocumented arguments tobpplt. ForbpplotM,xlim is a list with elements named as thex-axisvariables, to override theqlim calculations with user-specifiedx-axis limits for selected variables. Example:xlim=list(age=c(20,60)).

p

an already-startedplotly object

teststat

an html expression containing a test statistic

showlegend

set toTRUE to haveplotly includea legend. Not recommended when plotting more than one variable.

formula

a formula with continuous numeric analysis variables onthe left hand side and stratification variables on the right.The first variable on the right is the one that will vary thefastest, forming they-axis.formula may beomitted, in which case all numeric variables with more than 5unique values indata will be analyzed. Orformula may be a vector of variable names indatato analyze. In the latter two cases (and only those cases),groups must be given, representing a character vectorwith names of stratification variables.

groups

see above

data

an optional data frame

subset

an optional subsetting expression or logical vector

na.action

specifies a function to possibly subset the dataaccording toNAs (default is no such subsetting).

qlim

the outer quantiles to use for scaling each panel inbpplotM

vnames

default is to use variablelabel attributes whenthey exist, or use variable names otherwise. Specifyvnames='names' to always use variable names for panellabels inbpplotM

cex.strip

character size for panel strip labels

outerlabels

ifTRUE, pass thelattice graphicsthrough thelatticeExtra package'suseOuterStripsfunction if there are two conditioning (paneling) variables, toput panel labels in outer margins.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Esty WW, Banfield J: The box-percentile plot. J StatisticalSoftware 8 No. 17, 2003.

See Also

bpplot,panel.bwplot,scat1d,quantile,Ecdf,summaryP,useOuterStrips

Examples

set.seed(13)x <- rnorm(1000)g <- sample(1:6, 1000, replace=TRUE)x[g==1][1:20] <- rnorm(20)+3   # contaminate 20 x's for group 1# default trellis box plotrequire(lattice)bwplot(g ~ x)# box-percentile plot with data density (rug plot)bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=TRUE)# add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size# when a group has > 125 observations# small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantilesbwplot(g ~ x, panel=panel.bpplot, cex.means=.3)# suppress means and reference lines for lower and upper quartilesbwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=FALSE, qref=FALSE)# continuous plot up until quartiles ("Tootsie Roll plot")bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01))# start at quartiles then make it continuous ("coffin plot")bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01))# same as previous but add a spike to give 0.95 intervalbwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01)))# decile plot with reference lines at outer quintiles and medianbwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8))# default plot with tick marks showing all observations outside the outer# box (.05 and .95 quantiles), with very small ticksbwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01))# show 5 smallest and 5 largest observationsbwplot(g ~ x, panel=panel.bpplot, nout=5)# Use a scat1d option (preserve=TRUE) to ensure that the right peak extends # to the same position as the extreme scat1dbwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001),        datadensity=TRUE, scat1d.opt=list(preserve=TRUE))# Add an extended box plot to an existing base graphics plotplot(x, 1:length(x))panel.bpplot(x, 1070, nogrid=TRUE, pch=19, height=15, cex.means=.5)# Draw a prototype showing how to interpret the plotsbpplt()# Example for bpplotMset.seed(1)n <- 800d <- data.frame(treatment=sample(c('a','b'), n, TRUE),                sex=sample(c('female','male'), n, TRUE),                age=rnorm(n, 40, 10),                bp =rnorm(n, 120, 12),                wt =rnorm(n, 190, 30))label(d$bp) <- 'Systolic Blood Pressure'units(d$bp) <- 'mmHg'bpplotM(age + bp + wt ~ treatment, data=d)bpplotM(age + bp + wt ~ treatment * sex, data=d, cex.strip=.8)bpplotM(age + bp + wt ~ treatment*sex, data=d,        violin=TRUE,        violin.opts=list(col=adjustcolor('blue', alpha.f=.15),                         border=FALSE))bpplotM(c('age', 'bp', 'wt'), groups='treatment', data=d)# Can use Hmisc Cs function, e.g. Cs(age, bp, wt)bpplotM(age + bp + wt ~ treatment, data=d, nloc='left')# Without treatment: bpplotM(age + bp + wt ~ 1, data=d)## Not run: # Automatically find all variables that appear to be continuousgetHdata(support)bpplotM(data=support, group='dzgroup',        cex.strip=.4, cex.means=.3, cex.n=.45)# Separate displays for categorical vs. continuous baseline variablesgetHdata(pbc)pbc <- upData(pbc, moveUnits=TRUE)s <- summaryM(stage + sex + spiders ~ drug, data=pbc)plot(s)Key(0, .5)s <- summaryP(stage + sex + spiders ~ drug, data=pbc)plot(s, val ~ freq | var, groups='drug', pch=1:3, col=1:3,     key=list(x=.6, y=.8))bpplotM(bili + albumin + protime + age ~ drug, data=pbc)## End(Not run)

Patitions an object into different sets

Description

Partitions an object into subsets of length defined in thesepargument.

Usage

partition.vector(x, sep, ...)partition.matrix(x, rowsep, colsep, ...)

Arguments

x

object to be partitioned.

sep

determines how many elements should go into each set. Thesum ofsep should be equal to the length ofx.

rowsep

determins how many rows should go into each set. Thesum ofrowsep must equal the number of rows inx.

colsep

determins how many columns should go into each set. Thesum ofcolsep must equal the number of columns inx.

...

arguments used in other methods ofpartition.

Value

A list of equal length assep containing the partitioned objects.

Author(s)

Charles Dupont

See Also

split

Examples

a <- 1:7partition.vector(a, sep=c(1,3,2,1))

First Principal Component

Description

Given a numeric matrix which may or may not containNAs,pc1 standardizes the columns to have mean 0 and variance 1 andcomputes the first principal component usingprcomp. Theproportion of variance explained by this component is printed, and soare the coefficients of the original (not scaled) variables. Thesecoefficients may be applied to the raw data to obtain the first PC.

Usage

pc1(x, hi)

Arguments

x

numeric matrix

hi

if specified, the first PC is scaled so that its maximumvalue ishi and its minimum value is zero

Value

The vector of observations with the first PC. An attribute"coef" is attached to this vector."coef" contains theraw-variable coefficients.

Author(s)

Frank Harrell

See Also

prcomp

Examples

set.seed(1)x1 <- rnorm(100)x2 <- x1 + rnorm(100)w <- pc1(cbind(x1,x2))attr(w,'coef')

plot.princmp

Description

Plot Method for princmp

Usage

## S3 method for class 'princmp'plot(  x,  which = c("scree", "loadings"),  k = x$k,  offset = 0.8,  col = 1,  adj = 0,  ylim = NULL,  add = FALSE,  abbrev = 25,  nrow = NULL,  ...)

Arguments

x

results of 'princmp'

which

'‘scree'’ or '‘loadings’'

k

number of components to show, default is 'k' specified to 'princmp'

offset

controls positioning of text labels for cumulative fraction of variance explained

col

color of plotted text in scree plot

adj

angle for plotting text in scree plot

ylim

y-axis scree plotting limits, a 2-vector

add

set to 'TRUE' to add a line to an existing scree plot without drawing axes

abbrev

an integer specifying the variable name length above which names are passed through [abbreviate(..., minlength=abbrev)]

nrow

number of rows to use in plotting loadings. Defaults to the 'ggplot2' 'facet_wrap' default.

...

unused

Details

Uses base graphics to by default plot the scree plot from a [princmp()] result, showing cumultive proportion of variance explained. Alternatively the standardized PC loadings are shown in a 'ggplot2' bar chart.

Value

‘ggplot2' object if 'which=’loadings''

Author(s)

Frank Harrell


plotCorrM

Description

Plot Correlation Matrix and Correlation vs. Time Gap

Usage

plotCorrM(  r,  what = c("plots", "data"),  type = c("rectangle", "circle"),  xlab = "",  ylab = "",  maxsize = 12,  xangle = 0)

Arguments

r

correlation matrix

what

specifies whether to return plots or the data frame used in making the plots

type

specifies whether to use bottom-aligned rectangles (the default) or centered circles

xlab

x-axis label for correlation matrix

ylab

y-axis label for correlation matrix

maxsize

maximum circle size iftype='circle'

xangle

angle for placing x-axis labels, defaulting to 0. Consider usingxangle=45 when labels are long.

Details

Constructs twoggplot2 graphics. The first is a half matrix of rectangles where the height of the rectangle is proportional to the absolute value of the correlation coefficient, with positive and negative coefficients shown in different colors. The second graphic is a variogram-like graph of correlation coefficients on the y-axis and absolute time gap on the x-axis, with aloess smoother added. The times are obtained from the correlation matrix's row and column names if these are numeric. If any names are not numeric, the times are taken as the integers 1, 2, 3, ... The two graphics areggplotly-ready if you useplotly::ggplotly(..., tooltip='label').

Value

a list containing twoggplot2 objects ifwhat='plots', or a data frame ifwhat='data'

Author(s)

Frank Harrell

Examples

set.seed(1)r <- cor(matrix(rnorm(100), ncol=10))g <- plotCorrM(r)g[[1]]  # plot matrixg[[2]]  # plot correlation vs gap time# ggplotlyr(g[[2]])# ggplotlyr uses ggplotly with tooltip='label' then removes# txt: from hover text

Plot Precision of Estimate of Pearson Correlation Coefficient

Description

This function plots the precision (margin of error) of theproduct-moment linear correlation coefficient r vs. sample size, for a given vector ofcorrelation coefficientsrho. Precision is defined as the largerof the upper confidence limit minus rho and rho minus the lower confidencelimit.labcurve is used to automatically label the curves.

Usage

plotCorrPrecision(rho = c(0, 0.5), n = seq(10, 400, length.out = 100),                  conf.int = 0.95, offset=0.025, ...)

Arguments

rho

single or vector of true correlations. A worst-caseprecision graph results from rho=0

n

vector of sample sizes to use on the x-axis

conf.int

confidence coefficient; default uses 0.95 confidencelimits

offset

seelabcurve

...

other arguments tolabcurve

Author(s)

Xing Wang and Frank Harrell

See Also

rcorr,cor,cor.test

Examples

plotCorrPrecision()plotCorrPrecision(rho=0)

plotly Multiple

Description

Generates multiple plotly graphics, driven by specs in a data frame

Usage

plotlyM(  data,  x = ~x,  y = ~y,  xhi = ~xhi,  yhi = ~yhi,  htext = NULL,  multplot = NULL,  strata = NULL,  fitter = NULL,  color = NULL,  size = NULL,  showpts = !length(fitter),  rotate = FALSE,  xlab = NULL,  ylab = NULL,  ylabpos = c("top", "y"),  xlim = NULL,  ylim = NULL,  shareX = TRUE,  shareY = FALSE,  height = NULL,  width = NULL,  nrows = NULL,  ncols = NULL,  colors = NULL,  alphaSegments = 1,  alphaCline = 0.3,  digits = 4,  zeroline = TRUE)

Arguments

data

input data frame

x

formula specifying the x-axis variable

y

formula for y-axis variable

xhi

formula for upper x variable limits (x taken to be lower value)

yhi

formula for upper y variable limit (y taken to be lower value)

htext

formula for hovertext variable

multplot

formula specifying a variable indata that when stratified on produces a separate plot

strata

formula specifying an optional stratification variable

fitter

a fitting such asloess that comes with apredict method. Alternatively specifyfitter='ecdf' to use an internal function for computing and displaying ECDFs, which moves the analysis variable from the y-axis to the x-axis

color

plotly formula specifying a color variable or e.g.~ I('black'). To keep colors constant over multiple plots you will need to specify an AsIs color when you don't have a variable representing color groups.

size

plotly formula specifying a symbol size variable or AsIs

showpts

iffitter is given, set toTRUE to show raw data points in addition to smooth fits

rotate

set toTRUE to reverse the roles ofx andy, for example to get horizontal dot charts with error bars

xlab

x-axis label. May contain html.

ylab

a named vector of y-axis labels, possibly containing html (see example below). The names of the vector must correspond to levels of themultplot variable.ylab can be unnamed ifmultplot is not used.

ylabpos

position of y-axis labels. Default is on top left of plot. Specifyylabpos='y' for usual y-axis placement.

xlim

2-vector of x-axis limits, optional

ylim

2-vector of y-axis limits, optional

shareX

specifies whether x-axes should be shared when they align vertically over multiple plots

shareY

specifies whether y-axes should be shared when they align horizontally over multiple plots

height

height of the combined image in pixels

width

width of the combined image in pixels

nrows

the number of rows to produce usingsubplot

ncols

the number of columns to produce usingsubplot (specify at most one ofnrows,ncols)

colors

the color palette. Leave unspecified to use the defaultplotly palette

alphaSegments

alpha transparency for line segments (whenxhi oryhi is notNA)

alphaCline

alpha transparency for lines used to connect points

digits

number of significant digits to use in constructing hovertext

zeroline

set toFALSE to suppress vertical line at x=0

Details

Generates multipleplotly traces and combines them withplotly::subplot. The traces are controlled by specifications in data framedata plus various arguments.data must contain these variables:x,y, andtracename (ifcolor is not an "AsIs" color such as~ I('black')), and can contain these optional variables:xhi,yhi (rows containingNA for bothxhi andyhi represent points, and those with non-NAxhi oryhi represent segments,connect (set toTRUE for rows for points, to connect the symbols),legendgroup (seeplotly documentation), andhtext (hovertext). If thecolor argument is given and it is not an "AsIs" color, the variable named in thecolor formula must also be indata. Likewise forsize. If themultplot is given, the variable given in the formula must be indata. Ifstrata is present, another level of separate plots is generated by levels ofstrata, within levels ofmultplot.

Iffitter is specified, x,y coordinates for an individual plot arerun throughfitter, and a line plot is made instead of showing data points. Alternatively you can specifyfitter='ecdf' to compute and plot emirical cumulative distribution functions.

Value

plotly object produced bysubplot

Author(s)

Frank Harrell

Examples

## Not run: set.seed(1)pts     <- expand.grid(v=c('y1', 'y2', 'y3'), x=1:4, g=c('a', 'b'), yhi=NA,                       tracename='mean', legendgroup='mean',                       connect=TRUE, size=4)pts$y   <- round(runif(nrow(pts)), 2)segs     <- expand.grid(v=c('y1', 'y2', 'y3'), x=1:4, g=c('a', 'b'),                        tracename='limits', legendgroup='limits',                        connect=NA, size=6)segs$y   <- runif(nrow(pts))segs$yhi <- segs$y + runif(nrow(pts), .05, .15)z <- rbind(pts, segs)xlab <- labelPlotmath('X<sub>12</sub>', 'm/sec<sup>2</sup>', html=TRUE)ylab <- c(y1=labelPlotmath('Y1', 'cm', html=TRUE),          y2='Y2',          y3=labelPlotmath('Y3', 'mm', html=TRUE))W=plotlyM(z, multplot=~v, color=~g, xlab=xlab, ylab=ylab, ncols=2,          colors=c('black', 'blue'))W2=plotlyM(z, multplot=~v, color=~I('black'), xlab=xlab, ylab=ylab,           colors=c('black', 'blue'))## End(Not run)

Plot smoothed estimates

Description

Plot smoothed estimates of x vs. y, handling missing data for lowessor supsmu, and adding axis labels. Optionally suppresses plottingextrapolated estimates. An optionalgroup variable can bespecified to compute and plot the smooth curves by levels ofgroup. Whengroup is present, thedatadensityoption will draw tick marks showing the location of the rawx-values, separately for each curve.plsmo has anoption to plot connected points for raw data, with no smoothing. Thenon-panel version ofplsmo allowsy to be a matrix, forwhich smoothing is done separately over its columns. If bothgroup and multi-columny are used, the number of curvesplotted is the product of the number of groups and the number ofy columns.

method='intervals' is often used when y is binary, as it may betricky to specify a reasonable smoothing parameter tolowess orsupsmu in this case. The'intervals' method uses thecutGn function to form intervals of x containing a minimum ofmobs observations. For each interval theifun functionsummarizes y, with the default being the mean (proportions for binaryy). The results are plotted as step functions, with verticaldiscontinuities drawn with a saturation of 0.15 of the original color.A plus sign is drawn at the mean x within each interval.For this approach, the default x-range is the entire raw data range,andtrim andevaluate are ignored. Forpanel.plsmo it is best to specifytype='l' when using'intervals'.

panel.plsmo is apanel function fortrellis for thexyplot function that usesplsmo and its options to drawone or more nonparametric function estimates on each panel. This hasadvantages over usingxyplot withpanel.xyplot andpanel.loess: (1) by default it will invokelabcurve tolabel the curves where they are most separated, (2) thedatadensity option will put rug plots on each curve (instead of asingle rug plot at the bottom of the graph), and (3) whenpanel.plsmo invokesplsmo it can use the "super smoother"(supsmu function) instead oflowess, or passmethod='intervals'.panel.plsmo senses when agroup variable is specified toxyplot sothat it can invokepanel.superpose instead ofpanel.xyplot. Usingpanel.plsmo throughtrellishas some advantages over callingplsmo directly in thatconditioning variables are allowed andtrellis uses nicer fontsetc.

When agroup variable was used,panel.plsmo creates a functionKey in the session frame that the user can invoke to draw a key forindividual data point symbols used for thegroups. By default, the key is positioned at the upper rightcorner of the graph. IfKey(locator(1)) is specified, the key willappear so that its upper left corner is at the coordinates of themouse click.

Forggplot2 graphics the counterparts arestat_plsmo andhistSpikeg.

Usage

plsmo(x, y, method=c("lowess","supsmu","raw","intervals"), xlab, ylab,       add=FALSE, lty=1 : lc, col=par("col"), lwd=par("lwd"),      iter=if(length(unique(y))>2) 3 else 0, bass=0, f=2/3, mobs=30, trim,       fun, ifun=mean, group, prefix, xlim, ylim,       label.curves=TRUE, datadensity=FALSE, scat1d.opts=NULL,      lines.=TRUE, subset=TRUE,      grid=FALSE, evaluate=NULL, ...)#To use panel function:#xyplot(formula=y ~ x | conditioningvars, groups,#       panel=panel.plsmo, type='b', #       label.curves=TRUE,#       lwd = superpose.line$lwd, #       lty = superpose.line$lty, #       pch = superpose.symbol$pch, #       cex = superpose.symbol$cex, #       font = superpose.symbol$font, #       col = NULL, scat1d.opts=NULL, \dots)

Arguments

x

vector of x-values, NAs allowed

y

vector or matrix of y-values, NAs allowed

method

"lowess" (the default),"supsmu","raw" to notsmooth at all, or"intervals" to use intervals (see above)

xlab

x-axis label iff add=F. Defaults of label(x) or argument name.

ylab

y-axis label, like xlab.

add

Set to T to call lines instead of plot. Assumes axes already labeled.

lty

line type, default=1,2,3,..., corresponding to columns ofy andgroup combinations

col

color for each curve, corresponding togroup. Default iscurrentpar("col").

lwd

vector of line widths for the curves, corresponding togroup.Default is currentpar("lwd").lwd can also be specified as an element oflabel.curves iflabel.curves is a list.

iter

iter parameter ifmethod="lowess", default=0 ify is binary, and 3 otherwise.

bass

bass parameter ifmethod="supsmu", default=0.

f

passed to thelowess function, formethod="lowess"

mobs

formethod='intervals', the minimum number ofobservations per interval

trim

only plots smoothed estimates between trim and 1-trim quantilesof x. Default is to use 10th smallest to 10th largest x in the group if the number of observations in the group exceeds 200 (0 otherwise).Specify trim=0 to plot over entire range.

fun

after computing the smoothed estimates, iffun is given the y-valuesare transformed byfun()

ifun

a summary statistic function to apply to they-variable formethod='intervals'. Default ismean.

group

a variable, either afactor vector or one that will be converted tofactor byplsmo, that is used to stratify the data so that separatesmooths may be computed

prefix

a character string to appear in group of group labels. The presence ofprefix ensures thatlabcurve will be called even whenadd=TRUE.

xlim

a vector of 2 x-axis limits. Default is observed range.

ylim

a vector of 2 y-axis limits. Default is observed range.

label.curves

set toFALSE to preventlabcurve from being called to label multiplecurves corresponding togroups. Set to a list to pass options tolabcurve.lty andcol are passed tolabcurve automatically.

datadensity

set toTRUE to draw tick marks on each curve, using x-coordinatesof the raw datax values. This is done usingscat1d.

scat1d.opts

a list of options to hand toscat1d

lines.

set toFALSE to suppress smoothed curves from being drawn. This canmake sense ifdatadensity=TRUE.

subset

a logical or integer vector specifying a subset to use for processing,with respect too all variables being analyzed

grid

set toTRUE if theRgrid package drew the currentplot

evaluate

number of points to keep from smoother. If specified, anequally-spaced grid ofevaluatex values will be obtained from thesmoother using linear interpolation. This will keep from plotting anenormous number of points if the dataset contains a very large numberof uniquex values.

...

optional arguments that are passed toscat1d,or optional parameters to pass toplsmo frompanel.plsmo. See optional arguments forplsmo above.

type

set top to havepanel.plsmo plot points (and not callplsmo),l to callplsmo and not plot points, or use the defaultb to plot both.

pch,cex,font

vectors of graphical parameters corresponding to thegroups (scalarsifgroup is absent). By default, the parameters set up bytrellis will be used.

Value

plsmo returns a list of curves (x and y coordinates) that was passed tolabcurve

Side Effects

plots, andpanel.plsmo creates theKey function in the session frame.

See Also

lowess,supsmu,label,quantile,labcurve,scat1d,xyplot,panel.superpose,panel.xyplot,stat_plsmo,histSpikeg,cutGn

Examples

set.seed(1)x <- 1:100y <- x + runif(100, -10, 10)plsmo(x, y, "supsmu", xlab="Time of Entry") #Use label(y) or "y" for ylabplsmo(x, y, add=TRUE, lty=2)#Add lowess smooth to existing plot, with different line typeage <- rnorm(500, 50, 15)survival.time <- rexp(500)sex <- sample(c('female','male'), 500, TRUE)race <- sample(c('black','non-black'), 500, TRUE)plsmo(age, survival.time < 1, fun=qlogis, group=sex) # plot logit by sex#Bivariate Ysbp <- 120 + (age - 50)/10 + rnorm(500, 0, 8) + 5 * (sex == 'male')dbp <-  80 + (age - 50)/10 + rnorm(500, 0, 8) - 5 * (sex == 'male')Y <- cbind(sbp, dbp)plsmo(age, Y)plsmo(age, Y, group=sex)#Plot points and smooth trend line using trellis # (add type='l' to suppress points or type='p' to suppress trend lines)require(lattice)xyplot(survival.time ~ age, panel=panel.plsmo)#Do this for multiple panelsxyplot(survival.time ~ age | sex, panel=panel.plsmo)#Repeat this using equal sample size intervals (n=25 each) summarized by#the median, then a proportion (mean of binary y)xyplot(survival.time ~ age | sex, panel=panel.plsmo, type='l',       method='intervals', mobs=25, ifun=median)ybinary <- ifelse(runif(length(sex)) < 0.5, 1, 0)xyplot(ybinary ~ age, groups=sex, panel=panel.plsmo, type='l',       method='intervals', mobs=75, ifun=mean, xlim=c(0, 120))#Do this for subgroups of points on each panel, show the data#density on each curve, and draw a key at the default locationxyplot(survival.time ~ age | sex, groups=race, panel=panel.plsmo,       datadensity=TRUE)Key()#Use wloess.noiter to do a fast weighted smoothplot(x, y)lines(wtd.loess.noiter(x, y))lines(wtd.loess.noiter(x, y, weights=c(rep(1,50), 100, rep(1,49))), col=2)points(51, y[51], pch=18)   # show overly weighted point#Try to duplicate this smooth by replicating 51st observation 100 timeslines(wtd.loess.noiter(c(x,rep(x[51],99)),c(y,rep(y[51],99)),      type='ordered all'), col=3)#Note: These two don't agree exactly

Power and Sample Size for Ordinal Response

Description

popower computes the power for a two-tailed two sample comparisonof ordinal outcomes under the proportional odds ordinal logisticmodel. The power is the same as that of the Wilcoxon test but withties handled properly.posamsize computes the total sample sizeneeded to achieve a given power. Both functions compute the efficiencyof the design compared with a design in which the response variableis continuous.print methods exist for both functions. Any of theinput arguments may be vectors, in which case a vector of powers orsample sizes is returned. These functions use the methods ofWhitehead (1993).

pomodm is a function that assists in translating odds ratios todifferences in mean or median on the original scale.

simPOcuts simulates simple unadjusted two-group comparisons undera PO model to demonstrate the natural sampling variability that causesestimated odds ratios to vary over cutoffs of Y.

propsPO usesggplot2 to plot a stacked bar chart ofproportions stratified by a grouping variable (and optionally a stratification variable), with an optionaladditional graph showing what the proportions would be had proportionalodds held and an odds ratio was applied to the proportions in areference group. If the result is passed toggplotly, customizedtooltip hover text will appear.

propsTrans usesggplot2 to plot all successivetransition proportions.formula has the state variable on theleft hand side, the first right-hand variable is time, and the secondright-hand variable is a subject ID variable.\

multEventChart usesggplot2 to plot event chartsshowing state transitions, account for absorbing states/events. It isbased on code written by Lucy D'Agostino McGowan posted athttps://livefreeordichotomize.com/posts/2020-05-21-survival-model-detective-1/.

Usage

popower(p, odds.ratio, n, n1, n2, alpha=0.05)## S3 method for class 'popower'print(x, ...)posamsize(p, odds.ratio, fraction=.5, alpha=0.05, power=0.8)## S3 method for class 'posamsize'print(x, ...)pomodm(x=NULL, p, odds.ratio=1)simPOcuts(n, nsim=10, odds.ratio=1, p)propsPO(formula, odds.ratio=NULL, ref=NULL, data=NULL, ncol=NULL, nrow=NULL )propsTrans(formula, data=NULL, labels=NULL, arrow='\u2794',           maxsize=12, ncol=NULL, nrow=NULL)multEventChart(formula, data=NULL, absorb=NULL, sortbylast=FALSE,   colorTitle=label(y), eventTitle='Event',   palette='OrRd',   eventSymbols=c(15, 5, 1:4, 6:10),   timeInc=min(diff(unique(x))/2))

Arguments

p

a vector of marginal cell probabilities which must add up to one.Forpopower andposamsize, Theith element specifies the probability that a patient will bein response leveli, averaged over the two treatment groups. Forpomodm andsimPOcuts,p is the vector of cellprobabilities to be translated under a given odds ratio. ForsimPOcuts, ifp has names, those names are taken as theordered distinct Y-values. Otherwise Y-values are taken as the integers1, 2, ... up to the length ofp.

odds.ratio

the odds ratio to be able to detect. It doesn'tmatter which group is in the numerator. ForpropsPO,odds.ratio is a function of the grouping (right hand side)variable value. The value of the function specifies the odds ratio toapply to the refernce group to get all other group's expected proportionswere proportional odds to hold against the first group. Normally thefunction should return 1.0 when itsx argument corresponds to theref group. Forpomodm andsimPOcuts is the oddsratio to apply to convert the given cell probabilities.

n

total sample size forpopower. You must specify eithern orn1 andn2. If you specifyn,n1 andn2 are set ton/2. ForsimPOcuts is a single numberequal to the combined sample sizes of two groups.

n1

forpopower, the number of subjects in treatment group 1

n2

forpopower, the number of subjects in group 2

nsim

number of simulated studies to create bysimPOcuts

alpha

type I error

x

an object created bypopower orposamsize, or avector of data values given topomodm that corresponds to thevectorp of probabilities. Ifx is omitted forpomodm, theodds.ratio will be applied and the newvector of individual probabilities will be returned. Otherwise ifx is given topomodm, a 2-vector with the mean andmedianx after applying the odds ratio is returned.

fraction

forposamsize, the fraction of subjects that will be allocated to group 1

power

forposamsize, the desired power (default is 0.8)

formula

an R formula expressure forproposPO where theoutcome categorical variable is on the left hand side and the groupingvariable is on the right. It is assumed that the left hand variable iseither already a factor or will have its levels in the right order foran ordinal model when it is converted to factor. FormultEventChart the left hand variable is a categorial statusvariable, the first right hand side variable represents time, and thesecond right side variable is a unique subject ID. One line isproduced per subject.

ref

forpropsPO specifies the reference group (value ofthe right hand sideformula variable) to use in computingproportions on which too translate proportions in other groups, underthe proportional odds assumption.

data

a data frame ordata.table

labels

forpropsTrans is an optional character vectorcorresponding to y=1,2,3,... that is used to constructplotlyhovertext as alabel attribute in theggplot2aesthetic. Used with y is integer on axes but you want long labels inhovertext.

arrow

character to use as the arrow symbol for transitions inpropsTrans. The default is the dingbats heavy wide-headedrightwards arror.

nrow,ncol

seefacet_wrap

maxsize

maximum symbol size

...

unused

absorb

character vector specifying the subset of levels of theleft hand side variable that are absorbing states such as death orhospital discharge

sortbylast

set toTRUE to sort the subjects by theseverity of the status at the last time point

colorTitle

label for legend for status

eventTitle

label for legend forabsorb

palette

a single character string specifying thescale_fill_brewer color palette

eventSymbols

vector of symbol codes. Default for first twosymbols is a solid square and an open diamond.

timeInc

time increment for the x-axis. Default is 1/2 theshortest gap between any two distincttimes in the data.

Value

a list containingpower,eff (relative efficiency), andapprox.se (approximate standard error of log odds ratio) forpopower, or containingn andeff forposamsize.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Whitehead J (1993): Sample size calculations for ordered categoricaldata. Stat in Med 12:2257–2271.

Julious SA, Campbell MJ (1996): Letter to the Editor. Stat in Med 15:1065–1066. Shows accuracy of formula for binary response case.

See Also

simRegOrd,bpower,cpower,impactPO

Examples

# For a study of back pain (none, mild, moderate, severe) here are the# expected proportions (averaged over 2 treatments) that will be in# each of the 4 categories:p <- c(.1,.2,.4,.3)popower(p, 1.2, 1000)   # OR=1.2, total n=1000posamsize(p, 1.2)popower(p, 1.2, 3148)# If p was the vector of probabilities for group 1, here's how to# compute the average over the two groups:# p2   <- pomodm(p=p, odds.ratio=1.2)# pavg <- (p + p2) / 2# Compare power to test for proportions for binary case,# proportion of events in control group of 0.1p <- 0.1; or <- 0.85; n <- 4000popower(c(1 - p, p), or, n)    # 0.338bpower(p, odds.ratio=or, n=n)  # 0.320# Add more categories, starting with 0.1 in middlep <- c(.8, .1, .1)popower(p, or, n)   # 0.543p <- c(.7, .1, .1, .1)popower(p, or, n)   # 0.67# Continuous scale with final level have prob. 0.1p <- c(rep(1 / n, 0.9 * n), 0.1)popower(p, or, n)   # 0.843# Compute the mean and median x after shifting the probability# distribution by an odds ratio under the proportional odds modelx <- 1 : 5p <- c(.05, .2, .2, .3, .25)# For comparison make up a sample that looks like thisX <- rep(1 : 5, 20 * p)c(mean=mean(X), median=median(X))pomodm(x, p, odds.ratio=1)  # still have to figure out the right medianpomodm(x, p, odds.ratio=0.5)# Show variation of odds ratios over possible cutoffs of Y even when PO# truly holds.  Run 5 simulations for a total sample size of 300.# The two groups have 150 subjects each.s <- simPOcuts(300, nsim=5, odds.ratio=2, p=p)round(s, 2)# An ordinal outcome with levels a, b, c, d, e is measured at 3 times# Show the proportion of values in each outcome category stratified by# time.  Then compute what the proportions would be had the proportions# at times 2 and 3 been the proportions at time 1 modified by two odds ratios set.seed(1)d   <- expand.grid(time=1:3, reps=1:30)d$y <- sample(letters[1:5], nrow(d), replace=TRUE)propsPO(y ~ time, data=d, odds.ratio=function(time) c(1, 2, 4)[time])# To show with plotly, save previous result as object p and then:# plotly::ggplotly(p, tooltip='label')# Add a stratification variable and don't consider an odds ratiod   <- expand.grid(time=1:5, sex=c('female', 'male'), reps=1:30)d$y <- sample(letters[1:5], nrow(d), replace=TRUE)propsPO(y ~ time + sex, data=d)  # may add nrow= or ncol=# Show all successive transition proportion matricesd   <- expand.grid(id=1:30, time=1:10)d$state <- sample(LETTERS[1:4], nrow(d), replace=TRUE)propsTrans(state ~ time + id, data=d)pt1 <- data.frame(pt=1, day=0:3,   status=c('well', 'well', 'sick', 'very sick'))pt2 <- data.frame(pt=2, day=c(1,2,4,6),   status=c('sick', 'very sick', 'coma', 'death'))pt3 <- data.frame(pt=3, day=1:5,   status=c('sick', 'very sick', 'sick', 'very sick', 'discharged'))pt4 <- data.frame(pt=4, day=c(1:4, 10),   status=c('well', 'sick', 'very sick', 'well', 'discharged'))d <- rbind(pt1, pt2, pt3, pt4)d$status <- factor(d$status, c('discharged', 'well', 'sick',                               'very sick', 'coma', 'death'))label(d$day) <- 'Day'require(ggplot2)multEventChart(status ~ day + pt, data=d,               absorb=c('death', 'discharged'),               colorTitle='Status', sortbylast=TRUE) +               theme_classic() +               theme(legend.position='bottom')

princmp

Description

Enhanced Output for Principal and Sparse Principal Components

Usage

princmp(  formula,  data = environment(formula),  method = c("regular", "sparse"),  k = min(5, p - 1),  kapprox = min(5, k),  cor = TRUE,  sw = FALSE,  nvmax = 5)

Arguments

formula

a formula with no left hand side, or a numeric matrix

data

a data frame or table. By default variables come from the calling environment.

method

specifies whether to use regular or sparse principal components are computed

k

the number of components to plot, display, and return

kapprox

the number of components to approximate with stepwise regression whensw=TRUE

cor

set toFALSE to compute PCs on the original data scale, which is useful if all variables have the same units of measurement

sw

set toTRUE to run stepwise regression PC prediction/approximation

nvmax

maximum number of predictors to allow in stepwise regression PC approximations

Details

Expands any categorical predictors into indicator variables, and callsprincomp (ifmethod='regular' (the default)) orsPCAgrid in thepcaPP package (method='sparse') to compute lasso-penalized sparse principal components. By default all variables are first scaled by their standard deviation after observations with anyNAs on any variables informula are removed. Loadings of standardized variables, and iforig=TRUE loadings on the original data scale are printed. Ifpl=TRUE a scree plot is drawn with text added to indicate cumulative proportions of variance explained. Ifsw=TRUE, theleaps packageregsubsets function is used to approximate the PCs using forward stepwise regression with the original variables as individual predictors.

Aprint method prints the results and aplot method plots the scree plot of variance explained.

Value

a list of classprincmp with elementsscores, a k-column matrix with principal component scores, withNAs when the input data had anNA, and other components useful for printing and plotting. Ifk=1scores is a vector. Other components includevars (vector of variances explained),method,k.

Author(s)

Frank Harrell


prints a list of lists in a visually readable format.

Description

Takes a list that is composed of other lists and matrixes and printsit in a visually readable format.

Usage

## S3 method for class 'char.list'print(x, ..., hsep = c("|"), vsep = c("-"), csep = c("+"), print.it = TRUE,                rowname.halign = c("left", "centre", "right"),                rowname.valign = c("top", "centre", "bottom"),                colname.halign = c("centre", "left", "right"),                colname.valign = c("centre", "top", "bottom"),                text.halign = c("right", "centre", "left"),                text.valign = c("top", "centre", "bottom"),                rowname.width, rowname.height,                min.colwidth = .Options$digits, max.rowheight = NULL,                abbreviate.dimnames = TRUE, page.width = .Options$width,                colname.width, colname.height, prefix.width,                superprefix.width = prefix.width)

Arguments

x

list object to be printed

...

place for extra arguments to reside.

hsep

character used to separate horizontal fields

vsep

character used to separate veritcal feilds

csep

character used where horizontal and veritcal separators meet.

print.it

should the value be printed to the console or returned as a string.

rowname.halign

horizontal justification of row names.

rowname.valign

verical justification of row names.

colname.halign

horizontal justification of column names.

colname.valign

verical justification of column names.

text.halign

horizontal justification of cell text.

text.valign

vertical justification of cell text.

rowname.width

minimum width of row name strings.

rowname.height

minimum height of row name strings.

min.colwidth

minimum column width.

max.rowheight

maximum row height.

abbreviate.dimnames

should the row and column names be abbreviated.

page.width

width of the page being printed on.

colname.width

minimum width of the column names.

colname.height

minimum height of the column names

prefix.width

maximum width of the rowname columns

superprefix.width

maximum width of the super rowname columns

Value

String that formated table of the list object.

Author(s)

Charles Dupont


Function to print a matrix with stacked cells

Description

Prints a dataframe or matrix in stacked cells. Line break charctersin a matrix element will result in a line break in that cell, but tabcharacters are not supported.

Usage

## S3 method for class 'char.matrix'print(x, file = "", col.name.align = "cen", col.txt.align = "right",     cell.align = "cen", hsep = "|", vsep = "-", csep = "+", row.names = TRUE,     col.names = FALSE, append = FALSE,    top.border = TRUE, left.border = TRUE, ...)

Arguments

x

a matrix or dataframe

file

name of file if file output is desired. If left empty,output will be to the screen

col.name.align

if column names are used, they can be alignedright, left or centre. Default"cen" results in names centredbetween the sides of the columns they name. If the width of the textin the columns is less than the width of the name,col.name.alignwill have no effect. Other options are"right" and"left".

col.txt.align

how character columns are aligned. Optionsare the same as forcol.name.align with no effect when the width ofthe column is greater than its name.

cell.align

how numbers are displayed in columns

hsep

character string to use as horizontal separator,i.e. what separates columns

vsep

character string to use as vertical separator,i.e. what separates rows. Length cannot be more than one.

csep

character string to use where vertical and horizontalseparators cross. Ifhsep is more than one character,csep will need to be the same length. There is no provisionfor multiple vertical separators

row.names

logical: are we printing the names of the rows?

col.names

logical: are we printing the names of the columns?

append

logical: iffile is not"", are we appending tothe file or overwriting?

top.border

logical: do we want a border along the top above thecolumns?

left.border

logical: do we want a border along the left of thefirst column?

...

unused

Details

If any column ofx is a mixture of character and numeric, thedistinction between character and numeric columns will be lost. Thisis especially so if the matrix is of a form where you would not wantto print the column names, the column information being in the rows atthe beginning of the matrix.

Row names, if not specified in the making of the matrix will simply benumbers. To prevent printing them, setrow.names = FALSE.

Value

No value is returned. The matrix or dataframe will be printed to fileor to the screen.

Author(s)

Patrick Connollyp.connolly@hortresearch.co.nz

See Also

write,write.table

Examples

data(HairEyeColor)print.char.matrix(HairEyeColor[ , , "Male"], col.names = TRUE)print.char.matrix(HairEyeColor[ , , "Female"], col.txt.align = "left", col.names = TRUE)z <- rbind(c("", "N", "y"),           c("[ 1.34,40.3)\n[40.30,48.5)\n[48.49,58.4)\n[58.44,87.8]",             " 50\n 50\n 50\n 50",             "0.530\n0.489\n0.514\n0.507"),           c("female\nmale", " 94\n106", "0.552\n0.473"  ),           c("", "200", "0.510"))dimnames(z) <- list(c("", "age", "sex", "Overall"),NULL)print.char.matrix(z)

print.princmp

Description

Print Results of princmp

Usage

## S3 method for class 'princmp'print(x, which = c("none", "standardized", "original", "both"), k = x$k, ...)

Arguments

x

results ofprincmp

which

specifies which loadings to print, the default being'none' and other values being'standardized','original', or'both'

k

number of components to show, defaults tok specified toprincmp

...

unused

Details

Simple print method forprincmp()

Value

nothing

Author(s)

Frank Harrell


printL

Description

Print an object or a named list of objects. When multiple objects are given, their names are printed before their contents. When an object is a vector that is not longer thanmaxoneline and its elements are not named, all the elements will be printed on one line separated by commas. Whendec is given, numeric vectors or numeric columns of data frames or data tables are rounded to the nearestdec before printing. This function is especially helpful when printing objects in a Quarto or RMarkdown document and the code is not currently being shown to place the output in context.

Usage

printL(..., dec = NULL, maxoneline = 5)

Arguments

...

any number of objects toprint()

dec

optional decimal places to the right of the decimal point for rounding

maxoneline

controls how many elements may be printed on a single line forvector objects

Value

nothing

Author(s)

Frank Harrell

See Also

prn()

Examples

w <- pi + 1 : 2printL(w=w)printL(w, dec=3)printL('this is it'=c(pi, pi, 1, 2),       yyy=pi,       z=data.frame(x=pi+1:2, y=3:4, z=c('a', 'b')),       qq=1:10,       dec=4)

Print and Object with its Name

Description

Prints an object with its name and with an optional descriptivetext string. This is useful for annotating analysis output files andfor debugging.

Usage

prn(x, txt, file, head=deparse(substitute(x), width.cutoff=500)[1])

Arguments

x

any object

txt

optional text string

file

optional file name. By default, writes to console.append=TRUE is assumed.

head

optional heading. Default is derived from the user's expression forx

Side Effects

prints

See Also

print,cat,printL

Examples

x <- 1:5prn(x)# prn(fit, 'Full Model Fit')

Selectively Print Lines of a Text Vector

Description

Given one or two regular expressions or exact text matches, removeselements of the input vector that match these specifications. Omittedlines are replaced by .... This is useful for selectivelysuppressing some of the printed output of R functions such asregression fitting functions, especially in the context of makingstatistical reports using Sweave or Odfweave.

Usage

prselect(x, start = NULL, stop = NULL, i = 0, j = 0, pr = TRUE)

Arguments

x

input character vector

start

text or regular expression to look for starting line to omit. Ifomitted, deletions start at the first line.

stop

text or regular expression to look for ending line to omit. Ifomitted, deletions proceed until the last line.

i

increment in number of first line to delete after match is found

j

increment in number of last line to delete after match is found

pr

set toFALSE to suppress printing

Value

an invisible vector of retained lines of text

Author(s)

Frank Harrell

See Also

Sweave

Examples

x <- c('the','cat','ran','past','the','dog')prselect(x, 'big','bad')     # omit nothing- no matchprselect(x, 'the','past')    # omit first 4 linesprselect(x,'the','junk')     # omit nothing- no match for stopprselect(x,'ran','dog')      # omit last 4 linesprselect(x,'cat')            # omit lines 2-prselect(x,'cat',i=1)        # omit lines 3-prselect(x,'cat','past')     # omit lines 2-4prselect(x,'cat','past',j=1) # omit lines 2-5prselect(x,'cat','past',j=-1)# omit lines 2-3prselect(x,'t$','dog')       # omit lines 2-6; t must be at end# Example for Sweave: run a regression analysis with the rms package# then selectively output only a portion of what print.ols prints.# (Thanks to \email{romain.francois@dbmail.com})# <<z,eval=FALSE,echo=T>>=# library(rms)# y <- rnorm(20); x1 <- rnorm(20); x2 <- rnorm(20)# ols(y ~ x1 + x2)# <<echo=F>>=# z <- capture.output( {# <<z>>#    } )# prselect(z, 'Residuals:') # keep only summary stats; or:# prselect(z, stop='Coefficients', j=-1)  # keep coefficients, rmse, R^2; or:# prselect(z, 'Coefficients', 'Residual standard error', j=-1) # omit coef# @

Date/Time/Directory Stamp the Current Plot

Description

Date-time stamp the current plot in the extreme lower rightcorner. Optionally add the current working directory and arbitrary othertext to the stamp.

Usage

pstamp(txt, pwd = FALSE, time. = TRUE)

Arguments

txt

an optional single text string

pwd

set toTRUE to add the current working directoryname to the stamp

time.

set toFALSE to use the date without the time

Details

Certain functions are not supported for S-Plus under Windows. ForR,results may not be satisfactory ifpar(mfrow=) is in effect.

Author(s)

Frank Harrell

Examples

plot(1:20)pstamp(pwd=TRUE, time=FALSE)

qcrypt

Description

Store and Encrypt R Objects or Files or Read and Decrypt Them

Usage

qcrypt(obj, base, service = "R-keyring-service", file, pw)

Arguments

obj

an R object to write to disk and encrypt (ifbase is specified) or the base file name to read and uncrypted (ifbase is not specified). Not used whenfile is given.

base

base file name when creating a file. Not used whenfile is given.

service

a fairly arbitrarykeyring service name. The default is almost always OK unless you need to use different passwords for different files.service is ignored ifpw is specified as an argument.

file

full name of file to encrypt or decrypt

pw

a single character string containing an actual password

Details

qcrypt is used to protect sensitive information on a user's computer or when transmitting a copy of the file to another R user. Unencrypted information only exists for a moment, and the encryption password does not appear in the user's script but instead is managed by thekeyring package to remember the password across R sessions, and thegetPass package, which pops up a password entry window and does not allow the password to be visible. The password is requested only once, except perhaps when the user logs out of their operating system session or reboots.

The keyring can be bypassed and the password entered in a popup window by specifyingservice=NA. This is the preferred approach when sending an encrypted file to a user on a different computer.

qcrypt writes R objects to disk in a temporary file using theqs packageqsave function. The file is quickly encrypted using thesafer package, and the temporary unencryptedqs file is deleted. When reading an encrypted file the process is reversed.

To save an object in an encrypted file, specify the object as the first argumentobj and specify a base file name as a character string in the second argumentbase. The fullqs file name will be of the formbase.qs.encrypted in the user's current working directory. To unencrypt the file into a short-lived temporary file and useqs::qread to read it, specify the base file name as a character string with the first argument, and do not specify thebase argument.

Alternatively,qcrypt can be used to encrypt or decrypt existing files of any type using the same password and keyring mechanism. The former is done by specifyingfile that does not end in'.encrypted' and the latter is done by endingfile with'.encrypted'. Whenfile does not contain a path it is assumed to be in the current working directory. When a file is encrypted the original file is removed. Files are decrypted into a temporary directory created bytempdir(), with the name of the file being the value offile with'.encrypted' removed.

Interactive password provision works when runningR,Rscript,RStudio, orQuarto but does not work when running⁠R CMD BATCH⁠.getPass fails underRStudio on Macs.

It is also possible to pass the password as thepw argument. This is only safe if running interactively and the password is defined by typing e.g.pw <- 'whateverpassword' in the console, then running the script interactively withpw=pw added to theqcrypt call.

SeeR Workflow for more information.

Value

(invisibly) the full encrypted file name if writing the file, or the restored R object if reading the file. When decrypting a general file with⁠file=⁠, the returned value is the full path to a temporary file containing the decrypted data.

Author(s)

Frank Harrell

Examples

## Not run: # Suppose x is a data.table or data.frame# The first time qcrypt is run with a service a password will# be requested.  It will be remembered across sessions thanks to# the keyring packageqcrypt(x, 'x')   # creates x.qs.encrypted in current working directoryx <- qcrypt('x') # unencrypts x.qs.encrypted into a temporary                 # directory, uses qs::qread to read it, and                 # stores the result in x# Encrypt a general file using a different passwordqcrypt(file='report.pdf', service='pdfkey')# Decrypt that filefi <- qcrypt(file='report.pdf.encrypted', service='pdfkey')fi contains the full unencrypted file name which is in a temporary directory# Encrypt without using a keyringqcrypt(x, 'x', service=NA)x <- qcrypt('x', service=NA)pw <- 'somepassword'     # run this in the consolex <- qcrypt('x', pw=pw)  # interactively run this in a script## End(Not run)

qrxcenter

Description

Mean-center a data matrix and QR transform it

Usage

qrxcenter(x, ...)

Arguments

x

a numeric matrix or vector with at least 2 rows

...

passed tobase::qr()

Details

For a numeric matrixx (or a numeric vector that is automatically changed to a one-column matrix), computes column means and subtracts them fromx columns, and passes this matrix tobase::qr() to orthogonalize columns. Columns of the transformedx are negated as needed so that original directions are preserved (which are arbitrary with QR decomposition). Instead of the defaultqr operation for which sums of squares of column values are 1.0,qrxcenter makes all the transformed columns have standard deviation of 1.0.

Value

a list with componentsx (transformed data matrix),R (the matrix that can be used to transform rawx and to transform regression coefficients computed on transformedx back to the original space),Ri (transforms transformedx back to original scale except forxbar), andxbar (vector of means of originalx columns')

Examples

set.seed(1)age <- 1:10country <- sample(c('Slovenia', 'Italy', 'France'), 10, TRUE)x <- model.matrix(~ age + country)[, -1]xw <- qrxcenter(x)w# Reproduce w$xsweep(x, 2, w$xbar) %*% w$R# Reproduce x from w$xsweep(w$x %*% w$Ri, 2, w$xbar, FUN='+')# See also https://hbiostat.org/r/examples/gtrans/gtrans#sec-splinebasis

r2describe

Description

Summarize Strength of Relationships Using R-Squared From Linear Regression

Usage

r2describe(x, nvmax = 10)

Arguments

x

numeric matrix with 2 or more columns

nvmax

maxmum number of columns of x to use in predicting a given column

Details

Function to useleaps::regsubsets() to briefly describe which variables more strongly predict another variable. Variables are in a numeric matrix and are assumed to be transformed so that relationships are linear (e.g., usingredun() ortranscan().)

Value

nothing

Author(s)

Frank Harrell

Examples

## Not run: r <- redun(...)r2describe(r$scores)## End(Not run)

Generate Multinomial Random Variables with Varying Probabilities

Description

Given a matrix of multinomial probabilities where rows correspond toobservations and columns to categories (and each row sums to 1),generates a matrix with the same number of rows as hasprobs andwithm columns. The columns represent multinomial cell numbers,and within a row the columns are all samples from the same multinomialdistribution. The code is a modification of that in theimpute.polyreg function in theMICE package.

Usage

rMultinom(probs, m)

Arguments

probs

matrix of probabilities

m

number of samples for each row ofprobs

Value

an integer matrix havingm columns

See Also

rbinom

Examples

set.seed(1)w <- rMultinom(rbind(c(.1,.2,.3,.4),c(.4,.3,.2,.1)),200)t(apply(w, 1, table)/200)

Matrix of Correlations and P-values

Description

rcorr Computes a matrix of Pearson'sr or Spearman'srho rank correlation coefficients for all possible pairs ofcolumns of a matrix. Missing values are deleted in pairs rather thandeleting all rows ofx having any missing variables. Ranks arecomputed using efficient algorithms (see reference 2), using midranksfor ties.

Usage

rcorr(x, y, type=c("pearson","spearman"))## S3 method for class 'rcorr'print(x, ...)

Arguments

x

a numeric matrix with at least 5 rows and at least 2 columns (ify is absent). Forprint,x is an objectproduced byrcorr.

y

a numeric vector or matrix which will be concatenated tox. Ify is omitted forrcorr,x must be a matrix.

type

specifies the type of correlations to compute. Spearman correlationsare the Pearson linear correlations computed on the ranks of non-missingelements, using midranks for ties.

...

argument for method compatiblity.

Details

Uses midranks in case of ties, as described by Hollander and Wolfe.P-values are approximated by using thet orF distributions.

Value

rcorr returns a list with elementsr, thematrix of correlations,n thematrix of number of observations used in analyzing each pair of variables,P, the asymptotic P-values, andtype.Pairs with fewer than 2 non-missing values have the r values set to NA.The diagonals ofn are the number of non-NAs for the single variablecorresponding to that row and column.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods.New York: Wiley.

Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.

See Also

hoeffd,cor,combine.levels,varclus,dotchart3,impute,chisq.test,cut2.

Examples

x <- c(-2, -1, 0, 1, 2)y <- c(4,   1, 0, 1, 4)z <- c(1,   2, 3, 4, NA)v <- c(1,   2, 3, 4, 5)rcorr(cbind(x,y,z,v))

Rank Correlation for Censored Data

Description

Computes the c index and the correspondinggeneralization of Somers' Dxy rank correlation for a censored responsevariable. Also works for uncensored and binary responses, although its use of all possible pairingsmakes it slow for this purpose. Dxy and c are related byDxy=2(c-0.5).

rcorr.cens handles one predictor variable.rcorrcenscomputes rank correlation measures separately by a series ofpredictors. In addition,rcorrcens has a rough way of handlingcategorical predictors. If a categorical (factor) predictor has twolevels, it is coverted to a numeric having values 1 and 2. If it hasmore than 2 levels, an indicator variable is formed for the mostfrequently level vs. all others, and another indicator for the secondmost frequent level and all others. The correlation is taken as themaximum of the two (in absolute value).

Usage

rcorr.cens(x, S, outx=FALSE)## S3 method for class 'formula'rcorrcens(formula, data=NULL, subset=NULL,          na.action=na.retain, exclude.imputed=TRUE, outx=FALSE,          ...)

Arguments

x

a numeric predictor variable

S

anSurv object or a vector. If a vector, assumes that everyobservation is uncensored.

outx

set toTRUE to not count pairs of observations tied onx as arelevant pair. This results in a Goodman–Kruskal gamma type rankcorrelation.

formula

a formula with aSurv object or a numeric vectoron the left-hand side

data,subset,na.action

the usual options for models. Default forna.action is to retainall values, NA or not, so that NAs can be deleted in only a pairwisefashion.

exclude.imputed

set toFALSE to include imputed values (created byimpute) in the calculations.

...

extra arguments passed tobiVar.

Value

rcorr.cens returns a vector with the following named elements:C Index,Dxy,S.D.,n,missing,uncensored,Relevant Pairs,Concordant, andUncertain

n

number of observations not missing on any input variables

missing

number of observations missing onx orS

relevant

number of pairs of non-missing observations for whichS could be ordered

concordant

number of relevant pairs for whichx andSare concordant.

uncertain

number of pairs of non-missing observations for whichcensoring prevents classification of concordance ofx andS.

rcorrcens.formula returns an object of classbiVarwhich is documented with thebiVar function.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Newson R: Confidence intervals for rank statistics: Somers' D and extensions. Stata Journal 6:309-334; 2006.

See Also

concordance,somers2,biVar,rcorrp.cens

Examples

set.seed(1)x <- round(rnorm(200))y <- rnorm(200)rcorr.cens(x, y, outx=TRUE)   # can correlate non-censored variableslibrary(survival)age <- rnorm(400, 50, 10)bp  <- rnorm(400,120, 15)bp[1]  <- NAd.time <- rexp(400)cens   <- runif(400,.5,2)death  <- d.time <= censd.time <- pmin(d.time, cens)rcorr.cens(age, Surv(d.time, death))r <- rcorrcens(Surv(d.time, death) ~ age + bp)rplot(r)# Show typical 0.95 confidence limits for ROC areas for a sample size# with 24 events and 62 non-events, for varying population ROC areas# Repeat for 138 events and 102 non-eventsset.seed(8)par(mfrow=c(2,1))for(i in 1:2) { n1 <- c(24,138)[i] n0 <- c(62,102)[i] y <- c(rep(0,n0), rep(1,n1)) deltas <- seq(-3, 3, by=.25) C <- se <- deltas j <- 0 for(d in deltas) {  j <- j + 1  x <- c(rnorm(n0, 0), rnorm(n1, d))  w <- rcorr.cens(x, y)  C[j]  <- w['C Index']  se[j] <- w['S.D.']/2 } low <- C-1.96*se; hi <- C+1.96*se print(cbind(C, low, hi)) errbar(deltas, C, C+1.96*se, C-1.96*se,        xlab='True Difference in Mean X',        ylab='ROC Area and Approx. 0.95 CI') title(paste('n1=',n1,'  n0=',n0,sep='')) abline(h=.5, v=0, col='gray') true <- 1 - pnorm(0, deltas, sqrt(2)) lines(deltas, true, col='blue')}par(mfrow=c(1,1))

Rank Correlation for Paired Predictors with a Possibly CensoredResponse, and Integrated Discrimination Index

Description

Computes U-statistics to test for whether predictor X1 is moreconcordant than predictor X2, extendingrcorr.cens. Formethod=1, estimates the fraction of pairs for which thex1 difference is more impressive than thex2difference. Formethod=2, estimates the fraction of pairs forwhichx1 is concordant withS butx2 is not.

For binary responses the functionimproveProb provides severalassessments of whether one set of predicted probabilities is betterthan another, using the methods describe inPencina et al (2007). This involves NRI and IDI to test forwhether predictions from modelx1 are significantly differentfrom those obtained from predictions from modelx2. This is adistinct improvement over comparing ROC areas, sensitivity, orspecificity.

Usage

rcorrp.cens(x1, x2, S, outx=FALSE, method=1)improveProb(x1, x2, y)## S3 method for class 'improveProb'print(x, digits=3, conf.int=.95, ...)

Arguments

x1

first predictor (a probability, forimproveProb)

x2

second predictor (a probability, forimproveProb)

S

a possibly right-censoredSurv object. IfS is a vector instead, it is converted to aSurv object and it is assumed that noobservations are censored.

outx

set toTRUE to exclude pairs tied onx1 orx2from consideration

method

see above

y

a binary 0/1 outcome variable

x

the result fromimproveProb

digits

number of significant digits for use in printing the result ofimproveProb

conf.int

level for confidence limits

...

unused

Details

Ifx1,x2 represent predictions from models, thesefunctions assume either that you are using a separate sample from theone used to build the model, or that the amount of overfitting inx1 equals the amount of overfitting inx2. An exampleof the latter is giving both models equal opportunity to be complex sothat both models have the same number of effective degrees of freedom,whether a predictor was included in the model or was screened out by avariable selection scheme.

Note that in the first part of their paper,Pencina et al.presented measures that required binning the predicted probabilities.Those measures were then replaced with better continuous measures thatare implementedhere.

Value

a vector of statistics forrcorrp.cens, or a list with classimproveProb of statistics forimproveProb:

n

number of cases

na

number of events

nb

number of non-events

pup.ev

mean of pairwise differences in probabilities for those with eventsand a pairwise difference of\mbox{probabilities}>0

pup.ne

mean of pairwise differences in probabilities for those withoutevents and a pairwise difference of\mbox{probabilities}>0

pdown.ev

mean of pairwise differences in probabilities for those with eventsand a pairwise difference of\mbox{probabilities}>0

pdown.ne

mean of pairwise differences in probabilities for those withoutevents and a pairwise difference of\mbox{probabilities}>0

nri

Net Reclassification Index =(pup.ev-pdown.ev)-(pup.ne-pdown.ne)

se.nri

standard error of NRI

z.nri

Z score for NRI

nri.ev

Net Reclassification Index =pup.ev-pdown.ev

se.nri.ev

SE of NRI of events

z.nri.ev

Z score for NRI of events

nri.ne

Net Reclassification Index =pup.ne-pdown.ne

se.nri.ne

SE of NRI of non-events

z.nri.ne

Z score for NRI of non-events

improveSens

improvement in sensitivity

improveSpec

improvement in specificity

idi

Integrated Discrimination Index

se.idi

SE of IDI

z.idi

Z score of IDI

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

Scott Williams
Division of Radiation Oncology
Peter MacCallum Cancer Centre, Melbourne, Australia
scott.williams@petermac.org

References

Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS (2008):Evaluating the added predictive ability of a new marker: From areaunder the ROC curve to reclassification and beyond. Stat in Med 27:157-172.DOI: 10.1002/sim.2929

Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS:Rejoinder: Comments on Integrated discrimination and net reclassificationimprovements-Practical advice. Stat in Med 2007; DOI: 10.1002/sim.3106

Pencina MJ, D'Agostino RB, Steyerberg EW (2011): Extensions of netreclassification improvement calculations to measure usefulness of newbiomarkers. Stat in Med 30:11-21; DOI: 10.1002/sim.4085

See Also

rcorr.cens,somers2,Surv,val.prob,concordance

Examples

set.seed(1)library(survival)x1 <- rnorm(400)x2 <- x1 + rnorm(400)d.time <- rexp(400) + (x1 - min(x1))cens   <- runif(400,.5,2)death  <- d.time <= censd.time <- pmin(d.time, cens)rcorrp.cens(x1, x2, Surv(d.time, death))#rcorrp.cens(x1, x2, y) ## no censoringset.seed(1)x1 <- runif(1000)x2 <- runif(1000)y  <- sample(0:1, 1000, TRUE)rcorrp.cens(x1, x2, y)improveProb(x1, x2, y)

Restricted Cubic Spline Design Matrix

Description

Computes matrix that expands a single variable into the terms neededto fit a restricted cubic spline (natural spline) function using thetruncated power basis. Two normalization options are given forsomewhat reducing problems of ill-conditioning. The antiderivativefunction can be optionally created. If knot locations are not given,they will be estimated from the marginal distribution ofx.

Usage

rcspline.eval(x, knots, nk=5, inclx=FALSE, knots.only=FALSE,               type="ordinary", norm=2, rpm=NULL, pc=FALSE,              fractied=0.05)

Arguments

x

a vector representing a predictor variable

knots

knot locations. If not given, knots will be estimated using defaultquantiles ofx. For 3 knots, the outer quantiles used are 0.10and 0.90. For 4-6 knots, the outer quantiles used are 0.05 and0.95. For\code{nk}>6, the outer quantiles are 0.025 and 0.975. Theknots are equally spaced between these on the quantile scale. Forfewer than 100 non-missing values ofx, the outer knots arethe 5th smallest and largestx.

nk

number of knots. Default is 5. The minimum value is 3.

inclx

set toTRUE to addx as the first column of thereturned matrix

knots.only

return the estimated knot locations but not the expanded matrix

type

⁠"ordinary"⁠’ to fit the function, ‘⁠"integral"⁠’ to fit itsanti-derivative.

norm

⁠0⁠’ to use the terms as originally given byDevlin andWeeks (1986), ‘⁠1⁠’ to normalize non-linear terms by the cubeof the spacing between the last two knots, ‘⁠2⁠’ to normalize bythe square of the spacing between the first and last knots (thedefault).norm=2 has the advantage of making all nonlinearterms beon the x-scale.

rpm

If given, anyNAs inx will be replaced with the valuerpm after estimating any knot locations.

pc

Set toTRUE to replace the design matrix with orthogonal(uncorrelated) principal components computed on the scaled, centereddesign matrix

fractied

If the fraction of observations tied at the lowest and/or highestvalues ofx is greater than or equal tofractied, thealgorithm attempts to use a different algorithm for knot findingbased on quantiles ofx after excluding the one or two valueswith excessive ties. And if the number of uniquex valuesexcluding these values is small, the unique values will be used asthe knots. If the number of knots to use other than these exteriorvalues is only one, that knot will be at the median of thenon-extremex. This algorithm is not used if any interiorvalues ofx also have a proportion of ties equal to orexceedingfractied.

Value

Ifknots.only=TRUE, returns a vector of knotlocations. Otherwise returns a matrix withx (ifinclx=TRUE) followed by\code{nk}-2 nonlinear terms. Thematrix has an attributeknots which is the vector of knotsused. Whenpc isTRUE, an additional attribute isstored:pcparms, which contains thecenter andscale vectors and therotation matrix.

References

Devlin TF and Weeks BJ (1986): Spline functions for logistic regressionmodeling. Proc 11th Annual SAS Users Group Intnl Conf, p. 646–651.Cary NC: SAS Institute, Inc.

See Also

ns,rcspline.restate,rcs

Examples

x <- 1:100rcspline.eval(x, nk=4, inclx=TRUE)#lrm.fit(rcspline.eval(age,nk=4,inclx=TRUE), death)x <- 1:1000attributes(rcspline.eval(x))x <- c(rep(0, 744),rep(1,6), rep(2,4), rep(3,10),rep(4,2),rep(6,6),  rep(7,3),rep(8,2),rep(9,4),rep(10,2),rep(11,9),rep(12,10),rep(13,13),  rep(14,5),rep(15,5),rep(16,10),rep(17,6),rep(18,3),rep(19,11),rep(20,16),  rep(21,6),rep(22,16),rep(23,17), 24, rep(25,8), rep(26,6),rep(27,3),  rep(28,7),rep(29,9),rep(30,10),rep(31,4),rep(32,4),rep(33,6),rep(34,6),  rep(35,4), rep(36,5), rep(38,6), 39, 39, 40, 40, 40, 41, 43, 44, 45)attributes(rcspline.eval(x, nk=3))attributes(rcspline.eval(x, nk=5))u <- c(rep(0,30), 1:4, rep(5,30))attributes(rcspline.eval(u))

Plot Restricted Cubic Spline Function

Description

Provides plots of the estimated restricted cubic spline functionrelating a single predictor to the response for a logistic or Coxmodel. Thercspline.plot function does not allow forinteractions as dolrm andcph, but it canprovide detailed output for checking spline fits. This function usesthercspline.eval,lrm.fit, and Therneau'scoxph.fit functions and plots the estimated splineregression and confidence limits, placing summary statistics on thegraph. If there are no adjustment variables,rcspline.plot canalso plot two alternative estimates of the regression function whenmodel="logistic": proportions or logit proportions on groupeddata, and a nonparametric estimate. The nonparametric regressionestimate is based on smoothing the binary responses and taking thelogit transformation of the smoothed estimates, if desired. Thesmoothing usessupsmu.

Usage

rcspline.plot(x,y,model=c("logistic", "cox", "ols"), xrange, event, nk=5,              knots=NULL, show=c("xbeta","prob"), adj=NULL, xlab, ylab,              ylim, plim=c(0,1), plotcl=TRUE, showknots=TRUE, add=FALSE,              subset, lty=1, noprint=FALSE, m, smooth=FALSE, bass=1,              main="auto", statloc)

Arguments

x

a numeric predictor

y

a numeric response. For binary logistic regression,y shouldbe either 0 or 1.

model

"logistic" or"cox". For"cox", uses thecoxph.fit function withmethod="efron" arguement set.

xrange

range for evaluatingx, default is f and1 - f quantiles ofx, wheref = \frac{10}{\max{(n, 200)}}

event

event/censoring indicator ifmodel="cox". Ifevent ispresent,model is assumed to be"cox"

nk

number of knots

knots

knot locations, default based on quantiles ofx (byrcspline.eval)

show

"xbeta" or"prob" - what is plotted on⁠y⁠-axis

adj

optional matrix of adjustment variables

xlab

⁠x⁠-axis label, default is the “label” attribute ofx

ylab

⁠y⁠-axis label, default is the “label” attribute ofy

ylim

⁠y⁠-axis limits for logit or log hazard

plim

⁠y⁠-axis limits for probability scale

plotcl

plot confidence limits

showknots

show knot locations with arrows

add

add this plot to an already existing plot

subset

subset of observations to process, e.g.sex == "male"

lty

line type for plotting estimated spline function

noprint

suppress printing regression coefficients and standard errors

m

formodel="logistic", plot grouped estimates withtriangles. Each group containsm ordered observations onx.

smooth

plot nonparametric estimate ifmodel="logistic" andadj is not specified

bass

smoothing parameter (seesupsmu)

main

main title, default is"Estimated Spline Transformation"

statloc

location of summary statistics. Default positioning by clicking leftmouse button where upper left corner of statistics shouldappear. Alternative is"ll" to place below the graph on thelower left, or the actualx andy coordinates. Use"none" to suppress statistics.

Value

list with components (‘⁠knots⁠’, ‘⁠x⁠’, ‘⁠xbeta⁠’,‘⁠lower⁠’, ‘⁠upper⁠’) which are respectively the knot locations,design matrix, linear predictor, and lower and upper confidence limits

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

See Also

lrm,cph,rcspline.eval,plot,supsmu,coxph.fit,lrm.fit

Examples

#rcspline.plot(cad.dur, tvdlm, m=150)#rcspline.plot(log10(cad.dur+1), tvdlm, m=150)

Re-state Restricted Cubic Spline Function

Description

This function re-states a restricted cubic spline function inthe un-linearly-restricted form. Coefficients for that form arereturned, along with anR functional representation of this functionand a LaTeX character representation of the function.rcsplineFunction is a fast function that creates a function tocompute a restricted cubic spline function with given coefficients andknots, without reformatting the function to be pretty (i.e., intounrestricted form).

Usage

rcspline.restate(knots, coef,                 type=c("ordinary","integral"),                 x="X", lx=nchar(x),                 norm=2, columns=65, before="& &", after="\\",                 begin="", nbegin=0, digits=max(8, .Options$digits))rcsplineFunction(knots, coef, norm=2, type=c('ordinary', 'integral'))

Arguments

knots

vector of knots used in the regression fit

coef

vector of coefficients from the fit. If the length ofcoef isk-1, where k is equal to thelength(knots), thefirst coefficient must be for the linear term and remainingk-2 coefficients must be for the constructed terms (e.g., fromrcspline.eval). If the length ofcoef is k, anintercept is assumed to be in the first element (or a zero isprepended tocoef forrcsplineFunction).

type

The default is to represent the cubic spline function correspondingto the coefficients and knots. Settype = "integral" toinstead represent its anti-derivative.

x

a character string to use as the variable name in the LaTeX expressionfor the formula.

lx

length ofx to count with respect tocolumns. Defaultis length of character string contained byx. You may want tosetlx smaller than this if it includes non-printable LaTeXcommands.

norm

normalization that was used in deriving the original nonlinear termsused in the fit. Seercspline.eval for definitions.

columns

maximum number of symbols in the LaTeX expression to allow beforeinserting a newline (‘⁠\\⁠’) command. Set to a very largenumber to keep text all on one line.

before

text to place before each line of LaTeX output. Use ‘⁠"& &"⁠’for an equation array environment in LaTeX where you want to have aleft-hand prefix e.g. ‘⁠"f(X) & = &"⁠’ or using‘⁠"\lefteqn"⁠’.

after

text to place at the end of each line of output.

begin

text with which to start the first line of output. Useful whenadding LaTeX output to part of an existing formula

nbegin

number of columns of printable text inbegin

digits

number of significant digits to write for coefficients and knots

Value

rcspline.restate returns a vector of coefficients. Thecoefficients are un-normalized and two coefficients are added that arelinearly dependent on the other coefficients and knots. The vector ofcoefficients has four attributes.knots is a vector of knots,latex is a vector of text strings with the LaTeXrepresentation of the formula.columns.used is the number ofcolumns used in the output string since the last newline command.function is anR function, which is also return in characterstring format as thetext attribute.rcsplineFunctionreturns anR function with argumentsx (a user-suppliednumeric vector at which to evaluate the function), and someautomatically-supplied other arguments.

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

See Also

rcspline.eval,ns,rcs,latex,Function.transcan

Examples

set.seed(1)x <- 1:100y <- (x - 50)^2 + rnorm(100, 0, 50)plot(x, y)xx <- rcspline.eval(x, inclx=TRUE, nk=4)knots <- attr(xx, "knots")coef <- lsfit(xx, y)$coefoptions(digits=4)# rcspline.restate must ignore interceptw <- rcspline.restate(knots, coef[-1], x="{\\rm BP}")# could also have used coef instead of coef[-1], to include interceptcat(attr(w,"latex"), sep="\n")xtrans <- eval(attr(w, "function"))# This is an S function of a single argumentlines(x, coef[1] + xtrans(x), type="l")# Plots fitted transformationxtrans <- rcsplineFunction(knots, coef)xtranslines(x, xtrans(x), col='blue')#x <- blood.pressurexx.simple <- cbind(x, pmax(x-knots[1],0)^3, pmax(x-knots[2],0)^3,                       pmax(x-knots[3],0)^3, pmax(x-knots[4],0)^3)pred.value <- coef[1] + xx.simple %*% wplot(x, pred.value, type='l')   # same as above

Reshape Matrices and Serial Data

Description

If the first argument is a matrix,reShape strings out its valuesand creates row and column vectors specifying the row and column eachelement came from. This is useful for sending matrices to Trellisfunctions, for analyzing or plotting results oftable orcrosstabs, or for reformatting serial data stored in a matrix (withrows representing multiple time points) into vectors. The number ofobservations in the new variables will be the product of the number ofrows and number of columns in the input matrix. If the firstargument is a vector, theid andcolvar variables are used torestructure it into a matrix, withNAs for elements that correspondedto combinations ofid andcolvar values that did not exist in thedata. When more than one vector is given, multiple matrices arecreated. This is useful for restructuring irregular serial data intoregular matrices. It is also useful for converting data produced byexpand.grid into a matrix (see the last example). The number ofrows of the new matrices equals the number of unique values ofid,and the number of columns equals the number of unique values ofcolvar.

When the first argument is a vector and theid is a data frame(even with only one variable),reShape will produce a data frame, and the unique groups areidentified by combinations of the values of all variables inid.If a data frameconstant is specified, the variables in this dataframe are assumed to be constant within combinations ofidvariables (if not, an arbitrary observation inconstant will beselected for each group). A row ofconstant corresponding to thetargetid combination is then carried along when creating thedata frame result.

A different behavior ofreShape is achieved whenbase andrepsare specified. In that casex must be a list or data frame, andthose data are assumed to contain one or more non-repeatingmeasurements (e.g., baseline measurements) and one or more repeatedmeasurements represented by variables named by pasting together thecharacter strings in the vectorbase with the integers 1, 2, ...,reps. The input data are rearranged by repeating each value of thebaseline variablesreps times and by transposing each observation'svalues of one of the set of repeated measurements asrepsobservations under the variable whose name does not have an integerpasted to the end. ifx has arow.names attribute, thoseobservation identifiers are each repeatedreps times in the outputobject. See the last example.

Usage

reShape(x, ..., id, colvar, base, reps, times=1:reps,        timevar='seqno', constant=NULL)

Arguments

x

a matrix or vector, or, whenbase is specified, a list or data frame

...

other optional vectors, ifx is a vector

id

A numeric, character, category, or factor variable containing subjectidentifiers, or a data frame of such variables that in combination formgroups of interest. Required ifx is a vector, ignored otherwise.

colvar

A numeric, character, category, or factor variable containing columnidentifiers.colvar is using a "time of data collection" variable.Required ifx is a vector, ignored otherwise.

base

vector of character strings containing base names of repeatedmeasurements

reps

number of times variables named inbase are repeated. This must bea constant.

times

whenbase is given,times is the vector of times to createif you do not want to use consecutive integers beginning with 1.

timevar

specifies the name of the time variable to create iftimes isgiven, if you do not want to useseqno

constant

a data frame with the same number of rows inid andx,containing auxiliary information to be merged into the resulting dataframe. Logically, the rows ofconstant within each groupshould have the same value of all of its variables.

Details

In convertingdimnames to vectors, the resulting variables arenumeric if all elements of the matrix dimnames can be converted tonumeric, otherwise the corresponding row or column variable remainscharacter. When thedimnames ifx have anames attribute, thosetwo names become the new variable names. Ifx is a vector andanother vector is also given (in...), the matrices in the resultinglist are named the same as the input vector calling arguments. Youcan specify customized names for these on-the-fly by usinge.g.reShape(X=x, Y=y, id= , colvar= ). The new names will then beX andY instead ofx andy. A new variable namedseqnno isalso added to the resulting object.seqno indicates the sequentialrepeated measurement number. Whenbase andtimes arespecified, this new variable is named the character value oftimevar and the valuesare given by a table lookup into the vectortimes.

Value

Ifx is a matrix, returns a list containing the row variable, thecolumn variable, and theas.vector(x) vector, named the same as thecalling argument was called forx. Ifx is a vector and no othervectors were specified as..., the result is a matrix. If at leastone vector was given to..., the result is a list containingkmatrices, wherek one plus the number of vectors in.... Ifxis a list or data frame, the same type of object is returned. Ifx is a vector andid is a data frame, a data frame will bethe result.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

See Also

reshape,as.vector,matrix,dimnames,outer,table

Examples

set.seed(1)Solder  <- factor(sample(c('Thin','Thick'),200,TRUE),c('Thin','Thick'))Opening <- factor(sample(c('S','M','L'),  200,TRUE),c('S','M','L'))tab <- table(Opening, Solder)tabreShape(tab)# attach(tab)  # do further processing# An example where a matrix is created from irregular vectorsfollow <- data.frame(id=c('a','a','b','b','b','d'),                     month=c(1, 2,  1,  2,  3,  2),                     cholesterol=c(225,226, 320,319,318, 270))followattach(follow)reShape(cholesterol, id=id, colvar=month)detach('follow')# Could have done :# reShape(cholesterol, triglyceride=trig, id=id, colvar=month)# Create a data frame, reshaping a long dataset in which groups are# formed not just by subject id but by combinations of subject id and# visit number.  Also carry forward a variable that is supposed to be# constant within subject-visit number combinations.  In this example,# it is not constant, so an arbitrary visit number will be selected.w <- data.frame(id=c('a','a','a','a','b','b','b','d','d','d'),             visit=c(  1,  1,  2,  2,  1,  1,  2,  2,  2,  2),                 k=c('A','A','B','B','C','C','D','E','F','G'),               var=c('x','y','x','y','x','y','y','x','y','z'),               val=1:10)with(w,     reShape(val, id=data.frame(id,visit),             constant=data.frame(k), colvar=var))# Get predictions from a regression model for 2 systematically# varying predictors.  Convert the predictions into a matrix, with# rows corresponding to the predictor having the most values, and# columns corresponding to the other predictor# d <- expand.grid(x2=0:1, x1=1:100)# pred <- predict(fit, d)# reShape(pred, id=d$x1, colvar=d$x2)  # makes 100 x 2 matrix# Reshape a wide data frame containing multiple variables representing# repeated measurements (3 repeats on 2 variables; 4 subjects)set.seed(33)n <- 4w <- data.frame(age=rnorm(n, 40, 10),                sex=sample(c('female','male'), n,TRUE),                sbp1=rnorm(n, 120, 15),                sbp2=rnorm(n, 120, 15),                sbp3=rnorm(n, 120, 15),                dbp1=rnorm(n,  80, 15),                dbp2=rnorm(n,  80, 15),                dbp3=rnorm(n,  80, 15), row.names=letters[1:n])options(digits=3)wu <- reShape(w, base=c('sbp','dbp'), reps=3)ureShape(w, base=c('sbp','dbp'), reps=3, timevar='week', times=c(0,3,12))

Redundancy Analysis

Description

Uses flexible parametric additive models (seeareg and itsuse of regression splines), or alternatively to run a regular regressionafter replacing continuous variables with ranks, todetermine how well each variable can be predicted from the remainingvariables. Variables are dropped in a stepwise fashion, removing themost predictable variable at each step. The remaining variables are usedto predict. The process continues until no variable still in the listof predictors can be predicted with anR^2 or adjustedR^2of at leastr2 or until dropping the variable with the highestR^2 (adjusted or ordinary) would cause a variable that was droppedearlier to no longer be predicted at least at ther2 level fromthe now smaller list of predictors.

There is also an optionqrank to expand each variable into twocolumns containing the rank and square of the rank. Whenever ranks areused, they are computed as fractional ranks for numerical reasons.

Usage

redun(formula, data=NULL, subset=NULL, r2 = 0.9,      type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE,      rank=qrank, qrank=FALSE,      allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...)## S3 method for class 'redun'print(x, digits=3, long=TRUE, ...)

Arguments

formula

a formula. Enclose a variable inI() to forcelinearity. Alternately, can be a numeric matrix, in which case thedata are not run throughdataframeReduce. This is useful whenrunning the data throughtranscan first for nonlinearlytransforming the data.

data

a data frame, which must be omitted ifformula is amatrix

subset

usual subsetting expression

r2

ordinary or adjustedR^2 cutoff for redundancy

type

specify"adjusted" to use adjustedR^2

nk

number of knots to use for continuous variables. Usenk=0 to force linearity for all variables.

tlinear

set toFALSE to allow a variable to be automaticallynonlinearly transformed (seeareg) while being predicted. Bydefault, only continuous variables on the right hand side (i.e., whilethey are being predictors) are automatically transformed, usingregression splines. Estimating transformations for target (dependent)variables causes more overfitting than doing so for predictors.

rank

set toTRUE to replace non-categorical varibleswith ranks before running the analysis. This causesnk to beset to zero.

qrank

set toTRUE to also include squares of ranks toallow for non-monotonic transformations

allcat

set toTRUE to ensure that all categories ofcategorical variables having more than two categories are redundant(see details below)

minfreq

For a binary or categorical variable, there must be atleast two categories with at leastminfreq observations orthe variable will be dropped and not checked for redundancy againstother variables.minfreq also specifies the minimumfrequency of a category or its complement before that category is considered whenallcat=TRUE.

iterms

set toTRUE to consider derived terms (dummyvariables and nonlinear spline components) as separate variables.This will perform a redundancy analysis on pieces of the variables.

pc

ifiterms=TRUE you can setpc toTRUEto replace the submatrix of terms corresponding to each variablewith the orthogonal principal components before doing the redundancyanalysis. The components are based on the correlation matrix.

pr

set toTRUE to monitor progress of the stepwise algorithm

...

arguments to pass todataframeReduce to remove"difficult" variables fromdata ifformula is~. to use all variables indata (data must bespecified when these arguments are used). Ignored forprint.

x

an object created byredun

digits

number of digits to which to roundR^2 values whenprinting

long

set toFALSE to prevent theprint methodfrom printing theR^2 history and the originalR^2 withwhich each variable can be predicted from ALL other variables.

Details

A categorical variable is deemedredundant if a linear combination of dummy variables representing it canbe predicted from a linear combination of other variables. For example,if there were 4 cities in the data and each city's rainfall was alsopresent as a variable, with virtually the same rainfall reported for allobservations for a city, city would be redundant given rainfall (orvice-versa; the one declared redundant would be the first one in theformula). If two cities had the same rainfall,city might bedeclared redundant even though tied cities might be deemed non-redundantin another setting. To ensure that all categories may be predicted wellfrom other variables, use theallcat option. To ignorecategories that are too infrequent or too frequent, setminfreqto a nonzero integer. When the number of observations in the categoryis below this number or the number of observations not in the categoryis below this number, no attempt is made to predict observations beingin that category individually for the purpose of redundancy detection.

Value

an object of class"redun" including an element"scores", a numeric matrix with all transformed values when each variable was the dependent variable and the first canonical variate was computed

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

areg,dataframeReduce,transcan,varclus,r2describe,subselect::genetic

Examples

set.seed(1)n <- 100x1 <- runif(n)x2 <- runif(n)x3 <- x1 + x2 + runif(n)/10x4 <- x1 + x2 + x3 + runif(n)/10x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))x6 <- 1*(x5=='a' | x5=='c')redun(~x1+x2+x3+x4+x5+x6, r2=.8)redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)# x5 is no longer redundant but x6 isredun(~x1+x2+x3+x4+x5+x6, r2=.8, rank=TRUE)redun(~x1+x2+x3+x4+x5+x6, r2=.8, qrank=TRUE)# To help decode which variables made a particular variable redundant:# r <- redun(...)# r2describe(r$scores)

Special Version of legend for R

Description

rlegend is a version oflegend forR that implementsplot=FALSE, addsgrid=TRUE, and defaultslty,lwd,pch toNULL and checks forlength>0rather thanmissing(), so it's easier to deal withnon-applicable parameters. But whengrid is in effect, thepreferred function to use isrlegendg, which calls thelatticedraw.key function.

Usage

rlegend(x, y, legend, fill, col = "black", lty = NULL, lwd = NULL,        pch = NULL, angle = NULL, density = NULL, bty = "o",        bg = par("bg"), pt.bg = NA, cex = 1, xjust = 0, yjust = 1,        x.intersp = 1, y.intersp = 1, adj = 0, text.width = NULL,        merge = do.lines && has.pch, trace = FALSE, ncol = 1,        horiz = FALSE, plot = TRUE, grid = FALSE, ...)rlegendg(x, y, legend, col=pr$col[1], lty=NULL,         lwd=NULL, pch=NULL, cex=pr$cex[1], other=NULL)

Arguments

x,y,legend,fill,col,lty,lwd,pch,angle,density,bty,bg,pt.bg,cex,xjust,yjust,x.intersp,y.intersp,adj,text.width,merge,trace,ncol,horiz

seelegend

plot

set toFALSE to suppress drawing the legend. Thisis used the compute the size needed for when the legend is drawnwith a later call torlegend.

grid

set toTRUE if thegrid package is in effect

...

seelegend

other

a list containing other arguments to pass todraw.key. See the help file forxyplot.

Value

a list with elementsrect andtext.rect haselementsw, h, left, top with size/position information.

Author(s)

Frank Harrell and R-Core

See Also

legend,draw.key,xyplot


Bootstrap Repeated Measurements Model

Description

For a dataset containing a time variable, a scalar response variable,and an optional subject identification variable, obtains least squaresestimates of the coefficients of a restricted cubic spline function ora linear regression in time after adjusting for subject effectsthrough the use of subject dummy variables. Then the fit isbootstrappedB times, either by treating time and subject ID asfixed (i.e., conditioning the analysis on them) or as randomvariables. For the former, the residuals from the original model fitare used as the basis of the bootstrap distribution. For the latter,samples are taken jointly from the time, subject ID, and responsevectors to obtain unconditional distributions.

If a subjectid variable is given, the bootstrap sampling willbe based on samples with replacement from subjects rather than fromindividual data points. In other words, either none or all of a givensubject's data will appear in a bootstrap sample. This clustersampling takes into account any correlation structure that might existwithin subjects, so that confidence limits are corrected forwithin-subject correlation. Assuming that ordinary least squaresestimates, which ignore the correlation structure, are consistent(which is almost always true) and efficient (which would not be truefor certain correlation structures or for datasets in which the numberof observation times vary greatly from subject to subject), theresulting analysis will be a robust, efficient repeated measuresanalysis for the one-sample problem.

Predicted values of the fitted models are evaluated by default at agrid of 100 equally spaced time points ranging from the minimum tomaximum observed time points. Predictions are for the average subjecteffect. Pointwise confidence intervals are optionally computedseparately for each of the points on the time grid. However,simultaneous confidence regions that control the level of confidencefor the entire regression curve lying within a band are often moreappropriate, as they allow the analyst to draw conclusions aboutnuances in the mean time response profile that were not statedapriori. The method ofTibshirani (1997) is used to easilyobtain simultaneous confidence sets for the set of coefficients of thespline or linear regression function as well as the average interceptparameter (over subjects). Here one computes the objective criterion(here both the -2 log likelihood evaluated at the bootstrap estimateof beta but with respect to the original design matrix and responsevector, and the sum of squared errors in predicting the originalresponse vector) for the original fit as well as for all of thebootstrap fits. The confidence set of the regression coefficients isthe set of all coefficients that are associated with objectivefunction values that are less than or equal to say the 0.95 quantileof the vector of\code{B} + 1 objective function values. Forthe coefficients satisfying this condition, predicted curves arecomputed at the time grid, and minima and maxima of these curves arecomputed separately at each time point toderive the finalsimultaneous confidence band.

By default, the log likelihoods that are computed for obtaining thesimultaneous confidence band assume independence within subject. Thiswill cause problems unless such log likelihoods have very high rankcorrelation with the log likelihood allowing for dependence. To allowfor correlation or to estimate the correlation function, see thecor.pattern argument below.

Usage

rm.boot(time, y, id=seq(along=time), subset,        plot.individual=FALSE,        bootstrap.type=c('x fixed','x random'),        nk=6, knots, B=500, smoother=supsmu,         xlab, xlim, ylim=range(y),         times=seq(min(time), max(time), length=100),        absorb.subject.effects=FALSE,         rho=0, cor.pattern=c('independent','estimate'), ncor=10000,        ...)## S3 method for class 'rm.boot'plot(x, obj2, conf.int=.95,     xlab=x$xlab, ylab=x$ylab,      xlim, ylim=x$ylim,     individual.boot=FALSE,     pointwise.band=FALSE,     curves.in.simultaneous.band=FALSE,     col.pointwise.band=2,     objective=c('-2 log L','sse','dep -2 log L'), add=FALSE, ncurves,     multi=FALSE, multi.method=c('color','density'),     multi.conf   =c(.05,.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,.99),     multi.density=c( -1,90,80,70,60,50,40,30,20,10,  7,  4),     multi.col    =c(  1, 8,20, 5, 2, 7,15,13,10,11,  9, 14),     subtitles=TRUE, ...)

Arguments

time

numeric time vector

y

continuous numeric response vector of length the same astime.Subjects having multiple measurements have the measurements strung out.

x

an object returned fromrm.boot

id

subject ID variable. If omitted, it is assumed that eachtime-response pair is measured on a different subject.

subset

subset of observations to process if not all the data

plot.individual

set toTRUE to plot nonparametrically smoothed time-responsecurves for each subject

bootstrap.type

specifies whether to treat the time and subject ID variables asfixed or random

nk

number of knots in the restricted cubic spline function fit. Thenumber of knots may be 0 (denoting linear regression) or an integergreater than 2 in which k knots results ink - 1regression coefficients excluding the intercept. The default is 6knots.

knots

vector of knot locations. May be specified ifnk isomitted.

B

number of bootstrap repetitions. Default is 500.

smoother

a smoothing function that is used ifplot.individual=TRUE.Default issupsmu.

xlab

label for x-axis. Default is"units" attribute of theoriginaltime variable, or"Time" if no suchattribute was defined using theunits function.

xlim

specifies x-axis plotting limits. Default is to use range of timesspecified torm.boot.

ylim

forrm.boot this is a vector of y-axis limits used ifplot.individual=TRUE. It is also passed along for later usebyplot.rm.boot. Forplot.rm.boot,ylim canbe specified, to override the value stored in the object stored byrm.boot. The default is the actual range ofy in theinput data.

times

a sequence of times at which to evaluated fitted values andconfidence limits. Default is 100 equally spaced points in theobserved range oftime.

absorb.subject.effects

IfTRUE, adjusts the response vectory beforere-sampling so that the subject-specific effects in the initialmodel fit are all zero. Then in re-sampling, subject effects arenot used in the models. This will downplay one of the sources ofvariation. This option is used mainly for checking for consistencyof results, as the re-sampling analyses are simpler whenabsort.subject.effects=TRUE.

rho

The log-likelihood function that is used as the basis ofsimultaneous confidence bands assumes normality with independencewithin subject. To check the robustness of this assumption, ifrho is not zero, the log-likelihood under multivariatenormality within subject, with constant correlationrhobetween any two time points, is also computed. If the twolog-likelihoods have the same ranks across re-samples, alllowingthe correlation structure does not matter. The agreement in ranksis quantified using the Spearman rank correlation coefficient. Theplot method allows the non-zero intra-subjectcorrelation log-likelihood to be used in deriving the simultaneousconfidence band. Note that this approach does assumehomoscedasticity.

cor.pattern

More generally than using an equal-correlation structure, you canspecify a function of two time vectors that generates as manycorrelations as the length of these vectors. For example,cor.pattern=function(time1,time2) 0.2^(abs(time1-time2)/10)would specify a dampening serial correlation pattern.cor.pattern can also be a list containing vectorsx(a vector of absolute time differences) andy (acorresponding vector of correlations). To estimate the correlationfunction as a function of absolute time differences withinsubjects, specifycor.pattern="estimate". The products ofall possible pairs of residuals (or at least up toncor ofthem) within subjects will be related to the absolute timedifference. The correlation function is estimated by computing thesample mean of the products of standardized residuals, stratifiedby absolute time difference. The correlation for a zero timedifference is set to 1 regardless of thelowessestimate. NOTE: This approach fails in the presence of largesubject effects; correcting for such effects removes too much ofthe correlation structure in the residuals.

ncor

the maximum number of pairs of time values used in estimating thecorrelation function ifcor.pattern="estimate"

...

other arguments to pass tosmoother ifplot.individual=TRUE

obj2

a second object created byrm.boot that can also be passedtoplot.rm.boot. This is used for two-sample problems forwhich the time profiles are allowed to differ between the twogroups. The bootstrapped predicted y values for the second fit aresubtracted from the fitted values for the first fit so that thepredicted mean response for group 1 minus the predicted meanresponse for group 2 is what is plotted. The confidence bands thatare plotted are also for this difference. For the simultaneousconfidence band, the objective criterion is taken to be the sum ofthe objective criteria (-2 log L or sum of squared errors) for theseparate fits for the two groups. Thetimes vectors musthave been identical for both calls torm.boot, althoughNAs can be inserted by the user of one or both of the timevectors in therm.boot objects so as to suppress certainsections of the difference curve from being plotted.

conf.int

the confidence level to use in constructing simultaneous, andoptionally pointwise, bands. Default is 0.95.

ylab

label for y-axis. Default is the"label" attribute of theoriginaly variable, or"y" if no label was assignedtoy (using thelabel function, for example).

individual.boot

set toTRUE to plot the first 100 bootstrap regression fits

pointwise.band

set toTRUE to draw a pointwise confidence band in additionto the simultaneous band

curves.in.simultaneous.band

set toTRUE to draw all bootstrap regression fits that had asum of squared errors (obtained by predicting the originalyvector from the originaltime vector andid vector)that was less that or equal to theconf.int quantile of allbootstrapped models (plus the original model). This will show howthe point by point max and min were computed to form thesimultaneous confidence band.

col.pointwise.band

color for the pointwise confidence band. Default is ‘⁠2⁠’,which defaults to red for default Windows S-PLUS setups.

objective

the default is to use the -2 times log of the Gaussian likelihoodfor computing the simultaneous confidence region. If neithercor.pattern norrho was specified torm.boot,the independent homoscedastic Gaussian likelihood isused. Otherwise the dependent homoscedastic likelihood is usedaccording to the specified or estimated correlationpattern. Specifyobjective="sse" to instead use the sum ofsquared errors.

add

set toTRUE to add curves to an existing plot. If you dothis, titles and subtitles are omitted.

ncurves

when usingindividual.boot=TRUE orcurves.in.simultaneous.band=TRUE, you can plot a randomsample ofncurves of the fitted curves instead of plottingup toB of them.

multi

set toTRUE to draw multiple simultaneous confidence bandsshaded with different colors. Confidence levels vary over thevalues in themulti.conf vector.

multi.method

specifies the method of shading whenmulti=TRUE. Default isto use colors, with the default colors chosen so that when thegraph is printed under S-Plus for Windows 4.0 to an HP LaserJetprinter, the confidence regions are naturally ordered by darknessof gray-scale. Regions closer to the point estimates (i.e., thecenter) are darker. Specifymulti.method="density" toinstead use densities of lines drawn per inch in the confidenceregions, with all regions drawn with the default color. Thepolygon function is used to shade the regions.

multi.conf

vector of confidence levels, in ascending order. Default is to use12 confidence levels ranging from 0.05 to 0.99.

multi.density

vector of densities in lines per inch corresponding tomulti.conf. As is the convention in thepolygon function, a density of -1 indicates a solidregion.

multi.col

vector of colors corresponding tomulti.conf. Seemulti.method for rationale.

subtitles

set toFALSE to suppress drawing subtitles for the plot

Details

Observations having missingtime ory are excluded fromthe analysis.

As most repeated measurement studies consider the times as designpoints, the fixed covariable case is the default. Bootstrapping theresiduals from the initial fit assumes that the model is correctlyspecified. Even if the covariables are fixed, doing an unconditionalbootstrap is still appropriate, and for large sample sizesunconditional confidence intervals are only slightly wider thanconditional ones. For moderate to small sample sizes, thebootstrap.type="x random" method can be fairly conservative.

If not all subjects have the same number of observations (afterdeleting observations containing missing values) and ifbootstrap.type="x fixed", bootstrapped residual vectors mayhave a length m that is different from the number of originalobservations n. Ifm > n for a bootstraprepetition, the first n elements of the randomly drawn residualsare used. Ifm < n, the residual vector is appendedwith a random sample with replacement of lengthn - m from itself. A warning message is issued if this happens.If the number of time points per subject varies, the bootstrap resultsforbootstrap.type="x fixed" can still be invalid, as thismethod assumes that a vector (over subjects) of all residuals can beadded to the original yhats, and varying number of points will causemis-alignment.

Forbootstrap.type="x random" in the presence of significantsubject effects, the analysis is approximate as the subjects used inany one bootstrap fit will not be the entire list of subjects. Theaverage (over subjects used in the bootstrap sample) intercept is usedfrom that bootstrap sample as a predictor of average subject effectsin the overall sample.

Once the bootstrap coefficient matrix is stored byrm.boot,plot.rm.boot can be run multiple times with different options(e.g, different confidence levels).

Seebootcov in therms library for a generalapproach to handling repeated measurement data for ordinary linearmodels, binary and ordinal models, and survival models, using theunconditional bootstrap.bootcov does not handle bootstrappingresiduals.

Value

an object of classrm.boot is returned byrm.boot. Theprincipal object stored in the returned object is a matrix ofregression coefficients for the original fit and all of the bootstraprepetitions (objectCoef), along with vectors of thecorresponding -2 log likelihoods are sums of squared errors. Theoriginal fit object fromlm.fit.qr is stored infit. For this fit, a cell means model is used for theid effects.

plot.rm.boot returns a list containing the vector of times usedfor plotting along with the overall fitted values, lower and uppersimultaneous confidence limits, and optionally the pointwiseconfidence limits.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods forclustered data analysis with Gaussian error. Stat in Med 15:1793–1806.

Tibshirani R, Knight K (1997):Model search and inference by bootstrap "bumping". Technical Report, Department of Statistics, University of Toronto.
https://www.jstor.org/stable/1390820. Presented at the Joint StatisticalMeetings, Chicago, August 1996.

Efron B, Tibshirani R (1993): An Introduction to the Bootstrap.New York: Chapman and Hall.

Diggle PJ, Verbyla AP (1998): Nonparametric estimation of covariancestructure in logitudinal data. Biometrics 54:401–415.

Chapman IM, Hartman ML, et al (1997): Effect of aging on thesensitivity of growth hormone secretion to insulin-like growthfactor-I negative feedback. J Clin Endocrinol Metab 82:2996–3004.

Li Y, Wang YG (2008): Smooth bootstrap methods for analysis oflongitudinal data. Stat in Med 27:937-953. (potential improvements tocluster bootstrap; not implemented here)

See Also

rcspline.eval,lm,lowess,supsmu,bootcov,units,label,polygon,reShape

Examples

# Generate multivariate normal responses with equal correlations (.7)# within subjects and no correlation between subjects# Simulate realizations from a piecewise linear population time-response# profile with large subject effects, and fit using a 6-knot spline# Estimate the correlation structure from the residuals, as a function# of the absolute time difference# Function to generate n p-variate normal variates with mean vector u and# covariance matrix S# Slight modification of function written by Bill Venables# See also the built-in function rmvnormmvrnorm <- function(n, p = 1, u = rep(0, p), S = diag(p)) {  Z <- matrix(rnorm(n * p), p, n)  t(u + t(chol(S)) %*% Z)}n     <- 20         # Number of subjectssub   <- .5*(1:n)   # Subject effects# Specify functional form for time trend and compute non-stochastic componenttimes <- seq(0, 1, by=.1)g     <- function(times) 5*pmax(abs(times-.5),.3)ey    <- g(times)# Generate multivariate normal errors for 20 subjects at 11 times# Assume equal correlations of rho=.7, independent subjectsnt    <- length(times)rho   <- .7        set.seed(19)        errors <- mvrnorm(n, p=nt, S=diag(rep(1-rho,nt))+rho)# Note:  first random number seed used gave rise to mean(errors)=0.24!# Add E[Y], error components, and subject effectsy      <- matrix(rep(ey,n), ncol=nt, byrow=TRUE) + errors +           matrix(rep(sub,nt), ncol=nt)# String out data into long vectors for times, responses, and subject IDy      <- as.vector(t(y))times  <- rep(times, n)id     <- sort(rep(1:n, nt))# Show lowess estimates of time profiles for individual subjectsf <- rm.boot(times, y, id, plot.individual=TRUE, B=25, cor.pattern='estimate',             smoother=lowess, bootstrap.type='x fixed', nk=6)# In practice use B=400 or 500# This will compute a dependent-structure log-likelihood in addition# to one assuming independence.  By default, the dep. structure# objective will be used by the plot method  (could have specified rho=.7)# NOTE: Estimating the correlation pattern from the residual does not# work in cases such as this one where there are large subject effects# Plot fits for a random sample of 10 of the 25 bootstrap fitsplot(f, individual.boot=TRUE, ncurves=10, ylim=c(6,8.5))# Plot pointwise and simultaneous confidence regionsplot(f, pointwise.band=TRUE, col.pointwise=1, ylim=c(6,8.5))# Plot population response curve at average subject effectts <- seq(0, 1, length=100)lines(ts, g(ts)+mean(sub), lwd=3)## Not run: ## Handle a 2-sample problem in which curves are fitted # separately for males and females and we wish to estimate the# difference in the time-response curves for the two sexes.  # The objective criterion will be taken by plot.rm.boot as the # total of the two sums of squared errors for the two models#knots <- rcspline.eval(c(time.f,time.m), nk=6, knots.only=TRUE)# Use same knots for both sexes, and use a times vector that # uses a range of times that is included in the measurement # times for both sexes#tm <- seq(max(min(time.f),min(time.m)),          min(max(time.f),max(time.m)),length=100)f.female <- rm.boot(time.f, bp.f, id.f, knots=knots, times=tm)f.male   <- rm.boot(time.m, bp.m, id.m, knots=knots, times=tm)plot(f.female)plot(f.male)# The following plots female minus male response, with # a sequence of shaded confidence band for the differenceplot(f.female,f.male,multi=TRUE)# Do 1000 simulated analyses to check simultaneous coverage # probability.  Use a null regression model with Gaussian errorsn.per.pt <- 30n.pt     <- 10null.in.region <- 0for(i in 1:1000) {  y    <- rnorm(n.pt*n.per.pt)  time <- rep(1:n.per.pt, n.pt)#  Add the following line and add ,id=id to rm.boot to use clustering#  id   <- sort(rep(1:n.pt, n.per.pt))#  Because we are ignoring patient id, this simulation is effectively#  using 1 point from each of 300 patients, with times 1,2,3,,,30   f <- rm.boot(time, y, B=500, nk=5, bootstrap.type='x fixed')  g <- plot(f, ylim=c(-1,1), pointwise=FALSE)  null.in.region <- null.in.region + all(g$lower<=0 & g$upper>=0)  prn(c(i=i,null.in.region=null.in.region))}# Simulation Results: 905/1000 simultaneous confidence bands # fully contained the horizontal line at zero## End(Not run)

rmClose

Description

Remove close values from a numeric vector that are not at the outer limtis. This is useful for removing axis breaks that overlap when plotting.

Usage

rmClose(x, minfrac = 0.05)

Arguments

x

a numeric vector with noNAs

minfrac

minimum allowed spacing between consecutive orderedx, as a fraction of the range ofx

Value

a sorted numeric vector of non-close values ofx

Author(s)

Frank Harrell

Examples

rmClose(c(1, 2, 4, 47, 48, 49, 50), minfrac=0.07)

runParallel

Description

parallel Package Easy Front-End

Usage

runParallel(  onecore,  reps,  seed = round(runif(1, 0, 10000)),  cores = max(1, parallel::detectCores() - 1),  simplify = TRUE,  along)

Arguments

onecore

function to run the analysis on one core

reps

total number of repetitions

seed

species the base random number seed. The seed used for core i will beseed +i.

cores

number of cores to use, defaulting to one less than the number available

simplify

set to FALSE to not create an outer list if aonecore result has only one element

along

see Details

Details

Given a functiononecore that runs the needed set of simulations onone CPU core, and given a total number of repetitionsreps, determinesthe number of available cores and by default uses one less than that.By default the number of cores is one less than the number availableon your machine.reps is divided as evenly as possible over these cores, and batchesare run on the cores using theparallel packagemclapply function.The current per-core repetition number is continually updated inyour system's temporary directory (/tmp for Linux and Mac, TEMP for Windows)in a file name progressX.log where X is the core number.The random number seed is set for each core and is equal tothe scalarseed - core number + 1. The default seed is a randomnumber between 0 and 10000 but it's best if the user provides theseed so the simulation is reproducible.The total run time is computed and printedonefile must create a named list of all the results created duringthat one simulation batch. Elements of this list must be data frames,vectors, matrices, or arrays. Upon completion of all batches,all the results are rbind'd and saved in a single list.

onecore must have an argumentreps that will tell the functionhow many simulations to run for one batch, another argumentshowprogresswhich is a function to be called inside onecore to write to theprogress file for the current core and repetition, and an argumentcorewhich informsonecore which sequential core number (batch number) it isprocessing.When callingshowprogress insideonecore, the arguments, in order,must be the integer value of the repetition to be noted, the number of reps,core, an optional 4th argumentother that can contain a singlecharacter string to add to the output, and an optional 5th argumentpr.You can setpr=FALSE to suppress printing and haveshowprogressreturn the file name for holding progress information if you want tocustomize printing.

If any of the objects appearing as list elements produced by onecoreare multi-dimensional arrays, you must specify an integer value foralong. This specifies to theabind packageabind functionthe dimension along which to bind the arrays. For example, if thefirst dimension of the array corresponding to repetitions, you wouldspecify along=1. All arrays present must use the samealong unlessalong is a named vector and the names match elements of thesimulation result object.Setsimplify=FALSE if you don't want the result simplified ifonecore produces only one list element. The default returns thefirst (and only) list element rather than the list if there is only oneelement.

Whenonecore returns adata.table,runParallel simplifies all this and merelyrbinds all the per-core data tables into one large data table. In that case when youhaveonecore include a column containing a simulation number, it is wise to prependthat number with the core number so that you will have unique simulation IDs whenall the cores' results are combined.

Seehere for examples.

Value

result from combining all the parallel runs, formatting as similar to the result produced from one run as possible

Author(s)

Frank Harrell


runifChanged

Description

Re-run Code if an Input Changed

Usage

runifChanged(fun, ..., file = NULL, .print. = TRUE, .inclfun. = TRUE)

Arguments

fun

the (usually slow) function to run

...

input objects the result of running the function is dependent on

file

file in which to store the result offun augmented by attributes containing hash digests

.print.

set toTRUE to list which objects changed that neessitated re-runningf

.inclfun.

set toFALSE to not includefun in the hash digest, i.e., to not require re-runningfun if onlyfun itself has changed

Details

UseshashCheck to run a function and save the results if specified inputs have changed, otherwise to retrieve results from a file. This makes it easy to see if any objects changed that require re-running a long simulation, and reports on any changes. The file name is taken as the chunk name appended with.rds unless it is given as⁠file=⁠.fun has no arguments. Set.inclfun.=FALSE to not includefun in the hash check (for legacy uses). The typical workflow is as follows.

f <- function(       ) {# . . . do the real work with multiple function calls ...}seed <- 3set.seed(seed)w <- runifChanged(f, seed, obj1, obj2, ....)

⁠seed, obj1, obj2⁠, ... are all the objects thatf() uses that if changedwould give a different result off(). This can include functions such asthose in a package, andf will be re-run if any of the function's codechanges.f is also re-run if the code insidef changes.The result off is stored withsaveRDS by default in file namedxxx.rdswherexxx is the label for the current chunk. To control this use insteadfile=xxx.rds add the file argument torunifChanged(...). If nothing haschanged and the file already exists, the file is read to create the resultobject (e.g.,w above). Iff() needs to be run, the hashed input objectsare stored as attributes for the result then the enhanced result is written to the file.

Seehere for examples.

Value

the result of runningfun

Author(s)

Frank Harrell


Sample Size for 2-sample Binomial

Description

Computes sample size(s) for 2-sample binomial problem given vector orscalar probabilities in the two groups.

Usage

samplesize.bin(alpha, beta, pit, pic, rho=0.5)

Arguments

alpha

scalar ONE-SIDED test size, or two-sided size/2

beta

scalar or vector of powers

pit

hypothesized treatment probability of success

pic

hypothesized control probability of success

rho

proportion of the sample devoted to treated group (0 <\code{rho} < 1)

Value

TOTAL sample size(s)

AUTHOR

Rick Chappell
Dept. of Statistics and Human Oncology
University of Wisconsin at Madison
chappell@stat.wisc.edu

Examples

alpha <- .05beta <- c(.70,.80,.90,.95)# N1 is a matrix of total sample sizes whose# rows vary by hypothesized treatment success probability and# columns vary by power# See Meinert's book for formulae.N1 <- samplesize.bin(alpha, beta, pit=.55, pic=.5)N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.60, pic=.5))N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.65, pic=.5))N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.70, pic=.5))attr(N1,"dimnames") <- NULL#Accounting for 5% noncompliance in the treated groupinflation <- (1/.95)**2print(round(N1*inflation+.5,0))

Convert a SAS Dataset to an S Data Frame

Description

Converts aSAS dataset into an S data frame. You may choose to extract only a subset of variables or a subset of observations in theSAS dataset.You may have the function automatically convert

PROC FORMAT

-codedvariables to factor objects. The originalSAS codes are stored in anattribute calledsas.codes and these may be added back to thelevels of afactor variable using thecode.levels function.Information about special missing values may be captured in an attributeof each variable having special missing values. This attribute iscalledspecial.miss, and such variables are given classspecial.miss.There areprint,[],format, andis.special.missmethods for such variables.Thechron function is used to set up date, time, and date-time variables.If using S-Plus 5 or 6 or later, thetimeDate function is usedinstead.Under R,Dates is used for dates andchronfor date-times. For times withoutdates, these still need to be stored in date-time format in POSIX.SuchSAS time variables are given a major class ofPOSIXt and aformat.POSIXt function so that the date portion (which willalways be 1/1/1970) will not print by default.If a date variable represents a partial date (0.5 added ifmonth missing, 0.25 added if day missing, 0.75 if both), an attributepartial.date is added to the variable, and the variable also becomesa classimputed variable.Thedescribe function uses information about partial dates andspecial missing values.There is an option to automatically uncompress (orgunzip) compressedSAS datasets.

Usage

sas.get(libraryName, member, variables=character(0), ifs=character(0),     format.library=libraryName, id,     dates.=c("sas","yymmdd","yearfrac","yearfrac2"),     keep.log=TRUE, log.file="_temp_.log", macro=sas.get.macro,     data.frame.out=existsFunction("data.frame"), clean.up=FALSE, quiet=FALSE,     temp=tempfile("SaS"), formats=TRUE, recode=formats,     special.miss=FALSE, sasprog="sas",      as.is=.5, check.unique.id=TRUE, force.single=FALSE,     pos, uncompress=FALSE, defaultencoding="latin1", var.case="lower")is.special.miss(x, code)## S3 method for class 'special.miss'x[..., drop=FALSE]## S3 method for class 'special.miss'print(x, ...)## S3 method for class 'special.miss'format(x, ...)sas.codes(object)code.levels(object)

Arguments

libraryName

character string naming the directory in which the dataset is kept.

drop

logical. IfTRUE the result is coerced to thelowest possible dimension.

member

character string giving the second part of the two partSAS dataset name. (The first part is irrelevant here - it is mapped to the UNIX directory name.)

x

a variable that may have been created bysas.get withspecial.miss=T or withrecode in effect.

variables

vector of character strings naming the variables in theSAS dataset. The S dataset will contain only those variables from theSAS dataset. To get all of the variables (the default), an empty string may be given.It is a fatal error if any one of the variables is notin theSAS dataset. You can usesas.contents to getthe variables in theSAS dataset.If you have retrieved a subset of the variablesin theSAS dataset and which to retrieve the same list of variablesfrom another dataset, you can program the value ofvariables - seeone of the last examples.

ifs

a vector of character strings, each containing oneSAS “subsetting if”statement. These will be used to extract a subset of the observations in theSAS dataset.

format.library

The UNIX directory containing the file ‘formats.sct’, which containsthe definitions of the user defined formats used in this dataset.By default, we look for the formats in the same directory as the data.The user defined formats must be available (soSAS can read the data).

formats

Setformats toFALSE to keepsas.get from telling theSAS macro to retrieve value label formats fromformat.library. When you do notspecifyformats orrecode,sas.get will setformat toTRUE if aSAS format catalog (‘.sct’ or ‘.sc2’) file exists informat.library.Value label formats if present are stored as theformats attribute of the returnedobject (see below). A format is used if it is referred to by one or more variablesin the dataset, if it contains no ranges of values (i.e., it identifiesvalue labels for single values), and if it is a character formator a numeric format that is not used just to label missing values.If you setrecode toTRUE, 1, or 2,formats defaults toTRUE.To fetch the values and labels for variablex in the datasetd youcould type:
f <- attr(d\$x, "format")
formats <- attr(d, "formats")
formats\$f\$values; formats\$f\$labels

recode

This parameter defaults toTRUE ifformats isTRUE. If it isTRUE, variables that have an appropriate format (see above) arerecoded asfactor objects, which map the valuesto the value labels for the format. Alternatively, setrecode to1 to use labels of the form value:label, e.g. 1:good 2:better 3:best.Setrecode to 2 to use labels such as good(1) better(2) best(3).Sincesas.codes andcode.levels add flexibility, the usual choiceforrecode isTRUE.

special.miss

For numeric variables, any missing values are stored as NA in S.You can recover special missing values by settingspecial.miss toTRUE. This will cause thespecial.miss attribute and thespecial.miss class to be addedto each variable that has at least one special missing value. Suppose that variabley was .E in observation 3 and .Gin observation 544. Thespecial.miss attribute fory then has thevalue
list(codes=c("E","G"),obs=c(3,544))
To fetch this information for variabley you would say for example
s <- attr(y, "special.miss")
s\$codes; s\$obs
or useis.special.miss(x) or theprint.special.miss method, whichwill replaceNA values for the variable with ‘⁠E⁠’ or ‘⁠G⁠’ if theycorrespond to special missing values.The describefunction uses this information in printing a data summary.

id

The name of the variable to be used as the row names of the S dataset.The id variable becomes therow.names attribute of a data frame, butthe id variable is still retained as a variable in the data frame.(ifdata.frame.out isFALSE, this will be the attribute ‘⁠id⁠’ of theRdataset.) You can also specify a vector of variable names as theidparameter. After fetching the data fromSAS, all these variables will beconverted to character format and concatenated (with a space as a separator)to form a (hopefully) unique identification variable.

dates.

specifies the format for storingSAS dates in theresulting data frame

as.is

IFdata.frame.out = TRUE,SAS character variables are converted to S factorobjects ifas.is = FALSE or ifas.is is a number between 0 and 1 inclusive andthe number of unique values of the variable is less thanthe number of observations (n) timesas.is. The default ifas.is is 0.5,so character variables are converted to factors only if they have fewerthann/2 unique values. The primary purpose of this is to keep uniqueidentification variables as character values in the data frame insteadof using more space to store both the integer factor codes and thefactor labels.

check.unique.id

Ifid is specified, the row names are checked foruniqueness ifcheck.unique.id = TRUE. If any are duplicated, a warningis printed. Note that if a data frame is being created with duplicaterow names, statements such asmy.data.frame["B23",] will retrieveonly the first row with a row name of

B23

.

force.single

By default,SAS numeric variables havingLENGTH > 4 are stored asS double precision numerics, which allow for the same precision asaSAS

LENGTH

8 variable. Setforce.single = TRUE to store everynumeric variable in single precision (7 digits of precision).This option is useful when the creator of theSAS dataset hasfailed to use a

LENGTH

statement.R does not have single precision, so no attempt is made to convert tosingle if running R.

dates

One of the character strings"sas","yearfrac","yearfrac2","yymmdd".If aSAS variable has a date format (one of"DATE","MMDDYY","YYMMDD","DDMMYY","YYQ","MONYY","JULIAN"), it will be converted to the formatspecified bydates before being given to S."sas" givesdays from 1/1/1960 (from 1/1/1970 if usingchron),"yearfrac" gives days from 1/1/1900 divided by365.25,"yearfrac2" gives year plus fraction of current year,and"yymmdd" gives a 6 digit number

YYMMDD

(year%%100, month, day).Note thatR will store these as numbers, not ascharacter strings. Ifdates="sas" and a variable has one of theSASdate formats listed above, the variable will be given a class of ‘⁠date⁠’to work with Terry Therneau's implementation of the ‘⁠date⁠’ class in S.If thechron package ortimeDate function is available, these areused instead.

keep.log

logical flag: ifFALSE, delete theSAS log file upon completion.

log.file

the name of theSAS log file.

macro

the name of an S object in the current search path that contains the text oftheSAS macro called byR. TheR object is a character vector thatcan be edited using for examplesas.get.macro <- editor(sas.get.macro).

data.frame.out

logical flag: ifTRUE, the return value will be an S data frame,otherwise it will be a list.

clean.up

logical flag: ifTRUE, remove all temporary files when finished. Youmay want to keep these while debugging theSAS macro. Not needed forR.

quiet

logical flag: ifFALSE, print the contents of theSAS log file if there has been an error.

temp

the prefix to use for the temporary files. Two characterswill be added to this, the resulting namemust fit on your file system.

sasprog

the name of the system command to invokeSAS

uncompress

set toTRUE to automatically invoke theUNIXgunzip command(if ‘member.ssd01.gz’ exists) or theuncompress command (if ‘member.ssd01.Z’ exists) to uncompress theSAS dataset beforeproceeding. This assumes you have the file permissions to allowuncompressing in place. If the file is already uncompressed, thisoption is ignored.

pos

by default, a list or data frame which contains all the variables is returned.If you specifypos, each individual variable is placed into aseparate object (whose name is the name of the variable) using theassign function with thepos argument. For example, you canput each variable in its own file in a directory, which in some casesmay save memory over attaching a data frame.

code

a special missing value code (‘⁠A⁠’ through ‘⁠Z⁠’ or ‘⁠\_⁠’) to checkagainst. Ifcode is omitted,is.special.miss will returnaTRUE for each observation that has any special missing value.

defaultencoding

encoding to assume if the SAS dataset does not specify one. Defaults to "latin1".

var.case

default is to change case of SAS variable names tolower case. Specify alternatively"upper" or"preserve".

object

a variable in a data frame created bysas.get

...

ignored

Details

If you specifyspecial.miss = TRUE and there are no special missingvalues in the dataSAS dataset, theSAS step will bomb.

For variables having a

PROC FORMAT VALUE

format with some of the levels undefined,sas.get will interpret thosevalues asNA if you are usingrecode.

TheSAS macro ‘sas_get’ uses record lengths of up to 4096 in twoplaces. If you are exporting records that are very long (because ofa large number of variables and/or long character variables), youmay want to edit these

LRECL

s to quadruple them, for example.

Value

ifdata.frame.out isTRUE, the output willbe a data frame resembling theSAS dataset. Ifidwas specified, that column of the data frame will be usedas the row names of the data frame. Each variable in the data frameor vector in the list will have the attributeslabel andformatcontainingSAS labels and formats. Underscores in formats areconverted to periods. Formats for character variables have\$ placedin front of their names.Ifformats isTRUE and there are any appropriate format definitions informat.library, the returnedobject will have attributeformats containing lists named thesame as the format names (with periods substituted for underscores andcharacter formats prefixed by\$).Each of these lists has a vector calledvalues and one calledlabels with the

PROC FORMAT; VALUE ...

definitions.

Ifdata.frame.out isFALSE, the output willbe a list of vectors, each containing a variable from theSASdataset. Ifid was specified, that element of the list willbe used as theid attribute of the entire list.

Side Effects

if aSAS error occurs andquiet isFALSE, then theSAS log file will beprinted under the control of theless pager.

BACKGROUND

The references cited below explain the structure ofSAS datasets and howthey are stored underUNIX.SeeSAS Language for a discussion of the “subsetting if” statement.

Note

You must be able to runSAS (by typingsas) on your system.If the S command!sas does not startSAS, then this function cannot work.

If you are reading time ordate-time variables, you will need to execute the commandlibrary(chron)to print those variables or the data frame if thetimeDate functionis not available.

Author(s)

Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corporation
Michael W. Kattan, Cleveland Clinic Foundation
Reinhold Koch (encoding)

References

SAS Institute Inc. (1990).SAS Language: Reference, Version 6.First Edition.SAS Institute Inc., Cary, North Carolina.

SAS Institute Inc. (1988).SAS Technical Report P-176,Using theSAS System, Release 6.03, under UNIX Operating Systems and Derivatives.SAS Institute Inc., Cary, North Carolina.

SAS Institute Inc. (1985).SAS Introductory Guide.Third Edition.SAS Institute Inc., Cary, North Carolina.

See Also

data.frame,describe,label,upData,cleanup.import

Examples

## Not run: sas.contents("saslib", "mice")# [1] "dose"  "ld50"  "strain"  "lab_no"attr(, "n"):# [1] 117mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50"))plot(mice$dose, mice$ld50)nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",ifs="if strain='nude'")nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",var=c("dose", "ld50"), ifs="if strain='nude'")# Get a dataset from current directory, recode PROC FORMAT; VALUE \dots # variables into factors with labels of the form "good(1)" "better(2)",# get special missing values, recode missing codes .D and .R into new# factor levels "Don't know" and "Refused to answer" for variable q1d <- sas.get(".", "mydata", recode=2, special.miss=TRUE)attach(d)nl <- length(levels(q1))lev <- c(levels(q1), "Don't know", "Refused")q1.new <- as.integer(q1)q1.new[is.special.miss(q1,"D")] <- nl+1q1.new[is.special.miss(q1,"R")] <- nl+2q1.new <- factor(q1.new, 1:(nl+2), lev)# Note: would like to use factor() in place of as.integer \dots but# factor in this case adds "NA" as a category leveld <- sas.get(".", "mydata")sas.codes(d$x)    # for PROC FORMATted variables returns original data codesd$x <- code.levels(d$x)   # or attach(d); x <- code.levels(x)# This makes levels such as "good" "better" "best" into e.g.# "1:good" "2:better" "3:best", if the original SAS values were 1,2,3# Retrieve the same variables from another dataset (or an update of# the original dataset)mydata2 <- sas.get('mydata2', var=names(d))# This only works if none of the original SAS variable names contained _mydata2 <- cleanup.import(mydata2) # will make true integer variables# Code from Don MacQueen to generate SAS dataset to test import of# date, time, date-time variables# data ssd.test;#     d1='3mar2002'd ;#     dt1='3mar2002 9:31:02'dt;#     t1='11:13:45't;#     output;##     d1='3jun2002'd ;#     dt1='3jun2002 9:42:07'dt;#     t1='11:14:13't;#     output;#     format d1 mmddyy10. dt1 datetime. t1 time.;# run;## End(Not run)

Enhanced Importing of SAS Transport Files using read.xport

Description

Uses theread.xport andlookup.xport functions in theforeign library to import SAS datasets. SAS date, time, anddate/time variables are converted respectively toDate, POSIX, orPOSIXct objects inR, variable names are converted to lower case, SAS labels are associatedwith variables, and (by default) integer-valued variables are convertedfrom storage modedouble tointeger. If the user ranPROC FORMAT CNTLOUT= in SAS and included the resulting dataset inthe SAS version 5 transport file, variables having customized formatsthat do not include any ranges (i.e., variables having standardPROC FORMAT; VALUE label formats) will have their format labels lookedup, and these variables are converted to Sfactors.

For those users having access to SAS,method='csv' is preferredwhen importing several SAS datasets.Run SAS macroexportlib.sas available fromhttps://github.com/harrelfe/Hmisc/blob/master/src/sas/exportlib.sasto convert all SAS datasets in a SAS data library (from any enginesupported by your system) intoCSV files. If any customizedformats are used, it is assumed that thePROC FORMAT CNTLOUT=dataset is in the data library as a regular SAS dataset, as above.

SASdsLabels reads a file containingPROC CONTENTSprinted output to parse dataset labels, assuming thatPROCCONTENTS was run on an entire library.

Usage

sasxport.get(file, lowernames=TRUE, force.single = TRUE,             method=c('read.xport','dataload','csv'), formats=NULL, allow=NULL,             out=NULL, keep=NULL, drop=NULL, as.is=0.5, FUN=NULL)sasdsLabels(file)

Arguments

file

name of a file containing the SAS transport file.file may be a URL beginning withhttps://. ForsasdsLabels,file is the name of a file containing aPROC CONTENTS output listing. Formethod='csv',file is the name of the directory containing all theCSVfiles created by running theexportlib SAS macro.

lowernames

set toFALSE to keep from converting SASvariable names to lower case

force.single

set toFALSE to keep integer-valuedvariables not exceeding2^31-1 in value from being converted tointeger storage mode

method

set to"dataload" if you have thedataloadexecutable installed and want to use it instead ofread.xport. This seems to correct some errors in whichrarely some factor variables are always missing when read byread.xport when in fact they have some non-missing values.

formats

a data frame or list (like that created byread.xport) containingPROC FORMAToutput, if such output is not stored in the main transport file.

allow

a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9.

out

a character string specifying a directory in which to writeseparateRsave files (.rda files) for each regulardataset. Each file and the data frame inside it is named with theSAS dataset name translated to lower case and with underscoreschanged to periods. The defaultNULL value ofoutresults in a data frame or a list of data frames being returned.Whenout is given,sasxport.get returns only metadata (seebelow), invisibly.out only works withmethods='csv'.out shouldnot have a trailing slash.

keep

a vector of names of SAS datasets to process (original SASupper case names). Must includePROC FORMAT dataset if itexists, and if the kept datasets use any of its value label formats.

drop

a vector of names of SAS datasets to ignore (original SASupper case names)

as.is

SAS character variables are converted to S factorobjects ifas.is=FALSE or ifas.is is a number between0 and 1 inclusive and the number of unique values of the variable isless than the number of observations (n) timesas.is.The default ifas.is is .5, so character variables areconverted to factors only if they have fewer thann/2 uniquevalues. The primary purpose of this is to keep uniqueidentification variables as character values in the data frameinstead of using more space to store both the integer factor codesand the factor labels.

FUN

an optional function that will be run on each data framecreated, whenmethod='csv' andout are specified. Theresult of all theFUN calls is made into a list correspondingto the SAS datasets that are read. This list is theFUNattribute of the result returned bysasxport.get.

Details

Seecontents.list for a way to print thedirectory of SAS datasets when more than one was imported.

Value

If there is more than one dataset in the transport file other than thePROC FORMAT file, the result is a list of data framescontaining all the non-PROC FORMAT datasets. Otherwise theresult is the single data frame. There is an exception ifoutis specified; that causes separateRsave files to be writtenand the returned value to be a list corresponding to the SAS datasets,with keyPROC CONTENTS information in a data frame making upeach part of the list.sasdsLabels returns a namedvector of dataset labels, with names equal to the dataset names.

Author(s)

Frank E Harrell Jr

See Also

read.xport,label,sas.get,Dates,DateTimeClasses,lookup.xport,contents,describe

Examples

## Not run: # SAS code to generate test dataset:# libname y SASV5XPT "test2.xpt";## PROC FORMAT; VALUE race 1=green 2=blue 3=purple; RUN;# PROC FORMAT CNTLOUT=format;RUN;  * Name, e.g. 'format', unimportant;# data test;# LENGTH race 3 age 4;# age=30; label age="Age at Beginning of Study";# race=2;# d1='3mar2002'd ;# dt1='3mar2002 9:31:02'dt;# t1='11:13:45't;# output;## age=31;# race=4;# d1='3jun2002'd ;# dt1='3jun2002 9:42:07'dt;# t1='11:14:13't;# output;# format d1 mmddyy10. dt1 datetime. t1 time. race race.;# run;# data z; LENGTH x3 3 x4 4 x5 5 x6 6 x7 7 x8 8;#    DO i=1 TO 100;#        x3=ranuni(3);#        x4=ranuni(5);#        x5=ranuni(7);#        x6=ranuni(9);#        x7=ranuni(11);#        x8=ranuni(13);#        output;#        END;#    DROP i;#    RUN;# PROC MEANS; RUN;# PROC COPY IN=work OUT=y;SELECT test format z;RUN; *Creates test2.xpt;w <- sasxport.get('test2.xpt')# To use an existing copy of test2.xpt available on the web:w <- sasxport.get('https://github.com/harrelfe/Hmisc/raw/master/inst/tests/test2.xpt')describe(w$test)   # see labels, format names for dataset test# Note: if only one dataset (other than format) had been exported,# just do describe(w) as sasxport.get would not create a list for thatlapply(w, describe)# see descriptive stats for both datasetscontents(w$test)   # another way to see variable attributeslapply(w, contents)# show contents of both datasetsoptions(digits=7)  # compare the following matrix with PROC MEANS outputt(sapply(w$z, function(x) c(Mean=mean(x),SD=sqrt(var(x)),Min=min(x),Max=max(x))))## End(Not run)

One-Dimensional Scatter Diagram, Spike Histogram, or Density

Description

scat1d adds tick marks (bar codes. rug plot) on any of the foursides of an existing plot, corresponding with non-missing values of avectorx. This is used to show the data density. Can alsoplace the tick marks along a curve by specifying y-coordinates to goalong with thex values.

If any two values ofx are within\code{eps}*w ofeach other, whereeps defaults to .001 and w is the spanof the intended axis, values ofx are jittered by adding avalue uniformly distributed in[-\code{jitfrac}*w, \code{jitfrac}*w], wherejitfrac defaults to.008. Specifyingpreserve=TRUE invokesjitter2 with adifferent logic of jittering. Allows plotting random sub-segments tohandle very largex vectors (seetfrac).

jitter2 is a generic method for jittering, which does not addrandom noise. It retains unique values and ranks, and randomly spreadsduplicate values at equidistant positions within limits of enclosingvalues.jitter2 is especially useful for numeric variables withdiscrete values, like rating scales. Missing values are allowed andare returned. Currently implemented methods arejitter2.defaultfor vectors andjitter2.data.frame which returns a data.framewith each numeric column jittered.

datadensity is a generic method used to show data densities inmore complex situations. Here, anotherdatadensity method isdefined for data frames. Depending on thewhich argument, someor all of the variables in a data frame will be displayed, withscat1d used to display continuous variables and, by default,bars used to display frequencies of categorical, character, ordiscrete numeric variables. For such variables, when the total lengthof value labels exceeds 200, only the first few characters from eachlevel are used. By default,datadensity.data.frame willconstruct one axis (i.e., one strip) per variable in the data frame.Variable names appear to the left of the axes, and the number ofmissing values (if greater than zero) appear to the right of the axes.An optionalgroup variable can be used for stratification,where the different strata are depicted using different colors. Iftheq vector is specified, the desired quantiles (over allgroups) are displayed with solid triangles below each axis.

When the sample size exceeds 2000 (this value may be modified usingthenhistSpike argument,datadensity callshistSpike instead ofscat1d to show the data density fornumeric variables. This results in a histogram-like display thatmakes the resulting graphics file much smaller. In this case,datadensity uses theminf argument (see below) so thatvery infrequent data values will not be lost on the variable's axis,although this will slightly distortthe histogram.

histSpike is another method for showing a high-resolution datadistribution that is particularly good for very large datasets (say\code{n} > 1000). By default,histSpike bins thecontinuousx variable into 100 equal-width bins and thencomputes the frequency counts within bins (ifn does not exceed10, no binning is done). Ifadd=FALSE (the default), thefunction displays either proportions or frequencies as in a verticalhistogram. Instead of bars, spikes are used to depict thefrequencies. Ifadd=FALSE, the function assumes you are addingsmall density displays that are intended to take up a small amount ofspace in the margins of the overall plot. Thefrac argument isused as withscat1d to determine the relative length of thewhole plot that is used to represent the maximum frequency. Nojittering is done byhistSpike.

histSpike can also graph a kernel density estimate forx, or add a small density curve to any of 4 sides of anexisting plot. Wheny orcurve is specified, thedensity or spikes are drawn with respect to the curve rather than thex-axis.

histSpikeg is similar tohistSpike but is for adding layersto aggplot2 graphics object or traces to aplotlyobject.histSpikeg can also addlowess curves to the plot.

ecdfpM makes aplotly graph or series of graphs showingpossibly superposed empirical cumulative distribution functions.

Usage

scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac,       eps=ifelse(preserve,0,.001),       lwd=0.1, col=par("col"),       y=NULL, curve=NULL,       bottom.align=FALSE,       preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100,       type=c('proportion','count','density'), grid=FALSE, ...)jitter2(x, ...)## Default S3 method:jitter2(x, fill=1/3, limit=TRUE, eps=0,        presorted=FALSE, ...)## S3 method for class 'data.frame'jitter2(x, ...)datadensity(object, ...)## S3 method for class 'data.frame'datadensity(object, group,            which=c("all","continuous","categorical"),            method.cat=c("bar","freq"),            col.group=1:10,            n.unique=10, show.na=TRUE, nint=1, naxes,            q, bottom.align=nint>1,            cex.axis=sc(.5,.3), cex.var=sc(.8,.3),            lmgp=NULL, tck=sc(-.009,-.002),            ranges=NULL, labels=NULL, ...)# sc(a,b) means default to a if number of axes <= 3, b if >=50, use# linear interpolation within 3-50histSpike(x, side=1, nint=100, bins=NULL, frac=.05, minf=NULL, mult.width=1,          type=c('proportion','count','density'),          xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)),           ylab=switch(type,proportion='Proportion',                           count     ='Frequency',                           density   ='Density'),          y=NULL, curve=NULL, add=FALSE, minimal=FALSE,          bottom.align=type=='density', col=par('col'), lwd=par('lwd'),          grid=FALSE, ...)histSpikeg(formula=NULL, predictions=NULL, data, plotly=NULL,           lowess=FALSE, xlim=NULL, ylim=NULL,           side=1, nint=100,           frac=function(f) 0.01 + 0.02*sqrt(f-1)/sqrt(max(f,2)-1),           span=3/4, histcol='black', showlegend=TRUE)ecdfpM(x, group=NULL, what=c('F','1-F','f','1-f'), q=NULL,       extra=c(0.025, 0.025), xlab=NULL, ylab=NULL, height=NULL, width=NULL,       colors=NULL, nrows=NULL, ncols=NULL, ...)

Arguments

x

a vector of numeric data, or a data frame (forjitter2 orecdfpM)

object

a data frame or list (even with unequal number of observations pervariable, as long asgroup is notspecified)

side

axis side to use (1=bottom (default forhistSpike), 2=left,3=top (default forscat1d), 4=right)

frac

fraction of smaller of vertical and horizontal axes for tick marklengths. Can be negative to move tick marks outside of plot. ForhistSpike, this is the relative y-direction length to be used for thelargest frequency. Whenscat1d callshistSpike, itmultiplies itsfrac argument by 2.5. ForhistSpikeg,frac is a function off, the vector of all frequencies. Thedefault function scales tick marks so that they are between 0.01 and0.03 of the y range, linearly scaled in the square root of thefrequency less one.

jitfrac

fraction of axis for jittering. If\code{jitfrac} \le 0, nojittering is done. Ifpreserve=TRUE, the amount ofjittering is independent of jitfrac.

tfrac

Fraction of tick mark to actually draw. If\code{tfrac}<1,will draw a random fractiontfrac of the line segment ateach point. This is useful for very large samples or ones with somevery dense points. The default value is 1 if the number ofnon-missing observationsn is less than 125, and\max{(.1, 125/n)} otherwise.

eps

fraction of axis for determining overlapping points inx. Forpreserve=TRUE the default is 0 and original unique values areretained, bigger values of eps tends to bias observations from denseto sparse regions, but ranks are still preserved.

lwd

line width for tick marks, passed tosegments

col

color for tick marks, passed tosegments

y

specify a vector the same length asx to draw tick marksalong a curve instead of by one of the axes. They valuesare often predicted values from a model. Theside argumentis ignored wheny is given. If the curve is alreadyrepresented as a table look-up, you may specify it using thecurve argument instead.y may be a scalar to use aconstant verticalplacement.

curve

a list containing elementsx andy for which linearinterpolation is used to derivey values corresponding tovalues ofx. This results in tick marks being drawn alongthe curve. ForhistSpike, interpolatedy values arederived for binmidpoints.

minimal

forhistSpike setminimal=TRUE to draw aminimalist spike histogram with no y-axis. This works best whenproduce graphics images that are short, e.g., have a height oftwo inches.add is forced to beFALSE in this caseso that a standalone graph is produced. Only base graphics areused.

bottom.align

set toTRUE to have the bottoms of tick marks (forside=1 orside=3) aligned at the y-coordinate. Thedefault behavior is to center the tick marks. Fordatadensity.data.frame,bottom.align defaults toTRUE ifnint>1. In other words, if you are onlylabeling the first and last axis tick mark, thescat1d tickmarks are centered on the variable's axis.

preserve

set toTRUE to invokejitter2

fill

maximum fraction of the axis filled by jittered values. Ifdare duplicated values between a lower value l and upper valueu, then d will be spread within\pm \code{fill}*\min{(u-d,d-l)}/2.

limit

specifies a limit for maximum shift in jittered values. Duplicatevalues will be spread within\pm\code{fill}*\min{(u-d,d-l)}/2. ThedefaultTRUE restricts jittering to the smallest\min{(u-d,d-l)}/2 observed and resultsin equal amount of jittering for all d. Setting toFALSE allows for locally different amount of jittering, usingmaximum space available.

nhistSpike

If the number of observations exceeds or equalsnhistSpike,scat1d will automatically callhistSpike to draw thedata density, to prevent the graphics file from being too large.

type

used by or passed tohistSpike. Set to"count" todisplay frequency counts rather than relative frequencies, or"density" to display a kernel density estimate computed usingthedensity function.

grid

set toTRUE if theRgrid package is in effect forthe current plot

nint

number of intervals to divide each continuous variable's axis fordatadensity. ForhistSpike, is the number ofequal-width intervals for which to binx, and if insteadnint is a character string (e.g.,nint="all"), thefrequency tabulation is done with no binning. In other words,frequencies for all unique values ofx are derived andplotted. ForhistSpikeg, ifx has no more thannint unique values, all observed values are used, otherwisethe data are rounded before tabulation so that there are no morethannint intervals. ForhistSpike,nint isignored ifbins is given.

bins

forhistSpike specifies the actual cutpoints to usefor binningx. The default is to usenint inconjunction withxlim.

...

optional arguments passed toscat1d fromdatadensityor tohistSpike fromscat1d. ForhistSpikepare passed to thelines list toadd_trace. ForecdfpM these arguments are passed toadd_lines.

presorted

set toTRUE to prevent from sorting for determining the orderl<d<u. This is usefull if an existingmeaningfull local order would be destroyed by sorting, as in\sin{(\pi*\code{sort}(\code{round}(\code{runif}(1000,0,10),1)))}.

group

an optional stratification variable, which is converted to afactor vector if it is not one already

which

setwhich="continuous" to only plot continuous variables, orwhich="categorical" to only plot categorical, character, ordiscrete numeric ones. By default, all types of variables aredepicted.

method.cat

setmethod.cat="freq" to depict frequencies of categoricalvariables with digits representing the cell frequencies, with sizeproportional to the square root of the frequency. By default,vertical bars are used.

col.group

colors representing thegroup strata. The vector of colorsis recycled to be the same length as the levels ofgroup.

n.unique

number of unique values a numeric variable must have before it isconsidered to be a continuous variable

show.na

set toFALSE to suppress drawing the number ofNAs tothe right of each axis

naxes

number of axes to draw on each page before starting a new plot. Youcan setnaxes larger than the number of variables in the dataframe if you want to compress the plot vertically.

q

a vector of quantiles to display. By default, quantiles are notshown.

extra

a two-vector specifying the fraction of the xrange to add on the left and the fraction to add on the right

cex.axis

character size for draw labels for axis tick marks

cex.var

character size for variable names and frequence ofNAs

lmgp

spacing between numeric axis labels and axis (seepar formgp)

tck

seetck underpar

ranges

a list containing ranges for some or all of the numeric variables.Ifranges is not given or if a certain variable is not foundin the list, the empirical range, modified bypretty, isused. Example:ranges=list(age=c(10,100), pressure=c(50,150)).

labels

a vector of labels to use in labeling the axes fordatadensity.data.frame. Default is to use the names of thevariable in the input data frame. Note: margin widths computed forsetting aside names of variables use the names, and not theselabels.

minf

ForhistSpike, ifminf is specified low binfrequencies are set to a minimum value ofminf times themaximum bin frequency, so that rare data points will remain visible.A good choice ofminf is 0.075.datadensity.data.frame passesminf=0.075 toscat1d to pass tohistSpike. Note that specifyingminf will cause the shape of the histogram to be distortedsomewhat.

mult.width

multiplier for the smoothing window width computed byhistSpike whentype="density"

xlim

a 2-vector specifying the outer limits ofx for binning (andplotting, ifadd=FALSE andnint is a number). ForhistSpikeg, observations outside thexlim range are ignored.

ylim

y-axis range for plotting (ifadd=FALSE). Often needed forhistSpikeg to help scale the tick mark line segments.

xlab

x-axis label (add=FALSE or forecdfpM); default isname of input argument, or forecdfpM comes fromlabel andunits attributes of the analysisvariable. ForecdfpMxlab may be a vector if thereis more than one analysis variable.

ylab

y-axis label (add=FALSE or forecdfpM)

add

set toTRUE to add the spike-histogram to an existing plot,to show marginal data densities

formula

a formula of the formy ~ x1 ory ~ x1 + ... wherey is the name of they-axis variable being plottedwithggplot,x1 is the name of thex-axisvariable, and optional ... are variables used byggplot to produce multiple curves on a panel and/or facets.

predictions

the data frame being plotted byggplot, containingxandy coordinates of curves. If omitted, spike histogramsare drawn at the bottom (default) or top of the plot according toside.

data

forhistSpikeg is a mandatory data frame containing raw data whosefrequency distribution is to be summarized, using variables informula.

plotly

an existingplotly object. If notNULL,histSpikeg usesplotly instead ofggplot.

lowess

set toTRUE to havehistSpikeg add ageom_linelayer to theggplot2 graphic, containinglowess() nonparametric smoothers. This causes thereturned value ofhistSpikeg to be a list with twocomponents:"hist" and"lowess" each containinga layer. Fortunately,ggplot2 plots both layersautomatically. If the dependent variable is binary,iter=0 is passed tolowess so that outlierdetection is turned off; otherwiseiter=3 is passed.

span

passed tolowess as thef argument

histcol

color of line segments (tick marks) forhistSpikeg. Default is black. Set to any color or to"default" to use the prevailing colors for thegraphic.

showlegend

set toFALSE too have the addedplotlytraces not have entries in the plot legend

what

set to"1-F" to plot 1 minus the ECDF instead of theECDF,"f" to plot cumulative frequency, or"1-f" toplot the inverse cumulative frequency

height,width

passed toplot_ly

colors

a vector of colors to pas toadd_lines

nrows,ncols

passed toplotly::subplot

Details

Forscat1d the length of line segments used isfrac*min(par()$pin)/par()$uin[opp] data units, whereopp is the index of the opposite axis andfrac defaultsto .02. Assumes thatplot has already been called. Currentpar("usr") is used to determine the range of data for the axisof the current plot. This range is used in jittering and inconstructing line segments.

Value

histSpike returns the actual range ofx used in its binning.histSpikeg returns a list ofggplot2 layers thatggplot2will easily add with+.

Side Effects

scat1d adds line segments to plot.datadensity.data.frame draws a complete plot.histSpikedraws a complete plot or adds to an existing plot.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
Nashville TN, USA
fh@fharrell.com

Martin Maechler (improvedscat1d)
Seminar fuer Statistik
ETH Zurich SWITZERLAND
maechler@stat.math.ethz.ch

Jens Oehlschlaegel-Akiyoshi (wrotejitter2)
Center for Psychotherapy Research
Christian-Belser-Strasse 79a
D-70597 Stuttgart Germany
oehl@psyres-stuttgart.de

See Also

segments,jitter,rug,plsmo,lowess,stripplot,hist.data.frame,Ecdf,hist,histogram,table,density,stat_plsmo,histboxp

Examples

plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 )scat1d(x)                 # density bars on top of graphscat1d(y, 4)              # density bars at righthistSpike(x, add=TRUE)       # histogram instead, 100 binshistSpike(y, 4, add=TRUE)histSpike(x, type='density', add=TRUE)  # smooth density at bottomhistSpike(y, 4, type='density', add=TRUE)smooth <- lowess(x, y)    # add nonparametric regression curvelines(smooth)             # Note: plsmo() does thisscat1d(x, y=approx(smooth, xout=x)$y) # data density on curvescat1d(x, curve=smooth)   # same effect as previous commandhistSpike(x, curve=smooth, add=TRUE) # same as previous but with histogramhistSpike(x, curve=smooth, type='density', add=TRUE)  # same but smooth density over curveplot(x <- rnorm(250), y <- 3*x + rnorm(250)/2)scat1d(x, tfrac=0)        # dots randomly spaced from axisscat1d(y, 4, frac=-.03)   # bars outside axisscat1d(y, 2, tfrac=.2)    # same bars with smaller random fractionx <- c(0:3,rep(4,3),5,rep(7,10),9)plot(x, jitter2(x))       # original versus jittered valuesabline(0,1)               # unique values unjittered on ablinepoints(x+0.1, jitter2(x, limit=FALSE), col=2)                          # allow locally maximum jitteringpoints(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2)                          # fill 3/3 instead of 1/3x <- rnorm(200,0,2)+1; y <- x^2x2 <- round((x+rnorm(200))/2)*2x3 <- round((x+rnorm(200))/4)*4dfram <- data.frame(y,x,x2,x3)plot(dfram$x2, dfram$y)   # jitter2 via scat1dscat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2)scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2)scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2)pairs(jitter2(dfram))     # pairs for jittered data.frame# This gets reasonable pairwise scatter plots for all combinations of# variables where## - continuous variables (with unique values) are not jittered at all, thus#   all relations between continuous variables are shown as they are,#   extreme values have exact positions.## - discrete variables get a reasonable amount of jittering, whether they#   have 2, 3, 5, 10, 20 \dots levels## - different from adding noise, jitter2() will use the available space#   optimally and no value will randomly mask another## If you want a scatterplot with lowess smooths on the *exact* values and# the point clouds shown jittered, you just need#pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y))                                    lines(lowess(x,y)) } )datadensity(dfram)     # graphical snapshot of entire data framedatadensity(dfram, group=cut2(dfram$x2,g=3))                          # stratify points and frequencies by                          # x2 tertiles and use 3 colors# datadensity.data.frame(split(x, grouping.variable))# need to explicitly invoke datadensity.data.frame when the# first argument is a list## Not run: require(rms)require(ggplot2)f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)),         data=d)p <- Predict(f, cholesterol, sex)g <- ggplot(p, aes(x=cholesterol, y=yhat, color=sex)) + geom_line() +  xlab(xl2) + ylim(-1, 1)g <- g + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2,                linetype=0, show_guide=FALSE)g + histSpikeg(yhat ~ cholesterol + sex, p, d)# colors <- c('red', 'blue')# p <- plot_ly(x=x, y=y, color=g, colors=colors, mode='markers')# histSpikep(p, x, y, z, color=g, colors=colors)w <- data.frame(x1=rnorm(100), x2=exp(rnorm(100)))g <- c(rep('a', 50), rep('b', 50))ecdfpM(w, group=g, ncols=2)## End(Not run)

Score a Series of Binary Variables

Description

Creates a new variable from a series of logical conditions. The newvariable can be a hierarchical category or score derived from consideringthe rightmostTRUE value among the input variables, an additive pointscore, a union, or any of several others by specifying a function using thefun argument.

Usage

score.binary(..., fun=max, points=1:p,              na.rm=funtext == "max", retfactor=TRUE)

Arguments

...

a list of variables or expressions which are considered to be binaryor logical

fun

a function to compute on each row of the matrix represented bya specific observation of all the variables in...

points

points to assign to successive elements of... . The default is1, 2, ..., p, wherep is the number of elements. If you specifyone number forpoints, that number will be duplicated (i.e., equal weightsare assumed).

na.rm

set toTRUE to removeNAs from consideration when processingeach row of the matrix of variables in... . Forfun=max,na.rm=TRUE is the default sincescore.binary assumes that ahierarchical scale is based on available information. Otherwise,na.rm=FALSE is assumed. Forfun=mean you may want to specifyna.rm=TRUE.

retfactor

applies iffun=max, in which caseretfactor=TRUE makesscore.binaryreturn afactor object since a hierarchical scale impliesa unique choice.

Value

afactor object ifretfactor=TRUE andfun=max or a numeric vectorotherwise. Will not contain NAs ifna.rm=TRUE unless every variable ina row isNA. If afactor objectis returned, it has levels"none" followed by characterstring versions of the arguments given in... .

See Also

any,sum,max,factor

Examples

set.seed(1)age <- rnorm(25, 70, 15)previous.disease <- sample(0:1, 25, TRUE)#Hierarchical scale, highest of 1:age>70  2:previous.diseasescore.binary(age>70, previous.disease, retfactor=FALSE)#Same as above but return factor variable with levels "none" "age>70" # "previous.disease"score.binary(age>70, previous.disease)#Additive scale with weights 1:age>70  2:previous.diseasescore.binary(age>70, previous.disease, fun=sum)#Additive scale, equal weightsscore.binary(age>70, previous.disease, fun=sum, points=c(1,1))#Same as saying points=1#Union of variables, to create a new binary variablescore.binary(age>70, previous.disease, fun=any)

Character String Editing and Miscellaneous Character Handling Functions

Description

This suite of functions was written to implement many of the featuresof the UNIXsed program entirely within S (functionsedit).Thesubstring.location function returns the first and last positionnumbers that a sub-string occupies in a larger string. Thesubstring2<-function does the opposite of the builtin functionsubstring.It is namedsubstring2 because for S-Plus there is a built-infunctionsubstring, but it does not handle multiple replacements ina single string.replace.substring.wild edits character strings in the fashion of"change xxxxANYTHINGyyyy to aaaaANYTHINGbbbb", if the "ANYTHING"passes an optional user-specifiedtest function. Here, the"yyyy" string is searched for from right to left to handlebalancing parentheses, etc.numeric.stringandall.digits are two examples oftest functions, to check,respectively if each of a vector of strings is a legal numeric or if it contains onlythe digits 0-9. For the case whereold="*$" or "^*", or forreplace.substring.wild with the same values ofold or withfront=TRUE orback=TRUE,sedit (ifwild.literal=FALSE) andreplace.substring.wild will edit the largest substringsatisfyingtest.

substring2 is just a copy ofsubstring so thatsubstring2<- will work.

Usage

sedit(text, from, to, test, wild.literal=FALSE)substring.location(text, string, restrict)# substring(text, first, last) <- setto   # S-Plus onlyreplace.substring.wild(text, old, new, test, front=FALSE, back=FALSE)numeric.string(string)all.digits(string)substring2(text, first, last)substring2(text, first, last) <- value

Arguments

text

a vector of character strings forsedit, substring2, substring2<-or a single character string forsubstring.location, replace.substring.wild.

from

a vector of character strings to translate from, forsedit.A single asterisk wild card, meaning allow any sequence of characters(subject to thetest function, if any) in place of the"*".An element offrom may begin with"^" to force the match tobegin at the beginning oftext, and an element offrom can end with"$" to force the match to end at the end oftext.

to

a vector of character strings to translate to, forsedit.If a corresponding element infrom had an"*", the elementinto may also have an"*". Only single asterisks are allowed.Ifto is not the same length asfrom, therep functionis used to make it the same length.

string

a single character string, forsubstring.location,numeric.string,all.digits

first

a vector of integers specifying the first position to replace forsubstring2<-.first may also be a vector of character stringsthat are passed tosedit to use as patterns for replacingsubstrings withsetto. See one of the last examples below.

last

a vector of integers specifying the ending positions of the charactersubstrings to be replaced. The default is to go to the end ofthe string. Whenfirst is character,last must beomitted.

setto

a character string or vector of character strings used as replacements,insubstring2<-

old

a character string to translate from forreplace.substring.wild.May be"*$" or"^*" or any string containing a single"*" butnot beginning with"^" or ending with"$".

new

a character string to translate to forreplace.substring.wild

test

a function of a vector of character strings returning a logical vectorwhose elements areTRUE orFALSE accordingto whether that string element qualifies as the wild card string forsedit, replace.substring.wild

wild.literal

set toTRUE to not treat asterisks as wild cards and to not look for"^" or"$" inold

restrict

a vector of two integers forsubstring.location which specifies arange to which the search for matches should be restricted

front

specifyingfront = TRUE andold = "*" is the same asspecifyingold = "^*"

back

specifyingback = TRUE andold = "*" is the same asspecifyingold = "*$"

value

a character vector

Value

sedit returns a vector of character strings the same length astext.substring.location returns a list with components namedfirstandlast, each specifying a vector of character positions correspondingto matches.replace.substring.wild returns a single character string.numeric.string andall.digits return a single logical value.

Side Effects

substring2<- modifies its first argument

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

See Also

grep,substring

Examples

x <- 'this string'substring2(x, 3, 4) <- 'IS'xsubstring2(x, 7) <- ''xsubstring.location('abcdefgabc', 'ab')substring.location('abcdefgabc', 'ab', restrict=c(3,999))replace.substring.wild('this is a cat','this*cat','that*dog')replace.substring.wild('there is a cat','is a*', 'is not a*')replace.substring.wild('this is a cat','is a*', 'Z')qualify <- function(x) x==' 1.5 ' | x==' 2.5 'replace.substring.wild('He won 1.5 million $','won*million',                       'lost*million', test=qualify)replace.substring.wild('He won 1 million $','won*million',                       'lost*million', test=qualify)replace.substring.wild('He won 1.2 million $','won*million',                       'lost*million', test=numeric.string)x <- c('a = b','c < d','hello')sedit(x, c('=','he*o'),c('==','he*'))sedit('x23', '*$', '[*]', test=numeric.string)sedit('23xx', '^*', 'Y_{*} ', test=all.digits)replace.substring.wild("abcdefabcdef", "d*f", "xy")x <- "abcd"substring2(x, "bc") <- "BCX"xsubstring2(x, "B*d") <- "B*D"x

seqFreq

Description

Find Sequential Exclusions Due to NAs

Usage

seqFreq(..., labels = NULL, noneNA = FALSE)

Arguments

...

any number of variables

labels

if specified variable labels will be used in place of variable names

noneNA

set toTRUE to not include 'none' as a level in the result

Details

Finds the variable with the highest number ofNAs. From the non-NAs on that variable find the next variable from those remaining with the highest number ofNAs. Proceed in like fashion. The resulting variable summarizes sequential exclusions in a hierarchical fashion. Seethis for more information.

Value

factor variable withobs.per.numcond attribute

Author(s)

Frank Harrell


Display Colors, Plotting Symbols, and Symbol Numeric Equivalents

Description

show.pch plots the definitions of thepch parameters.show.col plots definitions of integer-valued colors.character.table draws numeric equivalents of all latincharacters; the character on linexy and columnz of thetable has numeric code"xyz", which you would surround in quotesand preceed by a backslash.

Usage

show.pch(object = par("font"))show.col(object=NULL)character.table(font=1)

Arguments

object

font forshow.pch, ignored forshow.col.

font

font

Author(s)

Pierre Joyetpierre.joyet@bluewin.ch, Frank Harrell

See Also

points,text

Examples

## Not run: show.pch()show.col()character.table()## End(Not run)

Display image from psfrag LaTeX strings

Description

showPsfrag is used to display (using ghostview) a postscriptimage that contained psfrag LaTeX strings, by building a small LaTeXscript and runninglatex anddvips.

Usage

showPsfrag(filename)

Arguments

filename

name or character string or character vector specifying fileprefix.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Grant MC, Carlisle (1998): The PSfrag System, Version 3. Fulldocumentation is obtained by searching www.ctan.org for ‘pfgguide.ps’.

See Also

postscript,par,ps.options,mgp.axis.labels,pdf,trellis.device,setTrellis


simMarkovOrd

Description

Simulate Ordinal Markov Process

Usage

simMarkovOrd(  n = 1,  y,  times,  initial,  X = NULL,  absorb = NULL,  intercepts,  g,  carry = FALSE,  rdsample = NULL,  ...)

Arguments

n

number of subjects to simulate

y

vector of possible y values in order (numeric, character, factor)

times

vector of measurement times

initial

initial value ofy (baseline state; numeric, character, or factor matchingy). If length 1 this value is used for all subjects, otherwise it is a vector of lengthn.

X

an optional vector of matrix of baseline covariate values passed tog. If a vector,X represents a set of single values for all the covariates and those values are used for every subject. OtherwiseX is a matrix with rows corresponding to subjects and columns corresponding to covariates whichg must know how to handle.g only sees one row ofX at a time.

absorb

vector of absorbing states, a subset ofy (numeric, character, or factor matchingy). The default is no absorbing states. Observations are truncated when an absorbing state is simulated.

intercepts

vector of intercepts in the proportional odds model. There must be one fewer of these than the length ofy.

g

a user-specified function of three or more arguments which in order areyprev - the value ofy at the previous time, the current timet, thegap between the previous time and the current time, an optional (usually named) covariate vectorX, and optional arguments such as a regression coefficient value to simulate from. The function needs to allowyprev to be a vector andyprev must not include any absorbing states. Theg function returns the linear predictor for the proportional odds model aside fromintercepts. The returned value must be a matrix with row names taken fromyprev. If the model is a proportional odds model, the returned value must be one column. If it is a partial proportional odds model, the value must have one column for each distinct value of the response variable Y after the first one, with the levels of Y used as optional column names. So columns correspond tointercepts. The different columns are used fory-specific contributions to the linear predictor (aside fromintercepts) for a partial or constrained partial proportional odds model. Parameters for partial proportional odds effects may be included in the ... arguments.

carry

set toTRUE to carry absorbing state forward after it is first hit; the default is to end records for the subject once the absorbing state is hit

rdsample

an optional function to do response-dependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state:times (ascending measurement times for one subject),y (vector of ordinal outcomes at these times for one subject. The function returnsNULL if no observations are to be dropped, returns the vector of new times to sample.

...

additional arguments to pass tog such as a regresson coefficient

Details

Simulates longitudinal data for subjects following a first-order Markov process under a proportional odds model. Optionally, response-dependent sampling can be done, e.g., if a subject hits a specified state at time t, measurements are removed for times t+1, t+3, t+5, ... This is applicable when for example a study of hospitalized patients samples every day, Y=1 denotes patient discharge to home, and sampling is less frequent outside the hospital. This example assumes that arriving home is not an absorbing state, i.e., a patient could return to the hospital.

Value

data frame with one row per subject per time, and columns id, time, gap, yprev, y

Author(s)

Frank Harrell

See Also

https://hbiostat.org/R/Hmisc/markov/


Simulate Power for Adjusted Ordinal Regression Two-Sample Test

Description

This function simulates the power of a two-sample test from aproportional odds ordinal logistic model for a continuous responsevariable- a generalization of the Wilcoxon test. The continuous datamodel is normal with equal variance. Nonlinear covariateadjustment is allowed, and the user can optionally specify discreteordinal level overrides to the continuous response. For example, ifthe main response is systolic blood pressure, one can add two ordinalcategories higher than the highest observed blood pressure to captureheart attack or death.

Usage

simRegOrd(n, nsim=1000, delta=0, odds.ratio=1, sigma,          p=NULL, x=NULL, X=x, Eyx, alpha=0.05, pr=FALSE)

Arguments

n

combined sample size (both groups combined)

nsim

number of simulations to run

delta

difference in means to detect, for continuous portion ofresponse variable

odds.ratio

odds ratio to detect for ordinal overrides ofcontinuous portion

sigma

standard deviation for continuous portion of response

p

a vector of marginal cell probabilities which must add up to one.Theith element specifies the probability that a patient will bein response leveli for the control arm for the discreteordinal overrides.

x

optional covariate to adjust for - a vector of lengthn

X

a design matrix for the adjustment covariatex ifpresent. This could represent for examplex andx^2or cubic spline components.

Eyx

a function ofx that provides the mean response forthe control arm treatment

alpha

type I error

pr

set toTRUE to see iteration progress

Value

a list containingn, delta, sigma, power, betas, se, pvals wherepower is the estimated power (scalar), andbetas, se,pvals arensim-vectors containing, respectively, the ordinalmodel treatment effect estimate, standard errors, and 2-tailedp-values. When a model fit failed, the corresponding entries inbetas, se, pvals areNA andpower is the proportionof non-failed iterations for which the treatment p-value is significantat thealpha level.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

See Also

popower

Examples

## Not run: ## First use no ordinal high-end category overrides, and compare power## to t-test when there is no covariaten <- 100delta <- .5sd <- 1require(pwr)power.t.test(n = n / 2, delta=delta, sd=sd, type='two.sample')  # 0.70set.seed(1)w <- simRegOrd(n, delta=delta, sigma=sd, pr=TRUE)     # 0.686## Now do ANCOVA with a quadratic effect of a covariaten <- 100x <- rnorm(n)w <- simRegOrd(n, nsim=400, delta=delta, sigma=sd, x=x,               X=cbind(x, x^2),               Eyx=function(x) x + x^2, pr=TRUE)w$power  # 0.68## Fit a cubic spline to some simulated pilot data and use the fitted## function as the true equation in the power simulationrequire(rms)N <- 1000set.seed(2)x <- rnorm(N)y <- x + x^2 + rnorm(N, 0, sd=sd)f <- ols(y ~ rcs(x, 4), x=TRUE)n <- 100j <- sample(1 : N, n, replace=n > N)x <-   x[j]X <- f$x[j,]w <- simRegOrd(n, nsim=400, delta=delta, sigma=sd, x=x,               X=X,               Eyx=Function(f), pr=TRUE)w$power  ## 0.70## Finally, add discrete ordinal category overrides and high end of y## Start with no effect of treatment on these ordinal event levels (OR=1.0)w <- simRegOrd(n, nsim=400, delta=delta, odds.ratio=1, sigma=sd,               x=x, X=X, Eyx=Function(f),               p=c(.98, .01, .01),               pr=TRUE)w$power  ## 0.61   (0.3 if p=.8 .1 .1, 0.37 for .9 .05 .05, 0.50 for .95 .025 .025)## Now assume that odds ratio for treatment is 2.5## First compute power for clinical endpoint portion of Y aloneor <- 2.5p <- c(.9, .05, .05)popower(p, odds.ratio=or, n=100)   # 0.275## Compute power of t-test on continuous part of Y alonepower.t.test(n = 100 / 2, delta=delta, sd=sd, type='two.sample')  # 0.70## Note this is the same as the p.o. model power from simulation above## Solve for OR that gives the same power estimate from popowerpopower(rep(.01, 100), odds.ratio=2.4, n=100)   # 0.706## Compute power for continuous Y with ordinal overridew <- simRegOrd(n, nsim=400, delta=delta, odds.ratio=or, sigma=sd,               x=x, X=X, Eyx=Function(f),               p=c(.9, .05, .05),               pr=TRUE)w$power  ## 0.72## End(Not run)

List Simplification

Description

Takes a list where each element is a group of rows that have beenspanned by a multirow row and combines it into one large matrix.

Usage

simplifyDims(x)

Arguments

x

list of spanned rows

Details

All rows must have the same number of columns. This is used to formatthe list for printing.

Value

a matrix that contains all of the spanned rows.

Author(s)

Charles Dupont

See Also

rbind

Examples

a <- list(a = matrix(1:25, ncol=5), b = matrix(1:10, ncol=5), c = 1:5)simplifyDims(a)

Compute Summary Statistics on a Vector

Description

A number of statistical summary functions is provided for usewithsummary.formula andsummarize (as well astapply and by themselves).smean.cl.normal computes 3 summary variables: the sample mean andlower and upper Gaussian confidence limits based on the t-distribution.smean.sd computes the mean and standard deviation.smean.sdl computes the mean plus or minus a constant times thestandard deviation.smean.cl.boot is a very fast implementation of the basicnonparametric bootstrap for obtaining confidence limits for thepopulation mean without assuming normality.These functions all delete NAs automatically.smedian.hilow computes the sample median and a selected pair ofouter quantiles having equal tail areas.

Usage

smean.cl.normal(x, mult=qt((1+conf.int)/2,n-1), conf.int=.95, na.rm=TRUE)smean.sd(x, na.rm=TRUE)smean.sdl(x, mult=2, na.rm=TRUE)smean.cl.boot(x, conf.int=.95, B=1000, na.rm=TRUE, reps=FALSE)smedian.hilow(x, conf.int=.95, na.rm=TRUE)

Arguments

x

for summary functionssmean.*,smedian.hilow, a numeric vectorfrom which NAs will be removed automatically

na.rm

defaults toTRUE unlike built-in functions, so that bydefaultNAs are automatically removed

mult

forsmean.cl.normal is the multiplier of the standard error of themean to use in obtaining confidence limits of the population mean(default is appropriate quantile of the t distribution). Forsmean.sdl,mult is the multiplier of the standard deviation usedin obtaining a coverage interval about the sample mean. The defaultismult=2 to use plus or minus 2 standard deviations.

conf.int

forsmean.cl.normal andsmean.cl.boot specifies the confidencelevel (0-1) for interval estimation of the population mean. Forsmedian.hilow,conf.int is the coverage probability the outerquantiles should target. When the default, 0.95, is used, the lowerand upper quantiles computed are 0.025 and 0.975.

B

number of bootstrap resamples forsmean.cl.boot

reps

set toTRUE to havesmean.cl.boot return the vector of bootstrappedmeans as thereps attribute of the returned object

Value

a vector of summary statistics

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

summarize,summary.formula

Examples

set.seed(1)x <- rnorm(100)smean.sd(x)smean.sdl(x)smean.cl.normal(x)smean.cl.boot(x)smedian.hilow(x, conf.int=.5)  # 25th and 75th percentiles# Function to compute 0.95 confidence interval for the difference in two means# g is grouping variablebootdif <- function(y, g) { g <- as.factor(g) a <- attr(smean.cl.boot(y[g==levels(g)[1]], B=2000, reps=TRUE),'reps') b <- attr(smean.cl.boot(y[g==levels(g)[2]], B=2000, reps=TRUE),'reps') meandif <- diff(tapply(y, g, mean, na.rm=TRUE)) a.b <- quantile(b-a, c(.025,.975)) res <- c(meandif, a.b) names(res) <- c('Mean Difference','.025','.975') res}

solve Function with tol argument

Description

A slightly modified version ofsolve that allows a tolerance argumentfor singularity (tol) which is passed toqr.

Usage

solvet(a, b, tol=1e-09)

Arguments

a

a square numeric matrix

b

a numeric vector or matrix

tol

tolerance for detecting linear dependencies in columns ofa

See Also

solve


Somers' Dxy Rank Correlation

Description

Computes Somers' Dxy rank correlation between a variablex and abinary (0-1) variabley, and the corresponding receiver operatingcharacteristic curve areac. Note thatDxy = 2(c-0.5).somers allows for aweights variable, which specifies frequenciesto associate with each observation.

Usage

somers2(x, y, weights=NULL, normwt=FALSE, na.rm=TRUE)

Arguments

x

typically a predictor variable.NAs are allowed.

y

a numeric outcome variable coded0-1.NAs are allowed.

weights

a numeric vector of observation weights (usually frequencies). Omitor specify a zero-length vector to do an unweighted analysis.

normwt

set toTRUE to makeweights sum to the actual number of non-missingobservations.

na.rm

set toFALSE to suppress checking for NAs.

Details

Thercorr.cens function, which although slower thansomers2 for largesample sizes, can also be used to obtain Dxy for non-censored binaryy, and it has the advantage of computing the standard deviation ofthe correlation index.

Value

a vector with the named elementsC,Dxy,n (number of non-missingpairs), andMissing. Uses the formulaC = (mean(rank(x)[y == 1]) - (n1 + 1)/2)/(n - n1), wheren1 is thefrequency ofy=1.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

See Also

concordance,rcorr.cens,rank,wtd.rank,

Examples

set.seed(1)predicted <- runif(200)dead      <- sample(0:1, 200, TRUE)roc.area <- somers2(predicted, dead)["C"]# Check weightsx <- 1:6y <- c(0,0,1,0,1,1)f <- c(3,2,2,3,2,1)somers2(x, y)somers2(rep(x, f), rep(y, f))somers2(x, y, f)

soprobMarkovOrd

Description

State Occupancy Probabilities for First-Order Markov Ordinal Model

Usage

soprobMarkovOrd(y, times, initial, absorb = NULL, intercepts, g, ...)

Arguments

y

a vector of possible y values in order (numeric, character, factor)

times

vector of measurement times

initial

initial value ofy (baseline state; numeric, character, factr)

absorb

vector of absorbing states, a subset ofy. The default is no absorbing states. (numeric, character, factor)

intercepts

vector of intercepts in the proportional odds model, with length one less than the length ofy

g

a user-specified function of three or more arguments which in order areyprev - the value ofy at the previous time, the current timet, thegap between the previous time and the current time, an optional (usually named) covariate vectorX, and optional arguments such as a regression coefficient value to simulate from. The function needs to allowyprev to be a vector andyprev must not include any absorbing states. Theg function returns the linear predictor for the proportional odds model aside fromintercepts. The returned value must be a matrix with row names taken fromyprev. If the model is a proportional odds model, the returned value must be one column. If it is a partial proportional odds model, the value must have one column for each distinct value of the response variable Y after the first one, with the levels of Y used as optional column names. So columns correspond tointercepts. The different columns are used fory-specific contributions to the linear predictor (aside fromintercepts) for a partial or constrained partial proportional odds model. Parameters for partial proportional odds effects may be included in the ... arguments.

...

additional arguments to pass tog such as covariate settings

Value

matrix with rows corresponding to times and columns corresponding to states, with values equal to exact state occupancy probabilities

Author(s)

Frank Harrell

See Also

https://hbiostat.org/R/Hmisc/markov/


soprobMarkovOrdm

Description

State Occupancy Probabilities for First-Order Markov Ordinal Model from a Model Fit

Usage

soprobMarkovOrdm(  object,  data,  times,  ylevels,  absorb = NULL,  tvarname = "time",  pvarname = "yprev",  gap = NULL)

Arguments

object

a fit object created byblrm,lrm,orm,VGAM::vglm(), orVGAM::vgam()

data

a single observation list or data frame with covariate settings, including the initial state for Y

times

vector of measurement times

ylevels

a vector of ordered levels of the outcome variable (numeric or character)

absorb

vector of absorbing states, a subset ofylevels. The default is no absorbing states. (numeric, character, factor)

tvarname

name of time variable, defaulting totime

pvarname

name of previous state variable, defaulting toyprev

gap

name of time gap variable, defaults assuming that gap time is not in the model

Details

Computes state occupancy probabilities for a single setting of baseline covariates. If the model fit was fromrms::blrm(), these probabilities are from all the posterior draws of the basic model parameters. Otherwise they are maximum likelihood point estimates.

Value

ifobject was not a Bayesian model, a matrix with rows corresponding to times and columns corresponding to states, with values equal to exact state occupancy probabilities. Ifobject was created byblrm, the result is a 3-dimensional array with the posterior draws as the first dimension.

Author(s)

Frank Harrell

See Also

https://hbiostat.org/R/Hmisc/markov/


spikecomp

Description

Compute Elements of a Spike Histogram

Usage

spikecomp(  x,  method = c("tryactual", "simple", "grid"),  lumptails = 0.01,  normalize = TRUE,  y,  trans = NULL,  tresult = c("list", "segments", "roundeddata"))

Arguments

x

a numeric variable

method

specifies the binning and output method. The default is'tryactual' and is intended to be used for spike histograms plotted in a way that allows for random x-coordinates and data gaps. No binning is done if there are less than 100 distinct values and the closest distinctx values are distinguishable (not with 1/500th of the data range of each other). Binning usespretty. Whentrans is specified to transformx to reduce long tails due to outliers,pretty rounding is not done, andlumptails is ignored.method='grid' is intended for sparkline spike histograms drawn with bar charts, where plotting is done in a way that x-coordinates must be equally spaced. For this method, extensive binning information is returned. For either'tryactual' or'grid', the default iftrans is omitted is to put all values beyond the 0.01 or 0.99 quantiles into a single bin so that outliers will not create long nearly empty tails. Wheny is specified,method is ignored.

lumptails

the quantile to use for lumping values into a single left and a single right bin for two of the methods. When outer quantiles usinglumptails equal outer quantiles using2*lumptails,lumptails is ignored as this indicates a large number of ties in the tails of the distribution.

normalize

set toFALSE to not divide frequencies by maximum frequency

y

a vector of frequencies corresponding tox if you want the (x,y) pairs to be taken as a possibly irregular-spaced frequency tabulation for which you want to convert to a regularly-spaced tabulation likecount='tabulate' produces. If there is a constant gap betweenx values, the original pairs are return, with possible removal ofNAs.

trans

a list with three elements: the name of a transformation to make onx, the transformation function, and the inverse transformation function. The latter is used formethod='grid'. Whentrans is givenlumptails is ignored.trans applies only tomethod='tryactual'.

tresult

applies only tomethod='tryactual'. The default'list' returns a list with elementsx,y, androundedTo.method='segments' returns a list suitable for drawing line segments, with elements⁠x, y1, y2⁠.method='roundeddata' returns a list with elementsx (non-tabulated rounded data vector after excludingNAs) and vectorroundedTo.

Details

Derives the line segment coordinates need to draw a spike histogram. This is useful for adding elements toggplot2 plots and for thedescribe function to construct spike histograms. Date/time variables are handled by doing calculations on the underlying numeric scale then converting back to the original class. For them the left endpoint of the first bin is taken as the minimal data value instead of rounded usingpretty().

Value

wheny is specified, a list with elementsx andy. Whenmethod='tryactual' the returned value depends ontresult. Formethod='grid', a list with elementsx andy and scalar elementroundedTo containing the typical bin width. Herex is a character string.

Author(s)

Frank Harrell

Examples

spikecomp(1:1000)spikecomp(1:1000, method='grid')## Not run: On a data.table d use ggplot2 to make spike histograms by country and sex groupss <- d[, spikecomp(x, tresult='segments'), by=.(country, sex)]ggplot(s) + geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) +   scale_y_continuous(breaks=NULL, labels=NULL) + ylab('') +   facet_grid(country ~ sex)## End(Not run)

Simulate Power of 2-Sample Test for Survival under Complex Conditions

Description

Given functions to generate random variables for survival times andcensoring times,spower simulates the power of a user-given2-sample test for censored data. By default, the logrank (Cox2-sample) test is used, and alogrank function for comparing 2groups is provided. Optionally a Cox model is fitted for each eachsimulated dataset and the log hazard ratios are saved (this requiresthesurvival package). Aprint method prints variousmeasures from these. For composingR functions to generate randomsurvival times under complex conditions, theQuantile2 functionallows the user to specify the intervention:control hazard ratio as afunction of time, the probability of a control subject actuallyreceiving the intervention (dropin) as a function of time, and theprobability that an intervention subject receives only the controlagent as a function of time (non-compliance, dropout).Quantile2 returns a function that generates either control orintervention uncensored survival times subject to non-constanttreatment effect, dropin, and dropout. There is aplot methodfor plotting the results ofQuantile2, which will aid inunderstanding the effects of the two types of non-compliance andnon-constant treatment effects.Quantile2 assumes that thehazard function for either treatment group is a mixture of the controland intervention hazard functions, with mixing proportions defined bythe dropin and dropout probabilities. It computes hazards andsurvival distributions by numerical differentiation and integrationusing a grid of (by default) 7500 equally-spaced time points.

Thelogrank function is intended to be used withspowerbut it can be used by itself. It returns the 1 degree of freedomchi-square statistic, with the associated Pike hazard ratio estimate as an attribute.

TheWeibull2 function accepts as input two vectors, onecontaining two times and one containing two survival probabilities, andit solves for the scale and shape parameters of the Weibull distribution(S(t) = e^{-\alpha {t}^{\gamma}})which will yieldthose estimates. It creates anR function to evaluate survivalprobabilities from this Weibull distribution.Weibull2 isuseful in creating functions to pass as the first argument toQuantile2.

TheLognorm2 andGompertz2 functions are similar toWeibull2 except that they produce survival functions for thelog-normal and Gompertz distributions.

Whencox=TRUE is specified tospower, the analyst may wishto extract the two margins of error by using theprint methodforspower objects (see example below) and take the maximum ofthe two.

Usage

spower(rcontrol, rinterv, rcens, nc, ni,        test=logrank, cox=FALSE, nsim=500, alpha=0.05, pr=TRUE)## S3 method for class 'spower'print(x, conf.int=.95, ...)Quantile2(scontrol, hratio,           dropin=function(times)0, dropout=function(times)0,          m=7500, tmax, qtmax=.001, mplot=200, pr=TRUE, ...)## S3 method for class 'Quantile2'print(x, ...)## S3 method for class 'Quantile2'plot(x,      what=c("survival", "hazard", "both", "drop", "hratio", "all"),     dropsep=FALSE, lty=1:4, col=1, xlim, ylim=NULL,     label.curves=NULL, ...)logrank(S, group)Gompertz2(times, surv)Lognorm2(times, surv)Weibull2(times, surv)

Arguments

rcontrol

a function of n which returns n random uncensoredfailure times for the control group.spower assumes thatnon-compliance (dropin) has been taken into account by thisfunction.

rinterv

similar torcontrol but for the intervention group

rcens

a function of n which returns n random censoring times.It is assumed that both treatment groups have the same censoringdistribution.

nc

number of subjects in the control group

ni

number in the intervention group

scontrol

a function of a time vector which returns the survival probabilitiesfor the control group at those times assuming that all patients arecompliant.

hratio

a function of time which specifies the intervention:control hazardratio (treatment effect)

x

an object of class “Quantile2” created byQuantile2,or of class “spower” created byspower

conf.int

confidence level for determining fold-change margins of error inestimating the hazard ratio

S

aSurv object or other two-column matrix for right-censoredsurvival times

group

group indicators have length equal to the number of rows inSargument.

times

a vector of two times

surv

a vector of two survival probabilities

test

any function of aSurv object and a grouping variable whichcomputes a chi-square for a two-sample censored data test. Thedefault islogrank.

cox

If trueTRUE the two margins of error are available by usingtheprint method forspower objects (see examplebelow) and taking the maximum of the two.

nsim

number of simulations to perform (default=500)

alpha

type I error (default=.05)

pr

IfFALSE preventsspower from printing progress notes forsimulations. IfFALSE preventsQuantile2 from printingtmaxwhen it calculatestmax.

dropin

a function of time specifying the probability that a control subjectactually is treated with the new intervention at the correspondingtime

dropout

a function of time specifying the probability of an interventionsubject dropping out to control conditions. As a function of time,dropout specifies the probability that a patient is treatedwith the control therapy at time t.dropin anddropout form mixing proportions for control and interventionhazard functions.

m

number of time points used for approximating functions (default is7500)

tmax

maximum time point to use in the grid ofm times. Default isthe time such thatscontrol(time) isqtmax.

qtmax

survival probability corresponding to the last time point used forapproximating survival and hazard functions. Default is 0.001. Forqtmax of the time for which a simulated time is needed whichcorresponds to a survival probability of less thanqtmax, thesimulated value will betmax.

mplot

number of points used for approximating functions for use inplotting (default is 200 equally spaced points)

...

optional arguments passed to thescontrol function when it'sevaluated byQuantile2. Unused forprint.spower.

what

a single character constant (may be abbreviated) specifying whichfunctions to plot. The default is ‘⁠"both"⁠’ meaning bothsurvival and hazard functions. Specifywhat="drop" to justplot the dropin and dropout functions,what="hratio" to plotthe hazard ratio functions, or ‘⁠"all"⁠’ to make 4 separate plotsshowing all functions (6 plots ifdropsep=TRUE).

dropsep

IfTRUE makesplot.Quantile2 separate pure andcontaminated functions onto separate plots

lty

vector of line types

col

vector of colors

xlim

optional x-axis limits

ylim

optional y-axis limits

label.curves

optional list which is passed as theopts argument tolabcurve.

Value

spower returns the power estimate (fraction of simulatedchi-squares greater than the alpha-critical value). Ifcox=TRUE,spower returns an object of class“spower” containing the power and various other quantities.

Quantile2 returns anR function of class “Quantile2”with attributes that drive theplot method. The majorattribute is a list containing several lists. Each of these sub-listscontains aTime vector along with one of the following:survival probabilities for either treatment group and with or withoutcontamination caused by non-compliance, hazard rates in a similar way,intervention:control hazard ratio function with and withoutcontamination, and dropin and dropout functions.

logrank returns a single chi-square statistic and an attributehr which is the Pike hazard ratio estimate.

Weibull2,Lognorm2 andGompertz2 return anRfunction with three arguments, only the first of which (the vector oftimes) is intended to be specified by the user.

Side Effects

spower prints the interation number every 10 iterations ifpr=TRUE.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Lakatos E (1988): Sample sizes based on the log-rank statistic in complexclinical trials. Biometrics 44:229–241 (Correction 44:923).

Cuzick J, Edwards R, Segnan N (1997): Adjusting for non-compliance and contamination in randomized clinical trials. Stat in Med 16:1017–1029.

Cook, T (2003): Methods for mid-course corrections in clinical trialswith survival outcomes. Stat in Med 22:3431–3447.

Barthel FMS, Babiker A et al (2006): Evaluation of sample size and powerfor multi-arm survival trials allowing for non-uniform accrual,non-proportional hazards, loss to follow-up and cross-over. Stat in Med25:2521–2542.

See Also

cpower,ciapower,bpower,cph,coxph,labcurve

Examples

# Simulate a simple 2-arm clinical trial with exponential survival so# we can compare power simulations of logrank-Cox test with cpower()# Hazard ratio is constant and patients enter the study uniformly# with follow-up ranging from 1 to 3 years# Drop-in probability is constant at .1 and drop-out probability is# constant at .175.  Two-year survival of control patients in absence# of drop-in is .8 (mortality=.2).  Note that hazard rate is -log(.8)/2# Total sample size (both groups combined) is 1000# % mortality reduction by intervention (if no dropin or dropout) is 25# This corresponds to a hazard ratio of 0.7283 (computed by cpower)cpower(2, 1000, .2, 25, accrual=2, tmin=1,        noncomp.c=10, noncomp.i=17.5)ranfun <- Quantile2(function(x)exp(log(.8)/2*x),                    hratio=function(x)0.7283156,                    dropin=function(x).1,                    dropout=function(x).175)rcontrol <- function(n) ranfun(n, what='control')rinterv  <- function(n) ranfun(n, what='int')rcens    <- function(n) runif(n, 1, 3)set.seed(11)   # So can reproduce resultsspower(rcontrol, rinterv, rcens, nc=500, ni=500,        test=logrank, nsim=50)  # normally use nsim=500 or 1000## Not run: # Run the same simulation but fit the Cox model for each one to# get log hazard ratios for the purpose of assessing the tightness# confidence intervals that are likely to resultset.seed(11)u <- spower(rcontrol, rinterv, rcens, nc=500, ni=500,        test=logrank, nsim=50, cox=TRUE)uv <- print(u)v[c('MOElower','MOEupper','SE')]## End(Not run)# Simulate a 2-arm 5-year follow-up study for which the control group's# survival distribution is Weibull with 1-year survival of .95 and# 3-year survival of .7.  All subjects are followed at least one year,# and patients enter the study with linearly increasing probability  after that# Assume there is no chance of dropin for the first 6 months, then the# probability increases linearly up to .15 at 5 years# Assume there is a linearly increasing chance of dropout up to .3 at 5 years# Assume that the treatment has no effect for the first 9 months, then# it has a constant effect (hazard ratio of .75)# First find the right Weibull distribution for compliant control patientssc <- Weibull2(c(1,3), c(.95,.7))sc# Inverse cumulative distribution for case where all subjects are followed# at least a years and then between a and b years the density rises# as (time - a) ^ d is a + (b-a) * u ^ (1/(d+1))rcens <- function(n) 1 + (5-1) * (runif(n) ^ .5)# To check this, type hist(rcens(10000), nclass=50)# Put it all togetherf <- Quantile2(sc,       hratio=function(x)ifelse(x<=.75, 1, .75),      dropin=function(x)ifelse(x<=.5, 0, .15*(x-.5)/(5-.5)),      dropout=function(x).3*x/5)par(mfrow=c(2,2))# par(mfrow=c(1,1)) to make legends fitplot(f, 'all', label.curves=list(keys='lines'))rcontrol <- function(n) f(n, 'control')rinterv  <- function(n) f(n, 'intervention')set.seed(211)spower(rcontrol, rinterv, rcens, nc=350, ni=350,        test=logrank, nsim=50)  # normally nsim=500 or morepar(mfrow=c(1,1))# Compose a censoring time generator function such that at 1 year# 5% of subjects are accrued, at 3 years 70% are accured, and at 10# years 100% are accrued.  The trial proceeds two years past the last# accrual for a total of 12 years of follow-up for the first subject.# Use linear interporation between these 3 pointsrcens <- function(n){  times <- c(0,1,3,10)  accrued <- c(0,.05,.7,1)  # Compute inverse of accrued function at U(0,1) random variables  accrual.times <- approx(accrued, times, xout=runif(n))$y  censor.times <- 12 - accrual.times  censor.times}censor.times <- rcens(500)# hist(censor.times, nclass=20)accrual.times <- 12 - censor.times# Ecdf(accrual.times)# lines(c(0,1,3,10), c(0,.05,.7,1), col='red')# spower(..., rcens=rcens, ...)## Not run: # To define a control survival curve from a fitted survival curve# with coordinates (tt, surv) with tt[1]=0, surv[1]=1:Scontrol <- function(times, tt, surv) approx(tt, surv, xout=times)$ytt <- 0:6surv <- c(1, .9, .8, .75, .7, .65, .64)formals(Scontrol) <- list(times=NULL, tt=tt, surv=surv)# To use a mixture of two survival curves, with e.g. mixing proportions# of .2 and .8, use the following as a guide:## Scontrol <- function(times, t1, s1, t2, s2)#  .2*approx(t1, s1, xout=times)$y + .8*approx(t2, s2, xout=times)$y# t1 <- ...; s1 <- ...; t2 <- ...; s2 <- ...;# formals(Scontrol) <- list(times=NULL, t1=t1, s1=s1, t2=t2, s2=s2)# Check that spower can detect a situation where generated censoring times# are later than all failure timesrcens <- function(n) runif(n, 0, 7)f <- Quantile2(scontrol=Scontrol, hratio=function(x).8, tmax=6)cont <- function(n) f(n, what='control')int  <- function(n) f(n, what='intervention')spower(rcontrol=cont, rinterv=int, rcens=rcens, nc=300, ni=300, nsim=20)# Do an unstratified logrank testlibrary(survival)# From SAS/STAT PROC LIFETEST manual, p. 1801days <- c(179,256,262,256,255,224,225,287,319,264,237,156,270,257,242,          157,249,180,226,268,378,355,319,256,171,325,325,217,255,256,          291,323,253,206,206,237,211,229,234,209)status <- c(1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,0,            0,rep(1,19))treatment <- c(rep(1,10), rep(2,10), rep(1,10), rep(2,10))sex <- Cs(F,F,M,F,M,F,F,M,M,M,F,F,M,M,M,F,M,F,F,M,          M,M,M,M,F,M,M,F,F,F,M,M,M,F,F,M,F,F,F,F)data.frame(days, status, treatment, sex)table(treatment, status)logrank(Surv(days, status), treatment)  # agrees with p. 1807# For stratified tests the picture is puzzling.# survdiff(Surv(days,status) ~ treatment + strata(sex))$chisq# is 7.246562, which does not agree with SAS (7.1609)# But summary(coxph(Surv(days,status) ~ treatment + strata(sex)))# yields 7.16 whereas summary(coxph(Surv(days,status) ~ treatment))# yields 5.21 as the score test, not agreeing with SAS or logrank() (5.6485)## End(Not run)

Enhanced Importing of SPSS Files

Description

spss.get invokes theread.spss function in theforeign package to read an SPSS file, with a default outputformat of"data.frame". Thelabel function is used toattach labels to individual variables instead of to the data frame asdone byread.spss. By default, integer-valued variables areconverted to a storage mode of integer unlessforce.single=FALSE. Date variables are converted toRDatevariables. By default, underscores in names are converted to periods.

Usage

spss.get(file, lowernames=FALSE, datevars = NULL,         use.value.labels = TRUE, to.data.frame = TRUE,         max.value.labels = Inf, force.single=TRUE,         allow=NULL, charfactor=FALSE, reencode = NA)

Arguments

file

input SPSS save file. May be a file on the WWW, indicatedbyfile starting with'http://' or'https://'.

lowernames

set toTRUE to convert variable names tolower case

datevars

vector of variable names containing dates to beconverted toR internal format

use.value.labels

seeread.spss

to.data.frame

seeread.spss; default isTRUE forspss.get

max.value.labels

seeread.spss

force.single

set toFALSE to prevent integer-valuedvariables from being converted from storage modedouble tointeger

allow

a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9.

charfactor

set toTRUE to change character variables tofactors if they have fewer than n/2 unique values. Blanks and nullstrings are converted toNAs.

reencode

seeread.spss

Value

a data frame or list

Author(s)

Frank Harrell

See Also

read.spss,cleanup.import,sas.get

Examples

## Not run: w <- spss.get('/tmp/my.sav', datevars=c('birthdate','deathdate'))  ## End(Not run)

Source a File from the Current Working Directory

Description

src concatenates".s" to its argument, quotes the result,andsources in the file. It setsoptions(last.source) tothis file name so thatsrc() can be issued to re-sourcethe file when it is edited.

Usage

src(x)

Arguments

x

an unquoted file name aside from".s". This base filename must be a legal S name.

Side Effects

Sets system optionlast.source

Author(s)

Frank Harrell

See Also

source

Examples

## Not run: src(myfile)   # source("myfile.s")src()         # re-source myfile.s## End(Not run)

Add a lowess smoother without counfidence bands.

Description

Automatically selectsiter=0 forlowess ify is binary, otherwise usesiter=3.

Usage

stat_plsmo(  mapping = NULL,  data = NULL,  geom = "smooth",  position = "identity",  n = 80,  fullrange = FALSE,  span = 2/3,  fun = function(x) x,  na.rm = FALSE,  show.legend = NA,  inherit.aes = TRUE,  ...)

Arguments

mapping,data,geom,position,show.legend,inherit.aes

see ggplot2 documentation

n

number of points to evaluate smoother at

fullrange

should the fit span the full range of the plot, or justthe data

span

seef argument tolowess

fun

a function to transform smoothedy

na.rm

IfFALSE (the default), removes missing values witha warning. IfTRUE silently removes missing values.

...

other arguments are passed to smoothing function

Value

a data.frame with additional columns

y

predicted value

See Also

lowess forloess smoother.

Examples

require(ggplot2)c <- ggplot(mtcars, aes(qsec, wt))c + stat_plsmo()c + stat_plsmo() + geom_point()c + stat_plsmo(span = 0.1) + geom_point()# Smoothers for subsetsc <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)c + stat_plsmo() + geom_point()c + stat_plsmo(fullrange = TRUE) + geom_point()# Geoms and stats are automatically split by aesthetics that are factorsc <- ggplot(mtcars, aes(y=wt, x=mpg, colour=factor(cyl)))c + stat_plsmo() + geom_point()c + stat_plsmo(aes(fill = factor(cyl))) + geom_point()c + stat_plsmo(fullrange=TRUE) + geom_point()# Example with logistic regressiondata("kyphosis", package="rpart")qplot(Age, as.numeric(Kyphosis) - 1, data = kyphosis) + stat_plsmo()

Enhanced Importing of STATA Files

Description

Reads a file in Stata version 5-11 binary format format into adata frame.

Usage

stata.get(file, lowernames = FALSE, convert.dates = TRUE,          convert.factors = TRUE, missing.type = FALSE,          convert.underscore = TRUE, warn.missing.labels = TRUE,          force.single = TRUE, allow=NULL, charfactor=FALSE, ...)

Arguments

file

inputSPSS save file. May be a file on theWWW, indicatedbyfile starting with ‘⁠'https://'⁠’.

lowernames

set toTRUE to convert variable names tolower case

convert.dates

seeread.dta

convert.factors

seeread.dta

missing.type

seeread.dta

convert.underscore

seeread.dta

warn.missing.labels

seeread.dta

force.single

set toFALSE to prevent integer-valuedvariables from being converted from storage modedouble tointeger

allow

a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9.

charfactor

set toTRUE to change character variables tofactors if they have fewer than n/2 unique values. Blanks and nullstrings are converted toNAs.

...

arguments passed toread.dta.

Details

stata.get invokes theread.dta function in theforeign package to read an STATA file, with a default outputformat ofdata.frame. Thelabel function is used toattach labels to individual variables instead of to the data frame asdone byread.dta. By default, integer-valued variables areconverted to a storage mode of integer unlessforce.single=FALSE. Date variables are converted toRDate variables. By default, underscores in names are converted to periods.

Value

A data frame

Author(s)

Charles Dupont

See Also

read.dta,cleanup.import,label,data.frame,Date

Examples

## Not run: w <- stata.get('/tmp/my.dta')## End(Not run)

Determine Dimensions of Strings

Description

This determines the number of rows and maximum number of columns ofeach string in a vector.

Usage

string.bounding.box(string, type = c("chars", "width"))

Arguments

string

vector of strings

type

character: whether to count characters or screen columns

Value

rows

vector containing the number of character rows in each string

columns

vector containing the maximum number of charactercolumns in each string

Author(s)

Charles Dupont

See Also

nchar,stringDims

Examples

a <- c("this is a single line string", "This is a\nmulti-line string")stringDims(a)

Break a String into Many Lines at Newlines

Description

Takes a string and breaks it into seperate substrings where there arenewline characters.

Usage

string.break.line(string)

Arguments

string

character vector to be separated into many lines.

Value

Returns a list that is the same length of as thestringargument.

Each list element is a character vector.

Each character vectors elements are thesplit lines of the corresponding element in thestring argument vector.

Author(s)

Charles Dupont

See Also

strsplit

Examples

a <- c('', 'this is a single line string',       'This is a\nmulti-line string.')b <- string.break.line(a)

String Dimentions

Description

Finds the height and width of all the string in a character vector.

Usage

stringDims(string)

Arguments

string

vector of strings

Details

stringDims finds the number of characters in width and number oflines in height for each string in thestring argument.

Value

height

a vector of the number of lines in each string.

width

a vector with the number of character columns in thelongest line.

Author(s)

Charles Dupont

See Also

string.bounding.box,nchar

Examples

a <- c("this is a single line string", "This is a\nmulty line string")stringDims(a)

Embed a new plot within an existing plot

Description

Subplot will embed a new plot within an existing plot at thecoordinates specified (in user units of the existing plot).

Usage

subplot(fun, x, y, size=c(1,1), vadj=0.5, hadj=0.5, pars=NULL)

Arguments

fun

an expression or function defining the new plot to be embedded.

x

x-coordinate(s) of the new plot (in user coordinatesof the existing plot).

y

y-coordinate(s) of the new plot,x andycan be specified in any of the ways understood byxy.coords.

size

The size of the embedded plot in inches ifx andy have length 1.

vadj

vertical adjustment of the plot wheny is a scalar,the default is to center vertically, 0 means place the bottom of theplot aty, 1 places the top of the plot aty.

hadj

horizontal adjustment of the plot whenx is ascalar, the default is to center horizontally, 0 means place theleft edge of the plot atx, and 1 means place the right edgeof the plot atx.

pars

a list of parameters to be passed topar beforerunningfun.

Details

The coordinatesx andy can be scalars or vectors oflength 2. If vectors of length 2 then they determine the oppositecorners of the rectangle for the embedded plot (and the parameterssize,vadj, andhadj are all ignored.

Ifx andy are given as scalars then the plot positionrelative to the point and the size of the plot will be determined bythe argumentssize,vadj, andhadj. The defaultis to center a 1 inch by 1 inch plot atx,y. Settingvadj andhadj to(0,0) will position the lowerleft corner of the plot at(x,y).

The rectangle defined byx,y,size,vadj,andhadj will be used as the plotting area of the new plot.Any tick marks, axis labels, main and sub titles will be outside ofthis rectangle.

Any graphical parameter settings that you would like to be in placebeforefun is evaluated can be specified in theparsargument (warning: specifying layout parameters here (plt,mfrow, etc.) may cause unexpected results).

After the function completes the graphical parameters will have beenreset to what they were before calling the function (so you cancontinue to augment the original plot).

Value

An invisible list with the graphical parameters that were in effectwhen the subplot was created. Passing this list topar willenable you to augment the embedded plot.

Author(s)

Greg Snowgreg.snow@imail.org

See Also

cnvrt.coords,par,symbols

Examples

# make an original plotplot( 11:20, sample(51:60) )# add some histogramssubplot( hist(rnorm(100)), 15, 55)subplot( hist(runif(100),main='',xlab='',ylab=''), 11, 51, hadj=0, vadj=0)subplot( hist(rexp(100, 1/3)), 20, 60, hadj=1, vadj=1, size=c(0.5,2) )subplot( hist(rt(100,3)), c(12,16), c(57,59), pars=list(lwd=3,ask=FALSE) )tmp <- rnorm(25)qqnorm(tmp)qqline(tmp)tmp2 <- subplot( hist(tmp,xlab='',ylab='',main=''), cnvrt.coords(0.1,0.9,'plt')$usr, vadj=1, hadj=0 )abline(v=0, col='red') # wrong way to add a reference line to histogram# right way to add a reference line to histogramop <- par(no.readonly=TRUE)par(tmp2)abline(v=0, col='green')par(op)

Summarize Scalars or Matrices by Cross-Classification

Description

summarize is a fast version ofsummary.formula(formula,method="cross",overall=FALSE) for producing stratified summary statisticsand storing them in a data frame for plotting (especially with trellisxyplot anddotplot and HmiscxYplot). Unlikeaggregate,summarize accepts a matrix as its firstargument and a multi-valuedFUNargument andsummarize also labels the variables in the new dataframe using their original names. Unlike methods based ontapply,summarize stores the values of the stratificationvariables using their original types, e.g., a numericby variablewill remain a numeric variable in the collapsed data frame.summarize also retains"label" attributes for variables.summarize works especially well with the HmiscxYplotfunction for displaying multiple summaries of a single variable on eachpanel, such as means and upper and lower confidence limits.

asNumericMatrix converts a data frame into a numeric matrix,saving attributes to reverse the process bymatrix2dataframe.It saves attributes that are commonly preserved across rowsubsetting (i.e., it does not savedim,dimnames, ornames attributes).

matrix2dataFrame converts a numeric matrix back into a dataframe if it was created byasNumericMatrix.

Usage

summarize(X, by, FUN, ...,           stat.name=deparse(substitute(X)),          type=c('variables','matrix'), subset=TRUE,          keepcolnames=FALSE)asNumericMatrix(x)matrix2dataFrame(x, at=attr(x, 'origAttributes'), restoreAll=TRUE)

Arguments

X

a vector or matrix capable of being operated on by thefunction specified as theFUN argument

by

one or more stratification variables. If a singlevariable,by may be a vector, otherwise it should be a list.Using the Hmiscllist function instead oflist will resultin individual variable names being accessible tosummarize. Forexample, you can specifyllist(age.group,sex) orllist(Age=age.group,sex). The latter givesage.group anew temporary name,Age.

FUN

a function of a single vector argument, used to create the statisticalsummaries forsummarize.FUN may compute any number ofstatistics.

...

extra arguments are passed toFUN

stat.name

the name to use when creating the main summary variable. By default,the name of theX argument is used. Setstat.name toNULL to suppress this name replacement.

type

Specifytype="matrix" to store the summary variables (if there aremore than one) in a matrix.

subset

a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame.

keepcolnames

by default whentype="matrix", the firstcolumn of the computed matrix is the name of the first argument tosummarize. Setkeepcolnames=TRUE to retain the name ofthe first column created byFUN

x

a data frame (forasNumericMatrix) or a numeric matrix (formatrix2dataFrame).

at

List containing attributes of original data frame that survivesubsetting. Defaults to attribute"origAttributes" of theobjectx, created by the call toasNumericMatrix

restoreAll

set toFALSE to only restore attributeslabel,units, andlevels instead of all attributes

Value

Forsummarize, a data frame containing theby variables and thestatistical summaries (the first of which is named the same as theXvariable unlessstat.name is given). Iftype="matrix", thesummaries are stored in a single variable in the data frame, and thisvariable is a matrix.

asNumericMatrix returns a numeric matrix and stores an objectorigAttributes as an attribute of the returned object, with originalattributes of component variables, thestorage.mode.

matrix2dataFrame returns a data frame.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

label,cut2,llist,by

Examples

## Not run: s <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,               stat.name='Proportion')dotplot(Proportion ~ size | bone, data=s7)## End(Not run)set.seed(1)temperature <- rnorm(300, 70, 10)month <- sample(1:12, 300, TRUE)year  <- sample(2000:2001, 300, TRUE)g <- function(x)c(Mean=mean(x,na.rm=TRUE),Median=median(x,na.rm=TRUE))summarize(temperature, month, g)mApply(temperature, month, g)mApply(temperature, month, mean, na.rm=TRUE)w <- summarize(temperature, month, mean, na.rm=TRUE)library(lattice)xyplot(temperature ~ month, data=w) # plot mean temperature by monthw <- summarize(temperature, llist(year,month),                quantile, probs=c(.5,.25,.75), na.rm=TRUE, type='matrix')xYplot(Cbind(temperature[,1],temperature[,-1]) ~ month | year, data=w)mApply(temperature, llist(year,month),       quantile, probs=c(.5,.25,.75), na.rm=TRUE)# Compute the median and outer quartiles.  The outer quartiles are# displayed using "error bars"set.seed(111)dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)attach(dfr)y <- abs(month-6.5) + 2*runif(length(month)) + year-1997s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)smApply(y, llist(month,year), smedian.hilow, conf.int=.5)xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s,        keys='lines', method='alt')# Can also do:s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),               stat.name=c('y','Q1','Q3'))xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')# To display means and bootstrapped nonparametric confidence intervals# use for example:s <- summarize(y, llist(month,year), smean.cl.boot)xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)# For each subject use the trapezoidal rule to compute the area under# the (time,response) curve using the Hmisc trap.rule functionx <- cbind(time=c(1,2,4,7, 1,3,5,10),response=c(1,3,2,4, 1,3,2,4))subject <- c(rep(1,4),rep(2,4))trap.rule(x[1:4,1],x[1:4,2])summarize(x, subject, function(y) trap.rule(y[,1],y[,2]))## Not run: # Another approach would be to properly re-shape the mm array below# This assumes no missing cells.  There are many other approaches.# mApply will do this well while allowing for missing cells.m <- tapply(y, list(year,month), quantile, probs=c(.25,.5,.75))mm <- array(unlist(m), dim=c(3,2,12),             dimnames=list(c('lower','median','upper'),c('1997','1998'),                          as.character(1:12)))# aggregate will help but it only allows you to compute one quantile# at a time; see also the Hmisc mApply functiondframe <- aggregate(y, list(Year=year,Month=month), quantile, probs=.5)# Compute expected life length by race assuming an exponential# distribution - can also use summarizeg <- function(y) { # computations for one race group  futime <- y[,1]; event <- y[,2]  sum(futime)/sum(event)  # assume event=1 for death, 0=alive}mApply(cbind(followup.time, death), race, g)# To run mApply on a data frame:xn <- asNumericMatrix(x)m <- mApply(xn, race, h)# Here assume h is a function that returns a matrix similar to xmatrix2dataFrame(m)# Get stratified weighted meansg <- function(y) wtd.mean(y[,1],y[,2])summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')mApply(cbind(y,wts), llist(sex,race), g)# Compare speed of mApply vs. by for computing d <- data.frame(sex=sample(c('female','male'),100000,TRUE),                country=sample(letters,100000,TRUE),                y1=runif(100000), y2=runif(100000))g <- function(x) {  y <- c(median(x[,'y1']-x[,'y2']),         med.sum =median(x[,'y1']+x[,'y2']))  names(y) <- c('med.diff','med.sum')  y}system.time(by(d, llist(sex=d$sex,country=d$country), g))system.time({             x <- asNumericMatrix(d)             a <- subsAttr(d)             m <- mApply(x, llist(sex=d$sex,country=d$country), g)            })system.time({             x <- asNumericMatrix(d)             summarize(x, llist(sex=d$sex, country=d$country), g)            })# An example where each subject has one record per diagnosis but sex of# subject is duplicated for all the rows a subject has.  Get the cross-# classified frequencies of diagnosis (dx) by sex and plot the results# with a dot plotcount <- rep(1,length(dx))d <- summarize(count, llist(dx,sex), sum)Dotplot(dx ~ count | sex, data=d)## End(Not run)d <- list(x=1:10, a=factor(rep(c('a','b'), 5)),          b=structure(letters[1:10], label='label for b'),          d=c(rep(TRUE,9), FALSE), f=pi*(1 : 10))x <- asNumericMatrix(d)attr(x, 'origAttributes')matrix2dataFrame(x)detach('dfr')# Run summarize on a matrix to get column meansx <- c(1:19,NA)y <- 101:120z <- cbind(x, y)g <- c(rep(1, 10), rep(2, 10))summarize(z, g, colMeans, na.rm=TRUE, stat.name='x')# Also works on an all numeric data framesummarize(as.data.frame(z), g, colMeans, na.rm=TRUE, stat.name='x')

Summarize Data for Making Tables and Plots

Description

summary.formula summarizes the variables listed in an S formula,computing descriptive statistics (including ones in auser-specified function). The summary statistics may be passed toprint methods,plot methods for making annotated dot charts, andlatex methods for typesetting tables using LaTeX.summary.formula has three methods for computing descriptivestatistics on univariate or multivariate responses, subsetted bycategories of other variables. The method of summarization isspecified in the parametermethod (see details below). For theresponse andcross methods, the statistics used tosummarize the data may be specified in a very flexible way (e.g., the geometric mean,33rd percentile, Kaplan-Meier 2-year survival estimate, mixtures ofseveral statistics). The default summary statistic for these methodsis the mean (the proportion of positive responses for a binaryresponse variable). Thecross method is useful for creating dataframes which contain summary statistics that are passed totrellisas raw data (to make multi-panel dot charts, for example). Theprint methods use theprint.char.matrix function to print boxedtables.

The right hand side offormula may containmChoice(“multiple choice”) variables. Whentest=TRUE each choice istested separately as a binary categorical response.

Theplot method formethod="reverse" creates a temporaryfunctionKey in frame 0 as is done by thexYplot andEcdf.formula functions. Afterplot runs, you can typeKey() to put a legend in a default location, ore.g.Key(locator(1)) to draw a legend where you click the leftmouse button. This key is for categorical variables, so to have theopportunity to put the key on the graph you will probably want to usethe commandplot(object, which="categorical"). A second functionKey2 is created if continuous variables are being plotted. It isused the same asKey. If thewhich argument is notspecified toplot, two pages of plots will be produced. If youdon't definepar(mfrow=) yourself,plot.summary.formula.reverse will try to lay out a multi-panelgraph to best fit all the individual dot charts for continuousvariables.

There is a subscripting method for objects created withmethod="response". This can be used to print or plot selected variables or summary statisticswhere there would otherwise be too many on one page.

cumcategory is a utility function useful when summarizing an ordinalresponse variable. It converts such a variable havingk levels to amatrix withk-1 columns, where columni is a vector of zeros andones indicating that the categorical response is in leveli+1 orgreater. When the left hand side offormula iscumcategory(y),the defaultfun will summarize it by computing all of the relevantcumulative proportions.

FunctionsconTestkw,catTestchisq,ordTestpo arethe default statistical test functions forsummary.formula.These defaults are: Wilcoxon-Kruskal-Wallis test for continuousvariables, Pearson chi-square test for categorical variables, and thelikelihood ratio chi-square test from the proportional odds model forordinal variables. These three functions serve also as templates forthe user to create her own testing functions that are self-defining interms of how the results are printed or rendered in LaTeX, or plotted.

Usage

## S3 method for class 'formula'summary(formula, data=NULL, subset=NULL,        na.action=NULL, fun = NULL,        method = c("response", "reverse", "cross"),        overall = method == "response" | method == "cross",        continuous = 10, na.rm = TRUE, na.include = method != "reverse",        g = 4, quant = c(0.025, 0.05, 0.125, 0.25, 0.375, 0.5, 0.625,                         0.75, 0.875, 0.95, 0.975),        nmin = if (method == "reverse") 100               else 0,        test = FALSE, conTest = conTestkw, catTest = catTestchisq,        ordTest = ordTestpo, ...)## S3 method for class 'summary.formula.response'x[i, j, drop=FALSE]## S3 method for class 'summary.formula.response'print(x, vnames=c('labels','names'), prUnits=TRUE,      abbreviate.dimnames=FALSE,      prefix.width, min.colwidth, formatArgs=NULL, markdown=FALSE, ...)## S3 method for class 'summary.formula.response'plot(x, which = 1, vnames = c('labels','names'), xlim, xlab,     pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), superposeStrata = TRUE,     dotfont = 1, add = FALSE, reset.par = TRUE, main, subtitles = TRUE,     ...)## S3 method for class 'summary.formula.response'latex(object, title = first.word(deparse(substitute(object))), caption,      trios, vnames = c('labels', 'names'), prn = TRUE, prUnits = TRUE,      rowlabel = '', cdec = 2, ncaption = TRUE, ...)## S3 method for class 'summary.formula.reverse'print(x, digits, prn = any(n != N), pctdig = 0,      what=c('%', 'proportion'),      npct = c('numerator', 'both', 'denominator', 'none'),      exclude1 = TRUE, vnames = c('labels', 'names'), prUnits = TRUE,      sep = '/', abbreviate.dimnames = FALSE,      prefix.width = max(nchar(lab)), min.colwidth, formatArgs=NULL, round=NULL,      prtest = c('P','stat','df','name'), prmsd = FALSE, long = FALSE,      pdig = 3, eps = 0.001, ...)## S3 method for class 'summary.formula.reverse'plot(x, vnames = c('labels', 'names'), what = c('proportion', '%'),     which = c('both', 'categorical', 'continuous'),     xlim = if(what == 'proportion') c(0,1)            else c(0,100),      xlab = if(what=='proportion') 'Proportion'            else 'Percentage',      pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), exclude1 = TRUE,     dotfont = 1, main,     prtest = c('P', 'stat', 'df', 'name'), pdig = 3, eps = 0.001,     conType = c('dot', 'bp', 'raw'), cex.means = 0.5, ...)## S3 method for class 'summary.formula.reverse'latex(object, title = first.word(deparse(substitute(object))), digits,      prn = any(n != N), pctdig = 0, what=c('%', 'proportion'),      npct = c("numerator", "both", "denominator", "slash", "none"),      npct.size = 'scriptsize', Nsize = "scriptsize", exclude1 = TRUE,      vnames=c("labels", "names"), prUnits = TRUE, middle.bold = FALSE,      outer.size = "scriptsize", caption, rowlabel = "",      insert.bottom = TRUE, dcolumn = FALSE, formatArgs=NULL, round = NULL,      prtest = c('P', 'stat', 'df', 'name'), prmsd = FALSE,      msdsize = NULL, long = dotchart, pdig = 3, eps = 0.001,      auxCol = NULL, dotchart=FALSE, ...)## S3 method for class 'summary.formula.cross'print(x, twoway = nvar == 2, prnmiss = any(stats$Missing > 0), prn = TRUE,      abbreviate.dimnames = FALSE, prefix.width = max(nchar(v)),      min.colwidth, formatArgs = NULL, ...)## S3 method for class 'summary.formula.cross'latex(object, title = first.word(deparse(substitute(object))),      twoway = nvar == 2, prnmiss = TRUE, prn = TRUE,      caption=attr(object, "heading"), vnames=c("labels", "names"),      rowlabel="", ...)stratify(..., na.group = FALSE, shortlabel = TRUE)## S3 method for class 'summary.formula.cross'formula(x, ...)cumcategory(y)conTestkw(group, x)catTestchisq(tab)ordTestpo(group, x)

Arguments

formula

AnR formula with additive effects. Formethod="response" or"cross", the dependent variable has the usual connotation. Formethod="reverse", the dependent variable is what is usually thoughtof as an independent variable, and it is one that is used to stratifyall of the right hand side variables. Formethod="response"(only), theformula may contain one or more invocations of thestratify function whose arguments are defined below. This causesthe entire analysis to be stratified by cross-classifications of thecombined list of stratification factors. This stratification will bereflected as major column groupings in the resulting table, or as moreresponse columns for plotting. Ifformula has no dependent variablemethod="reverse" is the only legal value and somethod defaults to"reverse" in this case.

x

an object created bysummary.formula. ForconTestkw a numeric vector, and forordTestpo, a numericor factor variable that can be considered ordered

y

a numeric, character, category, or factor vector forcumcategory.Is converted to a categorical variable is needed.

drop

logical. IfTRUE the result is coerced to thelowest possible dimension.

data

name or number of a data frame. Default is the current frame.

subset

a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame.

na.action

function for handling missing data in the input data. The default isa function defined here calledna.retain, which keeps allobservations for processing, with missing variables or not.

fun

function for summarizing data in each cell. Default is to take themean of each column of the possibly multivariate response variable.You can specifyfun="%" to compute percentages (100 times the mean of a series of logical or binary variables).User–specified functions can also return a matrix. For example, you might compute quartiles on a bivariate response. Does not apply tomethod="reverse".

method

The default is"response", in which case the response variable maybe multivariate and any number of statistics may be used to summarizethem. Here the responses are summarized separately for each of anynumber of independent variables. Continuous independent variables(see thecontinuous parameter below) are automatically stratifiedintog (see below) quantile groups (if you want to control thediscretization for selected variables, use thecut2 function on them). Otherwise, the data aresubsetted by all levels of discrete right hand side variables. Formultivariate responses, subjects are considered to be missing if anyof the columns is missing.

Themethod="reverse" option istypically used to make baseline characteristic tables, for example.The single left hand side variable must be categorical (e.g.,treatment), and the right hand side variables are broken down one at atime by the "dependent" variable. Continuous variables are describedby three quantiles (quartiles by default) along with outer quantiles (used only for scaling x-axes when plotting quartiles;all are used when plotting box-percentile plots), andcategorical ones aredescribed by counts and percentages. If there is no left hand sidevariable,summary assumes that there is only one group in the data,so that only one column of summaries will appear.If there is no dependent variable informula,method defaults to"reverse" automatically.

Themethod="cross" option allows for a multivariate dependentvariable and for up to three independents. Continuous independentvariables (those with at leastcontinuous unique values) areautomatically divided intog quantile groups.The independents are cross-classified, and marginal statistics may optionally be computed.The output ofsummary.formula in this case is a data framecontaining the independent variable combinations (with levels of"All" corresponding to marginals) and the corresponding summarystatistics in the matrixS. The output data frame is suitable fordirect use intrellis. Theprint andlatex typesetting methods for thismethod allows for a special two-way format if there are two righthand variables.

overall

Formethod="reverse", settingoverall=TRUE makes a new column withoverall statistics for the whole sample. Formethod="cross",overall=TRUE (the default) results in all marginal statistics beingcomputed. Fortrellis displays (usually multi-panel dot plots), these marginals just form other categories. For"response", thedefault isoverall=TRUE, causing a final row of global summarystatistics to appear in tables and dot charts. Iftest=TRUE thesemarginal statistics are ignored in doing statistical tests.

continuous

specifies the threshold for when a variable is considered to becontinuous (when there are at leastcontinuous unique values).factor variables are always considered to be categorical no matterhow many levels they have.

na.rm

TRUE (the default) to excludeNAs before passing data tofun to compute statistics,FALSE otherwise.na.rm=FALSE is useful if the response variable is a matrix andyou do not wish to exclude a row of the matrix if any of the columnsin that row areNA.na.rm also applies to summarystatistic functions such assmean.cl.normal. For thesena.rmdefaults toTRUE unlike built-in functions.

na.include

formethod="response", setna.include=FALSE to exclude missing values frombeing counted as their own category when subsetting the response(s)by levels of a categorical variable. Formethod="reverse" setna.include=TRUE to keep missing values of categorical variables frombeing excluded from the table.

g

number of quantile groups to use when variables are automaticallycategorized withmethod="response" or"cross" usingcut2

nmin

if fewer thannmin observations exist in a category for"response"(over all strata combined), that category will be ignored. For"reverse", for categories of the response variable in which thereare less than or equal tonmin non-missing observations, the rawdata are retained for later plotting in place of box plots.

test

applies ifmethod="reverse". Set toTRUE to compute teststatistics using tests specified inconTest andcatTest.

conTest

a function of two arguments (grouping variable and a continuousvariable) that returns a list with componentsP (the computedP-value),stat (the test statistic, either chi-square or F),df (degrees of freedom),testname (test name),statname(statistic name),namefun ("chisq", "fstat"), anoptional componentlatexstat (LaTeX representation ofstatname), an optional componentplotmathstat (for R - theplotmath representation ofstatname, as a character string), and anoptional componentnote that contains a character string note about the test (e.g.,"test not done because n < 5").conTest is applied to continuous variableson the right-hand-side of the formula whenmethod="reverse". Thedefault uses thespearman2 function to run the Wilcoxon orKruskal-Wallis test using the F distribution.

catTest

a function of a frequency table (an integer matrix) that returns alist with the same components as created byconTest. By default,the Pearson chi-square test is done, without continuity correction(the continuity correction would make the test conservative like theFisher exact test).

ordTest

a function of a frequency table (an integer matrix) that returns alist with the same components as created byconTest. By default,the Proportional odds likelihood ratio test is done.

...

forsummary.formula these are optionalarguments forcut2 when variables are automatically categorized.Forplot methods these arguments are passed todotchart2.ForKey andKey2 these arguments are passed tokey,text, ormtitle. Forprint methods these areoptional arguments toprint.char.matrix. Forlatex methodsthese are passed tolatex.default. One of the most important ofthese isfile. Specifyingfile="" will cause LaTeX codeto just be printed to standard output rather than be stored in apermanent file.

object

an object created bysummary.formula

quant

vector of quantiles to use for summarizing data withmethod="reverse". This must be numbers between 0 and 1inclusive and must include the numbers 0.5, 0.25, and 0.75 which areused for printing and for plotting quantile intervals. The outer quantiles are used for scaling the x-axesfor such plots. Specify outer quantiles as0 and1 toscale the x-axes using the whole observed data ranges instead of thedefault (a 0.95 quantile interval). Box-percentile plots are drawnusing all but the outer quantiles.

vnames

By default, tables and plots are usually labeled with variable labels(see thelabel andsas.get functions). To use the shortervariable names, specifyvnames="name".

pch

vector of plotting characters to represent different groups, in orderof group levels. Formethod="response" the characterscorrespond to levels of thestratify variable ifsuperposeStrata=TRUE, and if nostrata are used or ifsuperposeStrata=FALSE, thepch vector corresponds to thewhich argument formethod="response".

superposeStrata

Ifstratify was used, setsuperposeStrata=FALSE to makeseparate dot charts for each level of thestratificationvariable, formethod='response'. The default is tosuperposition all strata on one dot chart.

dotfont

font for plotting points

reset.par

set toFALSE to suppress the restoring of theold par values inplot.summary.formula.response

abbreviate.dimnames

seeprint.char.matrix

prefix.width

seeprint.char.matrix

min.colwidth

minimum column width to use for boxes printed withprint.char.matrix.The default is the maximum of the minimum column label length and the minimumlength of entries in the data cells.

formatArgs

a list containing other arguments to pass toformat.default such asscientific, e.g.,formatArgs=list(scientific=c(-5,5)). Forprint.summary.formula.reverse andformat.summary.formula.reverse,formatArgs applies only tostatistics computed on continuous variables, not to percents,numerators, and denominators. Theround argument may be preferred.

markdown

forprint.summary.formula.response set toTRUE to useknitr::kable to produce the table inmarkdown format rather than using raw text output created byprint.char.matrix

digits

number of significant digits to print. Default is to use the currentvalue of thedigits system option.

prn

set toTRUE to print the number of non-missing observations on thecurrent (row) variable. The default is to print these only if any ofthe counts of non-missing values differs from the total number ofnon-missing values of the left-hand-side variable.Formethod="cross" the default is to always printN.

prnmiss

set toFALSE to suppress printing counts of missing values for"cross"

what

formethod="reverse" specifies whether proportions or percentagesare to be plotted

pctdig

number of digits to the right of the decimal place for printingpercentages. The default is zero, so percents will be rounded to thenearest percent.

npct

specifies which counts are to be printed to the right of percentages.The default is to print the frequency (numerator of the percent) inparentheses. You can specify"both" to print both numerator anddenominator,"denominator","slash" totypeset horizontally using a forward slash, or"none".

npct.size

the size for typesettingnpct information which appears after percents.The default is"scriptsize".

Nsize

When a second row of column headings is added showing sample sizes,Nsize specifies the LaTeX size for these subheadings. Defaultis"scriptsize".

exclude1

by default,method="reverse" objects will be printed, plotted, or typeset byremoving redundant entries from percentage tables for categoricalvariables. For example, if you print the percent of females, youdon't need to print the percent of males. To override this, setexclude1=FALSE.

prUnits

set toFALSE to suppress printing or latexingunitsattributes of variables, whenmethod='reverse' or'response'

sep

character to use to separate quantiles when printingmethod="reverse" tables

prtest

a vector of test statistic components to print iftest=TRUE was ineffect whensummary.formula was called. Defaults to printing allcomponents. Specifyprtest=FALSE orprtest="none" to notprint any tests. This applies toprint,latex, andplot methods formethod='reverse'.

round

forprint.summary.formula.reverse andlatex.summary.formula.reverse specifyround to roundthe quantiles and optional mean and standard deviation toround digits after the decimal point

prmsd

set toTRUE to print mean and SD after the three quantiles, forcontinuous variables withmethod="reverse"

msdsize

defaults toNULL to use the current font size for the mean andstandard deviation ifprmsd isTRUE. Set to a characterstring to specify an alternate LaTeX font size.

long

set toTRUE to print the results for the first category on its ownline, not on the same line with the variable label (formethod="reverse" withprint andlatex methods)

pdig

number of digits to the right of the decimal place for printingP-values. Default is3. This is passed toformat.pval.

eps

P-values less thaneps will be printed as< eps. Seeformat.pval.

auxCol

an optional auxiliary column of information, right justified, to addin front of statistics typeset bylatex.summary.formula.reverse. This argument is a list with asingle element that has a name specifying the column heading. If thisname includes a newline character, the portions of the string beforeand after the newline form respectively the main heading and thesubheading (typically set in smaller font), respectively. See theextracolheads argument tolatex.default.auxColis filled with blanks when a variable being summarized takes up morethan one row in the output. This happens with categorical variables.

twoway

formethod="cross" with two right hand side variables,twowaycontrols whether the resulting table will be printed in enumerationformat or as a two-way table (the default)

which

Formethod="response" specifies the sequential number or a vector ofsubscripts of statistics to plot. If you had anystratifyvariables, these are counted as if more statistics were computed.Formethod="reverse" specifies whether to plot results for categorical variables, continuous variables, or both (the default).

conType

For plottingmethod="reverse" plots for continuous variables,dot plots showing quartiles are drawn by default. SpecifyconType='bp' to draw box-percentile plots using all thequantiles inquant except the outermost ones. Means are drawnwith a solid dot and vertical reference lines are placed at the threequartiles. SpecifyconType='raw' to make a strip chart showingthe raw data. This can only be used if the sample size for eachleft-hand-side group is less than or equal tonmin.

cex.means

character size for means in box-percentile plots; default is .5

xlim

vector of length two specifying x-axis limits. Formethod="reverse", this is only used for plotting categoricalvariables. Limits for continuous variables are determined by theouter quantiles specified inquant.

xlab

x-axis label

add

set toTRUE to add to an existing plot

main

a main title. Formethod="reverse" this applies only to the plotfor categorical variables.

subtitles

set toFALSE to suppress automatic subtitles

caption

character string containing LaTeX table captions.

title

name of resulting LaTeX file omitting the.tex suffix. Defaultis the name of thesummary object. Ifcaption is specied,title is also used for the table's symbolic reference label.

trios

If formethod="response" you summarized the response(s) by usingthree quantiles, specifytrios=TRUE ortrios=v to group each set ofthree statistics into one column forlatex output, using the formata B c, where the outer quantiles are in smaller font(scriptsize). Fortrios=TRUE, the overall column names are takenfrom the column names of the original data matrix. To give newcolumn names, specifytrios=v, wherev is a vector of columnnames, of lengthm/3, wherem is the original number of columnsof summary statistics.

rowlabel

seelatex.default (under the help filelatex)

cdec

number of decimal places to the right of the decimal point forlatex. This value should be a scalar (which will be properlyreplicated), or a vector with length equal to the number of columnsin the table. For"response" tables, this length does not countthe column forN.

ncaption

set toFALSE to not havelatex.summary.formula.responseput sample sizes in captions

i

a vector of integers, or character strings containing variable namesto subset on. Note that each row subsetted on in ansummary.formula.reverseobject subsets on all the levels that make up the corresponding variable(automatically).

j

a vector of integers representing column numbers

middle.bold

set toTRUE to have LaTeX use bold face for the middle quantile formethod="reverse"

outer.size

the font size for outer quantiles for"reverse" tables

insert.bottom

set toFALSE to suppress inclusion of definitions placed at thebottom of LaTeX tables formethod="reverse"

dcolumn

seelatex

na.group

set toTRUE to have missing stratification variables given their owncategory (NA)

shortlabel

set toFALSE to include stratification variable names and equal signsin labels for strata levels

dotchart

set toTRUE to output a dotchart in the latex table beinggenerated.

group

forconTest andordTest, a numeric orfactor variable with length the same asx

tab

forcatTest, a frequency table such as that createdbytable()

Value

summary.formula returns a data frame or list depending onmethod.plot.summary.formula.reverse returns the numberof pages of plots that were made.

Side Effects

plot.summary.formula.reverse creates a functionKey andKey2 in frame 0 that will draw legends.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Harrell FE (2007): Statistical tables and plots using S and LaTeX.Document available fromhttps://hbiostat.org/R/Hmisc/summary.pdf.

See Also

mChoice,smean.sd,summarize,label,strata,dotchart2,print.char.matrix,update,formula,cut2,llist,format.default,latex,latexTranslatebpplt,summaryM,summary

Examples

options(digits=3)set.seed(173)sex <- factor(sample(c("m","f"), 500, rep=TRUE))age <- rnorm(500, 50, 5)treatment <- factor(sample(c("Drug","Placebo"), 500, rep=TRUE))# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail',          'Muscle Ache','Depressed')symptom1 <- sample(symp, 500,TRUE)symptom2 <- sample(symp, 500,TRUE)symptom3 <- sample(symp, 500,TRUE)Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')table(Symptoms)# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selections#Frequency table sex*treatment, sex*Symptomssummary(sex ~ treatment + Symptoms, fun=table)# could also do summary(sex ~ treatment +#  mChoice(symptom1,symptom2,symptom3), fun=table)#Compute mean age, separately by 3 variablessummary(age ~ sex + treatment + Symptoms)f <- summary(treatment ~ age + sex + Symptoms, method="reverse", test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)plot(f)plot(f, conType='bp', prtest='P')bpplt()    # annotated example showing layout of bp plot#Compute predicted probability from a logistic regression model#For different stratifications compute receiver operating#characteristic curve areas (C-indexes)predicted <- plogis(.4*(sex=="m")+.15*(age-50))positive.diagnosis <- ifelse(runif(500)<=predicted, 1, 0)roc <- function(z) {   x <- z[,1];   y <- z[,2];   n <- length(x);   if(n<2)return(c(ROC=NA));   n1 <- sum(y==1);   c(ROC= (mean(rank(x)[y==1])-(n1+1)/2)/(n-n1) ); }y <- cbind(predicted, positive.diagnosis)options(digits=2)summary(y ~ age + sex, fun=roc)options(digits=3)summary(y ~ age + sex, fun=roc, method="cross")#Use stratify() to produce a table in which time intervals go down the#page and going across 3 continuous variables are summarized using#quartiles, and are stratified by two treatmentsset.seed(1)d <- expand.grid(visit=1:5, treat=c('A','B'), reps=1:100)d$sysbp <- rnorm(100*5*2, 120, 10)label(d$sysbp) <- 'Systolic BP'd$diasbp <- rnorm(100*5*2, 80,  7)d$diasbp[1] <- NAd$age    <- rnorm(100*5*2, 50, 12)g <- function(y) {  N <- apply(y, 2, function(w) sum(!is.na(w)))  h <- function(x) {    qu <- quantile(x, c(.25,.5,.75), na.rm=TRUE)    names(qu) <- c('Q1','Q2','Q3')    c(N=sum(!is.na(x)), qu)}  w <- as.vector(apply(y, 2, h))  names(w) <- as.vector( outer(c('N','Q1','Q2','Q3'), dimnames(y)[[2]],                                function(x,y) paste(y,x)))  w}#Use na.rm=FALSE to count NAs separately by columns <- summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat),             na.rm=FALSE, fun=g, data=d)#The result is very wide.  Re-do, putting treatment verticallyx <- with(d, factor(paste('Visit', visit, treat)))summary(cbind(age,sysbp,diasbp) ~ x, na.rm=FALSE, fun=g, data=d)#Compose LaTeX code directlyg <- function(y) {  h <- function(x) {    qu <- format(round(quantile(x, c(.25,.5,.75), na.rm=TRUE),1),nsmall=1)    paste('{\\scriptsize(',sum(!is.na(x)),          ')} \\hfill{\\scriptsize ', qu[1], '} \\textbf{', qu[2],          '} {\\scriptsize ', qu[3],'}', sep='')  }  apply(y, 2, h)}s <- summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat),             na.rm=FALSE, fun=g, data=d)# latex(s, prn=FALSE)## need option in latex to not print n#Put treatment verticallys <- summary(cbind(age,sysbp,diasbp) ~ x, fun=g, data=d, na.rm=FALSE)# latex(s, prn=FALSE)#Plot estimated mean life length (assuming an exponential distribution) #separately by levels of 4 other variables.  Repeat the analysis#by levels of a stratification variable, drug.  Automatically break#continuous variables into tertiles.#We are using the default, method='response'## Not run: life.expect <- function(y) c(Years=sum(y[,1])/sum(y[,2]))attach(pbc)require(survival)S <- Surv(follow.up.time, death)s2 <- summary(S ~ age + albumin + ascites + edema + stratify(drug),                         fun=life.expect, g=3)#Note: You can summarize other response variables using the same #independent variables using e.g. update(s2, response~.), or you #can change the list of independent variables using e.g. #update(s2, response ~.- ascites) or update(s2, .~.-ascites)#You can also print, typeset, or plot subsets of s2, e.g.#plot(s2[c('age','albumin'),]) or plot(s2[1:2,])s2    # invokes print.summary.formula.response#Plot results as a separate dot chart for each of the 3 strata levelspar(mfrow=c(2,2))plot(s2, cex.labels=.6, xlim=c(0,40), superposeStrata=FALSE)#Typeset table, creating s2.texw <- latex(s2, cdec=1)#Typeset table but just print LaTeX codelatex(s2, file="")    # useful for Sweave#Take control of groups used for age.  Compute 3 quartiles for#both cholesterol and bilirubin (excluding observations that are missing#on EITHER ONE)age.groups <- cut2(age, c(45,60))g <- function(y) apply(y, 2, quantile, c(.25,.5,.75))y <- cbind(Chol=chol,Bili=bili)label(y) <- 'Cholesterol and Bilirubin'#You can give new column names that are not legal S names#by enclosing them in quotes, e.g. 'Chol (mg/dl)'=chols <- summary(y ~ age.groups + ascites, fun=g)par(mfrow=c(1,2), oma=c(3,0,3,0))   # allow outer margins for overallfor(ivar in 1:2) {                  # title   isub <- (1:3)+(ivar-1)*3          # *3=number of quantiles/var.  plot(s3, which=isub, main='',        xlab=c('Cholesterol','Bilirubin')[ivar],       pch=c(91,16,93))            # [, closed circle, ]  }mtext(paste('Quartiles of', label(y)), adj=.5, outer=TRUE, cex=1.75)  #Overall (outer) titleprlatex(latex(s3, trios=TRUE)) # trios -> collapse 3 quartiles#Summarize only bilirubin, but do it with two statistics:#the mean and the median.  Make separate tables for the two randomized#groups and make plots for the active arm.g <- function(y) c(Mean=mean(y), Median=median(y))for(sub in c("D-penicillamine", "placebo")) {  ss <- summary(bili ~ age.groups + ascites + chol, fun=g,                subset=drug==sub)  cat('\n',sub,'\n\n')  print(ss)  if(sub=='D-penicillamine') {    par(mfrow=c(1,1))    plot(s4, which=1:2, dotfont=c(1,-1), subtitles=FALSE, main='')    #1=mean, 2=median     -1 font = open circle    title(sub='Closed circle: mean;  Open circle: median', adj=0)    title(sub=sub, adj=1)  }  w <- latex(ss, append=TRUE, fi='my.tex',              label=if(sub=='placebo') 's4b' else 's4a',             caption=paste(label(bili),' {\\em (',sub,')}', sep=''))  #Note symbolic labels for tables for two subsets: s4a, s4b  prlatex(w)}#Now consider examples in 'reverse' format, where the lone dependent#variable tells the summary function how to stratify all the #'independent' variables.  This is typically used to make tables #comparing baseline variables by treatment group, for example.s5 <- summary(drug ~ bili + albumin + stage + protime + sex +                      age + spiders,              method='reverse')#To summarize all variables, use summary(drug ~., data=pbc)#To summarize all variables with no stratification, use#summary(~a+b+c) or summary(~.,data=\dots)options(digits=1)print(s5, npct='both')#npct='both' : print both numerators and denominatorsplot(s5, which='categorical')Key(locator(1))  # draw legend at mouse clickpar(oma=c(3,0,0,0))  # leave outer margin at bottomplot(s5, which='continuous')Key2()           # draw legend at lower left corner of plot                 # oma= above makes this default key fit the page betteroptions(digits=3)w <- latex(s5, npct='both', here=TRUE)     # creates s5.tex#Turn to a different dataset and do cross-classifications on possibly #more than one independent variable.  The summary function with #method='cross' produces a data frame containing the cross-#classifications.  This data frame is suitable for multi-panel #trellis displays, although `summarize' works better for that.attach(prostate)size.quartile <- cut2(sz, g=4)bone <- factor(bm,labels=c("no mets","bone mets"))s7 <- summary(ap>1 ~ size.quartile + bone, method='cross')#In this case, quartiles are the default so could have said sz + boneoptions(digits=3)print(s7, twoway=FALSE)s7   # same as print(s7)w <- latex(s7, here=TRUE)   # Make s7.texlibrary(trellis,TRUE)invisible(ps.options(reset=TRUE))trellis.device(postscript, file='demo2.ps')dotplot(S ~ size.quartile|bone, data=s7, #s7 is name of summary stats                  xlab="Fraction ap>1", ylab="Quartile of Tumor Size")#Can do this more quickly with summarize:# s7 <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,#                 stat.name='Proportion')# dotplot(Proportion ~ size | bone, data=s7)summary(age ~ stage, method='cross')summary(age ~ stage, fun=quantile, method='cross')summary(age ~ stage, fun=smean.sd, method='cross')summary(age ~ stage, fun=smedian.hilow, method='cross')summary(age ~ stage, fun=function(x) c(Mean=mean(x), Median=median(x)),        method='cross')#The next statements print real two-way tablessummary(cbind(age,ap) ~ stage + bone,         fun=function(y) apply(y, 2, quantile, c(.25,.75)),        method='cross')options(digits=2)summary(log(ap) ~ sz + bone,        fun=function(y) c(Mean=mean(y), quantile(y)),        method='cross')#Summarize an ordered categorical response by all of the needed#cumulative proportionssummary(cumcategory(disease.severity) ~ age + sex)## End(Not run)

Summarize Mixed Data Types vs. Groups

Description

summaryM summarizes the variables listed in an S formula,computing descriptive statistics and optionally statistical tests forgroup differences. This function is typically used when there aremultiple left-hand-side variables that are independently against bygroups marked by a single right-hand-side variable. The summarystatistics may be passed toprint methods,plot methodsfor making annotated dot charts and extended box plots, andlatex methods for typesetting tables using LaTeX. Thehtml method useshtmlTable::htmlTable to typeset thetable in html, by passing information to thelatex method withhtml=TRUE. This is for use with Quarto/RMarkdown.Theprint methods use theprint.char.matrix function toprint boxed tables whenoptions(prType=) has not been given orwhenprType='plain'. For plain tables,print calls theinternal functionprintsummaryM. WhenprType='latex'thelatex method is invoked, and whenprType='html' htmlis rendered. In Quarto/RMarkdown, proper rendering will result evenifresults='asis' does not appear in the chunk header. Whenrendering in html at the console due to havingoptions(prType='html')the table will be rendered in a viewer.

Theplot method createsplotly graphics ifoptions(grType='plotly'), otherwise base graphics are used.plotly graphics provide extra information such as whichquantile is being displayed when hovering the mouse. Test statisticsare displayed by hovering over the mean.

Continuous variables are described by three quantiles (quartiles bydefault) when printing, or by the following quantiles when plottingexpended box plots using thebpplt function:0.05, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 0.95. The boxplots are scaled to the 0.025 and 0.975 quantiles of each continuousleft-hand-side variable. Categorical variables are described by counts and percentages.

The left hand side offormula may containmChoice("multiple choice") variables. Whentest=TRUE each choice istested separately as a binary categorical response.

Theplot method formethod="reverse" creates a temporaryfunctionKey as is done by thexYplot andEcdf.formula functions. Afterplotruns, you can typeKey() to put a legend in a default location, ore.g.Key(locator(1)) to draw a legend where you click the leftmouse button. This key is for categorical variables, so to have theopportunity to put the key on the graph you will probably want to usethe commandplot(object, which="categorical"). A second functionKey2 is created if continuous variables are being plotted. It isused the same asKey. If thewhich argument is notspecified toplot, two pages of plots will be produced. If youdon't definepar(mfrow=) yourself,plot.summaryM will try to lay out a multi-panelgraph to best fit all the individual charts for continuousvariables.

Usage

summaryM(formula, groups=NULL, data=NULL, subset, na.action=na.retain,         overall=FALSE, continuous=10, na.include=FALSE,         quant=c(0.025, 0.05, 0.125, 0.25, 0.375, 0.5, 0.625,                 0.75, 0.875, 0.95, 0.975),         nmin=100, test=FALSE,         conTest=conTestkw, catTest=catTestchisq,         ordTest=ordTestpo)## S3 method for class 'summaryM'print(...)printsummaryM(x, digits, prn = any(n != N),      what=c('proportion', '%'), pctdig = if(what == '%') 0 else 2,      npct = c('numerator', 'both', 'denominator', 'none'),      exclude1 = TRUE, vnames = c('labels', 'names'), prUnits = TRUE,      sep = '/', abbreviate.dimnames = FALSE,      prefix.width = max(nchar(lab)), min.colwidth, formatArgs=NULL, round=NULL,      prtest = c('P','stat','df','name'), prmsd = FALSE, long = FALSE,      pdig = 3, eps = 0.001, prob = c(0.25, 0.5, 0.75), prN = FALSE, ...)## S3 method for class 'summaryM'plot(x, vnames = c('labels', 'names'),     which = c('both', 'categorical', 'continuous'), vars=NULL,     xlim = c(0,1),     xlab = 'Proportion',     pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), exclude1 = TRUE,     main, ncols=2,     prtest = c('P', 'stat', 'df', 'name'), pdig = 3, eps = 0.001,     conType = c('bp', 'dot', 'raw'), cex.means = 0.5, cex=par('cex'),     height='auto', width=700, ...)## S3 method for class 'summaryM'latex(object, title =      first.word(deparse(substitute(object))),      file=paste(title, 'tex', sep='.'), append=FALSE, digits,       prn = any(n != N), what=c('proportion', '%'),      pctdig = if(what == '%') 0 else 2,      npct = c('numerator', 'both', 'denominator', 'slash', 'none'),      npct.size = if(html) mspecs$html$smaller else 'scriptsize',      Nsize = if(html) mspecs$html$smaller else 'scriptsize',      exclude1 = TRUE,      vnames=c("labels", "names"), prUnits = TRUE, middle.bold = FALSE,      outer.size = if(html) mspecs$html$smaller else "scriptsize",      caption, rowlabel = "", rowsep=html,      insert.bottom = TRUE, dcolumn = FALSE, formatArgs=NULL, round=NULL,      prtest = c('P', 'stat', 'df', 'name'), prmsd = FALSE,      msdsize = if(html) function(x) x else NULL, brmsd=FALSE,      long = FALSE, pdig = 3, eps = 0.001,      auxCol = NULL, table.env=TRUE, tabenv1=FALSE, prob=c(0.25, 0.5, 0.75),      prN=FALSE, legend.bottom=FALSE, html=FALSE,      mspecs=markupSpecs, ...)## S3 method for class 'summaryM'html(object, ...)

Arguments

formula

An S formula with additive effects. There may be several variableson the right hand side separated by "+",or the numeral1, indicating thatthere is no grouping variable so that only margin summaries areproduced. The right hand side variable, if present, must be adiscrete variable producing a limited number of groups. On theleft hand side there may be any number of variables, separated by"+", and these may be of mixed types. These variables are analyzedseparately by the grouping variable.

groups

if there is more than one right-hand variable, specifygroups as a character string containing the name of thevariable used to produce columns of the table. The remaining righthand variables are combined to produce levels that cause separatetables or plots to be produced.

x

an object created bysummaryM. ForconTestkw a numeric vector, and forordTestpo, a numericor factor variable that can be considered ordered

data

name or number of a data frame. Default is the current frame.

subset

a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame.

na.action

function for handling missing data in the input data. The default isa function defined here calledna.retain, which keeps allobservations for processing, with missing variables or not.

overall

Settingoverall=TRUE makes a new column withoverall statistics for the whole sample. Iftest=TRUE thesemarginal statistics are ignored in doing statistical tests.

continuous

specifies the threshold for when a variable is considered to becontinuous (when there are at leastcontinuous unique values).factor variables are always considered to be categorical no matterhow many levels they have.

na.include

Setna.include=TRUE to keep missing values of categoricalvariables from being excluded from the table.

nmin

For categories of the response variable in which thereare less than or equal tonmin non-missing observations, the rawdata are retained for later plotting in place of box plots.

test

Set toTRUE to compute teststatistics using tests specified inconTest andcatTest.

conTest

a function of two arguments (grouping variable and a continuousvariable) that returns a list with componentsP (the computedP-value),stat (the test statistic, either chi-square or F),df (degrees of freedom),testname (test name),namefun ("chisq", "fstat"),statname(statistic name), an optional componentlatexstat (LaTeXrepresentation ofstatname), an optional componentplotmathstat (for R - theplotmath representation ofstatname, as a character string), and anoptional componentnote that contains a character string note about the test (e.g.,"test not done because n < 5").conTest is applied tocontinuous variables on the right-hand-side of the formula whenmethod="reverse". Thedefault uses thespearman2 function to run the Wilcoxon orKruskal-Wallis test using the F distribution.

catTest

a function of a frequency table (an integer matrix) that returns alist with the same components as created byconTest. By default,the Pearson chi-square test is done, without continuity correction(the continuity correction would make the test conservative like theFisher exact test).

ordTest

a function of a frequency table (an integer matrix) that returns alist with the same components as created byconTest. By default,the Proportional odds likelihood ratio test is done.

...

ForKey andKey2 these arguments are passed tokey,text, ormtitle. Forprint methods these areoptional arguments toprint.char.matrix. Forlatex methodsthese are passed tolatex.default. Forhtml thearguments are passed thelatex.summaryM, and the argumentsmay not includefile. Forprint the arguments arepassed toprintsummaryM orlatex.summaryM depending onoptions(prType=).

object

an object created bysummaryM

quant

vector of quantiles to use for summarizing continuous variables.These must be numbers between 0 and 1inclusive and must include the numbers 0.5, 0.25, and 0.75 which areused for printing and for plotting quantile intervals. The outer quantiles are used for scaling the x-axesfor such plots. Specify outer quantiles as0 and1 toscale the x-axes using the whole observed data ranges instead of thedefault (a 0.95 quantile interval). Box-percentile plots are drawnusing all but the outer quantiles.

prob

vector of quantiles to use for summarizing continuous variables.These must be numbers between 0 and 1 inclusive and have previously beenincluded in thequant argument ofsummaryM. The vectormust be of length three. By default it contains 0.25, 0.5, and 0.75.

Warning: specifying 0 and 1 as two of the quantiles will result incomputing the minimum and maximum of the variable. As for many randomvariables the minimum will continue to become smaller as the sample sizegrows, and the maximum will continue to get larger. Thus the min and maxare not recommended as summary statistics.

vnames

By default, tables and plots are usually labeled with variable labels(see thelabel andsas.get functions). To use the shortervariable names, specifyvnames="name".

pch

vector of plotting characters to represent different groups, in orderof group levels.

abbreviate.dimnames

seeprint.char.matrix

prefix.width

seeprint.char.matrix

min.colwidth

minimum column width to use for boxes printed withprint.char.matrix.The default is the maximum of the minimum column label length andthe minimum length of entries in the data cells.

formatArgs

a list containing other arguments to pass toformat.default such asscientific, e.g.,formatArgs=list(scientific=c(-5,5)). Forprint.summary.formula.reverse andformat.summary.formula.reverse,formatArgs applies only tostatistics computed on continuous variables, not to percents,numerators, and denominators. Theround argument may be preferred.

digits

number of significant digits to print. Default is to use the currentvalue of thedigits system option.

what

specifies whether proportions or percentages are to beprinted or LaTeX'd

pctdig

number of digits to the right of the decimal place for printingpercentages or proportions. The default is zero ifwhat='%',so percents will be rounded to the nearest percent. The default is2 for proportions.

prn

set toTRUE to print the number of non-missing observations on thecurrent (row) variable. The default is to print these only if any ofthe counts of non-missing values differs from the total number ofnon-missing values of the left-hand-side variable.

prN

set toTRUE to print the number of non-missing observations onrows that contain continuous variables.

npct

specifies which counts are to be printed to the right of percentages.The default is to print the frequency (numerator of the percent) inparentheses. You can specify"both" to print both numerator anddenominator as a fraction,"denominator","slash" totypeset horizontally using a forward slash, or"none".

npct.size

the size for typesettingnpct information which appears afterpercents. The default is"scriptsize".

Nsize

When a second row of column headings is added showing sample sizes,Nsize specifies the LaTeX size for these subheadings. Defaultis"scriptsize".

exclude1

By default,summaryM objects will be printed, plotted, or typeset byremoving redundant entries from percentage tables for categoricalvariables. For example, if you print the percent of females, youdon't need to print the percent of males. To override this, setexclude1=FALSE.

prUnits

set toFALSE to suppress printing or latexingunitsattributes of variables, whenmethod='reverse' or'response'

sep

character to use to separate quantiles when printing tables

prtest

a vector of test statistic components to print iftest=TRUE was ineffect whensummaryM was called. Defaults to printing allcomponents. Specifyprtest=FALSE orprtest="none" to notprint any tests. This applies toprint,latex, andplot methods.

round

Specifyround to roundthe quantiles and optional mean and standard deviation toround digits after the decimal point. Setround='auto'to try an automatic choice.

prmsd

set toTRUE to print mean and SD after the three quantiles, forcontinuous variables

msdsize

defaults toNULL to use the current font size for the mean andstandard deviation ifprmsd isTRUE. Set to a characterstring or function to specify an alternate LaTeX font size.

brmsd

set toTRUE to put the mean and standard deviationon a separate line, for html

long

set toTRUE to print the results for the first category on its ownline, not on the same line with the variable label

pdig

number of digits to the right of the decimal place for printingP-values. Default is3. This is passed toformat.pval.

eps

P-values less thaneps will be printed as< eps. Seeformat.pval.

auxCol

an optional auxiliary column of information, right justified, to addin front of statistics typeset bylatex.summaryM. This argument is a list with asingle element that has a name specifying the column heading. If thisname includes a newline character, the portions of the string beforeand after the newline form respectively the main heading and thesubheading (typically set in smaller font), respectively. See theextracolheads argument tolatex.default.auxColis filled with blanks when a variable being summarized takes up morethan one row in the output. This happens with categorical variables.

table.env

set toFALSE to usetabular environmentwith no caption

tabenv1

set toTRUE in the case of stratification whenyou want only the first stratum's table to be in a tableenvironment. This is useful when usinghyperref.

which

Specifies whether to plot results for categorical variables,continuous variables, or both (the default).

vars

Subscripts (indexes) of variables to plot forplotly graphics. Default is to plot all variables of eachtype (categorical or continuous).

conType

For drawing plots for continuous variables,extended box plots (box-percentile-type plots) are drawn by default,using all quantiles inquant except for the outermost oneswhich are using for scaling the overall plot based on thenon-stratified marginal distribution of the current response variable.SpecifyconType='dot' to draw dot plots showing the threequartiles instead. For extended box plots, means are drawnwith a solid dot and vertical reference lines are placed at the threequartiles. SpecifyconType='raw' to make a strip chart showingthe raw data. This can only be used if the sample size for eachright-hand-side group is less than or equal tonmin.

cex.means

character size for means in box-percentile plots; default is .5

cex

character size for other plotted items

height,width

dimensions in pixels for theplotlysubplot object containing all the extended box plots. Ifheight="auto",plot.summaryM will setheightbased on the number of continuous variables andncols or for dot charts it will useHmisc::plotlyHeightDotchart. At presentheight isignored for extended box plots due to vertical spacing problem withplotly graphics.

xlim

vector of length two specifying x-axis limits. This is only usedfor plotting categorical variables. Limits for continuousvariables are determined by the outer quantiles specified inquant.

xlab

x-axis label

main

a main title. This applies only to the plot forcategorical variables.

ncols

number of columns forplotly graphics for extendedbox plots. Defaults to 2. Recommendation is for 1-2.

caption

character string containing LaTeX table captions.

title

name of resulting LaTeX file omitting the.tex suffix. Defaultis the name of thesummary object. Ifcaption is specied,title is also used for the table's symbolic reference label.

file

name of file to write LaTeX code to. Specifyingfile="" will cause LaTeX code to just be printed tostandard output rather than be stored in a permanent file.

append

specifyTRUE to add code to an existing file

rowlabel

seelatex.default (under the help filelatex)

rowsep

ifhtml isTRUE, instructs the function touse a horizontal line to separate variables from one another.Recommended ifbrmsd isTRUE. Ignored for LaTeX.

middle.bold

set toTRUE to have LaTeX use bold face for the middlequantile

outer.size

the font size for outer quantiles

insert.bottom

set toFALSE to suppress inclusion of definitions placed at thebottom of LaTeX tables. You can also specify a character stringcontaining other text that overrides the automatic text. Atpresent such text always appears in the main caption for LaTeX.

legend.bottom

set toTRUE to separate the table caption and legend. Thiswill place table legends at the bottom of LaTeX tables.

html

set toTRUE to typeset with html

mspecs

list defining markup syntax for various languages,defaults to HmiscmarkupSpecs which the user can use as astarting point for editing

dcolumn

seelatex

Value

a list.plot.summaryM returns the numberof pages of plots that were made if using base graphics, orplotly objects created byplotly::subplot otherwise.If both categorical and continuous variables were plotted, thereturned object is a list with two named elementsCategoricalandContinuous each containingplotly objects.Otherwise aplotly object is returned.Thelatex method returns attributeslegend andnstrata.

Side Effects

plot.summaryM creates a functionKey andKey2 in frame 0 that will draw legends, if base graphics arebeing used.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Harrell FE (2004): Statistical tables and plots using S and LaTeX.Document available fromhttps://hbiostat.org/R/Hmisc/summary.pdf.

See Also

mChoice,label,dotchart3,print.char.matrix,update,formula,format.default,latex,latexTranslate,bpplt,tabulr,bpplotM,summaryP

Examples

options(digits=3)set.seed(173)sex <- factor(sample(c("m","f"), 500, rep=TRUE))country <- factor(sample(c('US', 'Canada'), 500, rep=TRUE))age <- rnorm(500, 50, 5)sbp <- rnorm(500, 120, 12)label(sbp) <- 'Systolic BP'units(sbp) <- 'mmHg'treatment <- factor(sample(c("Drug","Placebo"), 500, rep=TRUE))treatment[1]sbp[1] <- NA# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail',          'Muscle Ache','Depressed')symptom1 <- sample(symp, 500,TRUE)symptom2 <- sample(symp, 500,TRUE)symptom3 <- sample(symp, 500,TRUE)Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')table(as.character(Symptoms))# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selectionsf <- summaryM(age + sex + sbp + Symptoms ~ treatment, test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)plot(f)    # first specify options(grType='plotly') to use plotlyplot(f, conType='dot', prtest='P')bpplt()    # annotated example showing layout of bp plot# Produce separate tables by countryf <- summaryM(age + sex + sbp + Symptoms ~ treatment + country,              groups='treatment', test=TRUE)f## Not run: getHdata(pbc)s5 <- summaryM(bili + albumin + stage + protime + sex +                age + spiders ~ drug, data=pbc)print(s5, npct='both')# npct='both' : print both numerators and denominatorsplot(s5, which='categorical')Key(locator(1))  # draw legend at mouse clickpar(oma=c(3,0,0,0))  # leave outer margin at bottomplot(s5, which='continuous')  # see also bpplotMKey2()           # draw legend at lower left corner of plot                 # oma= above makes this default key fit the page betteroptions(digits=3)w <- latex(s5, npct='both', here=TRUE, file='')options(grType='plotly')pbc <- upData(pbc, moveUnits = TRUE)s <- summaryM(bili + albumin + alk.phos + copper + spiders + sex ~              drug, data=pbc, test=TRUE)# Render htmloptions(prType='html')s   # invokes print.summaryMa <- plot(s)a$Categoricala$Continuousplot(s, which='con')## End(Not run)

Multi-way Summary of Proportions

Description

summaryP produces a tall and thin data frame containingnumerators (freq) and denominators (denom) afterstratifying the data by a series of variables. A special capabilityto group a series of related yes/no variables is included through theuse of theynbind function, for which the user specials a finalargumentlabel used to label the panel created for that groupof related variables.

Ifoptions(grType='plotly') is not in effect,theplot method forsummaryPdisplays proportions as amulti-panel dot chart using thelattice package'sdotplotfunction with a specialpanel function. Numerators anddenominators of proportions are also included as text, in the samecolors as used by an optionalgroups variable. Theformula argument used in thedotplot call is constructed,but the user can easily reorder the variables by specifyingformula, with elements namedval (category levels),var (classification variable name),freq (calculatedresult) plus the overall cross-classification variables excludinggroups. Ifoptions(grType='plotly') is in effect, theplot method makes an entirely different display usingHmisc::dotchartpl withplotly ifmarginVal isspecified, whereby a stratificationvariable causes more finely stratified estimates to be shown slightlybelow the lines, with smaller and translucent symbols ifdatahas been run throughaddMarginal. The marginal summaries areshown as the main estimates and the user can turn off display of thestratified estimates, or view their details with hover text.

Theggplot method forsummaryP does not draw numeratorsand denominators but the chart is more compact than using theplot method with base graphics becauseggplot2does not repeat category names the same way aslattice does.Variable names that are too long to fit in panel strips are renamed(1), (2), etc. and an attribute"fnvar" is added to the result;this attribute is a character string defining the abbreviations,useful in a figure caption. Theggplot2 object haslabels for points plotted, used byplotly::ggplotly ashover text (see example).

Thelatex method produces one or more LaTeXtabularscontaining a table representation of the result, with optionalside-by-side display ifgroups is specified. Multipletabulars result from the presence of non-group stratificationfactors.

Usage

summaryP(formula, data = NULL, subset = NULL,         na.action = na.retain, sort=TRUE,         asna = c("unknown", "unspecified"), ...)## S3 method for class 'summaryP'plot(x, formula=NULL, groups=NULL,         marginVal=NULL, marginLabel=marginVal,         refgroup=NULL, exclude1=TRUE,  xlim = c(-.05, 1.05),         text.at=NULL, cex.values = 0.5,         key = list(columns = length(groupslevels), x = 0.75,                    y = -0.04, cex = 0.9,                    col = lattice::trellis.par.get('superpose.symbol')$col,                    corner=c(0,1)),         outerlabels=TRUE, autoarrange=TRUE,         col=colorspace::rainbow_hcl, ...)## S3 method for class 'summaryP'ggplot(data, mapping, groups=NULL, exclude1=TRUE,           xlim=c(0, 1), col=NULL, shape=NULL, size=function(n) n ^ (1/4),           sizerange=NULL, abblen=5, autoarrange=TRUE, addlayer=NULL,           ..., environment)## S3 method for class 'summaryP'latex(object, groups=NULL, exclude1=TRUE, file='', round=3,                           size=NULL, append=TRUE, ...)

Arguments

formula

a formula with the variables for whose levelsproportions are computed on the left hand side, and majorclassification variables on the right. The formula need to includeany variable later used asgroups, as the data summarizationdoes not distinguish between superpositioning and paneling. For theplot method,formula can provide an overall to the defaultformula fordotplot().

data

an optional data frame. Forggplot.summaryPdata is the result ofsummaryP.

subset

an optional subsetting expression or vector

na.action

function specifying how to handleNAs. Thedefault is to keep allNAs in the analysis frame.

sort

set toFALSE to not sort category levels indescending order of global proportions

asna

character vector specifying level names to consider thesame asNA. Setasna=NULL to not consider any.

x

an object produced bysummaryP

groups

a character string containing the name of asuperpositioning variable for obtaining further stratification within a horizontal line in the dot chart.

marginVal

ifoptions(grType='plotly') is in effect andthe data given tosummaryP were run throughaddMarginal,specifies the category name that represents marginal summaries(usually"All").

marginLabel

specifies a different character string to use thanthe value ofmarginVal. For example, if marginal proportionswere computed over allregions, one may specifymarginVal="All", marginLabel="All Regions".marginLabelis only used for formatting graphical output.

refgroup

used when doing aplotly chart and a two-levelgroup variable was used, resulting in the half-width confidenceinterval for the difference in two proportions to be shown, and theactual confidence limits and the difference added to hover text. Seedotchartpl for more details.

exclude1

By default,ggplot,plot, andlatex methods forsummaryP remove redundant entries from tables for variables with only two levels. For example, if youprint the proportion of females, you don't need to print theproportion of males. To override this, setexclude1=FALSE.

xlim

x-axis limits. Default isc(0,1).

text.at

specify to leave unused space to the right of eachpanel to prevent numerators and denominators from touching datapoints.text.at is the upper limit for scaling panels'x-axes but tick marks are only labeled up tomax(xlim).

cex.values

character size to use for plotting numerators anddenominators

key

a list to pass to theauto.key argument ofdotplot. To place a key above the entire chart useauto.key=list(columns=2) for example.

outerlabels

by default if there are two conditioning variablesbesidesgroups, thelatticeExtra package'suseOuterStrips function is used to put strip labels in themargins, usually resulting in a much prettier chart. Set toFALSE to prevent usage ofuseOuterStrips.

autoarrange

IfTRUE, the formula is re-arranged so thatif there are two conditioning (paneling) variables, the variable withthe most levels is taken as the vertical condition.

col

a vector of colors to use to override defaults inggplot. Whenoptions(grType='plotly'), seedotchartpl.

shape

a vector of plotting symbols to overrideggplotdefaults

mapping,environment

not used; needed because of rules for generics

size

forggplot, a function that transforms denominatorsinto metrics used for thesize aesthetic. Default is thefourth root function so that the area of symbols is proportional tothe square root of sample size. SpecifyNULL to not vary pointsizes.size=sqrt is a reasonable alternative. Setsize to an integer to categorize the denominators intosize quantile groups usingcut2. Unlesssize isan integer, the legend for sizes uses the minimum and maximumdenominators and 6-tiles usingquantile(..., type=1) so thatactually occurring sample sizes are used as labels.size isoverridden toNULL if the range in denominators is less than 10or the ratio of the maximum to the minimum is less than 1.2.Forlatex,size is an optional font size such as"small"

sizerange

a 2-vector specifying therange argument to theggplot2scale_size_... function, which is therange of sizes allowed for the points according to the denominator.The default issizerange=c(.7, 3.25) but the lower limit isincreased according to the ratio of maximum to minimum sample sizes.

abblen

labels of variables having only one level and havingtheir name longer thanabblen characters are abbreviated and documented infnvar (described elsewherehere). The defaultabblen=5 is good for labels plottedvertically. If labels are rotated usingtheme a better valuewould be 12.

...

used only forplotly graphics and these argumentsare passed todotchartpl

object

an object produced bysummaryP

file

file name, defaults to writing to console

round

number of digits to the right of the decimal place forproportions

append

set toFALSE to start output over

addlayer

aggplot layer to add to the plot object

Value

summaryP produces a data frame of class"summaryP". Theplot method produces alatticeobject of class"trellis". Thelatex method produces anobject of class"latex" with an additional attributengrouplevels specifying the number of levels of anygroups variable and an attributenstrata specifying thenumber of strata.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

bpplotM,summaryM,ynbind,pBlock,ggplot,colorFacet

Examples

n <- 100f <- function(na=FALSE) {  x <- sample(c('N', 'Y'), n, TRUE)  if(na) x[runif(100) < .1] <- NA  x}set.seed(1)d <- data.frame(x1=f(), x2=f(), x3=f(), x4=f(), x5=f(), x6=f(), x7=f(TRUE),                age=rnorm(n, 50, 10),                race=sample(c('Asian', 'Black/AA', 'White'), n, TRUE),                sex=sample(c('Female', 'Male'), n, TRUE),                treat=sample(c('A', 'B'), n, TRUE),                region=sample(c('North America','Europe'), n, TRUE))d <- upData(d, labels=c(x1='MI', x2='Stroke', x3='AKI', x4='Migraines',                 x5='Pregnant', x6='Other event', x7='MD withdrawal',                 race='Race', sex='Sex'))dasna <- subset(d, region=='North America')with(dasna, table(race, treat))s <- summaryP(race + sex + ynbind(x1, x2, x3, x4, x5, x6, x7, label='Exclusions') ~              region + treat, data=d)# add exclude1=FALSE below to include female categoryplot(s, groups='treat')require(ggplot2)ggplot(s, groups='treat')plot(s, val ~ freq | region * var, groups='treat', outerlabels=FALSE)# Much better looking if omit outerlabels=FALSE; see output at# https://hbiostat.org/R/Hmisc/summaryFuns.pdf# See more examples under bpplotM## For plotly interactive graphic that does not handle variable size## panels well:## require(plotly)## g <- ggplot(s, groups='treat')## ggplotly(g, tooltip='text')## For nice plotly interactive graphic:## options(grType='plotly')## s <- summaryP(race + sex + ynbind(x1, x2, x3, x4, x5, x6, x7,##                                   label='Exclusions') ~##               treat, data=subset(d, region='Europe'))#### plot(s, groups='treat', refgroup='A')  # refgroup='A' does B-A differences# Make a chart where there is a block of variables that# are only analyzed for males.  Keep redundant sex in block for demo.# Leave extra space for numerators, denominatorssb <- summaryP(race + sex +               pBlock(race, sex, label='Race: Males', subset=sex=='Male') ~               region, data=d)plot(sb, text.at=1.3)plot(sb, groups='region', layout=c(1,3), key=list(space='top'),     text.at=1.15)ggplot(sb, groups='region')## Not run: plot(s, groups='treat')# plot(s, groups='treat', outerlabels=FALSE) for standard lattice outputplot(s, groups='region', key=list(columns=2, space='bottom'))require(ggplot2)colorFacet(ggplot(s))plot(summaryP(race + sex ~ region, data=d), exclude1=FALSE, col='green')require(lattice)# Make your own plot using data frame created by summaryPuseOuterStrips(dotplot(val ~ freq | region * var, groups=treat, data=s,        xlim=c(0,1), scales=list(y='free', rot=0), xlab='Fraction',        panel=function(x, y, subscripts, ...) {          denom <- s$denom[subscripts]          x <- x / denom          panel.dotplot(x=x, y=y, subscripts=subscripts, ...) }))# Show marginal summary for all regions combineds <- summaryP(race + sex ~ region, data=addMarginal(d, region))plot(s, groups='region', key=list(space='top'), layout=c(1,2))# Show marginal summaries for both race and sexs <- summaryP(ynbind(x1, x2, x3, x4, label='Exclusions', sort=FALSE) ~              race + sex, data=addMarginal(d, race, sex))plot(s, val ~ freq | sex*race)## End(Not run)

Graphical Summarization of Continuous Variables Against a Response

Description

summaryRc is a continuous version ofsummary.formulawithmethod='response'. It uses theplsmofunction to compute the possibly stratifiedlowessnonparametric regression estimates, and plots them along with the datadensity, with selected quantiles of the overall distribution (overstrata) of eachx shown as arrows on top of the graph. All thex variables must be numeric and continuous or nearly continuous.

Usage

summaryRc(formula, data=NULL, subset=NULL,          na.action=NULL, fun = function(x) x,          na.rm = TRUE, ylab=NULL, ylim=NULL, xlim=NULL,          nloc=NULL, datadensity=NULL,          quant = c(0.05, 0.1, 0.25, 0.5, 0.75,                    0.90, 0.95), quantloc=c('top','bottom'),          cex.quant=.6, srt.quant=0,          bpplot = c('none', 'top', 'top outside', 'top inside', 'bottom'),          height.bpplot=0.08,          trim=NULL, test = FALSE, vnames = c('labels', 'names'), ...)

Arguments

formula

AnR formula with additive effects. Theformula may containone or more invocations of thestratify function whosearguments are defined below. This causes the entire analysis to be stratified by cross-classifications of thecombined list of stratification factors. This stratification will bereflected as separatelowess curves.

data

name or number of a data frame. Default is the current frame.

subset

a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame.

na.action

function for handling missing data in the input data. The default isa function defined here calledna.retain, which keeps allobservations for processing, with missing variables or not.

fun

function for transforminglowess estimates. Default is theidentity function.

na.rm

TRUE (the default) to excludeNAs before passing data tofun to compute statistics,FALSE otherwise.

ylab

y-axis label. Default is label attribute ofy variable, or its name.

ylim

y-axis limits. By default each graph is scaled onits own.

xlim

a list with elements named as the variable names appearingon thex-axis, with each element being a 2-vector specifyinglower and upper limits. Any variable not appearing in the list willhave its limits computed and possiblytrimmed.

nloc

location for sample size. Specifynloc=FALSE tosuppress, ornloc=list(x=,y=) wherex,y are relativecoordinates in the data window. Default position is in the largestempty space.

datadensity

seeplsmo. Defaults toTRUEif there is astratify variable,FALSE otherwise.

quant

vector of quantiles to use for summarizing the marginal distributionof eachx. This must be numbers between 0 and 1inclusive. UseNULL to omit quantiles.

quantloc

specifyquantloc='bottom' to place at thebottom of each plot rather than the default

cex.quant

character size for writing which quantiles arerepresented. Set to0 to suppress quantile labels.

srt.quant

angle for text for quantile labels

bpplot

if not'none' will draw extended box plot atlocation given bybpplot, and quantiles discussed above willbe suppressed. Specifyingbpplot='top' is the same asspecifyingbpplot='top inside'.

height.bpplot

height in inches of the horizontal extended box plot

trim

The default is to plot from the 10th smallest to the 10thlargestx if the number of non-NAs exceeds 200, otherwise touse the entire range ofx. Specify another quantile to useother limits, e.g.,trim=0.01 will use the first and lastpercentiles

test

Set toTRUE to plot test statistics (not yet implemented).

vnames

By default, plots are usually labeled with variable labels(see thelabel andsas.get functions). To use the shortervariable names, specifyvnames="names".

...

arguments passed toplsmo

Value

no value is returned

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

See Also

plsmo,stratify,label,formula,panel.bpplot

Examples

options(digits=3)set.seed(177)sex <- factor(sample(c("m","f"), 500, rep=TRUE))age <- rnorm(500, 50, 5)bp  <- rnorm(500, 120, 7)units(age) <- 'Years'; units(bp) <- 'mmHg'label(bp) <- 'Systolic Blood Pressure'L <- .5*(sex == 'm') + 0.1 * (age - 50)y <- rbinom(500, 1, plogis(L))par(mfrow=c(1,2))summaryRc(y ~ age + bp)# For x limits use 1st and 99th percentiles to frame extended box plotssummaryRc(y ~ age + bp, bpplot='top', datadensity=FALSE, trim=.01)summaryRc(y ~ age + bp + stratify(sex),          label.curves=list(keys='lines'), nloc=list(x=.1, y=.05))y2 <- rbinom(500, 1, plogis(L + .5))Y <- cbind(y, y2)summaryRc(Y ~ age + bp + stratify(sex),          label.curves=list(keys='lines'), nloc=list(x=.1, y=.05))

Summarize Multiple Response Variables and Make Multipanel Scatteror Dot Plot

Description

Multiple left-hand formula variables along with right-hand sideconditioning variables are reshaped into a "tall and thin" data frame iffun is not specified. The resulting raw data can be plotted withtheplot method using user-specifiedpanel functions forlattice graphics, typically to make a scatterplot orloesssmooths, or both. TheHmiscpanel.plsmo function is handyin this context. Instead, iffun is specified, this functiontakes individual response variables (which may be matrices, as inSurv objects) and creates one or more summarystatistics that will be computed while the resulting data frame is beingcollapsed to one row per condition. Theplot method in this caseplots a multi-panel dot chart using thelatticedotplot function ifpanel is not specifiedtoplot. There is an option to printselected statistics as text on the panels.summaryS pays specialattention toHmisc variable annotations:label, units.Whenpanel is specified in addition tofun, a specialx-y plot is made that assumes that thex-axis variable(typically time) is discrete. This is used for example to plot multiplequantile intervals as vertical lines next to the main point. A specialpanel functionmvarclPanel is provided for this purpose.

Theplotp method produces correspondingplotly graphics.

Whenfun is given andpanel is omitted, and the result offun is a vector of more than one statistic, the first statistic is taken as the main one. Any columnswith names not intextonly will figure into the calculation ofaxis limits. Those intextonly will be printed right under thedot lines in the dot chart. Statistics with names intextplotwill figure into limits, be plotted, and printed.pch.stats canbe used to specify symbols for statistics after the first column. Whenfun computed three columns that are plotted, columns two andthree are taken as confidence limits for which horizontal "error bars"are drawn. Two levels with different thicknesses are drawn if there arefour plotted summary statistics beyond the first.

mbarclPanel is used to draw multiple vertical lines around themain points, such as a series of quantile intervals stratified byx and paneling variables. IfmbarclPanel finds a columnof an arumentyother that is named"se", and if there areexactly two levels to a superpositioning variable, the half-height ofthe approximate 0.95 confidence interval for the difference between twopoint estimates is shown, positioned at the midpoint of the two pointestimates at anx value. This assume normality of pointestimates, and the standard error of the difference is the square rootof the sum of squares of the two standard errors. By positioning theintervals in this fashion, a failure of the two point estimates to touchthe half-confidence interval is consistent with rejecting the nullhypothesis of no difference at the 0.05 level.

mbarclpl is thesfun function corresponding tombarclPanel forplotp, andmedvpl is thesfun replacement formedvPanel.

medvPanel takes raw data and plots mediany vs.x,along with confidence intervals and half-interval for the difference inmedians as withmbarclPanel. Quantile intervals are optional.Very transparent vertical violin plots are added by default. Unlikepanel.violin, only half of the violin is plotted, and when thereare two superpose groups they are side-by-side in different colors.

Forplotp, the function corresponding tomedvPanel ismedvpl, which draws back-to-back spike histograms, optional Ginimean difference, optional SD, quantiles (thin line version of boxplot with 0.05 0.25 0.5 0.75 0.95 quantiles), and half-width confidenceinterval for differences in medians. For quantiles, the Harrell-Davisestimator is used.

Usage

summaryS(formula, fun = NULL, data = NULL, subset = NULL,         na.action = na.retain, continuous=10, ...)## S3 method for class 'summaryS'plot(x, formula=NULL, groups=NULL, panel=NULL,           paneldoesgroups=FALSE, datadensity=NULL, ylab='',           funlabel=NULL, textonly='n', textplot=NULL,           digits=3, custom=NULL,           xlim=NULL, ylim=NULL, cex.strip=1, cex.values=0.5, pch.stats=NULL,           key=list(columns=length(groupslevels),             x=.75, y=-.04, cex=.9,             col=lattice::trellis.par.get('superpose.symbol')$col,             corner=c(0,1)),           outerlabels=TRUE, autoarrange=TRUE, scat1d.opts=NULL, ...)## S3 method for class 'summaryS'plotp(data, formula=NULL, groups=NULL, sfun=NULL,           fitter=NULL, showpts=! length(fitter), funlabel=NULL,           digits=5, xlim=NULL, ylim=NULL,           shareX=TRUE, shareY=FALSE, autoarrange=TRUE, ...)mbarclPanel(x, y, subscripts, groups=NULL, yother, ...)medvPanel(x, y, subscripts, groups=NULL, violin=TRUE, quantiles=FALSE, ...)mbarclpl(x, y, groups=NULL, yother, yvar=NULL, maintracename='y',         xlim=NULL, ylim=NULL, xname='x', alphaSegments=0.45, ...)medvpl(x, y, groups=NULL, yvar=NULL, maintracename='y',       xlim=NULL, ylim=NULL, xlab=xname, ylab=NULL, xname='x',       zeroline=FALSE, yother=NULL, alphaSegments=0.45,       dhistboxp.opts=NULL, ...)

Arguments

formula

a formula with possibly multiple left and right-sidevariables separated by+. Analysis (response) variables areon the left and are typically numeric. Forplot,formula is optional and overrides the default formulainferred for the reshaped data frame.

fun

an optional summarization function, e.g.,smean.sd

data

optional input data frame. Forplotp is the objectproduced bysummaryS.

subset

optional subsetting criteria

na.action

function for dealing withNAs whenconstructing the model data frame

continuous

minimum number of unique values for a numericvariable to have to be considered continuous

...

ignored forsummaryS andmbarclPanel,passed tostrip andpanel forplot. Passed tothedensity function bymedvPanel. Forplotp, are passed toplotlyM andsfun. Formbarclpl, passed toplotlyM.

x

an object created bysummaryS. FormbarclPanelis anx-axis argument provided bylattice

groups

a character string or factor specifying that one of theconditioning variables is used for superpositioning and notpaneling

panel

optionallatticepanel function

paneldoesgroups

set toTRUE if, likepanel.plsmo, the paneling function internallyhandles superpositioning forgroups

datadensity

set toTRUE to add rug plots etc. usingscat1d

ylab

optionaly-axis label

funlabel

optional axis label for whenfun is given

textonly

names of statistics to print and not plot. Bydefault, any statistic named"n" is only printed.

textplot

names of statistics to print and plot

digits

used if any statistics are printed as text (includingplotly hovertext), to specifythe number of significant digits to render

custom

a function that customizes formatting of statistics thatare printed as text. This is useful for generating plotmathnotation. See the example in the tests directory.

xlim

optionalx-axis limits

ylim

optionaly-axis limits

cex.strip

size of strip labels

cex.values

size of statistics printed as text

pch.stats

symbols to use for statistics (not included the oneone in columne one) that are plotted. This is a namedvectors, with names exactly matching those created byfun. When a column does not have an entry inpch.stats, no point is drawn for that column.

key

latticekey specification

outerlabels

set toFALSE to not pass two-way chartsthroughuseOuterStrips

autoarrange

set toFALSE to preventplot fromtrying to optimize which conditioning variable is vertical

scat1d.opts

a list of options to specify toscat1d

y,subscripts

provided bylattice

yother

passed to the panel function from theplot methodbased on multiple statistics computed

violin

controls whether violin plots are included

quantiles

controls whether quantile intervals are included

sfun

a function called byplotp.summaryS to compute andplot user-specified summary measures. Two functions for doingthis are provided here:mbarclpl, medvpl.

fitter

a fitting function such asloess to smoothpoints. The smoothed values over a systematic grid will beevaluated and plotted as curves.

showpts

set toTRUE to show raw data points in additonto smoothed curves

shareX

TRUE to causeplotly to share a singlex-axis when graphs are aligned vertically

shareY

TRUE to causeplotly to share a singley-axis when graphs are aligned horizontally

yvar

a character or factor variable used to stratify theanalysis into multiple y-variables

maintracename

a default trace name when it can't be inferred

xname

x-axis variable name for hover text when it can't beinferred

xlab

x-axis label when it can't be inferred

alphaSegments

alpha saturation to draw line segments forplotly

dhistboxp.opts

list of options to pass todhistboxp

zeroline

set toFALSE to suppressplotly zeroline at x=0

Value

a data frame with added attributes forsummaryS or alattice object ready to render forplot

Author(s)

Frank Harrell

See Also

summary,summarize

Examples

# See tests directory file summaryS.r for more examples, and summarySp.r# for plotp examplesrequire(survival)n <- 100set.seed(1)d <- data.frame(sbp=rnorm(n, 120, 10),                dbp=rnorm(n, 80, 10),                age=rnorm(n, 50, 10),                days=sample(1:n, n, TRUE),                S1=Surv(2*runif(n)), S2=Surv(runif(n)),                race=sample(c('Asian', 'Black/AA', 'White'), n, TRUE),                sex=sample(c('Female', 'Male'), n, TRUE),                treat=sample(c('A', 'B'), n, TRUE),                region=sample(c('North America','Europe'), n, TRUE),                meda=sample(0:1, n, TRUE), medb=sample(0:1, n, TRUE))d <- upData(d, labels=c(sbp='Systolic BP', dbp='Diastolic BP',            race='Race', sex='Sex', treat='Treatment',            days='Time Since Randomization',            S1='Hospitalization', S2='Re-Operation',            meda='Medication A', medb='Medication B'),            units=c(sbp='mmHg', dbp='mmHg', age='Year', days='Days'))s <- summaryS(age + sbp + dbp ~ days + region + treat,  data=d)# plot(s)   # 3 pagesplot(s, groups='treat', datadensity=TRUE,     scat1d.opts=list(lwd=.5, nhistSpike=0))plot(s, groups='treat', panel=lattice::panel.loess,     key=list(space='bottom', columns=2),     datadensity=TRUE, scat1d.opts=list(lwd=.5))# To make a plotly graph when the stratification variable region is not# present, run the following (showpts adds raw data points):# plotp(s, groups='treat', fitter=loess, showpts=TRUE)# Make your own plot using data frame created by summaryP# xyplot(y ~ days | yvar * region, groups=treat, data=s,#        scales=list(y='free', rot=0))# Use loess to estimate the probability of two different types of events as# a function of times <- summaryS(meda + medb ~ days + treat + region, data=d)pan <- function(...)   panel.plsmo(..., type='l', label.curves=max(which.packet()) == 1,               datadensity=TRUE)plot(s, groups='treat', panel=pan, paneldoesgroups=TRUE,     scat1d.opts=list(lwd=.7), cex.strip=.8)# Repeat using intervals instead of nonparametric smootherpan <- function(...)  # really need mobs > 96 to est. proportion  panel.plsmo(..., type='l', label.curves=max(which.packet()) == 1,              method='intervals', mobs=5)plot(s, groups='treat', panel=pan, paneldoesgroups=TRUE, xlim=c(0, 150))# Demonstrate dot charts of summary statisticss <- summaryS(age + sbp + dbp ~ region + treat, data=d, fun=mean)plot(s)plot(s, groups='treat', funlabel=expression(bar(X)))# Compute parametric confidence limits for mean, and include sample# sizes by naming a column "n"f <- function(x) {  x <- x[! is.na(x)]  c(smean.cl.normal(x, na.rm=FALSE), n=length(x))}s <- summaryS(age + sbp + dbp ~ region + treat, data=d, fun=f)plot(s, funlabel=expression(bar(X) %+-% t[0.975] %*% s))plot(s, groups='treat', cex.values=.65,     key=list(space='bottom', columns=2,       text=c('Treatment A:','Treatment B:')))# For discrete time, plot Harrell-Davis quantiles of y variables across# time using different line characteristics to distinguish quantilesd <- upData(d, days=round(days / 30) * 30)g <- function(y) {  probs <- c(0.05, 0.125, 0.25, 0.375)  probs <- sort(c(probs, 1 - probs))  y <- y[! is.na(y)]  w <- hdquantile(y, probs)  m <- hdquantile(y, 0.5, se=TRUE)  se <- as.numeric(attr(m, 'se'))  c(Median=as.numeric(m), w, se=se, n=length(y))}s <- summaryS(sbp + dbp ~ days + region, fun=g, data=d)plot(s, panel=mbarclPanel)plot(s, groups='region', panel=mbarclPanel, paneldoesgroups=TRUE)# For discrete time, plot median y vs x along with CL for difference,# using Harrell-Davis median estimator and its s.e., and use violin# plotss <- summaryS(sbp + dbp ~ days + region, data=d)plot(s, groups='region', panel=medvPanel, paneldoesgroups=TRUE)# Proportions and Wilson confidence limits, plus approx. Gaussian# based half/width confidence limits for difference in probabilitiesg <- function(y) {  y <- y[!is.na(y)]  n <- length(y)  p <- mean(y)  se <- sqrt(p * (1. - p) / n)  structure(c(binconf(sum(y), n), se=se, n=n),            names=c('Proportion', 'Lower', 'Upper', 'se', 'n'))}s <- summaryS(meda + medb ~ days + region, fun=g, data=d)plot(s, groups='region', panel=mbarclPanel, paneldoesgroups=TRUE)

Graphic Representation of a Frequency Table

Description

This function can be used to representcontingency tables graphically. Frequency counts are represented asthe heights of "thermometers" by default; you can also specifysymbol='circle' to the function. There is an option to includemarginal frequencies, which are plotted on a halved scale so as to notoverwhelm the plot. If you do not ask for marginal frequencies to beplotted usingmarginals=T,symbol.freq will ask you to clickthe mouse where a reference symbol is to be drawn to assist in readingthe scale of the frequencies.

label attributes, if present, are used for x- and y-axis labels.Otherwise, names of calling arguments are used.

Usage

symbol.freq(x, y, symbol = c("thermometer", "circle"),            marginals = FALSE, orig.scale = FALSE,            inches = 0.25, width = 0.15, subset, srtx = 0, ...)

Arguments

x

first variable to cross-classify

y

second variable

symbol

specify"thermometer" (the default) or"circle"

marginals

set toTRUE to add marginal frequencies(scaled by half) to the plot

orig.scale

set toTRUE when the first two arguments arenumeric variables; this uses their original values for x and ycoordinates)

inches

seesymbols

width

seethermometers option insymbols

subset

the usual subsetting vector

srtx

rotation angle for x-axis labels

...

other arguments to pass tosymbols

Author(s)

Frank Harrell

See Also

symbols

Examples

## Not run: getHdata(titanic)attach(titanic)age.tertile <- cut2(titanic$age, g=3)symbol.freq(age.tertile, pclass, marginals=T, srtx=45)detach(2)## End(Not run)

Run Unix or Dos Depending on System

Description

Runsunix ordos depending on the current operating system. ForR, just runssystem with optional concatenation of first twoarguments which are assumed namedcommand andtext.

Usage

sys(command, text=NULL, output=TRUE)# S-Plus: sys(\dots, minimized=FALSE)

Arguments

command

system command to execute

text

text to concatenate to system command, if any (typically options or filenames or both)

output

set toFALSE to not return output of command as a charactervector

Value

seeunix ordos

Side Effects

executes system commands

See Also

unix,system


t-test for Clustered Data

Description

Does a 2-sample t-test for clustered data.

Usage

t.test.cluster(y, cluster, group, conf.int = 0.95)## S3 method for class 't.test.cluster'print(x, digits, ...)

Arguments

y

normally distributed response variable to test

cluster

cluster identifiers, e.g. subject ID

group

grouping variable with two values

conf.int

confidence coefficient to use for confidence limits

x

an object created byt.test.cluster

digits

number of significant digits to print

...

unused

Value

a matrix of statistics of classt.test.cluster

Author(s)

Frank Harrell

References

Donner A, Birkett N, Buck C, Am J Epi 114:906-914, 1981.

Donner A, Klar N, J Clin Epi 49:435-439, 1996.

Hsieh FY, Stat in Med 8:1195-1201, 1988.

See Also

t.test

Examples

set.seed(1)y <- rnorm(800)group <- sample(1:2, 800, TRUE)cluster <- sample(1:40, 800, TRUE)table(cluster,group)t.test(y ~ group)   # R onlyt.test.cluster(y, cluster, group)# Note: negate estimates of differences from t.test to# compare with t.test.cluster

Interface to Tabular Function

Description

tabulr is a front-end to thetables package'stabular function so that the user can takeadvantage of variable annotations used by theHmisc package,particular those created by thelabel,units, andupData functions. When a variable appears in atabular function, the variablex is found in thedata argument or in the parentenvironment, and thelabelLatex function is used to createa LaTeX label. By default any units of measurement are right justifiedin the current LaTeX tabular field usinghfill; usenofillto list variables for whichunits are not right-justified withhfill. Once the label is constructed, the variable name ispreceeded byHeading("LaTeX label")*x in the formula before it ispassed totabular.nolabel can be used tospecify variables for which labels are ignored.

tabulr also replacestrio withtable_trio,Nwithtable_N, andfreq withtable_freq in theformula.

table_trio is a function that takes a numeric vector and computesthe three quartiles and optionally the mean and standard deviation, andoutputs a LaTeX-formatted character string representing the results. Bydefault, calculated statistics are formatted with 3 digits to the leftand 1 digit to the right of the decimal point. Runningtable_options(left=l, right=r) will uselandr digits instead. Other options that can be given totable_options areprmsd=TRUE to add mean +/- standarddeviation to the result,pn=TRUE to add the sample size,bold=TRUE to set the median in bold face,showfreq='all','low', 'high' used by thetable_freq function,pctdec,specifying the number of places to the right of the decimal point forpercentages (default is zero), andnpct='both','numerator','denominator','none' used bytable_formatpct to control what appears after the percent.Optionpnformat may be specified to control the formatting forpn. The default is"(n=..)". Specifypnformat="non" to suppress"n=".pnwhen specifieswhen to print the number of observations. The default is"always". Specifypnwhen="ifna" to includen onlyif there are missing values in the vector being processed.

tabulr substitutestable_N forN in the formula.This is used to create column headings for the number of observations,without a row label.

table_freq analyzes a character variable to compute, for a singleoutput cell, the percents, numerator, and denominator for each category,or optimally just the maximum or minimum, as specified bytable_options(showfreq).

table_formatpct is a function that formats percents depending onsettings of options intable_options.

nFm is a function that callssprintf to formatnumeric values to have a specific number of digits to theleftand to theright of the point.

table_latexdefs writes (by default) to the console a set of LaTeXdefinitions that can be invoked at any point thereafter in aknitr orsweave document by naming the macro, preceeded by a singleslash. Theblfootnote macro is called with a single LaTeXargument which will appear as a footnote without a number.keytrio invokesblfootnote to define the output oftable_trio if mean and SD are not included. If mean and SD areincluded, usekeytriomsd.

Usage

tabulr(formula, data = NULL, nolabel=NULL, nofill=NULL, ...)table_trio(x)table_freq(x)table_formatpct(num, den)nFm(x, left, right, neg=FALSE, pad=FALSE, html=FALSE)table_latexdefs(file='')

Arguments

formula

a formula suitable fortabularexcept for the addition of.(variable name),.n(),trio.

data

a data frame or list. If omitted, the parent environmentis assumed to contain the variables.

nolabel

a formula such as~ x1 + x2 containing the listof variables for which labels are to be ignored, forcing use of thevariable name

nofill

a formula such as~ x1 + x2 contaning the list ofvariables for which units of measurement are not to beright-justified in the field using the LaTeXhfill directive

...

other arguments totabular

x

a numeric vector

num

a single numerator or vector of numerators

den

a single denominator

left,right

number of places to the left and right of thedecimal point, respectively

neg

set toTRUE if negativex values are allowed,to add one more space to the left of the decimal place

pad

set toTRUE to replace blanks with the LaTeX tildeplaceholder

html

set toTRUE to makepad use an HTML spacecharacter instead of a LaTeX tilde space

file

location of output oftable_latexdefs

Value

tabulr returns an object of class"tabular"

Author(s)

Frank Harrell

See Also

tabular,label,latex,summaryM

Examples

## Not run: n <- 400set.seed(1)d <- data.frame(country=factor(sample(c('US','Canada','Mexico'), n, TRUE)),                sex=factor(sample(c('Female','Male'), n, TRUE)),                age=rnorm(n, 50, 10),                sbp=rnorm(n, 120, 8))d <- upData(d,            preghx=ifelse(sex=='Female', sample(c('No','Yes'), n, TRUE), NA),            labels=c(sbp='Systolic BP', age='Age', preghx='Pregnancy History'),            units=c(sbp='mmHg', age='years'))contents(d)require(tables)invisible(booktabs())  # use booktabs LaTeX style for tabularg <- function(x) {  x <- x[!is.na(x)]  if(length(x) == 0) return('')  paste(latexNumeric(nFm(mean(x), 3, 1)),        ' \hfill{\smaller[2](', length(x), ')}', sep='')}tab <- tabulr((age + Heading('Females')*(sex == 'Female')*sbp)*              Heading()*g + (age + sbp)*Heading()*trio ~               Heading()*country*Heading()*sex, data=d)# Formula after interpretation by tabulr:# (Heading('Age\hfill {\smaller[2] years}') * age + Heading("Females")# * (sex == "Female") * Heading('Systolic BP {\smaller[2] mmHg}') * sbp)# * Heading() * g + (age + sbp) * Heading() * table_trio ~ Heading()# * country * Heading() * sexcat('\begin{landscape}\n')cat('\begin{minipage}{\textwidth}\n')cat('\keytrio\n')latex(tab)cat('\end{minipage}\end{landscape}\n')getHdata(pbc)pbc <- upData(pbc, moveUnits=TRUE)# Convert to character to prevent tabular from stratifyingfor(x in c('sex', 'stage', 'spiders')) {  pbc[[x]] <- as.character(pbc[[x]])  label(pbc[[x]]) <- paste(toupper(substring(x, 1, 1)), substring(x, 2), sep='')}table_options(pn=TRUE, showfreq='all')tab <- tabulr((bili + albumin + protime + age) *              Heading()*trio +              (sex + stage + spiders)*Heading()*freq ~ drug, data=pbc)latex(tab)## End(Not run)

testCharDateTime

Description

Test Character Variables for Dates and Times

Usage

testCharDateTime(x, p = 0.5, m = 0, convert = FALSE, existing = FALSE)

Arguments

x

input vector of any type, but interesting cases are for characterx

p

minimum proportion of non-missing non-blank values ofx for which the format is one of the formats described before consideringx to be of that type

m

if greater than 0, a test is applied: the number of distinct illegal values ofx (values containing a letter or underscore) must not exceedm, or typecharacter will be returned.p is set to1.0 whenm > 0.

convert

set toTRUE to convert the variable under the dominant format. If all values areNA,type will be set to'character'.

existing

set toTRUE to return a character string with the current type of variable without examining pattern matches

Details

For a vectorx, if it is already a date-time, date, or time variable, the type is returned ifconvert=FALSE, or a list with that type, the original vector, andnumna=0 is returned. Otherwise ifx is not a character vector, a type ofnotcharacter is returned, or a list that includes the originalx andtype='notcharacter'. Whenx is character, the main logic is applied. The default logic (whenm=0) is to considerx a date-time variable when its format is YYYY-MM-DD HH:MM:SS (:SS is optional) in more than 1/2 of the non-missing observations. It is considered to be a date if its format is YYYY-MM-DD or MM/DD/YYYY or DD-MMM-YYYY in more than 1/2 of the non-missing observations (MMM=3-letter month). A time variable has the format HH:MM:SS or HH:MM. Blank values ofx (after trimming) are set toNA before proceeding.

Value

ifconvert=FALSE, a single character string with the type ofx:⁠"character", "datetime", "date", "time"⁠. Ifconvert=TRUE, a list with components namedtype,x (converted toPOSIXct,Date, orchron times format), andnumna, the number of originally non-NA values ofx that could not be converted to the predominant format. If there were any non-covertible dates/times,the returned vector is given an additional classspecial.miss and anattributespecial.miss which is a list with original character values(codes) and observation numbers (obs). These are summarized bydescribe().

Author(s)

Frank Harrell

Examples

for(conv in c(FALSE, TRUE)) {  print(testCharDateTime(c('2023-03-11', '2023-04-11', 'a', 'b', 'c'), convert=conv))  print(testCharDateTime(c('2023-03-11', '2023-04-11', 'a', 'b'), convert=conv))  print(testCharDateTime(c('2023-03-11 11:12:13', '2023-04-11 11:13:14', 'a', 'b'), convert=conv))  print(testCharDateTime(c('2023-03-11 11:12', '2023-04-11 11:13', 'a', 'b'), convert=conv))  print(testCharDateTime(c('3/11/2023', '4/11/2023', 'a', 'b'), convert=conv))}x <- c(paste0('2023-03-0', 1:9), 'a', 'a', 'a', 'b')y <- testCharDateTime(x, convert=TRUE)$xdescribe(y)  # note counts of special missing values a, b

function for use in graphs that are used with the psfrag package in LaTeX

Description

tex is a little function to save typing when including TeXcommands in graphs that are used with the psfrag package in LaTeX totypeset any LaTeX text inside a postscript graphic.texsurrounds the input character string with ‘⁠\tex[options]{}⁠’.This is especially useful for getting Greek letters and math symbolsin postscript graphs. By defaulttex returns a string withpsfrag commands specifying that the string be centered, notrotated, and not specially enlarged or shrunk.

Usage

tex(string, lref='c', psref='c', scale=1, srt=0)

Arguments

string

a character string to be processed bypsfrag in LaTeX.

lref

LaTeX reference point forstring. See thepsfragdocumentation referenced below. Default is"c" for centered(this is also the default forpsref).

psref

PostScript reference point.

scale

scall factor, default is 1

srt

rotation forstring in degrees (default is zero)

Value

tex returns a modified character string.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Grant MC, Carlisle (1998): The PSfrag System, Version 3. Fulldocumentation is obtained by searching www.ctan.org for ‘pfgguide.ps’.

See Also

postscript,par,ps.options,mgp.axis.labels,pdf,trellis.device,setTrellis

Examples

## Not run: pdf('test.pdf')x <- seq(0,15,length=100)plot(x, dchisq(x, 5), xlab=tex('$x$'),        ylab=tex('$f(x)$'), type='l')title(tex('Density Function of the $\chi_{5}^{2}$ Distribution'))dev.off()# To process this file in LaTeX do something like#\documentclass{article}#\usepackage[scanall]{psfrag}#\begin{document}#\begin{figure}#\includegraphics{test.ps}#\caption{This is an example}#\end{figure}#\end{document}## End(Not run)

Additive Regression and Transformations using ace or avas

Description

transace isace packaged for easily automaticallytransforming all variables in a formula without a left-hand side.transace is a fast one-iteration version oftranscan without imputation ofNAs. Theggplot method makes nice transformation plotsusingggplot2. Binary variables are automatically kept linear,and character or factor variables are automatically treated as categorical.

areg.boot usesareg oravas to fit additive regression models allowingall variables in the model (including the left-hand-side) to betransformed, with transformations chosen so as to optimize certaincriteria. The default method usesareg whose goal it isto maximizeR^2.method="avas" explicity tries totransform the response variable so as to stabilize the variance of theresiduals. All-variables-transformed models tend to inflateR^2and it can be difficult to get confidence limits for eachtransformation.areg.boot solves both of these problems usingthe bootstrap. As with thevalidate function in therms library, the Efron bootstrap is used to estimate theoptimism in the apparentR^2, and this optimism is subtractedfrom the apparentR^2 to optain a bias-correctedR^2.This is done however on the transformed response variable scale.

Tests with 3 predictors show that theavas andace estimates are unstable unless the sample sizeexceeds 350. ApparentR^2 with low sample sizes can be veryinflated, and bootstrap estimates ofR^2 can be even moreunstable in such cases, resulting in optimism-correctedR^2 thatare much lower even than the actualR^2. The situation can beimproved a little by restricting predictor transformations to bemonotonic. On the other hand, theareg approach allows one tocontrol overfitting by specifying the number of knots to use for eachcontinuous variable in a restricted cubic spline function.

Formethod="avas" the response transformation is restricted tobe monotonic. You can specify restrictions for transformations ofpredictors (and linearity for the response). When the first argumentis a formula, the function automatically determines which variablesare categorical (i.e.,factor,category, or charactervectors). Specify linear transformations by enclosing variables bythe identify function (I()), and specify monotonicity by usingmonotone(variable). Monotonicity restrictions are notallowed withmethod="areg".

Thesummary method forareg.boot computesbootstrap estimates of standard errors of differences in predictedresponses (usually on the original scale) for selected levels of eachpredictor against the lowest level of the predictor. The smearingestimator (see below) can be used here to estimate differences inpredicted means, medians, or many other statistics. By default,quartiles are used for continuous predictors and all levels are usedfor categorical ones. SeeDetails below. There is also aplot method for plotting transformation estimates,transformations for individual bootstrap re-samples, and pointwiseconfidence limits for transformations. Unless you already have apar(mfrow=) in effect with more than one row or column,plot will try to fit the plots on one page. Apredict method computes predicted values on the originalor transformed response scale, or a matrix of transformedpredictors. There is aFunction method for producing alist ofR functions that perform the final fitted transformations.There is also aprint method forareg.bootobjects.

When estimated means (or medians or other statistical parameters) arerequested for models fitted withareg.boot (bysummary.areg.boot orpredict.areg.boot), the“smearing” estimator ofDuan (1983) is used. Here weestimate the mean of the untransformed response by computing thearithmetic mean ofginverse(lp + residuals),where ginverse is the inverse of the nonparametrictransformation of the response (obtained by reverse linearinterpolation), lp is the linear predictor for an individualobservation on the transformed scale, and residuals is theentire vector of residuals estimated from the fitted model, on thetransformed scales (n residuals for n original observations). ThesmearingEst function computes the general smearing estimate.For efficiencysmearingEst recognizes that quantiles aretransformation-preserving, i.e., when one wishes to estimate aquantile of the untransformed distribution one just needs to computethe inverse transformation of the transformed estimate after thechosen quantile of the vector of residuals is added to it. When themedian is desired, the estimate isginverse(lp + \mbox{median}(residuals)).See the last example for howsmearingEst can be used outside ofareg.boot.

Mean is a generic function that returns anR function tocompute the estimate of the mean of a variable. Its input istypically some kind of model fit object. Likewise,Quantile isa generic quantile function-producing function.Mean.areg.bootandQuantile.areg.boot create functions of a vector of linearpredictors that transform them into the smearing estimates of the meanor quantile of the response variable,respectively.Quantile.areg.boot produces exactly the samevalue aspredict.areg.boot orsmearingEst.Meanapproximates the mapping of linear predictors to means over an evenlyspaced grid of by default 200 points. Linear interpolation is usedbetween these points. This approximate method is much faster than thefull smearing estimator onceMean creates the function. Thesefunctions are especially useful innomogram (see theexample on hypothetical data).

Usage

transace(formula, trim=0.01, data=environment(formula))## S3 method for class 'transace'print(x, ...)## S3 method for class 'transace'ggplot(data, mapping, ..., environment, nrow=NULL)areg.boot(x, data, weights, subset, na.action=na.delete,           B=100, method=c("areg","avas"), nk=4, evaluation=100, valrsq=TRUE,           probs=c(.25,.5,.75), tolerance=NULL)## S3 method for class 'areg.boot'print(x, ...)## S3 method for class 'areg.boot'plot(x, ylim, boot=TRUE, col.boot=2, lwd.boot=.15,     conf.int=.95, ...)smearingEst(transEst, inverseTrans, res,            statistic=c('median','quantile','mean','fitted','lp'),            q)## S3 method for class 'areg.boot'summary(object, conf.int=.95, values, adj.to,        statistic='median', q, ...)## S3 method for class 'summary.areg.boot'print(x, ...)## S3 method for class 'areg.boot'predict(object, newdata,        statistic=c("lp", "median",                    "quantile", "mean", "fitted", "terms"),        q=NULL, ...) ## S3 method for class 'areg.boot'Function(object, type=c('list','individual'),         ytype=c('transformed','inverse'),         prefix='.', suffix='', pos=-1, ...)Mean(object, ...)Quantile(object, ...)## S3 method for class 'areg.boot'Mean(object, evaluation=200, ...)## S3 method for class 'areg.boot'Quantile(object, q=.5, ...)

Arguments

formula

a formula without a left-hand-side variable. Variablesmay be enclosed inmonotone(), linear(), categorical() tomake certain assumptions about transformations.categoricalandlinear need not be specified if they can be summized fromthe variable values.

x

forareg.bootx is a formula. Forprint orplot, an object created byareg.boot ortransace. Forprint.summary.areg.boot, and object created bysummary.areg.boot. Forggplot isthe result oftransace.

object

an object created byareg.boot, or a model fit objectsuitable forMean orQuantile.

transEst

a vector of transformed values. In log-normal regression thesecould be predicted log(Y) for example.

inverseTrans

a function specifying the inverse transformation needed to changetransEst to the original untransformed scale.inverseTrans may also be a 2-element list defining a mappingfrom the transformed values to untransformed values. Linearinterpolation is used in this case to obtain untransform values.

trim

quantile to which to trim original and transformed valuesfor continuous variables for purposes of plotting thetransformations withggplot.transace

nrow

the number of rows to graph fortransacetransformations, with the default chosen byggplot2

data

data frame to use ifx is a formula and variables are notalready in the search list. Forggplot is atransace object.

environment,mapping

ignored

weights

a numeric vector of observation weights. By default, allobservations are weighted equally.

subset

an expression to subset data ifx is a formula

na.action

a function specifying how to handleNAs. Default isna.delete.

B

number of bootstrap samples (default=100)

method

"areg" (the default) or"avas"

nk

number of knots for continuous variables not restricted to belinear. Default is 4. One or two is not allowed.nk=0forces linearity for all continuous variables.

evaluation

number of equally-spaced points at which to evaluate (and save) thenonparametric transformations derived byavas orace. Default is 100. ForMean.areg.boot,evaluation is the number of points at which to evaluate exactsmearing estimates, to approximate them using linear interpolation(default is 200).

valrsq

set toTRUE to more quickly do bootstrapping withoutvalidatingR^2

probs

vector probabilities denoting the quantiles of continuous predictorsto use in estimating effects of those predictors

tolerance

singularity criterion; list source code for thelm.fit.qr.bare function.

res

a vector of residuals from the transformed model. Not required whenstatistic="lp" orstatistic="fitted".

statistic

statistic to estimate with the smearing estimator. ForsmearingEst, the default results in computation of the samplemedian of the model residuals, thensmearingEst adds themedian residual and back-transforms to get estimated medianresponses on the original scale.statistic="lp" causespredicted transformed responses to be computed. ForsmearingEst, the result (forstatistic="lp") is theinput argumenttransEst.statistic="fitted" givespredicted untransformed responses, i.e.,ginverse(lp), where ginverse is the inverseof the estimated response transformation, estimated by reverselinear interpolation on the tabulated nonparametric responsetransformation or by using an explicit analyticfunction.statistic="quantile" generalizes"median" toany single quantileq which must be specified."mean"causes the population mean response to be estimated. Forpredict.areg.boot,statistic="terms" returns a matrixof transformed predictors.statistic can also be anyRfunction that computes a single value on a vector of values, such asstatistic=var. Note that in this case the function name isnot quoted.

q

a single quantile of the original response scale to estimate, whenstatistic="quantile", or forQuantile.areg.boot.

ylim

2-vector of y-axis limits

boot

set toFALSE to not plot any bootstrapped transformations.Set it to an integer k to plot the first k bootstrapestimates.

col.boot

color for bootstrapped transformations

lwd.boot

line width for bootstrapped transformations

conf.int

confidence level (0-1) for pointwise bootstrap confidence limits andfor estimated effects of predictors insummary.areg.boot. Thelatter assumes normality of the estimated effects.

values

a list of vectors of settings of the predictors, for predictors forwhich you want to overide settings determined fromprobs.The list must have named components, with names corresponding to thepredictors. Example:values=list(x1=c(2,4,6,8), x2=c(-1,0,1)) specifies thatsummary is to estimate the effect ony of changingx1 from 2 to 4, 2 to 6, 2 to 8, and separately, of changingx2 from -1 to 0 and -1 to 1.

adj.to

a named vector of adjustment constants, for setting all otherpredictors when examining the effect of a single predictor insummary. The more nonlinear is the transformation ofy the more the adjustment settings will matter. Defaultvalues are the medians of the values defined byvalues orprobs. You only need to name the predictors for which youare overriding the default settings. Example:adj.to=c(x2=0,x5=10) will setx2 to 0 andx5 to10 when assessing the impact of variation in the other predictors.

newdata

a data frame or list containing the same number of values of all ofthe predictors used in the fit. Forfactor predictorsthe ‘⁠levels⁠’ attribute do not need to be in the same order asthose used in the original fit, and not all levels need to berepresented. Ifnewdata is omitted, you can still obtainlinear predictors (on the transformed response scale) and fittedvalues (on the original response scale), but not"terms".

type

specifies howFunction is to return the series offunctions that define the transformations of all variables. Bydefault a list is created, with the names of the list elements beingthe names of the variables. Specifytype="individual" tohave separate functions created in the current environment(pos=-1, the default) or in location defined byposifwhere is specified. For the latter method, the names ofthe objects created are the names of the corresponding variables,prefixed byprefix and withsuffix appended to theend. If any ofpos,prefix, orsuffix is specified,type is automatically set to"individual".

ytype

By default the first function created byFunction is they-transformation. Specifyytype="inverse" to instead createthe inverse of the transformation, to be able to obtain originallyscaled y-values.

prefix

character string defining the prefix for function names created whentype="individual". By default, the function specifying thetransformation for variablex will be named.x.

suffix

character string defining the suffix for the function names

pos

Seeassign.

...

arguments passed to other functions. Ignored forprint.transace andggplot.transace.

Details

Astransace only does one iteration over the predictors, it maynot find optimal transformations and it will be dependent on the orderof the predictors inx.

ace andavas standardize transformed variables to havemean zero and variance one for each bootstrap sample, so if apredictor is not important it will still consistently have a positiveregression coefficient. Therefore using the bootstrap to estimatestandard errors of the additive least squares regression coefficientswould not help in drawing inferences about the importance of thepredictors. To do this,summary.areg.boot computes estimatesof, e.g., the inter-quartile range effects of predictors in predictingthe response variable (after untransforming it). As an example, ateach bootstrap repetition the estimated transformed value of one ofthe predictors is computed at the lower quartile, median, and upperquartile of the raw value of the predictor. These transformed xvalues are then multipled by the least squares estimate of the partialregression coefficient for that transformed predictor in predictingtransformed y. Then these weighted transformed x values have theweighted transformed x value corresponding to the lower quartilesubtracted from them, to estimate an x effect accounting fornonlinearity. The last difference computed is then the standardizedeffect of raising x from its lowest to its highest quartile. Beforecomputing differences, predicted values are back-transformed to be onthe original y scale in a way depending onstatistic andq. The sample standard deviation of these effects (differences)is taken over the bootstrap samples, and this is used to computeapproximate confidence intervals for effects andapproximate P-values,both assuming normality.

predict does not re-insertNAs corresponding toobservations that were dropped before the fit, whennewdata isomitted.

statistic="fitted" estimates the same quantity asstatistic="median" if the residuals on the transformed responsehave a symmetric distribution. The two provide identical estimateswhen the sample median of the residuals is exactly zero. The samplemean of the residuals is constrained to be exactly zero although thisdoes not simplify anything.

Value

transace returns a list of classtransace containingthese elements:n (number of non-missing observations used),transformed (a matrix containing transformed values),rsq (vector ofR^2 with which eachvariable can be predicted from the others),omitted (rownumbers of data that were deleted due toNAs),trantab (compact transformation lookups),levels(original levels of character and factorvaribles if the input was adata frame),trim (value oftrim passed totransace),limits (the limits for plotting raw andtransformed variables, computed fromtrim), andtype (avector of transformation types used for the variables).

areg.boot returns a list of class ‘⁠areg.boot⁠’ containingmany elements, including (ifvalrsq isTRUE)rsquare.app andrsquare.val.summary.areg.bootreturns a list of class ‘⁠summary.areg.boot⁠’ containing a matrixof results for each predictor and a vector of adjust-to settings. Italso contains the call and a ‘⁠label⁠’ for the statistic that wascomputed. Aprint method for these objects handles theprinting.predict.areg.boot returns a vector unlessstatistic="terms", in which case it returns amatrix.Function.areg.boot returns by default a list offunctions whose argument is one of the variables (on the originalscale) and whose returned values are the corresponding transformedvalues. The names of the list of functions correspond to the names ofthe original variables. Whentype="individual",Function.areg.boot invisibly returns the vector of names of thecreated function objects.Mean.areg.boot andQuantile.areg.boot also return functions.

smearingEst returns a vector of estimates of distributionparameters of class ‘⁠labelled⁠’ so thatprint.labelled wilprint a label documenting the estimate that was used (seelabel). This label can be retrieved for other purposesby using e.g.label(obj), where obj was the vectorreturned bysmearingEst.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com

References

Harrell FE, Lee KL, Mark DB (1996): Stat in Med 15:361–387.

Duan N (1983): Smearing estimate: A nonparametric retransformationmethod. JASA 78:605–610.

Wang N, Ruppert D (1995): Nonparametric estimation of thetransformation in the transform-both-sides regression model. JASA90:522–534.

Seeavas,ace for primary references.

See Also

avas,ace,ols,validate,predab.resample,label,nomogram

Examples

# xtrans <- transace(~ monotone(age) + sex + blood.pressure + categorical(race.code))# print(xtrans)  # show R^2s and a few other things# ggplot(xtrans) # show transformations# Generate random data from the model y = exp(x1 + epsilon/3) where# x1 and epsilon are Gaussian(0,1)set.seed(171)  # to be able to reproduce examplex1 <- rnorm(200)x2 <- runif(200)  # a variable that is really unrelated to y]x3 <- factor(sample(c('cat','dog','cow'), 200,TRUE))  # also unrelated to yy  <- exp(x1 + rnorm(200)/3)f  <- areg.boot(y ~ x1 + x2 + x3, B=40)fplot(f)# Note that the fitted transformation of y is very nearly log(y)# (the appropriate one), the transformation of x1 is nearly linear,# and the transformations of x2 and x3 are essentially flat # (specifying monotone(x2) if method='avas' would have resulted# in a smaller confidence band for x2)summary(f)# use summary(f, values=list(x2=c(.2,.5,.8))) for example if you# want to use nice round values for judging effects# Plot Y hat vs. Y (this doesn't work if there were NAs)plot(fitted(f), y)  # or: plot(predict(f,statistic='fitted'), y)# Show fit of model by varying x1 on the x-axis and creating separate# panels for x2 and x3.  For x2 using only a few discrete valuesnewdat <- expand.grid(x1=seq(-2,2,length=100),x2=c(.25,.75),                      x3=c('cat','dog','cow'))yhat <- predict(f, newdat, statistic='fitted')  # statistic='mean' to get estimated mean rather than simple inverse trans.xYplot(yhat ~ x1 | x2, groups=x3, type='l', data=newdat)## Not run: # Another example, on hypothetical dataf <- areg.boot(response ~ I(age) + monotone(blood.pressure) + race)# use I(response) to not transform the response variableplot(f, conf.int=.9)# Check distribution of residualsplot(fitted(f), resid(f))qqnorm(resid(f))# Refit this model using ols so that we can draw a nomogram of it.# The nomogram will show the linear predictor, median, mean.# The last two are smearing estimators.Function(f, type='individual')  # create transformation functionsf.ols <- ols(.response(response) ~ age +              .blood.pressure(blood.pressure) + .race(race))# Note: This model is almost exactly the same as f but there# will be very small differences due to interpolation of# transformationsmeanr <- Mean(f)      # create function of lp computing mean responsemedr  <- Quantile(f)  # default quantile is .5nomogram(f.ols, fun=list(Mean=meanr,Median=medr))# Create S functions that will do the transformations# This is a table look-up with linear interpolationg <- Function(f)plot(blood.pressure, g$blood.pressure(blood.pressure))# produces the central curve in the last plot done by plot(f)## End(Not run)# Another simulated example, where y has a log-normal distribution# with mean x and variance 1.  Untransformed y thus has median# exp(x) and mean exp(x + .5sigma^2) = exp(x + .5)# First generate data from the model y = exp(x + epsilon),# epsilon ~ Gaussian(0, 1)set.seed(139)n <- 1000x <- rnorm(n)y <- exp(x + rnorm(n))f <- areg.boot(y ~ x, B=20)plot(f)       # note log shape for y, linear for x.  Good!xs <- c(-2, 0, 2)d <- data.frame(x=xs)predict(f, d, 'fitted')predict(f, d, 'median')   # almost same; median residual=-.001exp(xs)                   # population medianspredict(f, d, 'mean')exp(xs + .5)              # population means# Show how smearingEst worksres <- c(-1,0,1)          # define residualsy <- 1:5ytrans <- log(y)ys <- seq(.1,15,length=50)trans.approx <- list(x=log(ys), y=ys)options(digits=4)smearingEst(ytrans, exp, res, 'fitted')          # ignores ressmearingEst(ytrans, trans.approx, res, 'fitted') # ignores res smearingEst(ytrans, exp, res, 'median')          # median res=0smearingEst(ytrans, exp, res+.1, 'median')       # median res=.1smearingEst(ytrans, trans.approx, res, 'median')smearingEst(ytrans, exp, res, 'mean')mean(exp(ytrans[2] + res))                       # should equal 2nd # abovesmearingEst(ytrans, trans.approx, res, 'mean')smearingEst(ytrans, trans.approx, res, mean)# Last argument can be any statistical function operating# on a vector that returns a single value

Transformations/Imputations using Canonical Variates

Description

transcan is a nonlinear additive transformation and imputationfunction, and there are several functions for using and operating onits results.transcan automatically transforms continuous andcategorical variables to have maximum correlation with the best linearcombination of the other variables. There is also an option to use asubstitute criterion - maximum correlation with the first principalcomponent of the other variables. Continuous variables are expandedas restricted cubic splines and categorical variables are expanded ascontrasts (e.g., dummy variables). By default, the first canonicalvariate is used to find optimum linear combinations of componentcolumns. This function is similar toace except thattransformations for continuous variables are fitted using restrictedcubic splines, monotonicity restrictions are not allowed, andNAs are allowed. When a variable has anyNAs,transformed scores for that variable are imputed using least squaresmultiple regression incorporating optimum transformations, orNAs are optionally set to constants. Shrinkage can be used tosafeguard against overfitting when imputing. Optionally, imputedvalues on the original scale are also computed and returned. For thispurpose, recursive partitioning or multinomial logistic models canoptionally be used to impute categorical variables, using what ispredicted to be the most probable category.

By default,transcan imputesNAs with “bestguess” expected values of transformed variables, back transformed tothe original scale. Values thus imputed are most like conditionalmedians assuming the transformations make variables' distributionssymmetric (imputed values are similar to conditionl modes forcategorical variables). By instead specifyingn.impute,transcan does approximate multiple imputation from thedistribution of each variable conditional on all other variables.This is done by samplingn.impute residuals from thetransformed variable, with replacement (a la bootstrapping), or bydefault, using Rubin's approximate Bayesian bootstrap, where a sampleof size n with replacement is selected from the residuals onn non-missing values of the target variable, and then a sampleof size m with replacement is chosen from this sample, wherem is the number of missing values needing imputation for thecurrent multiple imputation repetition. Neither of these bootstrapprocedures assume normality or even symmetry of residuals. Forsometimes-missing categorical variables, optimal scores are computedby adding the “best guess” predicted mean score to randomresiduals off this score. Then categories having scores closest tothese predicted scores are taken as the random multiple imputations(impcat = "rpart" is not currently allowedwithn.impute). The literature recommends usingn.impute = 5 or greater.transcan provides only an approximation tomultiple imputation, especially since it “freezes” theimputation model before drawing the multiple imputations rather thanusing different estimates of regression coefficients for eachimputation. For multiple imputation, thearegImpute functionprovides a much better approximation to the full Bayesian approachwhile still not requiring linearity assumptions.

When you specifyn.impute totranscan you can usefit.mult.impute to re-fit any modeln.impute times basedonn.impute completed datasets (if there are any sometimesmissing variables not specified totranscan, some observationswill still be dropped from these fits). After fittingn.imputemodels,fit.mult.impute will return the fit object from thelast imputation, withcoefficients replaced by the average ofthen.impute coefficient vectors and with a componentvar equal to the imputation-corrected variance-covariancematrix using Rubin's rule.fit.mult.impute can also use the object created by themice function in themice library to draw themultiple imputations, as well as objects created byaregImpute. The following components of fit objects arealso replaced with averages over then.impute model fits:linear.predictors,fitted.values,stats,means,icoef,scale,center,y.imputed.

By specifyingfun tofit.mult.impute you can run anyfunction on the fit objects from completed datasets, with the resultssaved in an element namedfunresults. This facilitatesrunning bootstrap or cross-validation separately on each completeddataset and storing all these results in a list for later processing,e.g., with therms packageprocessMI function. Note that forrms-type validation you will need to specifyfitargs=list(x=TRUE,y=TRUE) tofit.mult.impute and touse special names forfun result components, such asvalidate andcalibrate so that the result can beprocessed withprocessMI. When simultaneously running multipleimputation and resampling model validation you may not need values forn.impute orB (number of bootstraps) as high as usual,as the total number of repetitions will ben.impute * B.

fit.mult.impute can incorporate robust sandwich variance estimates intoRubin's rule ifrobust=TRUE.

Forols models fitted byfit.mult.impute with stacking,theR^2 measure in the stacked model fit is OK, andprint.ols computes adjustedR^2 using the real samplesize so it is also OK becausefit.mult.compute corrects thestacked error degrees of freedom in the stacked fit object to reflectthe real sample size.

Thesummary method fortranscan prints the functioncall,R^2 achieved in transforming each variable, and for eachvariable the coefficients of all other transformed variables that areused to estimate the transformation of the initial variable. Ifimputed=TRUE was used in the call to transcan, also uses thedescribe function to print a summary of imputed values. Iflong = TRUE, also prints all imputed values with observationidentifiers. There is also a simple functionprint.transcanwhich merely prints the transformation matrix and the function call.It has an optional argumentlong, which if set toTRUEcauses detailed parameters to be printed. Instead of plotting whiletranscan is running, you can plot the final transformationsafter the fact usingplot.transcan orggplot.transcan,if the optiontrantab = TRUE was specified totranscan.If in addition the optionimputed = TRUE was specified totranscan,plot andggplot will show the location of imputed values(including multiples) along the axes. Forggplot, imputedvalues are shown as red plus signs.

impute method fortranscan does imputations for aselected original data variable, on the original scale (ifimputed=TRUE was given totranscan). If you do notspecify a variable toimpute, it will do imputations for allvariables given totranscan which had at least one missingvalue. This assumes that the original variables are accessible (i.e.,they have been attached) and that you want the imputed variables tohave the same names are the original variables. Ifn.impute wasspecified totranscan you must tellimpute whichimputation to use. Results are stored in.GlobalEnvwhenlist.out is not specified (it is recommended to uselist.out=TRUE).

Thepredict method fortranscan computespredicted variables and imputed values from a matrix of new data.This matrix should have the same column variables as the originalmatrix used withtranscan, and in the same order (unless aformula was used withtranscan).

TheFunction function is a generic functiongenerator.Function.transcan createsR functions to transformvariables using transformations created bytranscan. Thesefunctions are useful for getting predicted values with predictors setto values on the original scale.

Thevcov methods are defined here so thatimputation-corrected variance-covariance matrices are readilyextracted fromfit.mult.impute objects, and so thatfit.mult.impute can easily compute traditional covariancematrices for individual completed datasets.

The subscript method fortranscan preserves attributes.

TheinvertTabulated function does either inverse linearinterpolation or uses sampling to sample qualifying x-values havingy-values near the desired values. The latter is used to get inversevalues having a reasonable distribution (e.g., no floor or ceilingeffects) when the transformation has a flat or nearly flat segment,resulting in a many-to-one transformation in that region. Samplingweights are a combination of the frequency of occurrence of x-valuesthat are withintolInverse times the range ofy and thesquared distance between the associated y-values and the targety-value (aty).

Usage

transcan(x, method=c("canonical","pc"),         categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute,         boot.method=c('approximate bayesian', 'simple'),         trantab=FALSE, transformed=FALSE,          impcat=c("score", "multinom", "rpart"),         mincut=40,          inverse=c('linearInterp','sample'), tolInverse=.05,         pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE,          imputed.actual=c('none','datadensity','hist','qq','ecdf'),         iter.max=50, eps=.1, curtail=TRUE,          imp.con=FALSE, shrink=FALSE, init.cat="mode",          nres=if(boot.method=='simple')200 else 400,         data, subset, na.action, treeinfo=FALSE,          rhsImp=c('mean','random'), details.impcat='', ...)## S3 method for class 'transcan'summary(object, long=FALSE, digits=6, ...)## S3 method for class 'transcan'print(x, long=FALSE, ...)## S3 method for class 'transcan'plot(x, ...)## S3 method for class 'transcan'ggplot(data, mapping, scale=FALSE, ..., environment)## S3 method for class 'transcan'impute(x, var, imputation, name, pos.in, data,        list.out=FALSE, pr=TRUE, check=TRUE, ...)fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE,                dtrans, derived, fun, vcovOpts=NULL,                robust=FALSE, cluster, robmethod=c('huber', 'efron'),                method=c('ordinary', 'stack', 'only stack'),                funstack=TRUE, lrt=FALSE,                pr=TRUE, subset, fitargs)## S3 method for class 'transcan'predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE,        type=c("transformed","original"),        inverse, tolInverse, check=FALSE, ...)Function(object, ...)## S3 method for class 'transcan'Function(object, prefix=".", suffix="", pos=-1, ...)invertTabulated(x, y, freq=rep(1,length(x)),                 aty, name='value',                inverse=c('linearInterp','sample'),                tolInverse=0.05, rule=2)## Default S3 method:vcov(object, regcoef.only=FALSE, ...)## S3 method for class 'fit.mult.impute'vcov(object, regcoef.only=TRUE,                intercepts='mid', ...)

Arguments

x

a matrix containing continuous variable values and codes forcategorical variables. The matrix must have column names(dimnames). If row names are present, they are used informing thenames attribute of imputed values ifimputed = TRUE.x may also be a formula, in whichcase the model matrix is created automatically, using data in thecalling frame. Advantages of using a formula are that⁠categorical⁠ variables can be determined automatically by avariable being afactor variable, and variables withtwo unique levels are modeled⁠asis⁠. Variables with 3 uniquevalues are considered to be⁠categorical⁠ if a formula isspecified. For a formula you may also specify that a variable is toremain untransformed by enclosing its name with the identifyfunction, e.g.I(x3). The user may add other variable namesto theasis andcategorical vectors. ForinvertTabulated,x is a vector or a list with threecomponents: the x vector, the corresponding vector of transformedvalues, and the corresponding vector of frequencies of the pair oforiginal and transformed variables. Forprint,plot,ggplot,impute, andpredict,x is anobject created bytranscan.

formula

anyR model formula

fitter

anyR,rms, modeling function (not in quotes) that computesa vector ofcoefficients and for whichvcov will return a variance-covariance matrix. E.g.,fitter =lm,glm,ols. At present models involving non-regression parameters (e.g., scale parameters inparametric survival models) are not handled fully.

xtrans

an object created bytranscan,aregImpute, ormice

method

usemethod="canonical" or any abbreviation thereof, to usecanonical variates (the default).method="pc" transforms avariable instead so as to maximize the correlation with the firstprincipal component of the other variables. Forfit.mult.impute,method specifies whether to usestandard multiple imputation (the defaultmethod='ordinary')or whether to get final coefficients from stacking all completed datasets and fitting one model. Stacking is required iflikelihood ratio tests accounting for imputation are to be done.method='stack' means to do regular MI and stacking, whichresults in more valid standard errors of coefficient estimates.method='only stack' means that model fits are not done onindividual completed datasets, and standard errors will not be veryaccurate.

categorical

a character vector of names of variables inx which arecategorical, for which the ordering of re-scored values is notnecessarily preserved. Ifcategorical is omitted, it isassumed that all variables are continuous (or binary). Setcategorical="*" to treat all variables as categorical.

asis

a character vector of names of variables that are not to betransformed. For these variables, the guts oflm.fitmethod="qr" is used to imputemissing values. You may want to treat binary variables⁠asis⁠(this is automatic if using a formula). Ifimputed = TRUE,you may want to use ‘⁠"categorical"⁠’ for binary variables if youwant to force imputed values to be one of the original datavalues. Setasis="*" to treat all variables⁠asis⁠.

nk

number of knots to use in expanding each continuous variable (notlisted inasis) in a restricted cubic spline function.Default is 3 (yielding 2 parameters for a variable) ifn < 30, 4 if30 <= n < 100, and 5 ifn \ge 100 (4 parameters).

imputed

Set toTRUE to return a list containing imputed values on theoriginal scale. If the transformation for a variable isnon-monotonic, imputed values are not unique.transcan usestheapprox function, which returns the highest valueof the variable with the transformed score equalling the imputedscore.imputed=TRUE also causes original-scale imputed valuesto be shown as tick marks on the top margin of each graph whenshow.na=TRUE (for the final iteration only). For categoricalpredictors, these imputed values are passed through thejitter function so that their frequencies can bevisualized. Whenn.impute is used, eachNA will haven.impute tick marks.

n.impute

number of multiple imputations. If omitted, single predictedexpected value imputation is used.n.impute=5 is frequentlyrecommended.

boot.method

default is to use the approximate Bayesian bootstrap (sample withreplacement from sample with replacement of the vector of residuals).You can also specifyboot.method="simple" to use the usualbootstrap one-stage sampling with replacement.

trantab

Set toTRUE to add an attributetrantab to thereturned matrix. This contains a vector of lists each withcomponentsx andy containing the unique values andcorresponding transformed values for the columns ofx. Thisis set up to be used easily with theapprox function.You must specifytrantab=TRUE if you want to later use thepredict.transcan function withtype = "original".

transformed

set toTRUE to causetranscan to return an objecttransformed containing the matrix of transformed variables

impcat

This argument tells how to impute categorical variables on theoriginal scale. The default isimpcat="score" to impute thecategory whose canonical variate score is closest to the predictedscore. Useimpcat="rpart" to impute categorical variablesusing the values of all other transformed predictors in conjunctionwith therpart function. A better but somewhatslower approach is to useimpcat="multinom" to fit a multinomial logistic model tothe categorical variable, at the last iteraction of thetranscan algorithm. This uses themultinomfunction in thennet library of theMASS package (whichis assumed to have been installed by the user) to fit a polytomouslogistic model to the current working transformations of all theother variables (using conditional mean imputation for missingpredictors). Multiple imputations are made by drawing multinomialvalues from the vector of predicted probabilities of categorymembership for the missing categorical values.

mincut

Ifimputed=TRUE, there are categorical variables, andimpcat = "rpart",mincut specifies the lowest node sizethat will be allowed to be split. The default is 40.

inverse

By default, imputed values are back-solved on the original scaleusing inverse linear interpolation on the fitted tabulatedtransformed values. This will cause distorted distributions ofimputed values (e.g., floor and ceiling effects) when the estimatedtransformation has a flat or nearly flat section. To instead usetheinvertTabulated function (see above) with the"sample" option, specifyinverse="sample".

tolInverse

the multiplyer of the range of transformed values, weighted byfreq and by the distance measure, for determining the set ofx values having y values within a tolerance of the value ofaty ininvertTabulated. Forpredict.transcan,inverse andtolInverse are obtained from options thatwere specified totranscan by default. Otherwise, if notspecified by the user, these default to the defaults used toinvertTabulated.

pr

Fortranscan, set toFALSE to suppress printingR^2 and shrinkage factors. Setimpute.transcan=FALSEto suppress messages concerning the number ofNA valuesimputed. Setfit.mult.impute=FALSE to suppress printingvariance inflation factors accounting for imputation, rate ofmissing information, and degrees of freedom.

pl

Set toFALSE to suppress plotting the final transformationswith distribution of scores for imputed values (ifshow.na=TRUE).

allpl

Set toTRUE to plot transformations for intermediate iterations.

show.na

Set toFALSE to suppress the distribution of scores assignedto missing values (as tick marks on the right margin of eachgraph). See alsoimputed.

imputed.actual

The default is ‘⁠"none"⁠’ to suppress plotting of actualvs. imputed values for all variables having anyNA values.Other choices are ‘⁠"datadensity"⁠’ to usedatadensity to make a single plot, ‘⁠"hist"⁠’ tomake a series of back-to-back histograms, ‘⁠"qq"⁠’ to make aseries of q-q plots, or ‘⁠"ecdf"⁠’ to make a series of empiricalcdfs. Forimputed.actual="datadensity" for example you get arug plot of the non-missing values for the variable with beneath ita rug plot of the imputed values. Whenimputed.actual is not‘⁠"none"⁠’,imputed is automatically set toTRUE.

iter.max

maximum number of iterations to perform fortranscan orpredict. Forpredict, only one iteration isused if there are noNA values in the data or ifimp.con was used.

eps

convergence criterion fortranscan andpredict.eps is the maximum change in transformed values from oneiteration to the next. If for a given iteration all newtransformations of variables differ by less thaneps (with orwithout negating the transformation to allow for “flipping”)from the transformations in the previous iteration, one moreiteration is done fortranscan. During this last iteration,individual transformations are not updated but coefficients oftransformations are. This improves stability of coefficients ofcanonical variates on the right-hand-side.eps is ignoredwhenrhsImp="random".

curtail

fortranscan, causes imputed values on the transformed scaleto be truncated so that their ranges are within the ranges ofnon-imputed transformed values. Forpredict,curtail defaults toTRUE to truncate predictedtransformed values to their ranges in the original fit (xt).

imp.con

fortranscan, set toTRUE to imputeNA valueson the original scales with constants (medians or most frequentcategory codes). Set to a vector of constants to instead always usethese constants for imputation. These imputed values are ignoredwhen fitting the current working transformation for asinglevariable.

shrink

default isFALSE to use ordinary least squares or canonicalvariate estimates. For the purposes of imputingNAs, you maywant to setshrink=TRUE to avoid overfitting when developinga prediction equation to predict each variables from all the others(see details below).

init.cat

method for initializing scorings of categorical variables. Defaultis ‘⁠"mode"⁠’ to use a dummy variable set to 1 if the value isthe most frequent value (this is the default). Use ‘⁠"random"⁠’to use a random 0-1 variable. Set to ‘⁠"asis"⁠’ to use theoriginal integer codes asstarting scores.

nres

number of residuals to store ifn.impute is specified. Ifthe dataset has fewer thannres observations, all residualsare saved. Otherwise a random sample of the residuals of lengthnres without replacement is saved. The default fornres is higher ifboot.method="approximate bayesian".

data

Data frame used to fill the formula. Forggplot is theresult oftranscan withtrantab=TRUE.

subset

an integer or logical vector specifying the subset of observationsto fit

na.action

These may be used ifx is a formula. The defaultna.action isna.retain (defined bytranscan)which keeps all observations with anyNA values. Forimpute.transcan,data is a data frame to use as thesource of variables to be imputed, rather than usingpos.in. Forfit.mult.impute,data ismandatory and is a data frame containing the data to be used infitting the model but before imputations are applied. Variablesomitted fromdata are assumed to be available from frame1and do not need to be imputed.

treeinfo

Set toTRUE to get additional information printed whenimpcat="rpart", such as the predicted probabilities ofcategory membership.

rhsImp

Set to ‘⁠"random"⁠’ to use random draw imputation when asometimes missing variable is moved to be a predictor of othersometimes missing variables. Default isrhsImp="mean", whichuses conditional mean imputation on the transformed scale.Residuals used are residuals from the transformed scale. When‘⁠"random"⁠’ is used,transcan runs 5 iterations andignoreseps.

details.impcat

set to a character scalar that is the name of a category variable toinclude in the resultingtranscan object an elementdetails.impcat containing details of how the categoricalvariable was multiply imputed.

...

arguments passed toscat1d. Forggplot.transcan,these arguments are passed tofacet_wrap, e.g.ncol=2.

long

forsummary, set toTRUE to print all imputedvalues. Forprint, set toTRUE to print detailsof transformations/imputations.

digits

number of significant digits for printing values bysummary

scale

forggplot.transcan setscale=TRUE toscale transformed values to [0,1] before plotting.

mapping,environment

not used; needed because of rules about generics

var

Forimpute, is a variable that was originally a columninx, for which imputated values are to be filledin.imputed=TRUE must have been used intranscan.Omitvar to impute all variables, creating new variables inpositionpos (seeassign).

imputation

specifies which of the multiple imputations to use for filling inNA values

name

name of variable to impute, forimpute function.Default is character string version of the second argument(var) in the call toimpute. ForinvertTabulated, is the name of variable being transformed(used only for warning messages).

pos.in

location as defined byassign to find variables thatneed to be imputed, when all variables are to be imputed automatically byimpute.transcan (i.e., when no input variable name isspecified). Default is position that containsthe first variable to be imputed.

list.out

Ifvar is not specified, you can setlist.out=TRUE tohaveimpute.transcan return a list containing variables withneeded values imputed. This list will contain a single imputation.Variables not needing imputation are copied to the list as-is. Youcan use this list for analysis just like a data frame.

check

set toFALSE to suppress certain warning messages

newdata

a new data matrix for which to compute transformedvariables. Categorical variables must use the same integer codes aswere used in the call totranscan. If a formula wasoriginally specified totranscan (instead of a data matrix),newdata is optional and if given must be a data frame; amodel frame is generated automatically from the previous formula.Thena.action is handled automatically, and the levels forfactor variables must be the same and in the same order as were usedin the original variables specified in the formula given totranscan.

fit.reps

set toTRUE to save all fit objects from the fit for eachimputation infit.mult.impute. Then the object returned willhave a componentfits which is a list whose i'thelement is the i'th fit object.

dtrans

provides an approach to creating derived variables from a singlefilled-in dataset. The function specified asdtrans can evenreshape the imputed dataset. An example of such usage is fittingtime-dependent covariates in a Cox model that are created by“start,stop” intervals. Imputations may be done on a onerecord per subject data frame that is converted bydtrans tomultiple records per subject. The imputation can enforceconsistency of certain variables across records so that for examplea missing value of sex will not be imputed as ‘⁠male⁠’ forone of the subject's records and ‘⁠female⁠’ as another. Anexample of howdtrans might be specified isdtrans=function(w) {w$age <- w$years + w$months/12; w}wheremonths might havebeen imputed butyears wasnever missing. An outline for using 'dtrans' to impute missingbaseline variables in a longitudinal analysis appears in Details below.

derived

an expression containingR expressions for computing derivedvariables that are used in the model formula. This is useful whenmultiple imputations are done for component variables but the actualmodel uses combinations of these (e.g., ratios or otherderivations). For a single derived variable you can specify forexamplederived=expression(ratio <- weight/height). Formultiple derived variables use the formderived=expression({ratio <- weight/height; product <- weight*height}) or put the expression on separate input lines.To monitor the multiply-imputed derived variables you can add to theexpression a command such asprint(describe(ratio)).See the example below. Note thatderived is not yetimplemented.

fun

a function of a fit made on one of the completed datasets.Typical uses are bootstrap model validations. The result offun for imputationi is placed in theithelement of a list that is returned in thefit.mult.imputeobject element namedfunresults. SeethermsprocessMI function for help in processingthese results for the cases ofvalidate andcalibrate.

vcovOpts

a list of named additional arguments to pass to thevcov method forfitter. Useful fororm modelsfor retaining all intercepts(vcovOpts=list(intercepts='all')) instead of just the middleone.

robust

set toTRUE to havefit.mult.impute call therms packagerobcov function on each fit on acompleted dataset. Whencluster is given,robust isforced toTRUE.

cluster

a vector of cluster IDs that is the same length of the numberof rows in the dataset being analyzed. When specified,robust isassumed to beTRUE, and thermsrobcov function is called with thecluster vector given as its second argument.

robmethod

see therobcov function'smethodargument

funstack

set toFALSE to not runfun on thestacked dataset, making ann.impute+1 element offunresults

lrt

set toTRUE to havemethod, fun, fitargs setappropriately automatically so thatprocessMI can be used toget likelihood ratio tests. When doing this,fun may not be specified by the user.

fitargs

a list of extra arguments to pass tofitter,used especially withfun. Whenrobust=TRUE the argumentsx=TRUE, y=TRUE are automatically added tofitargs.

type

By default, the matrix of transformed variables is returned, withimputed values on the transformed scale. If you had specifiedtrantab=TRUE totranscan, specifyingtype="original" does the table look-ups with linearinterpolation to return the input matrixx but with imputedvalues on the original scale inserted forNA values. Forcategorical variables, the method used here is to select thecategory code having a corresponding scaled value closest to thepredicted transformed value. This corresponds to the defaultimpcat. Note: imputed valuesthus returned whentype="original" are single expected valueimputations even inn.impute is given.

object

an object created bytranscan, or an object to be converted toR function code, typically a model fit object of some sort

prefix,suffix

When creating separateR functions for each variable inx,the name of the new function will beprefix placed in front ofthe variable name, andsuffix placed in back of the name. Thedefault is to use names of the form ‘⁠.varname⁠’, wherevarname is the variable name.

pos

position as inassign at which to store new functions(forFunction). Default ispos=-1.

y

a vector corresponding tox forinvertTabulated, if itsfirst argumentx is not a list

freq

a vector of frequencies corresponding to cross-classifiedxandy ifx is not a list. Default is a vector of ones.

aty

vector of transformed values at which inverses are desired

rule

seeapprox.transcan assumesrule isalways 2.

regcoef.only

set toTRUE to makevcov.default delete positions inthe covariance matrix for any non-regression coefficients (e.g., logscale parameter frompsm orsurvreg)

intercepts

this is primarily forormobjects. Set to"none" to discard all intercepts from thecovariance matrix, or to"all" or"mid" to keep allelements generated byorm (orm only outputs thecovariance matrix for the intercept corresponding to the median).You can also setintercepts to a vector of subscripts forselecting particular intercepts in a multi-intercept model.

Details

The starting approximation to the transformation for each variable istaken to be the original coding of the variable. The initialapproximation for each missing value is taken to be the median of thenon-missing values for the variable (for continuous ones) or the mostfrequent category (for categorical ones). Instead, ifimp.conis a vector, its values are used for imputingNA values. Whenusing each variable as a dependent variable,NA values on thatvariable cause all observations to be temporarily deleted. Once a newworking transformation is found for the variable, along with a modelto predict that transformation from all the other variables, thatlatter model is used to imputeNA values in the selecteddependent variable ifimp.con is not specified.

When that variable is used to predict a new dependent variable, thecurrent working imputed values are inserted. Transformations areupdated after each variable becomes a dependent variable, so the orderof variables onx could conceivably make a difference in thefinal estimates. For obtaining out-of-samplepredictions/transformations,predict uses the sameiterative procedure astranscan for imputation, with the samestarting values for fill-ins as were used bytranscan. It also(by default) uses a conservative approach of curtailing transformedvariables to be within the range of the original ones. Even whenmethod = "pc" is specified, canonical variables are used forimputing missing values.

Note that fitted transformations, when evaluated at imputed variablevalues (on the original scale), will not precisely match thetransformed imputed values returned inxt. This is becausetranscan uses an approximate method based on linearinterpolation to back-solve for imputed values on the original scale.

Shrinkage uses the method ofVan Houwelingen and Le Cessie (1990) (similar toCopas, 1983). The shrinkage factor is

\frac{1-\frac{(1-R2)(n-1)}{n-k-1}}{R2}

where R2 is the apparentR^2d for predicting thevariable, n is the number of non-missing values, and k isthe effective number of degrees of freedom (aside from intercepts). Aheuristic estimate is used for k:A - 1 + sum(max(0,Bi - 1))/m + m, whereA is the number of d.f. required to represent the variable beingpredicted, the Bi are the number of columns required torepresent all the other variables, and m is the number of allother variables. Division by m is done because thetransformations for the other variables are fixed at their currenttransformations the last time they were being predicted. The+ m term comes from the number of coefficients estimatedon the right hand side, whether by least squares or canonicalvariates. If a shrinkage factor is negative, it is set to 0. Theshrinkage factor is the ratio of the adjustedR^2d tothe ordinaryR^2d. The adjustedR^2d is

1-\frac{(1-R2)(n-1)}{n-k-1}

which is also set to zero if it is negative. Ifshrink=FALSEand the adjustedR^2s are much smaller than theordinaryR^2s, you may want to runtranscanwithshrink=TRUE.

Canonical variates are scaled to have variance of 1.0, by multiplyingcanonical coefficients fromcancor by\sqrt{n-1}.

When specifying a non-rms library fitting function tofit.mult.impute (e.g.,lm,glm),running the result offit.mult.impute through that fit'ssummary method will not use the imputation-adjustedvariances. You may obtain the new variances usingfit$var orvcov(fit).

When you specify arms function tofit.mult.impute (e.g.lrm,ols,cph,psm,bj,Rq,Gls,Glm), automatically computedtransformation parameters (e.g., knot locations forrcs) that are estimated for the first imputation areused for all other imputations. This ensures that knot locations willnot vary, which would change the meaning of the regressioncoefficients.

Warning: even thoughfit.mult.impute takes imputation intoaccount when estimating variances of regression coefficient, it doesnot take into account the variation that results from estimation ofthe shapes and regression coefficients of the customized imputationequations. Specifyingshrink=TRUE solves a small part of thisproblem. To fully account for all sources of variation you shouldconsider putting thetranscan invocation inside a bootstrap orloop, if execution time allows. Better still, usearegImpute or a package such as asmice that usesreal Bayesian posterior realizations to multiply impute missing valuescorrectly.

It is strongly recommended that you use theHmiscnaclusfunction to determine is there is a good basis for imputation.naclus will tell you, for example, if systolic bloodpressure is missing whenever diastolic blood pressure is missing. Ifthe only variable that is well correlated with diastolic bp issystolic bp, there is no basis for imputing diastolic bp in this case.

At present,predict does not work with multiple imputation.

When callingfit.mult.impute withglm as thefitter argument, if you need to pass afamily argumenttoglm do it by quoting the family, e.g.,family="binomial".

fit.mult.impute will not work with proportional odds modelswhen regression imputation was used (as opposed to predictive meanmatching). That's because regression imputation will create values ofthe response variable that did not exist in the dataset, altering theintercept terms in the model.

You should be able to use a variable in the formula given tofit.mult.impute as a numeric variable in the regression modeleven though it was a factor variable in the invocation oftranscan. Use for examplefit.mult.impute(y ~ codes(x), lrm, trans) (thanks to Trevor Thompsontrevor@hp5.eushc.org).

Here is an outline of the steps necessary to impute baseline variablesusing thedtrans argument, when the analysis to be repeated byfit.mult.impute is a longitudinal analysis (usinge.g.Gls).

  1. Create a one row per subject data frame containing baselinevariables plus follow-up variables that are assigned to windows. Forexample, you may have dozens of repeated measurements over years butyou capture the measurements at the times measured closest to 1, 2,and 3 years after study entry

  2. Make sure the dataset contains the subject ID

  3. This dataset becomes the one passed toaregImpute asdata=. You will be imputing missing baseline variables fromfollow-up measurements defined at fixed times.

  4. Have another dataset with all the non-missing follow-up valueson it, one record per measurement time per subject. This datasetshould not have the baseline variables on it, and the follow-upmeasurements should not be named the same as the baseline variable(s);the subject ID must also appear

  5. Add the dtrans argument tofit.mult.impute to define afunction with one argument representing the one record per subjectdataset with missing values filled it from the current imputation.This function merges the above 2 datasets; the returned value of thisfunction is the merged data frame.

  6. This merged-on-the-fly dataset is the one handed byfit.mult.impute to your fitting function, so variable names in the formula given tofit.mult.impute must matched the names created by the merge

Value

Fortranscan, a list of class ‘⁠transcan⁠’ with elements

call

(with the function call)

iter

(number of iterations done)

rsq,rsq.adj

containing theR^2s and adjustedR^2s achieved in predicting each variable from allthe others

categorical

the values supplied forcategorical

asis

the values supplied forasis

coef

the within-variable coefficients used to compute the firstcanonical variate

xcoef

the (possibly shrunk) across-variables coefficients of the firstcanonical variate that predicts each variable in-turn.

parms

the parameters of the transformation (knots for splines, contrastmatrix for categorical variables)

fillin

the initial estimates for missing values (NA if variablenever missing)

ranges

the matrix of ranges of the transformed variables (min and max infirst and secondrow)

scale

a vector of scales used to determine convergence for atransformation.

formula

the formula (ifx was a formula)

, and optionally a vector of shrinkage factors used for predictingeach variable from the others. Forasis variables, the scaleis the average absolute difference about the median. For othervariables it is unity, since canonical variables are standardized.Forxcoef, row i has the coefficients to predicttransformed variable i, with the column for the coefficient ofvariable i set toNA. Ifimputed=TRUE was given,an optional elementimputed also appears. This is a list withthe vector of imputed values (on the original scale) for each variablecontainingNAs. Matrices rather than vectors are returned ifn.impute is given. Iftrantab=TRUE, thetrantabelement also appears, as described above. Ifn.impute > 0,transcan also returns a listresiduals that can be usedfor future multiple imputation.

impute returns a vector (the same length asvar) ofclass ‘⁠impute⁠’ withNA values imputed.

predict returns a matrix with the same number of columns orvariables as were inx.

fit.mult.impute returns a fit object that is a modification ofthe fit object created by fitting the completed dataset for the finalimputation. Thevar matrix in the fit object has theimputation-corrected variance-covariance matrix.coefficientsis the average (over imputations) of the coefficient vectors,variance.inflation.impute is a vector containing the ratios ofthe diagonals of the between-imputation variance matrix to thediagonals of the average apparent (within-imputation) variancematrix.missingInfo isRubin's rate of missing information anddfmi isRubin's degrees of freedom for a t-statisticfor testing a single parameter. The last two objects are vectorscorresponding to the diagonal of the variance matrix. The class"fit.mult.impute" is prepended to the other classes produced bythe fitting function.

Whenmethod is not'ordinary', i.e., stacking is used,fit.mult.impute returns a modified fit object that is computedon all completed datasets combined, with most all statistics that arefunctions of the sample size corrected to the real sample size.Elements in the fit such asresiduals will have length equal tothe real sample size times the number of imputations.

fit.mult.impute storesintercepts attributes in thecoefficient matrix and invar fororm fits.

Side Effects

prints, plots, andimpute.transcan creates new variables.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

References

Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, FourthEdition, Volume 2, pp. 1265–1323, 1990.

Van Houwelingen JC, Le Cessie S: Predictive value of statistical models.Statistics in Medicine 8:1303–1325, 1990.

Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.

He X, Shen L: Linear regression after spline transformation.Biometrika 84:474–481, 1997.

Little RJA, Rubin DB: Statistical Analysis with Missing Data. NewYork: Wiley, 1987.

Rubin DJ, Schenker N: Multiple imputation in health-care databases: Anoverview and some applications. Stat in Med 10:585–598, 1991.

Faris PD, Ghali WA, et al:Multiple imputation versus data enhancementfor dealing with missing data in observational health care outcomeanalyses. J Clin Epidem 55:184–191, 2002.

See Also

aregImpute,impute,naclus,naplot,ace,avas,cancor,prcomp,rcspline.eval,lsfit,approx,datadensity,mice,ggplot,processMI

Examples

## Not run: x <- cbind(age, disease, blood.pressure, pH)  #cbind will convert factor object `disease' to integerpar(mfrow=c(2,2))x.trans <- transcan(x, categorical="disease", asis="pH",                    transformed=TRUE, imputed=TRUE)summary(x.trans)  #Summary distribution of imputed values, and R-squaresf <- lm(y ~ x.trans$transformed)   #use transformed values in a regression#Now replace NAs in original variables with imputed values, if not#using transformationsage            <- impute(x.trans, age)disease        <- impute(x.trans, disease)blood.pressure <- impute(x.trans, blood.pressure)pH             <- impute(x.trans, pH)#Do impute(x.trans) to impute all variables, storing new variables under#the old namessummary(pH)       #uses summary.impute to tell about imputations                  #and summary.default to tell about pH overall# Get transformed and imputed values on some new data frame xnewnewx.trans     <- predict(x.trans, xnew)w              <- predict(x.trans, xnew, type="original")age            <- w[,"age"]            #inserts imputed valuesblood.pressure <- w[,"blood.pressure"]Function(x.trans)  #creates .age, .disease, .blood.pressure, .pH()#Repeat first fit using a formulax.trans <- transcan(~ age + disease + blood.pressure + I(pH),                     imputed=TRUE)age <- impute(x.trans, age)predict(x.trans, expand.grid(age=50, disease="pneumonia",        blood.pressure=60:260, pH=7.4))z <- transcan(~ age + factor(disease.code),  # disease.code categorical              transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)ggplot(z, scale=TRUE)plot(z$transformed)## End(Not run)# Multiple imputation and estimation of variances and covariances of# regression coefficient estimates accounting for imputationset.seed(1)x1 <- factor(sample(c('a','b','c'),100,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)y  <- x2 + 1*(x1=='c') + rnorm(100)x1[1:20] <- NAx2[18:23] <- NAd <- data.frame(x1,x2,y)n <- naclus(d)plot(n); naplot(n)  # Show patterns of NAsf  <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)options(digits=3)summary(f)f  <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)summary(f)h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)# Add ,fit.reps=TRUE to save all fit objects in h, then do something like:# for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))diag(vcov(h))h.complete <- lm(y ~ x1 + x2, na.action=na.omit)h.completediag(vcov(h.complete))# Note: had the rms ols function been used in place of lm, any# function run on h (anova, summary, etc.) would have automatically# used imputation-corrected variances and covariances# Example demonstrating how using the multinomial logistic model# to impute a categorical variable results in a frequency# distribution of imputed values that matches the distribution# of non-missing values of the categorical variable## Not run: set.seed(11)x1 <- factor(sample(letters[1:4], 1000,TRUE))x1[1:200] <- NAtable(x1)/sum(table(x1))x2 <- runif(1000)z  <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')table(z$imputed$x1)/sum(table(z$imputed$x1))# Here is how to create a completed datasetd <- data.frame(x1, x2)z <- transcan(~x1 + I(x2), n.impute=5, data=d)imputed <- impute(z, imputation=1, data=d,                  list.out=TRUE, pr=FALSE, check=FALSE)sapply(imputed, function(x)sum(is.imputed(x)))sapply(imputed, function(x)sum(is.na(x)))## End(Not run)# Do single imputation and create a filled-in data framez <- transcan(~x1 + I(x2), data=d, imputed=TRUE)imputed <- as.data.frame(impute(z, data=d, list.out=TRUE))# Example where multiple imputations are for basic variables and# modeling is done on variables derived from theseset.seed(137)n <- 400x1 <- runif(n)x2 <- runif(n)y  <- x1*x2 + x1/(1+x2) + rnorm(n)/3x1[1:5] <- NAd <- data.frame(x1,x2,y)w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)# Add ,show.imputed.actual for graphical diagnostics## Not run: g <- fit.mult.impute(y ~ product + ratio, ols, w,                     data=data.frame(x1,x2,y),                     derived=expression({                       product <- x1*x2                       ratio   <- x1/(1+x2)                       print(cbind(x1,x2,x1*x2,product)[1:6,])}))## End(Not run)# Here's a method for creating a permanent data frame containing# one set of imputed values for each variable specified to transcan# that had at least one NA, and also containing all the variables# in an original data frame.  The following is based on the fact# that the default output location for impute.transcan is# given by the global environment## Not run: xt <- transcan(~. , data=mine,               imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)attach(mine, use.names=FALSE)impute(xt, imputation=1) # use first imputation# omit imputation= if using single imputationdetach(1, 'mine2')## End(Not run)# Example of using invertTabulated outside transcanx    <- c(1,2,3,4,5,6,7,8,9,10)y    <- c(1,2,3,4,5,5,5,5,9,10)freq <- c(1,1,1,1,1,2,3,4,1,1)# x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5# Within a tolerance of .05*(10-1) all y's match exactly# so the distance measure does not play a roleset.seed(1)      # so can reproducefor(inverse in c('linearInterp','sample')) print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))# Test inverse='sample' when the estimated transformation is# flat on the right.  First show default imputationsset.seed(3)x <- rnorm(1000)y <- pmin(x, 0)x[1:500] <- NAfor(inverse in c('linearInterp','sample')) {par(mfrow=c(2,2))  w <- transcan(~ x + y, imputed.actual='hist',                inverse=inverse, curtail=FALSE,                data=data.frame(x,y))  if(inverse=='sample') next# cat('Click mouse on graph to proceed\n')# locator(1)}## Not run: # While running multiple imputation for a logistic regression model# Run the rms package validate and calibrate functions and save the# results in w$funresultsa <- aregImpute(~ x1 + x2 + y, data=d, n.impute=10)require(rms)g <- function(fit)  list(validate=validate(fit, B=50), calibrate=calibrate(fit, B=75))w <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, fun=g,                     fitargs=list(x=TRUE, y=TRUE))# Get all validate results in it's own list of length 10r <- w$funresultsval <- lapply(r, function(x) x$validate)cal <- lapply(r, function(x) x$calibrate)# See rms processMI and https://hbiostat.org/rmsc/validate.html#sec-val-mival## End(Not run)## Not run: # Account for within-subject correlation using the robust cluster sandwich# covariance estimate in conjunction with Rubin's rule for multiple imputation# rms package must be installeda <- aregImpute(..., data=d)f <- fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=30, data=d, cluster=d$id)# Get likelihood ratio chi-square tests accounting for missingnessa <- aregImpute(..., data=d)h <- fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=40, data=d, lrt=TRUE)processMI(h, which='anova')   # processMI is in rms## End(Not run)

Translate Vector or Matrix of Text Strings

Description

Uses the UNIX tr command to translate any character inold intext to the corresponding character innew. If multichar=Torold andnew have more than one element, or each have one elementbut they have different numbers of characters,uses the UNIXsed command to translate the series of characters inold to the series innew when these characters occur intext.Ifold ornew contain a backslash, you sometimes have to quadrupleit to make the UNIX command work. If they contain a forward slash,preceed it by two backslashes. Invokes the builtin chartr function ifmultichar=FALSE.

Usage

translate(text, old, new, multichar=FALSE)

Arguments

text

scalar, vector, or matrix of character strings to translate.

old

vector old characters

new

corresponding vector of new characters

multichar

See above.

Value

an object like text but with characters translated

See Also

grep

Examples

translate(c("ABC","DEF"),"ABCDEFG", "abcdefg")translate("23.12","[.]","\\cdot ") # change . to \cdottranslate(c("dog","cat","tiger"),c("dog","cat"),c("DOG","CAT"))# S-Plus gives  [1] "DOG"   "CAT"   "tiger" - check discrepencytranslate(c("dog","cat2","snake"),c("dog","cat"),"animal")# S-Plus gives  [1] "animal"  "animal2" "snake"

Return the floor, ceiling, or rounded value of date or time tospecified unit.

Description

truncPOSIXt returns the date truncated to the specified unit.ceil.POSIXt returns next ceiling of the date at the unit selected inunits.roundPOSIXt returns the date or time value rounded to nearestspecified unit selected indigits.

truncPOSIXt androundPOSIXt have been extended fromthebase package functionstrunc.POSIXt andround.POSIXt which in the future will add the other time unitswe need.

Usage

ceil(x, units,...)## Default S3 method:ceil(x, units, ...)truncPOSIXt(x, units = c("secs", "mins", "hours", "days","months", "years"), ...)## S3 method for class 'POSIXt'ceil(x, units = c("secs", "mins", "hours", "days","months", "years"), ...)roundPOSIXt(x, digits = c("secs", "mins", "hours", "days", "months", "years"))

Arguments

x

date to be ceilinged, truncated, or rounded

units

unit to that is is rounded up or down to.

digits

same asunits but different name to be compatiblewithround generic.

...

further arguments to be passed to or from other methods.

Value

An object of classPOSIXlt.

Author(s)

Charles Dupont

See Also

DatePOSIXtPOSIXltDateTimeClasses

Examples

date <- ISOdate(1832, 7, 12)ceil(date, units='months')  # '1832-8-1'truncPOSIXt(date, units='years')     # '1832-1-1'roundPOSIXt(date, digits='months')    # '1832-7-1'

Units Attribute of a Vector

Description

Sets or retrieves the"units" attribute of an object.Forunits.default replaces the builtinversion, which only works for time series objects. If the variable isalso given alabel, subsetting (using[.labelled) willretain the"units" attribute. For aSurv object,units first looks for an overall"units" attribute, thenit looks forunits for thetime2 variable then fortime1.When setting"units",value is changed to lower case and any "s" atthe end is removed.

Usage

units(x, ...)## Default S3 method:units(x, none='', ...)## S3 method for class 'Surv'units(x, none='', ...)## Default S3 replacement method:units(x) <- value

Arguments

x

any object

...

ignored

value

the units of the object, or ""

none

value to which to set result if no appropriate attribute isfound

Value

the units attribute of x, if any; otherwise, theunits attribute ofthetspar attribute ofx if any; otherwise the valuenone. Handling forSurv objects is different (see above).

See Also

label

Examples

require(survival)fail.time <- c(10,20)units(fail.time) <- "Day"describe(fail.time)S <- Surv(fail.time)units(S)label(fail.time) <- 'Failure Time'units(fail.time) <- 'Days'fail.time

Update a Data Frame or Cleanup a Data Frame after Importing

Description

cleanup.import will correct errors and shrinkthe size of data frames. By default, double precision numericvariables are changed to integer when they contain no fractional components. Infinite values or values greater than 1e20 in absolute value are setto NA. This solves problems of importing Excel spreadsheets thatcontain occasional character values for numeric columns, as Sconverts these toInf without warning. There is also an option toconvert variable names to lower case and to add labels to variables.The latter can be made easier by importing a CNTLOUT dataset createdby SAS PROC FORMAT and using thesasdict option as shown in theexample below.cleanup.import can also transform character orfactor variables to dates.

upData is a function facilitating the updating of a data framewithout attaching it in search position one. New variables can beadded, old variables can be modified, variables can be removed or renamed, and"labels" and"units" attributes can be provided.Observations can be subsetted. Various checksare made for errors and inconsistencies, with warnings issued to helpthe user. Levels of factor variables can be replaced, especiallyusing thelist notation of the standardmerge.levelsfunction. Unlessforce.single is set toFALSE,upData also converts double precision vectors to integer if nofractional values are present in a vector.upData is also used to process R workspace objectscreated by StatTransfer, which puts variable and value labels as attributes onthe data frame rather than on each variable. If such attributes arepresent, they are used to define all the labels and value labels(through conversion to factor variables) before any label changestake place, andforce.single is set to a default ofFALSE, as StatTransfer already does conversion to integer.

Variables having labels but not classed"labelled" (e.g., dataimported using thehaven package) have that class added to thembyupData.

ThedataframeReduce function removes variables from a data framethat are problematic for certain analyses. Variables can be removedbecause the fraction of missing values exceeds a threshold, because theyare character or categorical variables having too many levels, orbecause they are binary and have too small a prevalence in one of thetwo values. Categorical variables can also have their levels combinedwhen a level is of low prevalence. A data frame listing actions takeis return as attribute"info" to the main returned data frame.

Usage

cleanup.import(obj, labels, lowernames=FALSE,                force.single=TRUE, force.numeric=TRUE, rmnames=TRUE,               big=1e20, sasdict, print, datevars=NULL, datetimevars=NULL,               dateformat='%F',               fixdates=c('none','year'),               autodate=FALSE, autonum=FALSE, fracnn=0.3,               considerNA=NULL, charfactor=FALSE)upData(object, ...,        subset, rename, drop, keep, labels, units, levels, force.single=TRUE,       lowernames=FALSE, caplabels=FALSE, classlab=FALSE, moveUnits=FALSE,       charfactor=FALSE, print=TRUE, html=FALSE)dataframeReduce(data, fracmiss=1, maxlevels=NULL,  minprev=0, print=TRUE)

Arguments

obj

a data frame or list

object

a data frame or list

data

a data frame

force.single

By default, double precision variables are converted to single precision(in S-Plus only) unlessforce.single=FALSE.force.single=TRUE will also convert vectors having only integervalues to have a storage mode of integer, in R or S-Plus.

force.numeric

Sometimes importing will cause a numeric variable to bechanged to a factor vector. By default,cleanup.import will checkeach factor variable to see if the levels contain only numeric valuesand"". In that case, the variable will be converted to numeric,with"" converted to NA. Setforce.numeric=FALSE to preventthis behavior.

rmnames

set to ‘F’ to not have ‘cleanup.import’ remove ‘names’ or ‘.Names’attributes from variables

labels

a character vector the same length as the number of variables inobj. These character values are taken to be variable labels in thesame order of variables inobj.ForupData,labels is a named list or named vectorwith variables in no specific order.

lowernames

set this toTRUE to change variable names to lower case.upData does this before applying any other changes, so variablenames given inside arguments toupData need to be lower case iflowernames==TRUE.

big

a value such that values larger than this in absolute value are set tomissing bycleanup.import

sasdict

the name of a data frame containing a raw imported SAS PROC CONTENTSCNTLOUT= dataset. This is used to define variable names and to addattributes to the new data frame specifying the original SAS datasetname and label.

print

set toTRUE orFALSE to force or prevent printing of the currentvariable number being processed. By default, such messages are printed if theproduct of the number of variables and number of observations inobjexceeds 500,000. FordataframeReduce setprint toFALSE to suppress printing information about dropped ormodified variables. Similar forupData.

datevars

character vector of names (afterlowernames isapplied) of variables to consider as a factor or character vectorcontaining dates in a format matchingdateformat. Thedefault is"%F" which uses the yyyy-mm-dd format.

datetimevars

character vector of names (afterlowernamesis applied) of variables to consider to be date-time variables, withdate formats as described underdatevars followed by a spacefollowed by time in hh:mm:ss format.chron is used to storedate-time variables. If all times in the variableare 00:00:00 the variable will be converted to an ordinary date variable.

dateformat

forcleanup.import is the input format (seestrptime)

fixdates

for any of the variables listed indatevarsthat have adateformat thatcleanup.import understands,specifyingfixdates allows corrections of certain formattinginconsistencies before the fields are attempted to be converted todates (the default is to assume that thedateformat is followedfor all observation fordatevars). Currentlyfixdates='year' is implemented, which will cause 2-digit or4-digit years to be shifted to the alternate number of digits whendateform is the default"%F" or is"%y-%m-%d","%m/%d/%y", or"%m/%d/%Y". Two-digits years are padded with20on the left. Setdateformat to the desired format, not theexceptional format.

autodate

set toTRUE to havecleanup.importdetermine and automatically handle factor or character vectors that mainly contain dates of the form YYYY-mm-dd,mm/dd/YYYY, YYYY, or mm/YYYY, where the later two are imputed to,respectively, July 3 and the 15th of the month. Takes effect whenthe fraction of non-dates (of non-missing values) is less thanfracnn to allow for some free text such as"unknown".Attributesspecial.miss andimputed are created for the vector sothatdescribe() will inform the user. Illegal values areconverted toNAs and stored in thespecial.miss attribute.

autonum

set toTRUE to havecleanup.importexamine (afterautodate) character and factor variables tosee if they are legal numerics exact for at most a fraction offracnn of non-missing non-numeric values. Qualifying variables areconverted to numeric, and illegal values set toNA and stored inthespecial.miss attribute to enhancedescribe output.

fracnn

seeautodate andautonum

considerNA

forautodate andautonum, considerscharacter values in the vectorconsiderNA to be the same asNA. Leading and trailing white space and upper/lower caseare ignored.

charfactor

set toTRUE to change character variables tofactors if they have fewer than n/2 unique values. Null strings andblanks are converted toNAs.

...

forupData, one or more expressions of the formvariable=expression, to derive new variables or change old ones.

subset

an expression that evaluates to a logical vectorspecifying which rows ofobject should be retained. Theexpressions should use the original variable names, i.e., before anyvariables are renamed but afterlowernames takes effect.

rename

list or named vector specifying old and new names for variables. Variables arerenamed before any other operations are done. For example, to renamevariablesage andsex to respectivelyAge andgender, specifyrename=list(age="Age", sex="gender") orrename=c(age=...).

drop

a vector of variable names to remove from the data frame

keep

a vector of variable names to keep, with all othervariables dropped

units

a named vector or list defining"units" attributes ofvariables, in no specific order

levels

a named list defining"levels" attributes for factor variables, inno specific order. The values in this list may be character vectorsredefininglevels (in order) or another list (seemerge.levels if using S-Plus).

caplabels

set toTRUE to capitalize the first letter of each word ineach variable label

classlab

set toTRUE (the old default behavior) to automatically haveupData make variables havinga"label" attribute haveclass of"labelled". Note that when thelabelsargument toupData is given, these createlabelled-class variables as always.

moveUnits

set toTRUE to look for units of measurements in variablelabels and move them to a"units" attribute. If an expressionin a label is enclosed in parentheses or brackets it is assumed to beunits ifmoveUnits=TRUE.

html

set toTRUE to print conversion information as htmlvertabim at 0.6 size. The user will need to putresults='asis' in aknitr chunk header to properlyrender this output.

fracmiss

the maximum permissable proportion ofNAs for avariable to be kept. Default is to keep all variables no matter howmanyNAs are present.

maxlevels

the maximum number of levels of a character orcategorical or factor variable before the variable is dropped

minprev

the minimum proportion of non-missing observations in acategory for a binary variable to be retained, and the minimumrelative frequency of a category before it will be combined with othersmall categories

Value

a new data frame

Author(s)

Frank Harrell, Vanderbilt University

See Also

sas.get,data.frame,describe,label,read.csv,strptime,POSIXct,Date

Examples

## Not run: dat <- read.table('myfile.asc')dat <- cleanup.import(dat)## End(Not run)dat <- data.frame(a=1:3, d=c('01/02/2004',' 1/3/04',''))cleanup.import(dat, datevars='d', dateformat='%m/%d/%y', fixdates='year')dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)dat2 <- upData(dat, x=x^2, x=x-5, m=x/10,                rename=c(a='x'), drop='z',               labels=c(x='X', y='test'),               levels=list(y=list(a='a',b=c('b1','b2'))))dat2describe(dat2)dat <- dat2    # copy to original name and delete dat2 if OKrm(dat2)dat3 <- upData(dat, X=X^2, subset = x < (3/7)^2 - 5, rename=c(x='X'))# Remove hard to analyze variables from a redundancy analysis of all# variables in the data framed <- dataframeReduce(dat, fracmiss=.1, minprev=.05, maxlevels=5)# Could run redun(~., data=d) at this point or include dataframeReduce# arguments in the call to redun# If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,# the LABELs from this dataset can be added to the data.  Let's also# convert names to lower case for the main data file## Not run: mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)## End(Not run)

Change First Letters to Upper Case

Description

Changes the first letter of each word in a string to upper case, keeping selected words in lower case. Words containing at least 2 capital letters are kept as-is.

Usage

upFirst(txt, lower = FALSE, alllower = FALSE)

Arguments

txt

a character vector

lower

set toTRUE to make only the very first letter of the string upper case, and to keep words with at least 2 capital letters in their original form

alllower

set toTRUE to make every word start with lower case unless it has at least 2 caps

References

https://en.wikipedia.org/wiki/Letter_case#Headings_and_publication_titles

Examples

upFirst(c('this and that','that is Beyond question'))

Store Descriptive Information About an Object

Description

Functions get or set useful information about the contents of theobject for later use.

Usage

valueTags(x)valueTags(x) <- valuevalueLabel(x)valueLabel(x) <- valuevalueName(x)valueName(x) <- valuevalueUnit(x)valueUnit(x) <- value

Arguments

x

an object

value

forvalueTags<- a named list of value tags.a character vector of length 1, orNULL.

Details

These functions store the a short name of for the contents, a longerlabel that is useful for display, and the units of the contents thatis useful for display.

valueTag is an accessor, andvalueTag<- is a replacementfunction for all of the value's information.

valueName is an accessor, andvalueName<- is areplacement function for the value's name. This name is used when aplot or a latex table needs a short name and the variable name is notuseful.

valueLabel is an accessor, andvalueLabel<- is areplacement function for the value's label. The label is used in aplots or latex tables when they need a descriptive name.

valueUnit is an accessor, andvalueUnit<- is areplacement function for the value's unit. The unit is used to addunit information to the R output.

Value

valueTag returnsNULL or a named list with each of thenamed valuesname,label,unit set if they existsin the object.

ForvalueTag<- returnslist

ForvalueName,valueLable, andvalueUnit returnsNULL or character vector of length 1.

ForvalueName<-,valueLabel<-, andvalueUnit returnsvalue

Author(s)

Charles Dupont

See Also

names,attributes

Examples

age <- c(21,65,43)y   <- 1:3valueLabel(age) <- "Age in Years"plot(age, y, xlab=valueLabel(age))x1 <- 1:10x2 <- 10:1valueLabel(x2) <- 'Label for x2'valueUnit(x2) <- 'mmHg'x2x2[1:5]dframe <- data.frame(x1, x2)Label(dframe)##In these examples of llist, note that labels are printed after##variable names, because of print.labelleda <- 1:3b <- 4:6valueLabel(b) <- 'B Label'

Variable Clustering

Description

Does a hierarchical cluster analysis on variables, using the HoeffdingD statistic, squared Pearson or Spearman correlations, or proportionof observations for which two variables are both positive as similaritymeasures. Variable clustering is used for assessing collinearity,redundancy, and for separating variables into clusters that can bescored as a single variable, thus resulting in data reduction. Forcomputing any of the three similarity measures, pairwise deletion ofNAs is done. The clustering is done byhclust(). A small functionnaclus is also provided which depicts similarities in whichobservations are missing for variables in a data frame. Thesimilarity measure is the fraction ofNAs in common between any twovariables. The diagonals of thissim matrix are the fraction of NAsin each variable by itself.naclus also computesna.per.obs, thenumber of missing variables in each observation, andmean.na, avector whose ith element is the mean number of missing variables otherthan variable i, for observations in which variable i is missing. Thenaplot function makes several plots (see thewhich argument).

So as to not generate too many dummy variables for multi-valuedcharacter or categorical predictors,varclus will automaticallycombine infrequent cells of such variables usingcombine.levels.

plotMultSim plots multiple similarity matrices, with the similaritymeasure being on the x-axis of each subplot.

na.pattern prints a frequency table of all combinations ofmissingness for multiple variables. If there are 3 variables, afrequency table entry labeled110 corresponds to the number ofobservations for which the first and second variables were missing butthe third variable was not missing.

Usage

varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),        type=c("data.matrix","similarity.matrix"),         method="complete",        data=NULL, subset=NULL, na.action=na.retain,        trans=c("square", "abs", "none"), ...)## S3 method for class 'varclus'print(x, abbrev=FALSE, ...)## S3 method for class 'varclus'plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)naclus(df, method)naplot(obj, which=c('all','na per var','na per obs','mean na',                    'na per var vs mean na'), ...)plotMultSim(s, x=1:dim(s)[3],            slim=range(pretty(c(0,max(s,na.rm=TRUE)))),            slimds=FALSE,            add=FALSE, lty=par('lty'), col=par('col'),            lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,            labelx=TRUE, xspace=.35)na.pattern(x)

Arguments

x

a formula,a numeric matrix of predictors, or a similarity matrix. Ifx isa formula,model.matrix is used to convert it to a design matrix.If the formula excludes an intercept (e.g.,~ a + b -1),the first categorical (factor) variable in the formula will havedummy variables generated for all levels instead of omitting one forthe first level. Forplot andprint,x is an object created byvarclus. Forna.pattern,x is a data table, data frame,or matrix.

ForplotMultSim, is a numeric vector specifying the orderedunique values on the x-axis, corresponding to the third dimension ofs.

df

a data frame

s

an array of similarity matrices. The third dimension of this arraycorresponds to different computations of similarities. The first twodimensions come from a single similarity matrix. This is useful fordisplaying similarity matrices computed byvarclus, for example. Ause for this might be to show pairwise similarities of variablesacross time in a longitudinal study (see the example below). Ifvname is not given,s must havedimnames.

similarity

the default is to use squared Spearman correlation coefficients, whichwill detect monotonic but nonlinear relationships. You can alsospecify linear correlation or Hoeffding's (1948) D statistic, whichhas the advantage of being sensitive to many typesof dependence, including highly non-monotonic relationships. Forbinary data, or data to be made binary,similarity="bothpos" uses asa similarity measure the proportion of observations for which twovariables are both positive.similarity="ccbothpos" uses achance-corrected measure which is the proportion of observations forwhich both variables are positive minus the product of the twomarginal proportions. This difference is expected to be zero underindependence. For diagonals,"ccbothpos" still uses the proportionof positives for the single variable. So"ccbothpos" is not reallya similarity measure, and clustering is not done. This measure isuseful for plotting withplotMultSim (see the last example).

type

ifx is not a formula, it may be a data matrix or a similarity matrix.By default, it is assumed to be a data matrix.

method

seehclust. The default, for bothvarclus andnaclus, is"compact" (forR it is"complete").

data

a data frame, data table, or list

subset

a standard subsetting expression

na.action

These may be specified ifx is a formula. The defaultna.action isna.retain, defined byvarclus. Thiscauses all observations to be kept in the model frame, with laterpairwise deletion ofNAs.

trans

By default, when the similarity measure is based onPearson's or Spearman's correlation coefficients, the coefficients aresquared. Specifytrans="abs" to take absolute values ortrans="none" to use the coefficients as they stand.

...

forvarclus these are optional arguments to pass tothedataframeReduce function. Otherwise,passed toplclust (or todotchart ordotchart2 fornaplot).

ylab

y-axis label. Default is constructed on the basis ofsimilarity.

legend.

set toTRUE to plot a legend defining the abbreviations

loc

a list with elementsx andy defining coordinates of theupper left corner of the legend. Default islocator(1).

maxlen

if a legend is plotted describing abbreviations, original labelslonger thanmaxlen characters are truncated atmaxlen.

labels

a vector of character strings containing labels corresponding tocolumns in the similar matrix, if the column names of that matrix arenot to be used

obj

an object created bynaclus

which

defaults to"all" meaning to havenaplot make 4 separateplots. To make only one of the plots, usewhich="na per var" (dot chart offraction of NAs for each variable), ,"na per obs" (dot chart showingfrequency distribution of number of variables having NAs in anobservation),"mean na" (dot chart showing mean number of othervariables missing when the indicated variable is missing), or"na per var vs mean na", a scatterplot showing on the x-axis thefraction of NAs in the variable and on the y-axis the mean number ofother variables that are NA when the indicated variable is NA.

abbrev

set toTRUE to abbreviate variable names for plotting orprinting. Is set toTRUE automatically iflegend=TRUE.

slim

2-vector specifying the range of similarity values for scaling they-axes. By default this is the observed range over all ofs.

slimds

set toslimds toTRUE to scale diagonals andoff-diagonals separately

add

set toTRUE to add similarities to an existing plot (usuallyspecifyinglty orcol)

lty,col,lwd

line type, color, or line thickness forplotMultSim

vname

optional vector of variable names, in order, used ins

h

relative height for subplot

w

relative width for subplot

u

relative extra height and width to leave unused inside the subplot.Also used as the space between y-axis tick mark labels and graph border.

labelx

set toFALSE to suppress drawing of labels in the x direction

xspace

amount of space, on a scale of 1:n wheren is the numberof variables, to set aside for y-axis labels

Details

options(contrasts= c("contr.treatment", "contr.poly")) is issued temporarily byvarclus to make sure that ordinary dummy variablesare generated forfactor variables. Pass arguments to thedataframeReduce function to remove problematic variables(especially if analyzing all variables in a data frame).

Value

forvarclus ornaclus, a list of classvarclus with elementscall (containing the calling statement),sim (similarity matrix),n (sample size used ifx was not a correlation matrix already -n is a matrix),hclust, the object created byhclust,similarity, andmethod.naclus also returns thetwo vectors listed under description, andnaplot returns an invisible vector that is thefrequency table of the number of missing variables per observation.plotMultSim invisibly returns the limits of similarities used inconstructing the y-axes of each subplot. Forsimilarity="ccbothpos"thehclust object isNULL.

na.pattern creates an integer vector of frequencies.

Side Effects

plots

Author(s)

Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com

References

Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition,1990. Cary NC: SAS Institute, Inc.

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat19:546–57.

See Also

hclust,plclust,hoeffd,rcorr,cor,model.matrix,locator,na.pattern,cut2,combine.levels

Examples

set.seed(1)x1 <- rnorm(200)x2 <- rnorm(200)x3 <- x1 + x2 + rnorm(200)x4 <- x2 + rnorm(200)x <- cbind(x1,x2,x3,x4)v <- varclus(x, similarity="spear")  # spearman is the default anywayv    # invokes print.varclusprint(round(v$sim,2))plot(v)# Convert the dendrogram to be horizontalv <- as.dendrogram(v$hclust)plot(v, horiz=TRUE, axes=FALSE, xlab=expression(paste('Spearman ', rho^2)))rh <- seq(0, 1, by=0.1)  # re-label x-axis re:similarity not distanceaxis(1, at=1 - rh, labels=format(rh))# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)# the -1 causes k dummies to be generated for k countries# plot(varclus(~ age + factor(disease.code) - 1))### use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all# "useful" variables - see dataframeReduce for details about argumentsdf <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),                 e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))par(mfrow=c(2,2))for(m in c("ward","complete","median")) {  plot(naclus(df, method=m))  title(m)}naplot(naclus(df))n <- naclus(df)plot(n); naplot(n)na.pattern(df)# plotMultSim example: Plot proportion of observations# for which two variables are both positive (diagonals# show the proportion of observations for which the# one variable is positive).  Chance-correct the# off-diagonals by subtracting the product of the# marginal proportions.  On each subplot the x-axis# shows month (0, 4, 8, 12) and there is a separate# curve for females and malesd <- data.frame(sex=sample(c('female','male'),1000,TRUE),                month=sample(c(0,4,8,12),1000,TRUE),                x1=sample(0:1,1000,TRUE),                x2=sample(0:1,1000,TRUE),                x3=sample(0:1,1000,TRUE))s <- array(NA, c(3,3,4))opar <- par(mar=c(0,0,4.1,0))  # waste less spacefor(sx in c('female','male')) {  for(i in 1:4) {    mon <- (i-1)*4    s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,                      subset=d$month==mon & d$sex==sx)$sim    }  plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),              add=sx=='male', slimds=TRUE,              lty=1+(sx=='male'))  # slimds=TRUE causes separate  scaling for diagonals and  # off-diagonals}par(opar)

vlab

Description

Easily Retrieve Text Form of Labels/Units

Usage

vlab(x, name = NULL)

Arguments

x

a single variable name, unquoted

name

optional character string to use as variable name

Details

Uses the same search method ashlab returns label and units in a character string with units, if present, in brackets

Value

character string

Author(s)

Frank Harrell

See Also

hlab()


Weighted Statistical Estimates

Description

These functions compute various weighted versions of standardestimators. In most cases theweights vector is a vector the samelength ofx, containing frequency counts that in effect expandxby these counts.weights can also be sampling weights, in whichsettingnormwt toTRUE will often be appropriate. This results inmakingweights sum to the length of the non-missing elements inx.normwt=TRUE thus reflects the fact that the true sample size isthe length of thex vector and not the sum of the original values ofweights (which would be appropriate hadnormwt=FALSE). Whenweightsis all ones, the estimates are all identical to unweighted estimates(unless one of the non-default quantile estimation options isspecified towtd.quantile). When missing data have already beendeleted for,x,weights, and (in the case ofwtd.loess.noiter)y,specifyingna.rm=FALSE will save computation time. Omitting theweights argument or specifyingNULL or a zero-length vector willresult in the usual unweighted estimates.

wtd.mean,wtd.var, andwtd.quantile computeweighted means, variances, and quantiles, respectively.wtd.Ecdfcomputes a weighted empirical distribution function.wtd.tablecomputes a weighted frequency table (although only one stratificationvariable is supported at present).wtd.rank computes weightedranks, using mid–ranks for ties. This can be used to obtain Wilcoxontests and rank correlation coefficients.wtd.loess.noiter is aweighted version ofloess.smooth when no iterations for outlierrejection are desired. This results in especially good smoothing wheny is binary.wtd.quantile removes any observations withzero weight at the beginning. Previously, these were changing thequantile estimates.

num.denom.setup is a utility function that allows one to deal withobservations containing numbers of events and numbers of trials, byoutputting two observations when the number of events and non-events(trials - events) exceed zero. A vector of subscripts is generatedthat will do the proper duplications of observations, and a new binaryvariabley is created along with usual cell frequencies (weights)for each of they=0,y=1 cells per observation.

Usage

wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE,        method=c('unbiased', 'ML'))wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1),              type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'),              normwt=FALSE, na.rm=TRUE)wtd.Ecdf(x, weights=NULL,          type=c('i/n','(i-1)/(n-1)','i/(n+1)'),          normwt=FALSE, na.rm=TRUE)wtd.table(x, weights=NULL, type=c('list','table'),           normwt=FALSE, na.rm=TRUE)wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)wtd.loess.noiter(x, y, weights=rep(1,n),                 span=2/3, degree=1, cell=.13333,                  type=c('all','ordered all','evaluate'),                  evaluation=100, na.rm=TRUE)num.denom.setup(num, denom)

Arguments

x

a numeric vector (may be a character orcategory orfactor vectorforwtd.table)

num

vector of numerator frequencies

denom

vector of denominators (numbers of trials)

weights

a numeric vector of weights

normwt

specifynormwt=TRUE to makeweights sum tolength(x) after deletion ofNAs. Ifweights arefrequency weights, thennormwt should beFALSE, and ifweights are normalization (aka reliability) weights, thennormwt should beTRUE. In the case of the former, no checkis made thatweights are valid frequencies.

na.rm

set toFALSE to suppress checking for NAs

method

determines the estimator type; if'unbiased' (thedefault) then the usual unbiased estimate (using Bessel's correction)is returned, if'ML' then it is the maximum likelihood estimatefor a Gaussian distribution. In the case of the latter, thenormwt argument has no effect. Usesstats:cov.wt forboth methods.

probs

a vector of quantiles to compute. Default is 0 (min), .25, .5, .75, 1(max).

type

Forwtd.quantile,type defaults toquantile to use the sameinterpolated order statistic method asquantile. Settype to"(i-1)/(n-1)","i/(n+1)", or"i/n" to use the inverse of theempirical distribution function, using, respectively, (wt - 1)/T,wt/(T+1), or wt/T, where wt is the cumulative weight and T is thetotal weight (usually total sample size). These three values oftype are the possibilities forwtd.Ecdf. Forwtd.table thedefaulttype is"list", meaning that the function is to return alist containing two vectors:x is the sorted unique values ofxandsum.of.weights is the sum of weights for thatx. This is thedefault so that you don't have to convert thenames attribute of theresult that can be obtained withtype="table" to a numeric variablewhenx was originally numeric.type="table" forwtd.tableresults in an object that is the same structure as those returned fromtable. Forwtd.loess.noiter the defaulttype is"all",indicating that the function is to return a list containing all theoriginal values ofx (including duplicates and without sorting) andthe smoothedy values corresponding to them. Settype="orderedall" to sort byx, andtype="evaluate" to evaluate the smoothonly atevaluation equally spaced points between the observed limitsofx.

y

a numeric vector the same length asx

span,degree,cell,evaluation

seeloess.smooth. The default is linear (degree=1) and 100 pointsto evaluation (iftype="evaluate").

Details

The functions correctly combine weights of observations havingduplicate values ofx before computing estimates.

Whennormwt=FALSE the weighted variance will not equal theunweighted variance even if the weights are identical. That is becauseof the subtraction of 1 from the sum of the weights in the denominatorof the variance formula. If you want the weighted variance to equal theunweighted variance when weights do not vary, usenormwt=TRUE.The articles by Gatz and Smith discuss alternative approaches, to arriveat estimators of the standard error of a weighted mean.

wtd.rank does not handle NAs as elegantly asrank ifweights is specified.

Value

wtd.mean andwtd.var return scalars.wtd.quantile returns avector the same length asprobs.wtd.Ecdf returns a list whoseelementsx andEcdf correspond to unique sorted values ofx.If the first CDF estimate is greater than zero, a point (min(x),0) isplaced at the beginning of the estimates.See above forwtd.table.wtd.rank returns a vector the samelength asx (after removal of NAs, depending onna.rm). See aboveforwtd.loess.noiter.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Benjamin Tyner
btyner@gmail.com

References

Research Triangle Institute (1995): SUDAAN User's Manual, Release6.40, pp. 8-16 to 8-17.

Gatz DF, Smith L (1995): The standard error of a weighted meanconcentration–I. Bootstrapping vs other methods. Atmospheric Env11:1185-1193.

Gatz DF, Smith L (1995): The standard error of a weighted meanconcentration–II. Estimating confidence intervals. Atmospheric Env29:1195-1200.

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean

See Also

mean,var,quantile,table,rank,loess.smooth,lowess,plsmo,Ecdf,somers2,describe

Examples

set.seed(1)x <- runif(500)wts <- sample(1:6, 500, TRUE)std.dev <- sqrt(wtd.var(x, wts))wtd.quantile(x, wts)death <- sample(0:1, 500, TRUE)plot(wtd.loess.noiter(x, death, wts, type='evaluate'))describe(~x, weights=wts)# describe uses wtd.mean, wtd.quantile, wtd.tablexg <- cut2(x,g=4)table(xg)wtd.table(xg, wts, type='table')# Here is a method for getting stratified weighted meansy <- runif(500)g <- function(y) wtd.mean(y[,1],y[,2])summarize(cbind(y, wts), llist(xg), g, stat.name='y')# Empirically determine how methods used by wtd.quantile match with# methods used by quantile, when all weights are unityset.seed(1)u <-  eval(formals(wtd.quantile)$type)v <- as.character(1:9)r <- matrix(0, nrow=length(u), ncol=9, dimnames=list(u,v))for(n in c(8, 13, 22, 29))  {    x <- rnorm(n)    for(i in 1:5) {      probs <- sort( runif(9))      for(wtype in u) {        wq <- wtd.quantile(x, type=wtype, weights=rep(1,length(x)), probs=probs)        for(qtype in 1:9) {          rq <- quantile(x, type=qtype, probs=probs)          r[wtype, qtype] <- max(r[wtype,qtype], max(abs(wq-rq)))        }      }    }  }r# Restructure data to generate a dichotomous response variable# from records containing numbers of events and numbers of trialsnum   <- c(10,NA,20,0,15)   # data are 10/12 NA/999 20/20 0/25 15/35denom <- c(12,999,20,25,35)w     <- num.denom.setup(num, denom)w# attach(my.data.frame[w$subs,])

xyplot and dotplot with Matrix Variables to Plot Error Bars and Bands

Description

A utility functionCbind returns the first argument as a vector andcombines all other arguments into a matrix stored as an attribute called"other". The arguments can be named (e.g.,Cbind(pressure=y,ylow,yhigh)) or alabel attribute may be pre-attachedto the first argument. In either case, the name or label of the firstargument is stored as an attribute"label" of the object returned byCbind. Storing other vectors as a matrix attribute facilitates plottingerror bars, etc., astrellis really wants the x- and y-variables to bevectors, not matrices. If a single argument is given toCbind and thatargument is a matrix with column dimnames, the first column is taken as themain vector and remaining columns are taken as"other". A subscriptmethod forCbind objects subscripts theother matrix alongwith the mainy vector.

ThexYplot function is a substitute forxyplot that allows forsimulated multi-columny. It uses by default thepanel.xYplot andprepanel.xYplot functions to do the actual work. Themethod argumentpassed topanel.xYplot fromxYplot allows you to make error bars, theupper-only or lower-only portions of error bars, alternating lower-only andupper-only bars, bands, or filled bands.panel.xYplot decides how toalternate upper and lower bars according to whether the mediany value ofthe current main data line is above the mediany for allgroups oflines or not. If the median is above the overall median, only the upperbar is drawn. Forbands (but not 'filled bands'), any number of othercolumns ofy will be drawn as lines having the same thickness, color, andtype as the main data line. If plotting bars, bands, or filled bands andonly one additional column is specified for the response variable, thatcolumn is taken as the half width of a precision interval fory, and thelower and upper values are computed automatically asy plus or minus thevalue of the additional column variable.

When agroups variable is present,panel.xYplot will create a functionin frame 0 (.GlobalEnv inR) calledKey that wheninvoked will draw a key describing thegroups labels, point symbols, and colors. By default, the key is outsidethe graph. For S-Plus, ifKey(locator(1)) is specified, the key will appear so thatits upper left corner is at the coordinates of the mouse click. ForR/Lattice the first two arguments ofKey (x andy) are fractionsof the page, measured from the lower left corner, and the defaultplacement is atx=0.05, y=0.95. ForR, an optional argumenttosKey,other, may contain a list of arguments to pass todraw.key (seexyplot for a list of possible arguments, underthekey option).

Whenmethod="quantile" is specified,xYplot automatically groups thex variable into intervals containing a target ofnx observations each,and within eachx group computes three quantiles ofy and plots theseas three lines. The meanx within eachx group is taken as thex-coordinate. This will make a useful empirical display for largedatasets in which scatterdiagrams are too busy to see patterns of centraltendency and variability. You can also specify a general function of adata vector that returns a matrix of statistics for themethod argument.Arguments can be passed to that function via a listmethodArgs. Thestatistic in the first column should be the measure of central tendency.Examples of usefulmethod functions are those listed under the help fileforsummary.formula such assmean.cl.normal.

xYplot can also produce bubble plots. This is done whensize is specified toxYplot. Whensize is used, afunctionsKey is generated for drawing a key to the charactersizes. See the bubble plot example.size can also specify avector where the first character of each observation is used as theplotting symbol, ifrangeCex is set to a singlecexvalue. An optional argument tosKey,other, may containa list of arguments to pass todraw.key (seexyplot for a list of possible arguments, underthekey option). See the bubble plot example.

Dotplot is a substitute fordotplot allowing for a matrix x-variable,automatic superpositioning whengroups is present, and creation of aKey function. When the x-variable (created byCbind to simulate amatrix) contains a total of 3 columns, the first column specifies where thedot is positioned, and the last 2 columns specify starting and endingpoints for intervals. The intervals are shown using line type, width, andcolor from the trellisplot.line list. By default, you will usually see adarker line segment for the low and high values, with the dotted referenceline elsewhere. A good choice of thepch argument for such plots is3(plus sign) if you want to emphasize the interval more than the pointestimate. When the x-variable contains a total of 5 columns, the 2nd and5th columns are treated as the 2nd and 3rd are treated above, and the 3rdand 4th columns define an inner line segment that will have twice thethickness of the outer segments. In addition, tick marks separate the outerand inner segments. This type of display (an example of which appeared inThe Elements of Graphing Data by Cleveland) is very suitable fordisplaying two confidence levels (e.g., 0.9 and 0.99) or the 0.05, 0.25,0.75, 0.95 sample quantiles, for example. For this display, the centralpoint displays well with a default circle symbol.

setTrellis sets nice defaults for Trellis graphics, assuming that thegraphics device has already been opened if using postscript, etc. Bydefault, it sets panel strips to blank and reference dot lines to thickness1 instead of the Trellis default of 2.

numericScale is a utility function that facilitates usingxYplot to plot variables that are not considered to be numeric but which can readilybe converted to numeric usingas.numeric().numericScaleby default will keep the name of the input variable as alabelattribute for the new numeric variable.

Usage

Cbind(...)xYplot(formula, data = sys.frame(sys.parent()), groups,       subset, xlab=NULL, ylab=NULL, ylim=NULL,       panel=panel.xYplot, prepanel=prepanel.xYplot, scales=NULL,       minor.ticks=NULL, sub=NULL, ...)panel.xYplot(x, y, subscripts, groups=NULL,              type=if(is.function(method) || method=='quantiles')                'b' else 'p',             method=c("bars", "bands", "upper bars", "lower bars",                       "alt bars", "quantiles", "filled bands"),              methodArgs=NULL, label.curves=TRUE, abline,             probs=c(.5,.25,.75), nx=NULL,             cap=0.015, lty.bar=1,              lwd=plot.line$lwd, lty=plot.line$lty, pch=plot.symbol$pch,              cex=plot.symbol$cex, font=plot.symbol$font, col=NULL,              lwd.bands=NULL, lty.bands=NULL, col.bands=NULL,              minor.ticks=NULL, col.fill=NULL,             size=NULL, rangeCex=c(.5,3), ...)prepanel.xYplot(x, y, ...)Dotplot(formula, data = sys.frame(sys.parent()), groups, subset,         xlab = NULL, ylab = NULL, ylim = NULL,        panel=panel.Dotplot, prepanel=prepanel.Dotplot,        scales=NULL, xscale=NULL, ...)prepanel.Dotplot(x, y, ...)panel.Dotplot(x, y, groups = NULL,              pch  = dot.symbol$pch,               col  = dot.symbol$col, cex = dot.symbol$cex,               font = dot.symbol$font, abline, ...)setTrellis(strip.blank=TRUE, lty.dot.line=2, lwd.dot.line=1)numericScale(x, label=NULL, ...)

Arguments

...

forCbind... is any number of additional numericvectors. Unless you are usingDotplot (which allows for either 2or 4 "other" variables) orxYplot withmethod="bands",vectors after the first two are ignored. If drawing bars and only oneextra variable is given in..., upper and lower values arecomputed as described above. If the second argument toCbind is amatrix, that matrix is stored in the"other" attribute andarguments after the second are ignored. For bubble plots, name anargumentcex.

Also can be other arguments to pass tolabcurve.

formula

atrellis formula consistent withxyplot ordotplot

x

x-axis variable. FornumericScalex is any vectorsuch asas.numeric(x) returns a numeric vector suitable for x- ory-coordinates.

y

a vector, or an object created byCbind forxYplot.y represents the main variable to plot, i.e., the variable used todraw the main lines. ForDotplot the first argument toCbind will be the mainx-axis variable.

data,subset,ylim,subscripts,groups,type,scales,panel,prepanel,xlab,ylab

seetrellis.args.xlab andylab get default values from"label" attributes.

xscale

allows one to use the defaultscales but specifyonly thex component of it forDotplot

method

defaults to"bars" to draw error-bar type plots. See meaning of othervalues above.method can be a function. Specifyingmethod=quantile,methodArgs=list(probs=c(.5,.25,.75)) is the same as specifyingmethod="quantile" without specifyingprobs.

methodArgs

a list containing optional arguments to be passed to the function specifiedinmethod

label.curves

set toFALSE to suppress invocation oflabcurve to label primary curveswhere they are most separated or to draw a legend in an empty spot on thepanel. You can also setlabel.curves to a list of options to pass tolabcurve. These options can also be passed as... toxYplot. See theexamples below.

abline

a list of arguments to pass topanel.abline for each panel, e.g.list(a=0, b=1, col=3) to draw the line of identity using color3. To make multiple calls topanel.abline, pass a list ofunnamed lists asabline, e.g.,abline=list(list(h=0),list(v=1)).

probs

a vector of three quantiles with the quantile corresponding to the centralline listed first. By defaultprobs=c(.5, .25, .75). You can also specifyprobs throughmethodArgs=list(probs=...).

nx

number of target observations for eachx group (seecut2m argument).nx defaults to the minimum of 40 and the number of points in the currentstratum divided by 4. Setnx=FALSE ornx=0 ifx is already discrete andrequires no grouping.

cap

the half-width of horizontal end pieces for error bars, as a fraction ofthe length of thex-axis

lty.bar

line type for bars

lwd,lty,pch,cex,font,col

seetrellis.args. These are vectors whengroups is present, and theorder of their elements corresponds to the differentgroups, regardlessof how many bands or bars are drawn. If you don't specifylty.bands, forexample, all band lines within each group will have the samelty.

lty.bands,lwd.bands,col.bands

used to allowlty,lwd,col to vary across the different band linesfor differentgroups. These parameters are vectors or lists whoseelements correspond to the added band lines (i.e., they ignore the centralline, whose line characteristics are defined bylty,lwd,col). Forexample, suppose that 4 lines are drawn in addition to the central line.Specifyinglwd.bands=1:4 will cause line widths of 1:4 to be used forevery group, regardless of the value oflwd. To vary characteristicsover thegroups use e.g.lwd.bands=list(rep(1,4), rep(2,4)) orlist(c(1,2,1,2), c(3,4,3,4)).

minor.ticks

a list with elementsat andlabels specifying positionsand labels for minor tick marks to be used on the x-axis of each panel,if any.

sub

an optional subtitle

col.fill

used to override default colors used for the bands in method='filledbands'. This is a vector whengroups is present, and the order of theelements corresponds to the differentgroups, regardless of how manybands are drawn. The default colors for 'filled bands' are pastel colorsmatching the default colors superpose.line$col (plot.line$col)

size

a vector the same length asx giving a variable whose valuesare a linear function of the size of the symbol drawn. This is usedfor example for bubble plots.

rangeCex

a vector of two values specifying the range in character sizes to usefor thesize variable (lowest first, highest second).size values are linearly translated to this range, based on theobserved range ofsize whenx andy coordinatesare not missing. Specify a single numericcex value forrangeCex to use the first character of each observations'ssize as the plotting symbol.

strip.blank

set toFALSE to not make the panel strip backgrounds blank

lty.dot.line

line type for dot plot reference lines (default = 1 for dotted; use 2 fordotted)

lwd.dot.line

line thickness for reference lines for dot plots (default = 1)

label

a scalar character string to be used as a variable label afternumericScale converts the variable to numeric form

Details

Unlikexyplot,xYplot senses the presence of agroups variable andautomatically invokespanel.superpose instead ofpanel.xyplot. The sameis true forDotplot vs.dotplot.

Value

Cbind returns a matrix with attributes. Other functions return standardtrellis results.

Side Effects

plots, andpanel.xYplot may create temporaryKey andsKey functions in the session frame.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Madeline Bauer
Department of Infectious Diseases
University of Southern California School of Medicine
mbauer@usc.edu

See Also

xyplot,panel.xyplot,summarize,label,labcurve,errbar,dotplot,reShape,cut2,panel.abline

Examples

# Plot 6 smooth functions.  Superpose 3, panel 2.# Label curves with p=1,2,3 where most separated d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi)) xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l') # Use a key instead, use 3 line widths instead of 3 colors # Put key in most empty portion of each panelxYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d,        type='l', keys='lines', lwd=1:3, col=1) # Instead of implicitly using labcurve(), put a # single key outside of panels at lower left cornerxYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d,        type='l', label.curves=FALSE, lwd=1:3, col=1, lty=1:3) Key()# Bubble plotsx <- y <- 1:8x[2] <- NAunits(x) <- 'cm^2'z <- 101:108p <- factor(rep(c('a','b'),4))g <- c(rep(1,7),2)data.frame(p, x, y, z, g)xYplot(y ~ x | p, groups=g, size=z) Key(other=list(title='g', cex.title=1.2))  # draw key for colorssKey(.2,.85,other=list(title='Z Values', cex.title=1.2))# draw key for character sizes# Show the median and quartiles of height given age, stratified # by sex and race.  Draws 2 sets (male, female) of 3 lines per panel.# xYplot(height ~ age | race, groups=sex, method='quantiles')# Examples of plotting raw datadfr <- expand.grid(month=1:12, continent=c('Europe','USA'),                    sex=c('female','male'))set.seed(1)dfr <- upData(dfr,              y=month/10 + 1*(sex=='female') + 2*(continent=='Europe') +                 runif(48,-.15,.15),              lower=y - runif(48,.05,.15),              upper=y + runif(48,.05,.15))xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA',       data=dfr)xYplot(Cbind(y,lower,upper) ~ month|continent, subset=sex=='male',data=dfr)xYplot(Cbind(y,lower,upper) ~ month|continent, groups=sex, data=dfr); Key() # add ,label.curves=FALSE to suppress use of labcurve to label curves where# farthest apartxYplot(Cbind(y,lower,upper) ~ month,groups=sex,                              subset=continent=='Europe', data=dfr) xYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b',                              subset=continent=='Europe', keys='lines',                              data=dfr)# keys='lines' causes labcurve to draw a legend where the panel is most emptyxYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b', data=dfr,                              subset=continent=='Europe',method='bands') xYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b', data=dfr,                              subset=continent=='Europe',method='upper')label(dfr$y) <- 'Quality of Life Score'   # label is in Hmisc library = attr(y,'label') <- 'Quality\dots'; will be# y-axis label # can also specify Cbind('Quality of Life Score'=y,lower,upper) xYplot(Cbind(y,lower,upper) ~ month, groups=sex,       subset=continent=='Europe', method='alt bars',        offset=grid::unit(.1,'inches'), type='b', data=dfr)   # offset passed to labcurve to label .4 y units away from curve# for R (using grid/lattice), offset is specified using the grid# unit function, e.g., offset=grid::unit(.4,'native') or# offset=grid::unit(.1,'inches') or grid::unit(.05,'npc')# The following example uses the summarize function in Hmisc to # compute the median and outer quartiles.  The outer quartiles are # displayed using "error bars"set.seed(111)dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)month <- dfr$month; year <- dfr$yeary <- abs(month-6.5) + 2*runif(length(month)) + year-1997s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s,        keys='lines', method='alt', type='b')# Can also do:s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75),               stat.name=c('y','Q1','Q3')) xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s,        type='b', keys='lines') # Or:xYplot(y ~ month, groups=year, keys='lines', nx=FALSE, method='quantile',       type='b') # nx=FALSE means to treat month as a discrete variable# To display means and bootstrapped nonparametric confidence intervals # use:s <- summarize(y, llist(month,year), smean.cl.boot) sxYplot(Cbind(y, Lower, Upper) ~ month | year, data=s, type='b')# Can also use Y <- cbind(y, Lower, Upper); xYplot(Cbind(Y) ~ ...) # Or:xYplot(y ~ month | year, nx=FALSE, method=smean.cl.boot, type='b')# This example uses the summarize function in Hmisc to # compute the median and outer quartiles.  The outer quartiles are # displayed using "filled bands"s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) # filled bands: default fill = pastel colors matching solid colors# in superpose.line (this works differently in R)xYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year,      method="filled bands" , data=s, type="l")# note colors based on levels of selected subgroups, not first two colorsxYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year,      method="filled bands" , data=s, type="l",     subset=(year == 1998 | year == 2000), label.curves=FALSE )# filled bands using black lines with selected solid colors for fillxYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year,      method="filled bands" , data=s, label.curves=FALSE,     type="l", col=1, col.fill = 2:3)Key(.5,.8,col = 2:3) #use fill colors in key# A good way to check for stable variance of residuals from ols # xYplot(resid(fit) ~ fitted(fit), method=smean.sdl) # smean.sdl is defined with summary.formula in Hmisc# Plot y vs. a special variable x# xYplot(y ~ numericScale(x, label='Label for X') | country) # For this example could omit label= and specify #    y ~ numericScale(x) | country, xlab='Label for X'# Here is an example of using xYplot with several options# to change various Trellis parameters,# xYplot(y ~ x | z, groups=v, pch=c('1','2','3'),#        layout=c(3,1),     # 3 panels side by side#        ylab='Y Label', xlab='X Label',#        main=list('Main Title', cex=1.5),#        par.strip.text=list(cex=1.2),#        strip=function(\dots) strip.default(\dots, style=1),#        scales=list(alternating=FALSE))## Dotplot examples#s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) setTrellis()            # blank conditioning panel backgrounds Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s) # or Cbind(\dots), groups=year, data=s# Display a 5-number (5-quantile) summary (2 intervals, dot=median) # Note that summarize produces a matrix for y, and Cbind(y) trusts the # first column to be the point estimate (here the median) s <- summarize(y, llist(month,year), quantile,               probs=c(.5,.05,.25,.75,.95), type='matrix') Dotplot(month ~ Cbind(y) | year, data=s) # Use factor(year) to make actual years appear in conditioning title strips# Plot proportions and their Wilson confidence limitsset.seed(3)d <- expand.grid(continent=c('USA','Europe'), year=1999:2001,                 reps=1:100)# Generate binary events from a population probability of 0.2# of the event, same for all years and continentsd$y <- ifelse(runif(6*100) <= .2, 1, 0)s <- with(d,          summarize(y, llist(continent,year),                    function(y) {                     n <- sum(!is.na(y))                     s <- sum(y, na.rm=TRUE)                     binconf(s, n)                    }, type='matrix'))Dotplot(year ~ Cbind(y) | continent,  data=s, ylab='Year',        xlab='Probability')# Dotplot(z ~ x | g1*g2)                 # 2-way conditioning # Dotplot(z ~ x | g1, groups=g2); Key()  # Key defines symbols for g2# If the data are organized so that the mean, lower, and upper # confidence limits are in separate records, the Hmisc reShape # function is useful for assembling these 3 values as 3 variables # a single observation, e.g., assuming type has values such as # c('Mean','Lower','Upper'):# a <- reShape(y, id=month, colvar=type) # This will make a matrix with 3 columns named Mean Lower Upper # and with 1/3 as many rows as the original data

Auxiliary Function Method for Sorting and Ranking

Description

An auxiliary function method that is a workaround for bug in theimplementation of xtfrm handles inheritance.

Usage

## S3 method for class 'labelled'xtfrm(x)

Arguments

x

any object of class labelled.

See Also

xtfrm


Mean x vs. function of y in groups of x

Description

Compute mean x vs. a function of y (e.g. median) by quantilegroups of x or by x grouped to have a given minimum number ofobservations. Deletes NAs in x and y before doing computations.

Usage

xy.group(x, y, m=150, g, fun=mean, result="list")

Arguments

x

a vector, may contain NAs

y

a vector of same length as x, may contain NAs

m

number of observations per group

g

number of quantile groups

fun

function of y such as median or mean (the default)

result

"list" (the default), or "matrix"

Value

if result="list", a list with components x and y suitable for plotting.if result="matrix", matrix with rows corresponding to x-groups and columns namedn, x, and y.

See Also

cut2,cutGn,tapply

Examples

  ## Not run: plot(xy.group(x, y, g=10))#Plot mean y by deciles of xxy.group(x, y, m=100, result="matrix")#Print table, 100 obs/group  ## End(Not run)

Get Number of Days in Year or Month

Description

Returns the number of days in a specific year or month.

Usage

yearDays(time)monthDays(time)

Arguments

time

A POSIXt or Date object describing the month or year inquestion.

Author(s)

Charles Dupont

See Also

POSIXt,Date


Combine Variables in a Matrix

Description

ynbind column binds a series of related yes/no variables,allowing for a final argumentlabel used to label the panelcreated for the group.labels for individual variables arecollected into a vector attribute"labels" for the result;original variable names are used in place of labels for those variableswithout labels. A positive response is taken to bey, yes,present (ignoring case) or alogicalTRUE value. Bydefault, the columns are sorted be ascending order or the overallproportion of positives. A subsetting method is provided for objects ofclass"ynbind".

pBlock creates a matrix similarly labeled, from a general set ofvariables (without special handling of binaries), and sets toNAany observation not insubset so that when that block ofvariables is analyzed it will be only for that subset.

Usage

ynbind(..., label = deparse(substitute(...)),       asna = c("unknown", "unspecified"), sort = TRUE)pBlock(..., subset=NULL, label = deparse(substitute(...)))

Arguments

...

a series of vectors

label

a label for the group, to be attached to the resultingmatrix as a"label" attribute, used bysummaryP.

asna

a vector of character strings specifying levels that areto be treated the same asNA if present

sort

set toFALSE to not sort the columns by theirproportions

subset

subset criteria - either a vector of logicals or subscripts

Value

a matrix of class"ynbind" or"pBlock" with"label" and"labels" attributes.For"pBlock", factor input vectors will have values convertedtocharacter.

Author(s)

Frank Harrell

See Also

summaryP

Examples

x1 <- c('yEs', 'no', 'UNKNOWN', NA)x2 <- c('y', 'n', 'no', 'present')label(x2) <- 'X2'X <- ynbind(x1, x2, label='x1-2')X[1:3,]pBlock(x1, x2, subset=2:3, label='x1-2')

[8]ページ先頭

©2009-2025 Movatter.jp