| Version: | 5.2-4 |
| Date: | 2025-10-02 |
| Title: | Harrell Miscellaneous |
| Depends: | R (≥ 4.2.0) |
| Imports: | methods, ggplot2, cluster, rpart, nnet, foreign, gtable, grid,gridExtra, data.table, htmlTable (≥ 1.11.0), viridisLite,htmltools, base64enc, colorspace, rmarkdown, knitr, Formula |
| Suggests: | survival, qreport, acepack, chron, rms, mice, rstudioapi,tables, plotly (≥ 4.5.6), rlang, VGAM, leaps, pcaPP, digest,parallel, polspline, abind, kableExtra, rio, lattice,latticeExtra, gt, sparkline, jsonlite, htmlwidgets, qs,getPass, keyring, safer, htm2txt, boot |
| Description: | Contains many functions useful for dataanalysis, high-level graphics, utility operations, functions forcomputing sample size and power, simulation, importing and annotating datasets,imputing missing values, advanced table making, variable clustering,character string manipulation, conversion of R objects to LaTeX and html code,recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis. |
| License: | GPL-2 |GPL-3 [expanded from: GPL (≥ 2)] |
| LazyLoad: | Yes |
| URL: | https://hbiostat.org/R/Hmisc/ |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | yes |
| Packaged: | 2025-10-03 20:24:42 UTC; harrelfe |
| Author: | Frank E Harrell Jr |
| Maintainer: | Frank E Harrell Jr <fh@fharrell.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-10-05 06:50:02 UTC |
Find Matching (or Non-Matching) Elements
Description
%nin% is a binary operator, which returns a logical vector indicatingif there is a match or not for its left operand. A true vector elementindicates no match in left operand, false indicates a match.
Usage
x %nin% tableArguments
x | a vector (numeric, character, factor) |
table | a vector (numeric, character, factor), matching the mode of |
Value
vector of logical values with length equal to length ofx.
See Also
Examples
c('a','b','c') %nin% c('a','b')Character strings from unquoted names
Description
Cs makes a vector of character strings from a list of valid Rnames..q is similar but also makes uses of names of arguments.
Usage
Cs(...).q(...)Arguments
... | any number of names separated by commas. For |
Value
character string vector. For.q there will be anamesattribute to the vector if any names appeared in ....
See Also
sys.frame, deparse
Examples
Cs(a,cat,dog)# subset.data.frame <- dataframe[,Cs(age,sex,race,bloodpressure,height)].q(a, b, c, 'this and that').q(dog=a, giraffe=b, cat=c)Empirical Cumulative Distribution Plot
Description
Computes coordinates of cumulative distribution function of x, and by defaultsplots it as a step function. A grouping variable may be specified so thatstratified estimates are computed and (by default) plotted. If there ismore than one group, thelabcurve function is used (by default) to labelthe multiple step functions or to draw a legend defining line types, colors,or symbols by linking them with group labels. Aweights vector maybe specified to get weighted estimates. Specifynormwt to makeweights sum to the length ofx (after removing NAs). Other wisethe total sample size is taken to be the sum of the weights.
Ecdf is actually a method, andEcdf.default is what'scalled for a vector argument.Ecdf.data.frame is called when thefirst argument is a data frame. This function can automatically set upa matrix of ECDFs and wait for a mouse click if the matrix requires morethan one page. Categorical variables, character variables, andvariables having fewer than a set number of unique values are ignored.Ifpar(mfrow=..) is not set up beforeEcdf.data.frame iscalled, the function will try to figure the best layout depending on thenumber of variables in the data frame. Upon return the originalmfrow is left intact.
When the first argument toEcdf is a formula, a Trellis/Lattice functionEcdf.formula is called. This allows for multi-panelconditioning, superposition using agroups variable, and otherTrellis features, along with the ability to easily plot transformedECDFs using thefun argument. For example, iffun=qnorm,the inverse normal transformation will be used for the y-axis. If thetransformed curves are linear this indicates normality. Like thexYplot function,Ecdf will create a functionKey ifthegroups variable is used. This function can be invoked by theuser to define the keys for the groups.
Usage
Ecdf(x, ...)## Default S3 method:Ecdf(x, what=c('F','1-F','f','1-f'), weights=rep(1, length(x)), normwt=FALSE, xlab, ylab, q, pl=TRUE, add=FALSE, lty=1, col=1, group=rep(1,length(x)), label.curves=TRUE, xlim, subtitles=TRUE, datadensity=c('none','rug','hist','density'), side=1, frac=switch(datadensity,none=NA,rug=.03,hist=.1,density=.1), dens.opts=NULL, lwd=1, log='', ...)## S3 method for class 'data.frame'Ecdf(x, group=rep(1,nrows), weights=rep(1, nrows), normwt=FALSE, label.curves=TRUE, n.unique=10, na.big=FALSE, subtitles=TRUE, vnames=c('labels','names'),...)## S3 method for class 'formula'Ecdf(x, data=sys.frame(sys.parent()), groups=NULL, prepanel=prepanel.Ecdf, panel=panel.Ecdf, ..., xlab, ylab, fun=function(x)x, what=c('F','1-F','f','1-f'), subset=TRUE)Arguments
x | a numeric vector, data frame, or Trellis/Lattice formula |
what | The default is |
weights | numeric vector of weights. Omit or specify a zero-length vector orNULL to get unweighted estimates. |
normwt | see above |
xlab | x-axis label. Default is label(x) or name of calling argument. For |
ylab | y-axis label. Default is |
q | a vector for quantiles for which to draw reference lines on the plot.Default is not to draw any. |
pl | set to F to omit the plot, to just return estimates |
add | set to TRUE to add the cdf to an existing plot. Does not apply if usinglattice graphics (i.e., if a formula is given as the first argument). |
lty | integer line type for plot. If |
lwd | line width for plot. Can be a vector corresponding to |
log | see |
col | color for step function. Can be a vector. |
group | a numeric, character, or |
label.curves | applies if more than one |
xlim | x-axis limits. Default is entire range of |
subtitles | set to |
datadensity | If |
side | If |
frac | passed to |
dens.opts | a list of optional arguments for |
... | other parameters passed to plot if add=F. For data frames, otherparameters to pass to |
n.unique | minimum number of unique values before an ECDF is drawn for a variablein a data frame. Default is 10. |
na.big | set to |
vnames | By default, variable labels are used to label x-axes. Set |
method | method for computing the empirical cumulative distribution. See |
fun | a function to transform the cumulative proportions, for theTrellis-type usage of |
data,groups,subset,prepanel,panel | the usual Trellis/Lattice parameters, with |
Value
forEcdf.default an invisible list with elements x and y giving thecoordinates of the cdf. If there is more than onegroup, a list ofsuch lists is returned. An attribute,N, is in the returnedobject. It contains the elementsn andm, the number ofnon-missing and missing observations, respectively.
Side Effects
plots
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
See Also
wtd.Ecdf,label,table,cumsum,labcurve,xYplot,histSpike
Examples
set.seed(1)ch <- rnorm(1000, 200, 40)Ecdf(ch, xlab="Serum Cholesterol")scat1d(ch) # add rug plothistSpike(ch, add=TRUE, frac=.15) # add spike histogram# Better: add a data density display automatically:Ecdf(ch, datadensity='density')label(ch) <- "Serum Cholesterol"Ecdf(ch)other.ch <- rnorm(500, 220, 20)Ecdf(other.ch,add=TRUE,lty=2)sex <- factor(sample(c('female','male'), 1000, TRUE))Ecdf(ch, q=c(.25,.5,.75)) # show quartilesEcdf(ch, group=sex, label.curves=list(method='arrow'))# Example showing how to draw multiple ECDFs from paired datapre.test <- rnorm(100,50,10)post.test <- rnorm(100,55,10)x <- c(pre.test, post.test)g <- c(rep('Pre',length(pre.test)),rep('Post',length(post.test)))Ecdf(x, group=g, xlab='Test Results', label.curves=list(keys=1:2))# keys=1:2 causes symbols to be drawn periodically on top of curves# Draw a matrix of ECDFs for a data framem <- data.frame(pre.test, post.test, sex=sample(c('male','female'),100,TRUE))Ecdf(m, group=m$sex, datadensity='rug')freqs <- sample(1:10, 1000, TRUE)Ecdf(ch, weights=freqs) # weighted estimates# Trellis/Lattice examples:region <- factor(sample(c('Europe','USA','Australia'),100,TRUE))year <- factor(sample(2001:2002,1000,TRUE))Ecdf(~ch | region*year, groups=sex)Key() # draw a key for sex at the default location# Key(locator(1)) # user-specified positioning of keyage <- rnorm(1000, 50, 10)Ecdf(~ch | lattice::equal.count(age), groups=sex) # use overlapping shinglesEcdf(~ch | sex, datadensity='hist', side=3) # add spike histogram at topDebug Printing Function Generator
Description
Takes the name of a systemoptions(opt=) and checks to see if optionopt isset toTRUE, taking its default value to beFALSE. IfTRUE, a function iscreated that callsprn() to print an object with the object's name in thedescription along with the option name and the name of the function within whichthe generated function was called, if any. If optionopt is not set, a dummy functionis generated instead. Ifoptions(debug_file=) is set when the generated functionis called,prn() output will be appended to that file name instead of the console.At any time, setoptions(debug_file='') to resume printing to the console.
Usage
Fdebug(opt)Arguments
opt | character string containing an option name |
Value
a function
Author(s)
Fran Harrell
Examples
dfun <- Fdebug('my_option_name') # my_option_name not currently setdfundfun(sqrt(2))options(my_option_name=TRUE)dfun <- Fdebug('my_option_name')dfundfun(sqrt(2))# options(debug_file='/tmp/z') to append output to /tmp/zoptions(my_option_name=NULL)Gini's Mean Difference
Description
GiniMD computes Gini's mean difference on anumeric vector. This index is defined as the mean absolute differencebetween any two distinct elements of a vector. For a Bernoulli(binary) variable with proportion of ones equal top and samplesizen, Gini's mean difference is2\frac{n}{n-1}p(1-p). For a trinomial variable (e.g., predicted values for a 3-level categoricalpredictor using two dummy variables) having (predicted)valuesA, B, C with corresponding proportionsa, b, c,Gini's mean difference is2\frac{n}{n-1}[ab|A-B|+ac|A-C|+bc|B-C|]
Usage
GiniMd(x, na.rm=FALSE)Arguments
x | a numeric vector (for |
na.rm | set to |
Value
a scalar numeric
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
David HA (1968): Gini's mean difference rediscovered. Biometrika 55:573–575.
Examples
set.seed(1)x <- rnorm(40)# Test GiniMd against a brute-force solutiongmd <- function(x) { n <- length(x) sum(outer(x, x, function(a, b) abs(a - b))) / n / (n - 1) }GiniMd(x)gmd(x)z <- c(rep(0,17), rep(1,6))n <- length(z)GiniMd(z)2*mean(z)*(1-mean(z))*n/(n-1)a <- 12; b <- 13; c <- 7; n <- a + b + cA <- -.123; B <- -.707; C <- 0.523xx <- c(rep(A, a), rep(B, b), rep(C, c))GiniMd(xx)2*(a*b*abs(A-B) + a*c*abs(A-C) + b*c*abs(B-C))/n/(n-1)Internal Hmisc functions
Description
Internal Hmisc functions.
Details
These are not to be called by the user or are undocumented.
Overview of Hmisc Library
Description
The Hmisc library contains many functions useful for data analysis,high-level graphics, utility operations, functions for computingsample size and power, translating SAS datasets intoR, imputingmissing values, advanced table making, variable clustering, characterstring manipulation, conversion ofR objects to LaTeX code, recodingvariables, and bootstrap repeated measures analysis. Most of thesefunctions were written by F Harrell, but a few were collected fromstatlib and from s-news; other authors are indicated below. Thiscollection of functions includes all of Harrell's submissions tostatlib other than the functions in therms and displaylibraries. A few of the functions do not have “Help”documentation.
To makeHmisc load silently, issueoptions(Hverbose=FALSE) beforelibrary(Hmisc).
Functions
| Function Name | Purpose |
| abs.error.pred | Computes various indexes of predictive accuracy based |
| on absolute errors, for linear models | |
| addMarginal | Add marginal observations over selected variables |
| all.is.numeric | Check if character strings are legal numerics |
| approxExtrap | Linear extrapolation |
| aregImpute | Multiple imputation based on additive regression, |
| bootstrapping, and predictive mean matching | |
| areg.boot | Nonparametrically estimate transformations for both |
| sides of a multiple additive regression, and | |
bootstrap these estimates andR^2 | |
| ballocation | Optimum sample allocations in 2-sample proportion test |
| binconf | Exact confidence limits for a proportion and more accurate |
| (narrower!) score stat.-based Wilson interval | |
| (Rollin Brant, mod. FEH) | |
| bootkm | Bootstrap Kaplan-Meier survival or quantile estimates |
| bpower | Approximate power of 2-sided test for 2 proportions |
| Includes bpower.sim for exact power by simulation | |
| bpplot | Box-Percentile plot |
| (Jeffrey Banfield,umsfjban@bill.oscs.montana.edu) | |
| bpplotM | Chart extended box plots for multiple variables |
| bsamsize | Sample size requirements for test of 2 proportions |
| bystats | Statistics on a single variable by levels of >=1 factors |
| bystats2 | 2-way statistics |
| character.table | Shows numeric equivalents of all latin characters |
| Useful for putting many special chars. in graph titles | |
| (Pierre Joyet,pierre.joyet@bluewin.ch) | |
| ciapower | Power of Cox interaction test |
| cleanup.import | More compactly store variables in a data frame, and clean up |
| problem data when e.g. Excel spreadsheet had a non- | |
| numeric value in a numeric column | |
| combine.levels | Combine infrequent levels of a categorical variable |
| confbar | Draws confidence bars on an existing plot using multiple |
| confidence levels distinguished using color or gray scale | |
| contents | Print the contents (variables, labels, etc.) of a data frame |
| cpower | Power of Cox 2-sample test allowing for noncompliance |
| Cs | Vector of character strings from list of unquoted names |
| csv.get | Enhanced importing of comma separated files labels |
| cut2 | Like cut with better endpoint label construction and allows |
| construction of quantile groups or groups with given n | |
| datadensity | Snapshot graph of distributions of all variables in |
| a data frame. For continuous variables uses scat1d. | |
| dataRep | Quantify representation of new observations in a database |
| ddmmmyy | SAS “date7” output format for a chron object |
| deff | Kish design effect and intra-cluster correlation |
| describe | Function to describe different classes of objects. |
| Invoke by saying describe(object). It calls one of the | |
| following: | |
| describe.data.frame | Describe all variables in a data frame (generalization |
| of SAS UNIVARIATE) | |
| describe.default | Describe a variable (generalization of SAS UNIVARIATE) |
| dotplot3 | A more flexible version of dotplot |
| Dotplot | Enhancement of Trellis dotplot allowing for matrix |
| x-var., auto generation of Key function, superposition | |
| drawPlot | Simple mouse-driven drawing program, including a function |
| for fitting Bezier curves | |
| Ecdf | Empirical cumulative distribution function plot |
| errbar | Plot with error bars (Charles Geyer, U. Chi., mod FEH) |
| event.chart | Plot general event charts (Jack Lee,jjlee@mdanderson.org, |
| Ken Hess, Joel Dubin; Am Statistician 54:63-70,2000) | |
| event.history | Event history chart with time-dependent cov. status |
| (Joel Dubin,jdubin@uwaterloo.ca) | |
| find.matches | Find matches (with tolerances) between columns of 2 matrices |
| first.word | Find the first word in anR expression (R Heiberger) |
| fit.mult.impute | Fit most regression models over multiple transcan imputations, |
| compute imputation-adjusted variances and avg. betas | |
| format.df | Format a matrix or data frame with much user control |
| (R Heiberger and FE Harrell) | |
| ftupwr | Power of 2-sample binomial test using Fleiss, Tytun, Ury |
| ftuss | Sample size for 2-sample binomial test using " " " " |
| (Both by Dan Heitjan,dheitjan@biostats.hmc.psu.edu) | |
| gbayes | Bayesian posterior and predictive distributions when both |
| the prior and the likelihood are Gaussian | |
| getHdata | Fetch and list datasets on our web site |
| hdquantile | Harrell-Davis nonparametric quantile estimator with s.e. |
| histbackback | Back-to-back histograms (Pat Burns, Salomon Smith |
| Barney, London,pburns@dorado.sbi.com) | |
| hist.data.frame | Matrix of histograms for all numeric vars. in data frame |
| Use hist.data.frame(data.frame.name) | |
| histSpike | Add high-resolution spike histograms or density estimates |
| to an existing plot | |
| hoeffd | Hoeffding's D test (omnibus test of independence of X and Y) |
| impute | Impute missing data (generic method) |
| interaction | More flexible version of builtin function |
| is.present | Tests for non-blank character values or non-NA numeric values |
| james.stein | James-Stein shrinkage estimates of cell means from raw data |
| labcurve | Optimally label a set of curves that have been drawn on |
| an existing plot, on the basis of gaps between curves. | |
| Also position legends automatically at emptiest rectangle. | |
| label | Set or fetch a label for anR-object |
| Lag | Lag a vector, padding on the left with NA or '' |
| latex | Convert anR object to LaTeX (R Heiberger & FE Harrell) |
| list.tree | Pretty-print the structure of any data object |
| (Alan Zaslavsky,zaslavsk@hcp.med.harvard.edu) | |
| Load | Enhancement ofload |
| mask | 8-bit logical representation of a short integer value |
| (Rick Becker) | |
| matchCases | Match each case on one continuous variable |
| matxv | Fast matrix * vector, handling intercept(s) and NAs |
| mgp.axis | Version of axis() that uses appropriate mgp from |
| mgp.axis.labels and gets around bug in axis(2, ...) | |
| that causes it to assume las=1 | |
| mgp.axis.labels | Used by survplot and plot inrms library (and other |
| functions in the future) so that different spacing | |
| between tick marks and axis tick mark labels may be | |
| specified for x- and y-axes. | |
| Use mgp.axis.labels('default') to set defaults. | |
| Users can set values manually using | |
| mgp.axis.labels(x,y) where x and y are 2nd value of | |
| par('mgp') to use. Use mgp.axis.labels(type=w) to | |
| retrieve values, where w='x', 'y', 'x and y', 'xy', | |
| to get 3 mgp values (first 3 types) or 2 mgp.axis.labels. | |
| minor.tick | Add minor tick marks to an existing plot |
| mtitle | Add outer titles and subtitles to a multiple plot layout |
| multLines | Draw multiple vertical lines at each x |
| in a line plot | |
| %nin% | Opposite of %in% |
| nobsY | Compute no. non-NA observations for left hand formula side |
| nomiss | Return a matrix after excluding any row with an NA |
| panel.bpplot | Panel function for trellis bwplot - box-percentile plots |
| panel.plsmo | Panel function for trellis xyplot - uses plsmo |
| pBlock | Block variables for certain lattice charts |
| pc1 | Compute first prin. component and get coefficients on |
| original scale of variables | |
| plotCorrPrecision | Plot precision of estimate of correlation coefficient |
| plsmo | Plot smoothed x vs. y with labeling and exclusion of NAs |
| Also allows a grouping variable and plots unsmoothed data | |
| popower | Power and sample size calculations for ordinal responses |
| (two treatments, proportional odds model) | |
| prn | prn(expression) does print(expression) but titles the |
| output with 'expression'. Do prn(expression,txt) to add | |
| a heading (‘txt’) before the ‘expression’ title | |
| pstamp | Stamp a plot with date in lower right corner (pstamp()) |
| Add ,pwd=T and/or ,time=T to add current directory | |
| name or time | |
| Put additional text for label as first argument, e.g. | |
| pstamp('Figure 1') will draw 'Figure 1 date' | |
| putKey | Different way to use key() |
| putKeyEmpty | Put key at most empty part of existing plot |
| rcorr | Pearson or Spearman correlation matrix with pairwise deletion |
| of missing data | |
| rcorr.cens | Somers' Dxy rank correlation with censored data |
| rcorrp.cens | Assess difference in concordance for paired predictors |
| rcspline.eval | Evaluate restricted cubic spline design matrix |
| rcspline.plot | Plot spline fit with nonparametric smooth and grouped estimates |
| rcspline.restate | Restate restricted cubic spline in unrestricted form, and |
| create TeX expression to print the fitted function | |
| reShape | Reshape a matrix into 3 vectors, reshape serial data |
| rm.boot | Bootstrap spline fit to repeated measurements model, |
| with simultaneous confidence region - least | |
| squares using spline function in time | |
| rMultinom | Generate multinomial random variables with varying prob. |
| samplesize.bin | Sample size for 2-sample binomial problem |
| (Rick Chappell,chappell@stat.wisc.edu) | |
| sas.get | Convert SAS dataset to S data frame |
| sasxport.get | Enhanced importing of SAS transport dataset in R |
| Save | Enhancement ofsave |
| scat1d | Add 1-dimensional scatterplot to an axis of an existing plot |
| (like bar-codes, FEH/Martin Maechler, | |
| maechler@stat.math.ethz.ch/Jens Oehlschlaegel-Akiyoshi, | |
| oehl@psyres-stuttgart.de) | |
| score.binary | Construct a score from a series of binary variables or |
| expressions | |
| sedit | A set of character handling functions written entirely |
| inR. sedit() does much of what the UNIX sed | |
| program does. Other functions included are | |
| substring.location, substring<-, replace.string.wild, | |
| and functions to check if a string is numeric or | |
| contains only the digits 0-9 | |
| setTrellis | Set Trellis graphics to use blank conditioning panel strips, |
| line thickness 1 for dot plot reference lines: | |
| setTrellis(); 3 optional arguments | |
| show.col | Show colors corresponding to col=0,1,...,99 |
| show.pch | Show all plotting characters specified by pch=. |
| Just type show.pch() to draw the table on the | |
| current device. | |
| showPsfrag | Use LaTeX to compile, and dvips and ghostview to |
| display a postscript graphic containing psfrag strings | |
| solvet | Version of solve with argument tol passed to qr |
| somers2 | Somers' rank correlation and c-index for binary y |
| spearman | Spearman rank correlation coefficient spearman(x,y) |
| spearman.test | Spearman 1 d.f. and 2 d.f. rank correlation test |
| spearman2 | Spearman multiple d.f.\rho^2, adjusted\rho^2, Wilcoxon-Kruskal- |
| Wallis test, for multiple predictors | |
| spower | Simulate power of 2-sample test for survival under |
| complex conditions | |
| Also contains the Gompertz2,Weibull2,Lognorm2functions. | |
| spss.get | Enhanced importing of SPSS files using read.spssfunction |
| src | src(name) = source("name.s") with memory |
| store | store an object permanently (easy interface to assign function) |
| strmatch | Shortest unique identifier match |
| (Terry Therneau,therneau@mayo.edu) | |
| subset | More easily subset a data frame |
| substi | Substitute one var for another when observations NA |
| summarize | Generate a data frame containing stratified summary |
| statistics. Useful for passing to trellis. | |
| summary.formula | General table making and plotting functions for summarizing |
| data | |
| summaryD | Summarizing using user-provided formula and dotchart3 |
| summaryM | Replacement for summary.formula(..., method='reverse') |
| summaryP | Multi-panel dot chart for summarizing proportions |
| summaryS | Summarize multiple response variables for multi-panel |
| dot chart or scatterplot | |
| summaryRc | Summary for continuous variables using lowess |
| symbol.freq | X-Y Frequency plot with circles' area prop. to frequency |
| sys | Execute unix() or dos() depending on what's running |
| tabulr | Front-end to tabular function in the tables package |
| tex | Enclose a string with the correct syntax for using |
| with the LaTeX psfrag package, for postscript graphics | |
| transace | ace() packaged for easily automatically transforming all |
| variables in a matrix | |
| transcan | automatic transformation and imputation of NAs for a |
| series of predictor variables | |
| trap.rule | Area under curve defined by arbitrary x and y vectors, |
| using trapezoidal rule | |
| trellis.strip.blank | To make the strip titles in trellis more visible, you can |
| make the backgrounds blank by saying trellis.strip.blank(). | |
| Use before opening the graphics device. | |
| t.test.cluster | 2-sample t-test for cluster-randomized observations |
| uncbind | Form individual variables from a matrix |
| upData | Update a data frame (change names, labels, remove vars, etc.) |
| units | Set or fetch "units" attribute - units of measurement for var. |
| varclus | Graph hierarchical clustering of variables using squared |
| Pearson or Spearman correlations or Hoeffding D as similarities | |
| Also includes the naclus function for examining similarities in | |
| patterns of missing values across variables. | |
| wtd.mean | |
| wtd.var | |
| wtd.quantile | |
| wtd.Ecdf | |
| wtd.table | |
| wtd.rank | |
| wtd.loess.noiter | |
| num.denom.setup | Set of function for obtaining weighted estimates |
| xy.group | Compute mean x vs. function of y by groups of x |
| xYplot | Like trellis xyplot but supports error bars and multiple |
| response variables that are connected as separate lines | |
| ynbind | Combine a series of yes/no true/false present/absent variables into a matrix |
| zoom | Zoom in on any graphical display |
| (Bill Dunlap,bill@statsci.com) |
Copyright Notice
GENERAL DISCLAIMER
This program is free software; you can redistribute itand/or modify it under the terms of the GNU General PublicLicense as published by the Free Software Foundation; eitherversion 2, or (at your option) any later version.
This program is distributed in the hope that it will beuseful, but WITHOUT ANY WARRANTY; without even the impliedwarranty of MERCHANTABILITY or FITNESS FOR A PARTICULARPURPOSE. See the GNU General Public License for moredetails.
In short: You may use it any way you like, as long as youdon't charge money for it, remove this notice, or hold anyone liablefor its results. Also, please acknowledge the source and communicatechanges to the author.
If this software is used is work presented for publication, kindlyreference it using for example:
Harrell FE (2014): Hmisc: A package of miscellaneous R functions.Programs available fromhttps://hbiostat.org/R/Hmisc/.
Be sure to referenceR itself and other libraries used.
Author(s)
Frank E Harrell Jr
Professor of Biostatistics
Vanderbilt University School of Medicine
Nashville, Tennessee
fh@fharrell.com
References
See Alzola CF, Harrell FE (2004): An Introduction to S and theHmisc and Design Libraries athttps://hbiostat.org/R/doc/sintro.pdffor extensive documentation and examples for the Hmisc package.
Lag a Numeric, Character, or Factor Vector
Description
Shifts a vectorshift elements later. Character or factorvariables are padded with"", numerics withNA. The shiftmay be negative.
Usage
Lag(x, shift = 1)Arguments
x | a vector |
shift | integer specifying the number of observations tobe shifted to the right. Negative values imply shifts to the left. |
Details
A.ttributes of the original object are carried along to the new laggedone.
Value
a vector likex
Author(s)
Frank Harrell
See Also
Examples
Lag(1:5,2)Lag(letters[1:4],2)Lag(factor(letters[1:4]),-2)# Find which observations are the first for a given subjectid <- c('a','a','b','b','b','c')id != Lag(id)!duplicated(id)Merge Multiple Data Frames or Data Tables
Description
Merges an arbitrarily large series of data frames or data tables containing commonid variables. Information about number of observations and number of uniqueids in individual and final merged datasets is printed. The first data frame/table has special meaning in that all of its observations are kept whether they matchids in other data frames or not. For all other data frames, by default non-matching observations are dropped. The first data frame is also the one against which counts of uniqueids are compared. Sometimesmerge drops variable attributes such aslabels andunits. These are restored byMerge.
Usage
Merge(..., id = NULL, all = TRUE, verbose = TRUE)Arguments
... | two or more dataframes or data tables |
id | a formula containing all the identification variables such that the combination of these variables uniquely identifies subjects or records of interest. May be omitted for data tables; in that case the |
all | set to |
verbose | set to |
Examples
## Not run: a <- data.frame(sid=1:3, age=c(20,30,40))b <- data.frame(sid=c(1,2,2), bp=c(120,130,140))d <- data.frame(sid=c(1,3,4), wt=c(170,180,190))all <- Merge(a, b, d, id = ~ sid)# First file should be the master file and must# contain all ids that ever occur. ids not in the master will# not be merged from other datasets.a <- data.table(a); setkey(a, sid)# data.table also does not allow duplicates without allow.cartesian=TRUEb <- data.table(sid=1:2, bp=c(120,130)); setkey(b, sid)d <- data.table(d); setkey(d, sid)all <- Merge(a, b, d)## End(Not run)Miscellaneous Functions
Description
This documents miscellaneous small functions in Hmisc that may be ofinterest to users.
clowess runslowess but if theiter argumentexceeds zero, sometimes wild values can result, in which caselowess is re-run withiter=0.
confbar draws multi-level confidence bars using small rectanglesthat may be of different colors.
getLatestSource fetches andsources the most recentsource code for functions in GitHub.
grType retrieves the system optiongrType, which isforced to be"base" if theplotly package is notinstalled.
prType retrieves the system optionprType, which isset to"plain" if the option is not set.print methodsthat allow for markdown/html/latex can be automatically invoked bysettingoptions(prType="html") oroptions(prType='latex').
htmlSpecialType retrieves the system optionhtmlSpecialType, which is set to"unicode" if the optionis not set.htmlSpecialType='unicode' cause html-generatingfunctions inHmisc andrms to use unicode for specialcharacters, andhtmlSpecialType='&' uses the older ampersand3-digit format.
inverseFunction generates a function to find all inverses of amonotonic or nonmonotonic function that is tabulated at vectors (x,y),typically 1000 points. If the original function is monotonic, simple linearinterpolation is used and the result is a vector, otherwise linearinterpolation is used within each interval in which the function ismonotonic and the result is a matrix with number of columns equal to thenumber of monotonic intervals. If a requested y is not within anyinterval, the extreme x that pertains to the nearest extreme y isreturned. Specifying what='sample' to the returned function will cause avector to be returned instead of a matrix, with elements taken as arandom choice of the possible inverses.
james.stein computes James-Stein shrunken estimates of cellmeans given a response variable (which may be binary) and a groupingindicator.
keepHattrib for an input variable or a data frame, creates alist object saving special Hmisc attributes such aslabel andunits that might be lost during certain operations such asrunningdata.table.restoreHattrib restores these attributes.
km.quick provides a fast way to invokesurvfitKM in thesurvival package to efficiently get Kaplan-Meier or Fleming-Harrington estimates for asingle stratum for a vector of time points (iftimes is given) or toget a vector of survival time quantiles (ifq is given). If neither is given,the whole curve is returned in a list with objectstime andsurv, andthere is an option to consider an interval as pertaining to greater than or equalto a specific time instead of the traditional greater than. If the censoring is not right censoring, the more generalsurvfit is called bykm.quick.
latexBuild takes pairs of character strings and produces asingle character string containing concatenation of all of them, plusan attribute"close" which is a character string containing theLaTeX closure that will balance LaTeX code with respect toparentheses, braces, brackets, orbegin vs.end. Whenan even-numbered element of the vector is not a left parenthesis,brace, or bracket, the element is taken as a word that was surroundedbybegin and braces, for which the correspondingend isconstructed in the returned attribute.
lm.fit.qr.bare is a fast stripped-down function for computingregression coefficients, residuals,R^2, and fitted values. Ituseslm.fit.
matxv multiplies a matrix by a vector, handling automaticaddition of intercepts if the matrix does not have a column of ones.If the first argument is not a matrix, it will be converted to one.An optional argument allows the second argument to be treated as amatrix, useful when its rows represent bootstrap reps ofcoefficients. Then ab' is computed.matxv respects the"intercepts" attribute if it is stored onb by therms package. This is used byormfits that are bootstrap-repeated bybootcov whereonly the intercept corresponding to the median is retained. Ifkint has nonzero length, it is checked for consistency with theattribute.
makeSteps is a copy of the dostep function inside thesurvival package'splot.survfit function. It expands aseries of points to include all the segments needed to plot stepfunctions. This is useful for drawing polygons to shade confidencebands for step functions.
nomiss returns a data frame (if its argument is one) with rowscorresponding toNAs removed, or it returns a matrix with rowswith any element missing removed.
outerText usesaxis() to put right-justified textstrings in the right margin. Placement depends onpar('mar')[4]
plotlyParm is a list of functions useful for specifyingparameters toplotly graphics.
plotp is a generic to handleplotp methods to makeplotly graphics.
rendHTML renders HTML in a character vector, first convertingto one character string with newline delimeters. Ifknitr iscurrently running, runs this string throughknitr::asis_outputso that the user need not includeresults='asis' in the chunkheader for R Markdown or Quarto. Ifknitr is not running, useshtmltools::browsable andhtmltools::HTML and prints theresult so that an RStudio viewer (if running inside RStudio) orseparate browser window displays the rendered HTML. The HTML code issurrounded by yaml markup to make Pandoc not fiddle with the HTML.Set the argumenthtml=FALSE to not add this, in case you arereally rendering markdown.html=FALSE also invokesrmarkdown::render to convert the character vector to HTMLbefore usinghtmltools to view, assuming the charactersrepresent RMarkdown/Quarto text other than the YAML header. Ifoptions(rawmarkup=TRUE) is in effect,rendHTML will justcat() its first argument. This is useful when rendering ishappening inside a Quarto margin, for example.
sepUnitsTrans converts character vectors containing values suchasc("3 days","3day","4month","2 years","2weeks","7") tonumeric vectors (herec(3,3,122,730,14,7)) in a flexible fashion. The user canspecify a vector of units of measurements and conversion factors. The unitswith a conversion factor of1 are taken as the target units,and if those units are present in the character strings they areignored. The target units are added to the resulting vector as the"units" attribute.
strgraphwrap is likestrwrap but is for the currentgraphics environment.
tobase64image is a function written by Dirk Eddelbuettel thatuses thebase64enc package to convert a png graphic file tobase64 encoding to include as an inline image in an html file.
trap.rule computes the area under a curve using the trapezoidalrule, assumingx is sorted.
trellis.strip.blank sets up Trellis or Lattice graphs to have aclear background on the strips for panel labels.
unPaste provides a version of the S-Plusunpaste thatworks forR and S-Plus.
whichClosePW is a very fast function using weighted multinomialsampling to determine which element of a vector is "closest" to eachelement of another vector.whichClosest quickly finds the closestelement without any randomness.
whichClosek is a slow function that finds, after jittering thelookup table, thek closest matchest to each element of theother vector, and chooses from among these one at random.
xless is a function for Linux/Unix users to invoke the systemxless command to pop up a window to display the result ofprinting an object. For MacOSxless uses the systemopen command to pop up aTextEdit window.
Usage
confbar(at, est, se, width, q = c(0.7, 0.8, 0.9, 0.95, 0.99), col = gray(c(0, 0.25, 0.5, 0.75, 1)), type = c("v", "h"), labels = TRUE, ticks = FALSE, cex = 0.5, side = "l", lwd = 5, clip = c(-1e+30, 1e+30), fun = function(x) x, qfun = function(x) ifelse(x == 0.5, qnorm(x), ifelse(x < 0.5, qnorm(x/2), qnorm((1 + x)/2))))getLatestSource(x=NULL, package='Hmisc', recent=NULL, avail=FALSE)grType()prType()htmlSpecialType()inverseFunction(x, y)james.stein(y, group)keepHattrib(obj)km.quick(S, times, q, type = c("kaplan-meier", "fleming-harrington", "fh2"), interval = c(">", ">="), method=c('constant', 'linear'), fapprox=0, n.risk=FALSE)latexBuild(..., insert, sep='')lm.fit.qr.bare(x, y, tolerance, intercept=TRUE, xpxi=FALSE, singzero=FALSE)matxv(a, b, kint=1, bmat=FALSE)nomiss(x)outerText(string, y, cex=par('cex'), ...)plotlyParmplotp(data, ...)rendHTML(x, html=TRUE)restoreHattrib(obj, attribs)sepUnitsTrans(x, conversion=c(day=1, month=365.25/12, year=365.25, week=7), round=FALSE, digits=0)strgraphwrap(x, width = 0.9 * getOption("width"), indent = 0, exdent = 0, prefix = "", simplify = TRUE, units='user', cex=NULL)tobase64image(file, Rd = FALSE, alt = "image")trap.rule(x, y)trellis.strip.blank()unPaste(str, sep="/")whichClosest(x, w)whichClosePW(x, w, f=0.2)whichClosek(x, w, k)xless(x, ..., title)Arguments
a | a numeric matrix or vector |
alt,Rd | see |
at | x-coordinate for vertical confidence intervals, y-coordinatefor horizontal |
attribs | an object returned by |
avail | set to |
b | a numeric vector |
cex | character expansion factor |
clip | interval to truncate limits |
col | vector of colors |
conversion | a named numeric vector |
data | an object having a |
digits | number of digits used for |
est | vector of point estimates for confidence limits |
f | a scaling constant |
file | a file name |
fun | function to transform scale |
group | a categorical grouping variable |
html | set to |
insert | a list of 3-element lists for |
intercept | set to |
k | get the |
kint | which element of |
bmat | set to |
labels | set to |
lwd | line widths |
package | name of package for |
obj | a variable, data frame, or data table |
q | vector of confidence coefficients or quantiles |
qfun | quantiles on transformed scale |
recent | an integer telling |
round | set to |
S | a |
se | vector of standard errors |
sep | a single character string specifying the delimiter. For |
side | for |
str | a character string vector |
string | a character string vector |
ticks | set to |
times | a numeric vector of times |
title | a character string to title a window or plot. Ignored for |
tolerance | tolerance for judging singularity in matrix |
type |
|
w | a numeric vector |
width | width of confidence rectanges in user units, or see |
x | a numeric vector (matrix for |
xpxi | set to |
singzero | set to |
y | a numeric vector. For |
indent,exdent,prefix | see |
simplify | see |
units | see |
interval | specifies whether to deal with probabilities of exceeding a value(the default) or of exceeding or equalling the value |
method,fapprox | see |
n.risk | set to |
... | arguments passed through to another function. For |
Author(s)
Frank Harrell and Charles Dupont
Examples
trap.rule(1:100,1:100)unPaste(c('a;b or c','ab;d','qr;s'), ';')sepUnitsTrans(c('3 days','4 months','2 years','7'))set.seed(1)whichClosest(1:100, 3:5)whichClosest(1:100, rep(3,20))whichClosePW(1:100, rep(3,20))whichClosePW(1:100, rep(3,20), f=.05)whichClosePW(1:100, rep(3,20), f=1e-10)x <- seq(-1, 1, by=.01)y <- x^2h <- inverseFunction(x,y)formals(h)$turns # vertexa <- seq(0, 1, by=.01)plot(0, 0, type='n', xlim=c(-.5,1.5))lines(a, h(a)[,1]) ## first inverselines(a, h(a)[,2], col='red') ## second inversea <- c(-.1, 1.01, 1.1, 1.2)points(a, h(a)[,1])d <- data.frame(x=1:2, y=3:4, z=5:6)d <- upData(d, labels=c(x='X', z='Z lab'), units=c(z='mm'))a <- keepHattrib(d)d <- data.frame(x=1:2, y=3:4, z=5:6)d2 <- restoreHattrib(d, a)sapply(d2, attributes)## Not run: getLatestSource(recent=5) # source() most recent 5 revised files in HmiscgetLatestSource('cut2') # fetch and source latest cut2.sgetLatestSource('all') # get everythinggetLatestSource(avail=TRUE) # list available files and latest versions## End(Not run)R2Measures
Description
Generalized R^2 Measures
Usage
R2Measures(lr, p, n, ess = NULL, padj = 1)Arguments
lr | likelihoood ratio chi-square statistic |
p | number of non-intercepts in the model that achieved |
n | raw number of observations |
ess | if a single number, is the effective sample size. If a vector of numbers is assumed to be the frequency tabulation of all distinct values of the outcome variable, from which the effective sample size is computed. |
padj | set to 2 to use the classical adjusted R^2 penalty, 1 (the default) to subtract |
Details
Computes various generalized R^2 measures related to the Maddala-Cox-Snell (MCS) R^2 for regression models fitted with maximum likelihood. The original MCS R^2 is labeledR2 in the result. This measure uses the raw sample sizen and does not penalize for the number of free parameters, so it can be rewarded for overfitting. A measure adjusted for the number of fitted regression coefficientsp uses the analogy to R^2 in linear models by computing1 - exp(- lr / n) * (n-1)/(n-p-1) ifpadj=2, which is approximately1 - exp(- (lr - p) / n), the version used ifpadj=1 (the default). The latter measure is appealing because the expected value of the likelihood ratio chi-square statisticlr isp under the global null hypothesis of no predictors being associated with the response variable. Seehttps://hbiostat.org/bib/r2.html for more details.
It is well known that in logistic regression the MCS R^2 cannot achieve a value of 1.0 even with a perfect model, which prompted Nagelkerke to divide the R^2 measure by its maximum attainable value. This is not necessarily the best recalibration of R^2 throughout its range. An alternative is to use the formulas above but to replace the raw sample sizen with the effective sample size, which for data with many ties can be significantly lower than the number of observations. As used in thepopower() anddescribe() functions, in the context of a Wilcoxon test or the proportional odds model, the effective sample size isn * (1 - f) wheref is the sums of cubes of the proportion of observations at each distict value of the response variable. Whitehead derived this from an approximation to the variance of a log odds ratio in a proportional odds model. To obtain R^2 measures using the effective sample size, either provideess as a single number specifying the effective sample size, or specify a vector of frequencies of distinct Y values from which the effective sample size will be computed. In the context of survival analysis, the single number effective sample size you may wish to specify is the number of uncensored observations. This is exactly correct when estimating the hazard rate from a simple exponential distribution or when using the Cox PH/log-rank test. For failure time distributions with a very high early hazard, censored observations contain enough information that the effective sample size is greater than the number of events. See Benedetti et al, 1982.
If the effective sample size equals the raw sample size, measures involving the effective sample size are set toNA.
Value
named vector of R2 measures. The notation for results isR^2(p, n) where thep component is empty for unadjusted estimates andn is the sample size used (actual sample size for first measures, effective sample size for remaining ones). For indexes that are not adjusted, onlyn appears.
Author(s)
Frank Harrell
References
Smith TJ and McKenna CM (2013): A comparison of logistic regression pseudo R^2 indices. Multiple Linear Regression Viewpoints 39:17-26.https://www.glmj.org/archives/articles/Smith_v39n2.pdf
Benedetti JK, et al (1982): Effective sample size for tests of censored survival data. Biometrika 69:343–349.
Mittlbock M, Schemper M (1996): Explained variation for logistic regression. Stat in Med 15:1987-1997.
Date, S: R-squared, adjusted R-squared and pseudo R-squared.https://timeseriesreasoning.com/contents/r-squared-adjusted-r-squared-pseudo-r-squared/
UCLA: What are pseudo R-squareds?https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
Allison P (2013): What's the beset R-squared for logistic regression?https://statisticalhorizons.com/r2logistic/
Menard S (2000): Coefficients of determination for multiple logistic regression analysis. The Am Statistician 54:17-24.
Whitehead J (1993): Sample size calculations for ordered categorical data. Stat in Med 12:2257-2271. See errata (1994) 13:871 and letter to the editor by Julious SA, Campbell MJ (1996) 15:1065-1066 showing that for 2-category Y the Whitehead sample size formula agrees closely with the usual formula for comparing two proportions.
Examples
x <- c(rep(0, 50), rep(1, 50))y <- x# f <- lrm(y ~ x)# f # Nagelkerke R^2=1.0# lr <- f$stats['Model L.R.']# 1 - exp(- lr / 100) # Maddala-Cox-Snell (MCS) 0.75lr <- 138.6267 # manually so don't need rms packageR2Measures(lr, 1, 100, c(50, 50)) # 0.84 Effective n=75R2Measures(lr, 1, 100, 50) # 0.94# MCS requires unreasonable effective sample size = minimum outcome# frequency to get close to the 1.0 that Nagelkerke R^2 achievesFaciliate Use of save and load to Remote Directories
Description
These functions are slightly enhanced versions ofsave andload that allow a target directory to be specified usingoptions(LoadPath="pathname"). If theLoadPath option isnot set, the current working directory is used.
Usage
# options(LoadPath='mypath')Save(object, name=deparse(substitute(object)), compress=TRUE)Load(object)Arguments
object | the name of an object, usually a data frame. It mustnot be quoted. |
name | an optional name to assign to the object and file nameprefix, if the argument name is not used |
compress | see |
Details
Save creates a temporary version of the object under the namegiven by the user, so thatsave will internalize this name.Then subsequentLoad orload will cause an object of theoriginal name to be created in the global environment. The name oftheR data file is assumed to be the name of the object (or the valueofname) appended with".rda".
Author(s)
Frank Harrell
See Also
Examples
## Not run: d <- data.frame(x=1:3, y=11:13)options(LoadPath='../data/rda')Save(d) # creates ../data/rda/d.rdaLoad(d) # reads ../data/rda/d.rdaSave(d, 'D') # creates object D and saves it in .../D.rda## End(Not run)Indexes of Absolute Prediction Error for Linear Models
Description
Computes the mean and median of various absolute errors related toordinary multiple regression models. The mean and median absoluteerrors correspond to the mean square due to regression, error, andtotal. The absolute errors computed are derived from\hat{Y} - \mbox{median($\hat{Y}$)},\hat{Y} - Y, andY - \mbox{median($Y$)}. The function alsocomputes ratios that correspond toR^2 and1 - R^2 (butthese ratios do not add to 1.0); theR^2 measure is the ratio ofmean or median absolute\hat{Y} - \mbox{median($\hat{Y}$)} to the mean or median absoluteY - \mbox{median($Y$)}. The1 - R^2 or SSE/SSTmeasure is the mean or median absolute\hat{Y} - Ydivided by the mean or median absolute\hat{Y} - \mbox{median($Y$)}.
Usage
abs.error.pred(fit, lp=NULL, y=NULL)## S3 method for class 'abs.error.pred'print(x, ...)Arguments
fit | a fit object typically from |
lp | a vector of predicted values (Y hat above) if |
y | a vector of response variable values if |
x | an object created by |
... | unused |
Value
a list of classabs.error.pred (used byprint.abs.error.pred) containing two matrices:differences andratios.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Schemper M (2003): Stat in Med 22:2299-2308.
Tian L, Cai T, Goetghebeur E, Wei LJ (2007): Biometrika 94:297-311.
See Also
Examples
set.seed(1) # so can regenerate resultsx1 <- rnorm(100)x2 <- rnorm(100)y <- exp(x1+x2+rnorm(100))f <- lm(log(y) ~ x1 + poly(x2,3), y=TRUE)abs.error.pred(lp=exp(fitted(f)), y=y)rm(x1,x2,y,f)Add Marginal Observations
Description
Given a data frame and the names of variable, doubles thedata frame for each variable with a new category"All" by default, or by the value oflabel.A new variable.marginal. is added to the resulting data frame,with value"" if the observation is an original one, and withvalue equal to the names of the variable being marginalized (separatedby commas) otherwise. If there is another stratification variablebesides the one in ..., and that variable is nested inside thevariable in ..., specifynested=variable name to have the valueof that variable set folabel whenever marginal observations arecreated for .... See the state-city example below.
Usage
addMarginal(data, ..., label = "All", margloc=c('last', 'first'), nested)Arguments
data | a data frame |
... | a list of names of variables to marginalize |
label | category name for added marginal observations |
margloc | location for marginal category within factor variablespecifying categories. Set to |
nested | a single unquoted variable name if used |
Examples
d <- expand.grid(sex=c('female', 'male'), country=c('US', 'Romania'), reps=1:2)addMarginal(d, sex, country)# Example of nested variablesd <- data.frame(state=c('AL', 'AL', 'GA', 'GA', 'GA'), city=c('Mobile', 'Montgomery', 'Valdosto', 'Augusta', 'Atlanta'), x=1:5, stringsAsFactors=TRUE)addMarginal(d, state, nested=city) # cite set to 'All' when state isaddggLayers
Description
Add Spike Histograms and Extended Box Plots toggplot
Usage
addggLayers( g, data, type = c("ebp", "spike"), ylim = layer_scales(g)$y$get_limits(), by = "variable", value = "value", frac = 0.065, mult = 1, facet = NULL, pos = c("bottom", "top"), showN = TRUE)Arguments
g | a |
data | data frame/table containing raw data |
type | specifies either extended box plot or spike histogram. Both are horizontal so are showing the distribution of the x-axis variable. |
ylim | y-axis limits to use for scaling the height of the added plots, if you don't want to use the limits that |
by | the name of a variable in |
value | name of x-variable |
frac | fraction of y-axis range to devote to vertical aspect of the added plot |
mult | fudge factor for scaling y aspect |
facet | optional faceting variable |
pos | position for added plot |
showN | sete to |
Details
For an example seethis. Note that it was not possible to just create the layers needed to be added, as creating these particular layers in isolation resulted in aggplot error.
Value
the originalggplot object with more layers added
Author(s)
Frank Harrell
See Also
spikecomp()
Check if All Elements in Character Vector are Numeric
Description
Tests, without issuing warnings, whether all elements of a charactervector are legal numeric values, or optionally converts the vector to anumeric vector. Leading and trailing blanks inx are ignored.
Usage
all.is.numeric(x, what = c("test", "vector", "nonnum"), extras=c('.','NA'))Arguments
x | a character vector |
what | specify |
extras | a vector of character strings to count as numericvalues, other than |
Value
a logical value ifwhat="test" or a vector otherwise
Author(s)
Frank Harrell
See Also
Examples
all.is.numeric(c('1','1.2','3'))all.is.numeric(c('1','1.2','3a'))all.is.numeric(c('1','1.2','3'),'vector')all.is.numeric(c('1','1.2','3a'),'vector')all.is.numeric(c('1','',' .'),'vector')all.is.numeric(c('1', '1.2', '3a'), 'nonnum')Linear Extrapolation
Description
Works in conjunction with theapprox function to do linearextrapolation.approx in R does not support extrapolation atall, and it is buggy in S-Plus 6.
Usage
approxExtrap(x, y, xout, method = "linear", n = 50, rule = 2, f = 0, ties = "ordered", na.rm = FALSE)Arguments
x,y,xout,method,n,rule,f | see |
ties | applies only to R. See |
na.rm | set to |
Details
Duplicates inx (and correspondingy elements) are removedbefore usingapprox.
Value
a vector the same length asxout
Author(s)
Frank Harrell
See Also
Examples
approxExtrap(1:3,1:3,xout=c(0,4))Additive Regression with Optimal Transformations on Both Sides usingCanonical Variates
Description
Expands continuous variables into restricted cubic spline bases andcategorical variables into dummy variables and fits a multivariateequation using canonical variates. This finds optimum transformationsthat maximizeR^2. Optionally, the bootstrap is used to estimatethe covariance matrix of both left- and right-hand-side transformationparameters, and to estimate the bias in theR^2 due to overfittingand compute the bootstrap optimism-correctedR^2.Cross-validation can also be used to get an unbiased estimate ofR^2 but this is not as precise as the bootstrap estimate. Thebootstrap and cross-validation may also used to get estimates of meanand median absolute error in predicted values on the originalyscale. These two estimates are perhaps the best ones for gauging theaccuracy of a flexible model, because it is difficult to compareR^2 under different y-transformations, and becauseR^2allows for an out-of-sample recalibration (i.e., it only measuresrelative errors).
Note that uncertainty about the proper transformation ofy causesan enormous amount of model uncertainty. When the transformation fory is estimated from the data a high variance in predicted valueson the originaly scale may result, especially if the truetransformation is linear. Comparing bootstrap or cross-validated meanabsolute errors with and without restricted they transform to belinear (ytype='l') may help the analyst choose the proper modelcomplexity.
Usage
areg(x, y, xtype = NULL, ytype = NULL, nk = 4, B = 0, na.rm = TRUE, tolerance = NULL, crossval = NULL)## S3 method for class 'areg'print(x, digits=4, ...)## S3 method for class 'areg'plot(x, whichx = 1:ncol(x$x), ...)## S3 method for class 'areg'predict(object, x, type=c('lp','fitted','x'), what=c('all','sample'), ...)Arguments
x | A single predictor or a matrix of predictors. Categoricalpredictors are required to be coded as integers (as |
y | a |
xtype | a vector of one-letter character codes specifying how each predictoris to be modeled, in order of columns of |
ytype | same coding as for |
nk | number of knots, 0 for linear, or 3 or more. Default is 4which will fit 3 parameters to continuous variables (one linear termand two nonlinear terms) |
B | number of bootstrap resamples used to estimate covariancematrices of transformation parameters. Default is no bootstrapping. |
na.rm | set to |
tolerance | singularity tolerance. List source code for |
crossval | set to a positive integer k to compute k-foldcross-validated R-squared (square of first canonical correlation)and mean and median absolute error of predictions on the original scale |
digits | number of digits to use in formatting for printing |
object | an object created by |
whichx | integer or character vector specifying which predictorsare to have their transformations plotted (default is all). The |
type | tells |
what | When the |
... | arguments passed to the plot function. |
Details
areg is a competitor oface in theacepackpackage. Transformations fromace are seldom smooth enough andare often overfitted. Withareg the complexity can be controlledwith thenk parameter, and predicted values are easy to obtainbecause parametric functions are fitted.
If one side of the equation has a categorical variable with more thantwo categories and the other side has a continuous variable not assumedto act linearly, larger sample sizes are needed to reliably estimatetransformations, as it is difficult to optimally score categoricalvariables to maximizeR^2 against a simultaneously optimallytransformed continuous variable.
Value
a list of class"areg" containing many objects
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Breiman and Friedman, Journal of the American StatisticalAssociation (September, 1985).
See Also
Examples
set.seed(1)ns <- c(30,300,3000)for(n in ns) { y <- sample(1:5, n, TRUE) x <- abs(y-3) + runif(n) par(mfrow=c(3,4)) for(k in c(0,3:5)) { z <- areg(x, y, ytype='c', nk=k) plot(x, z$tx)title(paste('R2=',format(z$rsquared))) tapply(z$ty, y, range) a <- tapply(x,y,mean) b <- tapply(z$ty,y,mean) plot(a,b)abline(lsfit(a,b)) # Should get same result to within linear transformation if reverse x and y w <- areg(y, x, xtype='c', nk=k) plot(z$ty, w$tx) title(paste('R2=',format(w$rsquared))) abline(lsfit(z$ty, w$tx)) }}par(mfrow=c(2,2))# Example where one category in y differs from others but only in variance of xn <- 50y <- sample(1:5,n,TRUE)x <- rnorm(n)x[y==1] <- rnorm(sum(y==1), 0, 5)z <- areg(x,y,xtype='l',ytype='c')zplot(z)z <- areg(x,y,ytype='c')zplot(z)## Not run: # Examine overfitting when true transformations are linearpar(mfrow=c(4,3))for(n in c(200,2000)) { x <- rnorm(n); y <- rnorm(n) + x for(nk in c(0,3,5)) { z <- areg(x, y, nk=nk, crossval=10, B=100) print(z) plot(z) title(paste('n=',n)) }}par(mfrow=c(1,1))# Underfitting when true transformation is quadratic but overfitting# when y is allowed to be transformedset.seed(49)n <- 200x <- rnorm(n); y <- rnorm(n) + .5*x^2#areg(x, y, nk=0, crossval=10, B=100)#areg(x, y, nk=4, ytype='l', crossval=10, B=100)z <- areg(x, y, nk=4) #, crossval=10, B=100)z# Plot x vs. predicted value on original scale. Since y-transform is# not monotonic, there are multiple y-inversesxx <- seq(-3.5,3.5,length=1000)yhat <- predict(z, xx, type='fitted')plot(x, y, xlim=c(-3.5,3.5))for(j in 1:ncol(yhat)) lines(xx, yhat[,j], col=j)# Plot a random sample of possible y inversesyhats <- predict(z, xx, type='fitted', what='sample')points(xx, yhats, pch=2)## End(Not run)# True transformation of x1 is quadratic, y is linearn <- 200x1 <- rnorm(n); x2 <- rnorm(n); y <- rnorm(n) + x1^2z <- areg(cbind(x1,x2),y,xtype=c('s','l'),nk=3)par(mfrow=c(2,2))plot(z)# y transformation is inverse quadratic but areg gets the same answer by# making x1 quadraticn <- 5000x1 <- rnorm(n); x2 <- rnorm(n); y <- (x1 + rnorm(n))^2z <- areg(cbind(x1,x2),y,nk=5)par(mfrow=c(2,2))plot(z)# Overfit 20 predictors when no true relationships existn <- 1000x <- matrix(runif(n*20),n,20)y <- rnorm(n)z <- areg(x, y, nk=5) # add crossval=4 to expose the problem# Test predict functionn <- 50x <- rnorm(n)y <- rnorm(n) + xg <- sample(1:3, n, TRUE)z <- areg(cbind(x,g),y,xtype=c('s','c'))range(predict(z, cbind(x,g)) - z$linear.predictors)Multiple Imputation using Additive Regression, Bootstrapping, andPredictive Mean Matching
Description
Thetranscan function creates flexible additive imputation modelsbut provides only an approximation to true multiple imputation as theimputation models are fixed before all multiple imputations aredrawn. This ignores variability caused by having to fit theimputation models.aregImpute takes all aspects of uncertainty inthe imputations into account by using the bootstrap to approximate theprocess of drawing predicted values from a full Bayesian predictivedistribution. Different bootstrap resamples are used for each of themultiple imputations, i.e., for theith imputation of a sometimesmissing variable,i=1,2,... n.impute, a flexible additivemodel is fitted on a sample with replacement from the original data andthis model is used to predict all of the original missing andnon-missing values for the target variable.
areg is used to fit the imputation models. By default, linearityis assumed for target variables (variables being imputed) andnk=3 knots are assumed for continuous predictors transformedusing restricted cubic splines. Ifnk is three or greater andtlinear is set toFALSE,aregsimultaneously finds transformations of the target variable and of all ofthe predictors, to get a good fit assuming additivity, maximizingR^2, using the same canonical correlation method astranscan. Flexible transformations may be overridden forspecific variables by specifying the identity transformation for them.When a categorical variable is being predicted, the flexibletransformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. Ifnk is a vector,areg's bootstrap andcrossval=10options will be used to help find the optimum validating value ofnk over values of that vector, at the last imputation iteration.For the imputations, the minimum value ofnk is used.
Instead of defaulting to taking random draws from fitted imputationmodels using random residuals as is done bytranscan,aregImpute by default uses predictive mean matching with optionalweighted probability sampling of donors rather than using only theclosest match. Predictive mean matching works for binary, categorical,and continuous variables without the need for iterative maximumlikelihood fitting for binary and categorical variables, and without theneed for computing residuals or for curtailing imputed values to be inthe range of actual data. Predictive mean matching is especiallyattractive when the variable being imputed is also being transformedautomatically. Constraints may be placed on variables being imputedwith predictive mean matching, e.g., a missing hospital discharge datemay be required to be imputed from a donor observation whose dischargedate is before the recipient subject's first post-discharge visit date.See Details below for more information about thealgorithm. A"regression" method is also available that issimilar to that used intranscan. This option should be usedwhen mechanistic missingness requires the use of extrapolation duringimputation.
Aprint method summarizes the results, and aplot method plotsdistributions of imputed values. Typically,fit.mult.impute willbe called afteraregImpute.
If a target variable is transformed nonlinearly (i.e., ifnk isgreater than zero andtlinear is set toFALSE) and theestimated target variable transformation is non-monotonic, imputedvalues are not unique. Whentype='regression', a random choiceof possible inverse values is made.
ThereformM function provides two ways of recreating a formula togive toaregImpute by reordering the variables in the formula.This is a modified version of a function written by Yong Hao Pua. Onecan specifynperm to obtain a list ofnperm randomlypermuted variables. The list is converted to a single ordinary formulaifnperm=1. Ifnperm is omitted, variables are sorted indescending order of the number ofNAs.reformM alsoprints a recommended number of multiple imputations to use, which is aminimum of 5 and the percent of incomplete observations.
Usage
aregImpute(formula, data, subset, n.impute=5, group=NULL, nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'), pmmtype=1, match=c('weighted','closest','kclosest'), kclosest=3, fweighted=0.2, curtail=TRUE, constraint=NULL, boot.method=c('simple', 'approximate bayesian'), burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75)## S3 method for class 'aregImpute'print(x, digits=3, ...)## S3 method for class 'aregImpute'plot(x, nclass=NULL, type=c('ecdf','hist'), datadensity=c("hist", "none", "rug", "density"), diagnostics=FALSE, maxn=10, ...)reformM(formula, data, nperm)Arguments
formula | an S model formula. You can specify restrictions for transformationsof variables. The function automatically determines which variablesare categorical (i.e., |
x | an object created by |
data | input raw data |
subset | These may be also be specified. You may not specify |
n.impute | number of multiple imputations. |
group | a character or factor variable the same length as thenumber of observations in |
nk | number of knots to use for continuous variables. When boththe target variable and the predictors are having optimumtransformations estimated, there is more instability than with normalregression so the complexity of the model should decrease more sharplyas the sample size decreases. Hence set |
tlinear | set to |
type | The default is |
pmmtype | type of matching to be used for predictive meanmatching when |
match | Defaults to |
kclosest | see |
fweighted | Smoothing parameter (multiple of mean absolute difference) used when |
curtail | applies if |
constraint | for predictive mean matching |
boot.method | By default, simple boostrapping is used in which thetarget variable is predicted using a sample with replacement from theobservations with non-missing target variable. Specify |
burnin |
|
pr | set to |
plotTrans | set to |
tolerance | singularity criterion; list the source code in the |
B | number of bootstrap resamples to use if |
digits | number of digits for printing |
nclass | number of bins to use in drawing histogram |
datadensity | see |
diagnostics | Specify |
maxn | Maximum number of observations shown for diagnostics. Default is |
nperm | number of random formula permutations for |
... | other arguments that are ignored |
Details
The sequence of steps used by thearegImpute algorithm is thefollowing.
(1) For each variable containing mNAs where m > 0, initialize theNAs to values from a random sample (without replacement ifa sufficient number of non-missing values exist) of size m from thenon-missing values.
(2) Forburnin+n.impute iterations do the following steps. Thefirstburnin iterations provide a burn-in, and imputations aresaved only from the lastn.impute iterations.
(3) For each variable containing anyNAs, draw a sample withreplacement from the observations in the entire dataset in which thecurrent variable being imputed is non-missing. Fit a flexibleadditive model to predict this target variable while finding theoptimum transformation of it (unless the identitytransformation is forced). Use this fitted flexible model topredict the target variable in all of the original observations.Impute each missing value of the target variable with the observedvalue whose predicted transformed value is closest to the predictedtransformed value of the missing value (ifmatch="closest" andtype="pmm"), or use a draw from a multinomial distribution with probabilities derivedfrom distance weights, ifmatch="weighted" (the default).
(4) After these imputations are computed, use these random drawimputations the next time the curent target variable is used as apredictor of other sometimes-missing variables.
Whenmatch="closest", predictive mean matching does not work wellwhen fewer than 3 variables are used to predict the target variable,because many of the multiple imputations for an observation will beidentical. In the extreme case of one right-hand-side variable andassuming that only monotonic transformations of left and right-sidevariables are allowed, every bootstrap resample will give predictedvalues of the target variable that are monotonically related topredicted values from every other bootstrap resample. The same is truefor Bayesian predicted values. This causes predictive mean matching toalways match on the same donor observation.
When the missingness mechanism for a variable is so systematic that thedistribution of observed values is truncated, predictive mean matchingdoes not work. It will only yield imputed values that are near observedvalues, so intervals in which no values are observed will not bepopulated by imputed values. For this case, the only hope is to makeregression assumptions and use extrapolation. Withtype="regression",aregImpute will use linearextrapolation to obtain a (hopefully) reasonable distribution of imputedvalues. The"regression" option causesaregImpute toimpute missing values by adding a random sample of residuals (withreplacement if there are moreNAs than measured values) on thetransformed scale of the target variable. After random residuals areadded, predicted random draws are obtained on the original untransformedscale using reverse linear interpolation on the table of original andtransformed target values (linear extrapolation when a random residualis large enough to put the random draw prediction outside the range ofobserved values). The bootstrap is used as withtype="pmm" tofactor in the uncertainty of the imputation model.
As model uncertainty is high when the transformation of a targetvariable is unknown,tlinear defaults toTRUE to limit thevariance in predicted values whennk is positive.
Value
a list of class"aregImpute" containing the following elements:
call | the function call expression |
formula | the formula specified to |
match | the |
fweighted | the |
n | total number of observations in input dataset |
p | number of variables |
na | list of subscripts of observations for which values were originally missing |
nna | named vector containing the numbers of missing values in the data |
type | vector of types of transformations used for each variable( |
tlinear | value of |
nk | number of knots used for smooth transformations |
cat.levels | list containing character vectors specifying the |
df | degrees of freedom (number of parameters estimated) for eachvariable |
n.impute | number of multiple imputations per missing value |
imputed | a list containing matrices of imputed values in the same format asthose created by |
x | if |
rsq | for the last round of imputations, a vector containing the R-squareswith which each sometimes-missing variable could be predicted from theothers by |
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
van Buuren, Stef. Flexible Imputation of Missing Data. Chapman &Hall/CRC, Boca Raton FL, 2012.
Little R, An H. Robust likelihood-based analysis of multivariate datawith missing values. Statistica Sinica 14:949-968, 2004.
van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fullyconditional specifications in multivariate imputation. J Stat CompSim 72:1049-1064, 2006.
de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB.Multiple imputation to correct for partial verification biasrevisited. Stat Med 27:5880-5889, 2008.
Siddique J. Multiple imputation using an iterative hot-deck withdistance-based donor selection. Stat Med 27:83-102, 2008.
White IR, Royston P, Wood AM. Multiple imputation using chainedequations: Issues and guidance for practice. Stat Med 30:377-399,2011.
Curnow E, Carpenter JR, Heron JE, et al: Multiple imputation ofmissing data under missing at random: compatible imputation models arenot sufficient to avoid bias if they are mis-specified. J Clin EpiJune 9, 2023. DOI:10.1016/j.jclinepi.2023.06.011.
See Also
fit.mult.impute,transcan,areg,naclus,naplot,mice,dotchart3,Ecdf,completer
Examples
# Check that aregImpute can almost exactly estimate missing values when# there is a perfect nonlinear relationship between two variables# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3set.seed(3)x1 <- rnorm(200)x2 <- x1^2x3 <- runif(200)m <- 30x2[1:m] <- NAa <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')amatplot(x1[1:m]^2, a$imputed$x2)abline(a=0, b=1, lty=2)x1[1:m]^2a$imputed$x2# Multiple imputation and estimation of variances and covariances of# regression coefficient estimates accounting for imputation# Example 1: large sample size, much missing data, no overlap in# NAs across variablesx1 <- factor(sample(c('a','b','c'),1000,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)x3 <- rnorm(1000)y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)orig.x1 <- x1[1:250]orig.x2 <- x2[251:350]x1[1:250] <- NAx2[251:350] <- NAd <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)# Find value of nk that yields best validating imputation models# tlinear=FALSE means to not force the target variable to be linearf <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE, data=d, B=10) # normally B=75f# Try forcing target variable (x1, then x2) to be linear while allowing# predictors to be nonlinear (could also say tlinear=TRUE)f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)f## Not run: # Use 100 imputations to better check against individual true valuesf <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)fpar(mfrow=c(2,1))plot(f)modecat <- function(u) { tab <- table(u) as.numeric(names(tab)[tab==max(tab)][1])}table(orig.x1,apply(f$imputed$x1, 1, modecat))par(mfrow=c(1,1))plot(orig.x2, apply(f$imputed$x2, 1, mean))fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f, data=d)sqrt(diag(vcov(fmi)))fcc <- lm(y ~ x1 + x2 + x3)summary(fcc) # SEs are larger than from mult. imputation## End(Not run)## Not run: # Example 2: Very discriminating imputation models,# x1 and x2 have some NAs on the same rows, smaller nset.seed(5)x1 <- factor(sample(c('a','b','c'),100,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)x3 <- rnorm(100)y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)orig.x1 <- x1[1:20]orig.x2 <- x2[18:23]x1[1:20] <- NAx2[18:23] <- NA#x2[21:25] <- NAd <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)n <- naclus(d)plot(n); naplot(n) # Show patterns of NAs# 100 imputations to study them; normally use 5 or 10f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)par(mfrow=c(2,3))plot(f, diagnostics=TRUE, maxn=2)# Note: diagnostics=TRUE makes graphs similar to those made by:# r <- range(f$imputed$x2, orig.x2)# for(i in 1:6) { # use 1:2 to mimic maxn=2# plot(1:100, f$imputed$x2[i,], ylim=r,# ylab=paste("Imputations for Obs.",i))# abline(h=orig.x2[i],lty=2)# }table(orig.x1,apply(f$imputed$x1, 1, modecat))par(mfrow=c(1,1))plot(orig.x2, apply(f$imputed$x2, 1, mean))fmi <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)sqrt(diag(vcov(fmi)))fcc <- lm(y ~ x1 + x2)summary(fcc) # SEs are larger than from mult. imputation## End(Not run)## Not run: # Study relationship between smoothing parameter for weighting function# (multiplier of mean absolute distance of transformed predicted# values, used in tricube weighting function) and standard deviation# of multiple imputations. SDs are computed from average variances# across subjects. match="closest" same as match="weighted" with# small value of fweighted.# This example also shows problems with predicted mean# matching almost always giving the same imputed values when there is# only one predictor (regression coefficients change over multiple# imputations but predicted values are virtually 1-1 functions of each# other)set.seed(23)x <- runif(200)y <- x + runif(200, -.05, .05)r <- resid(lsfit(x,y))rmse <- sqrt(sum(r^2)/(200-2)) # sqrt of residual MSEy[1:20] <- NAd <- data.frame(x,y)f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)# As an aside here is how to create a completed dataset for imputation# number 3 as fit.mult.impute would do automatically. In this degenerate# case changing 3 to 1-2,4-10 will not alter the results.imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE, pr=FALSE, check=FALSE)sd <- sqrt(mean(apply(f$imputed$y, 1, var)))ss <- c(0, .01, .02, seq(.05, 1, length=20))sds <- ss; sds[1] <- sdfor(i in 2:length(ss)) { f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))}plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b')abline(v=.2, lty=2) # default value of fweightedabline(h=rmse, lty=2) # root MSE of residuals from linear regression## End(Not run)## Not run: # Do a similar experiment for the Titanic datasetgetHdata(titanic3)h <- lm(age ~ sex + pclass + survived, data=titanic3)rmse <- summary(h)$sigmaset.seed(21)f <- aregImpute(~ age + sex + pclass + survived, n.impute=10, data=titanic3, match='closest')sd <- sqrt(mean(apply(f$imputed$age, 1, var)))ss <- c(0, .01, .02, seq(.05, 1, length=20))sds <- ss; sds[1] <- sdfor(i in 2:length(ss)) { f <- aregImpute(~ age + sex + pclass + survived, data=titanic3, n.impute=10, fweighted=ss[i]) sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))}plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values', type='b')abline(v=.2, lty=2) # default value of fweightedabline(h=rmse, lty=2) # root MSE of residuals from linear regression## End(Not run)set.seed(2)d <- data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)), x3=c(runif(4), rep(NA, 11), runif(35)))reformM(~ x1 + x2 + x3, data=d)reformM(~ x1 + x2 + x3, data=d, nperm=2)# Give result or one of the results as the first argument to aregImpute# Constrain imputed values for two variables# Require imputed values for x2 to be above 0.2# Assume x1 is never missing and require imputed values for# x3 to be less than the recipient's value of x1a <- aregImpute(~ x1 + x2 + x3, data=d, constraint=list(x2 = expression(d$x2 > 0.2), x3 = expression(d$x3 < r$x1)))aBivariate Summaries Computed Separately by a Series of Predictors
Description
biVar is a generic function that accepts a formula and usualdata,subset, andna.action parameters plus aliststatinfo that specifies a function of two variables tocompute along with information about labeling results for printing andplotting. The function is called separately with each right hand sidevariable and the same left hand variable. The result is a matrix ofbivariate statistics and thestatinfo list that drives printingand plotting. The plot method draws a dot plot with x-axis values bydefault sorted in order of one of the statistics computed by the function.
spearman2 computes the square of Spearman's rho rank correlationand a generalization of it in whichx can relatenon-monotonically toy. This is done by computing the Spearmanmultiple rho-squared between(rank(x), rank(x)^2) andy.Whenx is categorical, a different kind of Spearman correlationused in the Kruskal-Wallis test is computed (andspearman2 can dothe Kruskal-Wallis test). This is done by computing the ordinarymultipleR^2 betweenk-1 dummy variables andrank(y), wherex hask categories.x canalso be a formula, in which case each predictor is correlated separatelywithy, using non-missing observations for that predictor.biVar is used to do the looping and bookkeeping. By default theplot shows the adjustedrho^2, using the same formula used forthe ordinary adjustedR^2. TheF test uses the unadjustedR2.
spearman computes Spearman's rho on non-missing values of twovariables.spearman.test is a simple version ofspearman2.default.
chiSquare is set up likespearman2 except it is intendedfor a categorical response variable. Separate Pearson chi-square testsare done for each predictor, with optional collapsing of infrequentcategories. Numeric predictors having more thang levels arecategorized intog quantile groups.chiSquare usesbiVar.
Usage
biVar(formula, statinfo, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...)## S3 method for class 'biVar'print(x, ...)## S3 method for class 'biVar'plot(x, what=info$defaultwhat, sort.=TRUE, main, xlab, vnames=c('names','labels'), ...)spearman2(x, ...)## Default S3 method:spearman2(x, y, p=1, minlev=0, na.rm=TRUE, exclude.imputed=na.rm, ...)## S3 method for class 'formula'spearman2(formula, data=NULL, subset, na.action=na.retain, exclude.imputed=TRUE, ...)spearman(x, y)spearman.test(x, y, p=1)chiSquare(formula, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, ...)Arguments
formula | a formula with a single left side variable |
statinfo | see |
data,subset,na.action | the usual options for models. Default for |
exclude.imputed | set to |
... | other arguments that are passed to the function used tocompute the bivariate statistics or to |
na.rm | logical; delete NA values? |
x | a numeric matrix with at least 5 rows and at least 2 columns (if |
y | a numeric vector |
p | for numeric variables, specifies the order of the Spearman |
minlev | minimum relative frequency that a level of a categorical predictorshould have before it is pooled with other categories (see |
what | specifies which statistic to plot. Possibilities include thecolumn names that appear with the print method is used. |
sort. | set |
main | main title for plot. Default title shows the name of the responsevariable. |
xlab | x-axis label. Default constructed from |
vnames | set to |
Details
Uses midranks in case of ties, as described by Hollander and Wolfe.P-values for Spearman, Wilcoxon, or Kruskal-Wallis tests areapproximated by using thet orF distributions.
Value
spearman2.default (thefunction that is called for a singlex, i.e., when there is noformula) returns a vector of statistics for the variable.biVar,spearman2.formula, andchiSquare return amatrix with rows corresponding to predictors.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods.New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.
See Also
combine.levels,varclus,dotchart3,impute,chisq.test,cut2.
Examples
x <- c(-2, -1, 0, 1, 2)y <- c(4, 1, 0, 1, 4)z <- c(1, 2, 3, 4, NA)v <- c(1, 2, 3, 4, 5)spearman2(x, y)plot(spearman2(z ~ x + y + v, p=2))f <- chiSquare(z ~ x + y + v)fConfidence Intervals for Binomial Probabilities
Description
Produces 1-alpha confidence intervals for binomial probabilities.
Usage
binconf(x, n, alpha=0.05, method=c("wilson","exact","asymptotic","all"), include.x=FALSE, include.n=FALSE, return.df=FALSE)Arguments
x | vector containing the number of "successes" for binomial variates |
n | vector containing the numbers of corresponding observations |
alpha | probability of a type I error, so confidence coefficient = 1-alpha |
method | character string specifing which method to use. The "all" method onlyworks when x and n are length 1. The "exact" method uses the F distributionto compute exact (based on the binomial cdf) intervals; the"wilson" interval is score-test-based; and the "asymptotic" is thetext-book, asymptotic normal interval. Following Agresti andCoull, the Wilson interval is to be preferred and so is thedefault. |
include.x | logical flag to indicate whether |
include.n | logical flag to indicate whether |
return.df | logical flag to indicate that a data frame rather than a matrix bereturned |
Value
a matrix or data.frame containing the computed intervals and,optionally,x andn.
Author(s)
Rollin Brant, Modified by Frank Harrell and
Brad Biggerstaff
Centers for Disease Control and Prevention
National Center for Infectious Diseases
Division of Vector-Borne Infectious Diseases
P.O. Box 2087, Fort Collins, CO, 80522-2087, USA
bkb5@cdc.gov
References
A. Agresti and B.A. Coull, Approximate is better than "exact" forinterval estimation of binomial proportions,American Statistician,52:119–126, 1998.
R.G. Newcombe, Logit confidence intervals and the inverse sinhtransformation,American Statistician,55:200–202, 2001.
L.D. Brown, T.T. Cai and A. DasGupta, Interval estimation fora binomial proportion (with discussion),Statistical Science,16:101–133, 2001.
Examples
binconf(0:10,10,include.x=TRUE,include.n=TRUE)binconf(46,50,method="all")Bootstrap Kaplan-Meier Estimates
Description
Bootstraps Kaplan-Meier estimate of the probability of survival to atleast a fixed time (times variable) or the estimate of theqquantile of the survival distribution (e.g., median survival time, thedefault).
Usage
bootkm(S, q=0.5, B=500, times, pr=TRUE)Arguments
S | a |
q | quantile of survival time, default is 0.5 for median |
B | number of bootstrap repetitions (default=500) |
times | time vector (currently only a scalar is allowed) at which to computesurvival estimates. You may specify only one of |
pr | set to |
Details
bootkm uses Therneau'ssurvfitKM function to efficientlycompute Kaplan-Meier estimates.
Value
a vector containingB bootstrap estimates
Side Effects
updates.Random.seed, and, ifpr=TRUE, prints progressof simulations
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Akritas MG (1986): Bootstrapping the Kaplan-Meier estimator. JASA81:1032–1038.
See Also
survfit,Surv,Survival.cph,Quantile.cph
Examples
# Compute 0.95 nonparametric confidence interval for the difference in# median survival time between females and males (two-sample problem)set.seed(1)library(survival)S <- Surv(runif(200)) # no censoringsex <- c(rep('female',100),rep('male',100))med.female <- bootkm(S[sex=='female',], B=100) # normally B=500med.male <- bootkm(S[sex=='male',], B=100)describe(med.female-med.male)quantile(med.female-med.male, c(.025,.975), na.rm=TRUE)# na.rm needed because some bootstrap estimates of median survival# time may be missing when a bootstrap sample did not include the# longer survival timesPower and Sample Size for Two-Sample Binomial Test
Description
Uses method of Fleiss, Tytun, and Ury (but without the continuitycorrection) to estimate the power (or the sample size to achieve a givenpower) of a two-sided test for the difference in two proportions. The twosample sizes are allowed to be unequal, but forbsamsize you must specifythe fraction of observations in group 1. For power calculations, oneprobability (p1) must be given, and either the other probability (p2),anodds.ratio, or apercent.reduction must be given. Forbpower orbsamsize, any or all of the arguments may be vectors, in which case theyreturn a vector of powers or sample sizes. All vector arguments must havethe same length.
Givenp1, p2,ballocation uses the method of Brittain and Schlesselmanto compute the optimal fraction of observations to be placed in group 1that either (1) minimize the variance of the difference in two proportions,(2) minimize the variance of the ratio of the two proportions, (3) minimize the variance of the log odds ratio, or(4) maximize the power of the 2-tailed test for differences. For (4)the total sample size must be given, or the fraction optimizingthe power is not returned. The fraction for (3) is one minus the fractionfor (1).
bpower.sim estimates power by simulations, in minimal time. By usingbpower.sim you can see that the formulas without any continuity correctionare quite accurate, and that the power of a continuity-corrected testis significantly lower. That's why no continuity corrections are implementedhere.
Usage
bpower(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05)bsamsize(p1, p2, fraction=.5, alpha=.05, power=.8)ballocation(p1, p2, n, alpha=.05)bpower.sim(p1, p2, odds.ratio, percent.reduction, n, n1, n2, alpha=0.05, nsim=10000)Arguments
p1 | population probability in the group 1 |
p2 | probability for group 2 |
odds.ratio | odds ratio to detect |
percent.reduction | percent reduction in risk to detect |
n | total sample size over the two groups. If you omit this for |
n1 | sample size in group 1 |
n2 | sample size in group 2. |
alpha | type I assertion probability |
fraction | fraction of observations in group 1 |
power | the desired probability of detecting a difference |
nsim | number of simulations of binomial responses |
Details
Forbpower.sim, all arguments must be of length one.
Value
forbpower, the power estimate; forbsamsize, a vector containingthe sample sizes in the two groups; forballocation, a vector with4 fractions of observations allocated to group 1, optimizing the fourcriteria mentioned above. Forbpower.sim, a vector with threeelements is returned, corresponding to the simulated power and itslower and upper 0.95 confidence limits.
AUTHOR
Frank Harrell
Department of Biostatistics
Vanderbilt University
References
Fleiss JL, Tytun A, Ury HK (1980): A simple approximation for calculatingsample sizes for comparing independent proportions. Biometrics 36:343–6.
Brittain E, Schlesselman JJ (1982): Optimal allocation for the comparisonof proportions. Biometrics 38:1003–9.
Gordon I, Watson R (1996): The myth of continuity-corrected sample sizeformulae. Biometrics 52:71–6.
See Also
samplesize.bin,chisq.test,binconf
Examples
bpower(.1, odds.ratio=.9, n=1000, alpha=c(.01,.05))bpower.sim(.1, odds.ratio=.9, n=1000)bsamsize(.1, .05, power=.95)ballocation(.1, .5, n=100)# Plot power vs. n for various odds ratios (base prob.=.1)n <- seq(10, 1000, by=10)OR <- seq(.2,.9,by=.1)plot(0, 0, xlim=range(n), ylim=c(0,1), xlab="n", ylab="Power", type="n")for(or in OR) { lines(n, bpower(.1, odds.ratio=or, n=n)) text(350, bpower(.1, odds.ratio=or, n=350)-.02, format(or))}# Another way to plot the same curves, but letting labcurve do the# work, including labeling each curve at points of maximum separationpow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)), n=n)names(pow) <- format(OR)labcurve(pow, pl=TRUE, xlab='n', ylab='Power')# Contour graph for various probabilities of outcome in the control# group, fixing the odds ratio at .8 ([p2/(1-p2) / p1/(1-p1)] = .8)# n is varied alsop1 <- seq(.01,.99,by=.01)n <- seq(100,5000,by=250)pow <- outer(p1, n, function(p1,n) bpower(p1, n=n, odds.ratio=.8))# This forms a length(p1)*length(n) matrix of power estimatescontour(p1, n, pow)Box-percentile plots
Description
Producess side-by-side box-percentile plots from several vectors or alist of vectors.
Usage
bpplot(..., name=TRUE, main="Box-Percentile Plot", xlab="", ylab="", srtx=0, plotopts=NULL)Arguments
... | vectors or lists containing numeric components (e.g., the output of |
name | character vector of names for the groups. Default is |
main | main title for the plot. |
xlab | x axis label. |
ylab | y axis label. |
srtx | rotation angle for x-axis labels. Default is zero. |
plotopts | a list of other parameters to send to |
Value
There are no returned values
Side Effects
A plot is created on the current graphics device.
BACKGROUND
Box-percentile plots are similiar to boxplots, except box-percentile plotssupply more information about the univariate distributions. At any heightthe width of the irregular "box" is proportional to the percentile of thatheight, up to the 50th percentile, and above the 50th percentile the widthis proportional to 100 minus the percentile. Thus, the width at any givenheight is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box.
Author(s)
Jeffrey Banfield
umsfjban@bill.oscs.montana.edu
Modified by F. Harrell 30Jun97
References
Esty WW, Banfield J: The box-percentile plot. J StatisticalSoftware 8 No. 17, 2003.
See Also
panel.bpplot,boxplot,Ecdf,bwplot
Examples
set.seed(1)x1 <- rnorm(500)x2 <- runif(500, -2, 2)x3 <- abs(rnorm(500))-2bpplot(x1, x2, x3)g <- sample(1:2, 500, replace=TRUE)bpplot(split(x2, g), name=c('Group 1','Group 2'))rm(x1,x2,x3,g)Statistics by Categories
Description
For any number of cross-classification variables,bystatsreturns a matrix with the sample size, number missingy, andfun(non-missing y), with the cross-classifications designatedby rows. Uses Harrell's modification of theinteractionfunction to produce cross-classifications. The defaultfun ismean, and ify is binary, the mean is labeled asFraction. There is aprint method as well as alatex method for objects created bybystats.bystats2 handles the special case in which there are 2classifcation variables, and places the first one in rows and thesecond in columns. Theprint method forbystats2 usestheprint.char.matrix function to organize statisticsfor cells into boxes.
Usage
bystats(y, ..., fun, nmiss, subset)## S3 method for class 'bystats'print(x, ...)## S3 method for class 'bystats'latex(object, title, caption, rowlabel, ...)bystats2(y, v, h, fun, nmiss, subset)## S3 method for class 'bystats2'print(x, abbreviate.dimnames=FALSE, prefix.width=max(nchar(dimnames(x)[[1]])), ...)## S3 method for class 'bystats2'latex(object, title, caption, rowlabel, ...)Arguments
y | a binary, logical, or continuous variable or a matrix or data frame ofsuch variables. If |
... | For |
v | vertical variable for |
h | horizontal variable for |
fun | a function to compute on the non-missing |
nmiss | A column containing a count of missing values is included if |
subset | a vector of subscripts or logical values indicating the subset ofdata to analyze |
abbreviate.dimnames | set to |
prefix.width | |
title |
|
caption | caption to pass to |
rowlabel |
|
x | an object created by |
object | an object created by |
Value
forbystats, a matrix with row names equal to the classification labels and columnnamesN, Missing, funlab, wherefunlab is determined fromfun.A row is added to the end with the summary statistics computed on all observations combined. The class of this matrix isbystats.Forbystats, returns a 3-dimensional array with the last dimensioncorresponding to statistics being computed. The class of the array isbystats2.
Side Effects
latex produces a.tex file.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
interaction,cut,cut2,latex,print.char.matrix,translate
Examples
## Not run: bystats(sex==2, county, city)bystats(death, race)bystats(death, cut2(age,g=5), race)bystats(cholesterol, cut2(age,g=4), sex, fun=median)bystats(cholesterol, sex, fun=quantile)bystats(cholesterol, sex, fun=function(x)c(Mean=mean(x),Median=median(x)))latex(bystats(death,race,nmiss=FALSE,subset=sex=="female"), digits=2)f <- function(y) c(Hazard=sum(y[,2])/sum(y[,1]))# f() gets the hazard estimate for right-censored data from exponential dist.bystats(cbind(d.time, death), race, sex, fun=f)bystats(cbind(pressure, cholesterol), age.decile, fun=function(y) c(Median.pressure =median(y[,1]), Median.cholesterol=median(y[,2])))y <- cbind(pressure, cholesterol)bystats(y, age.decile, fun=function(y) apply(y, 2, median)) # same result as last onebystats(y, age.decile, fun=function(y) apply(y, 2, quantile, c(.25,.75)))# The last one computes separately the 0.25 and 0.75 quantiles of 2 vars.latex(bystats2(death, race, sex, fun=table))## End(Not run)capitalize the first letter of a string
Description
Capitalizes the first letter of each element of the string vector.
Usage
capitalize(string)Arguments
string | String to be capitalized |
Value
Returns a vector of charaters with the first letter capitalized
Author(s)
Charles Dupont
Examples
capitalize(c("Hello", "bob", "daN"))Power of Interaction Test for Exponential Survival
Description
Uses the method of Peterson and George to compute the power of aninteraction test in a 2 x 2 setup in which all 4 distributions areexponential. This will be the same as the power of the Cox modeltest if assumptions hold. The test is 2-tailed. The duration of accrual is specified(constant accrual is assumed), as is the minimum follow-up time.The maximum follow-up time is thenaccrual + tmin. Treatmentallocation is assumed to be 1:1.
Usage
ciapower(tref, n1, n2, m1c, m2c, r1, r2, accrual, tmin, alpha=0.05, pr=TRUE)Arguments
tref | time at which mortalities estimated |
n1 | total sample size, stratum 1 |
n2 | total sample size, stratum 2 |
m1c | tref-year mortality, stratum 1 control |
m2c | tref-year mortality, stratum 2 control |
r1 | % reduction in |
r2 | % reduction in |
accrual | duration of accrual period |
tmin | minimum follow-up time |
alpha | type I error probability |
pr | set to |
Value
power
Side Effects
prints
AUTHOR
Frank Harrell
Department of Biostatistics
Vanderbilt University
References
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
See Also
Examples
# Find the power of a race x treatment test. 25% of patients will# be non-white and the total sample size is 14000. # Accrual is for 1.5 years and minimum follow-up is 5y.# Reduction in 5-year mortality is 15% for whites, 0% or -5% for# non-whites. 5-year mortality for control subjects if assumed to# be 0.18 for whites, 0.23 for non-whites.n <- 14000for(nonwhite.reduction in c(0,-5)) { cat("\n\n\n% Reduction in 5-year mortality for non-whites:", nonwhite.reduction, "\n\n") pow <- ciapower(5, .75*n, .25*n, .18, .23, 15, nonwhite.reduction, 1.5, 5) cat("\n\nPower:",format(pow),"\n")}Convert between the 5 different coordinate sytems on a graphical device
Description
Takes a set of coordinates in any of the 5 coordinate systems (usr,plt, fig, dev, or tdev) and returns the same points in all 5coordinate systems.
Usage
cnvrt.coords(x, y = NULL, input = c("usr", "plt", "fig", "dev","tdev"))Arguments
x | Vector, Matrix, or list of x coordinates (or x and ycoordinates), NA's allowed. |
y | y coordinates (if |
input | Character scalar indicating the coordinate system of theinput points. |
Details
Every plot has 5 coordinate systems:
usr (User): the coordinate system of the data, this is shown by thetick marks and axis labels.
plt (Plot): Plot area, coordinates range from 0 to 1 with 0corresponding to the x and y axes and 1 corresponding to the top andright of the plot area. Margins of the plot correspond to plotcoordinates less than 0 or greater than 1.
fig (Figure): Figure area, coordinates range from 0 to 1 with 0corresponding to the bottom and left edges of the figure (includingmargins, label areas) and 1 corresponds to the top and right edges.fig and dev coordinates will be identical if there is only 1 figurearea on the device (layout, mfrow, or mfcol has not been used).
dev (Device): Device area, coordinates range from 0 to 1 with 0corresponding to the bottom and left of the device region within theouter margins and 1 is the top and right of the region withing theouter margins. If the outer margins are all set to 0 then tdev anddev should be identical.
tdev (Total Device): Total Device area, coordinates range from 0 to 1 with 0corresponding to the bottom and left edges of the device (piece ofpaper, window on screen) and 1 corresponds to the top and right edges.
Value
A list with 5 components, each component is a list with vectors namedx and y. The 5 sublists are:
usr | The coordinates of the input points in usr (User) coordinates. |
plt | The coordinates of the input points in plt (Plot)coordinates. |
fig | The coordinates of the input points in fig (Figure)coordinates. |
dev | The coordinates of the input points in dev (Device)coordinates. |
tdev | The coordinates of the input points in tdev (Total Device)coordinates. |
Note
You must provide both x and y, but one of them may beNA.
This function is becoming depricated with the new functionsgrconvertX andgrconvertY in R version 2.7.0 and beyond.These new functions use the correct coordinate system names and havemore coordinate systems available, you should start using them instead.
Author(s)
Greg Snowgreg.snow@imail.org
See Also
par specifically 'usr','plt', and 'fig'. Also'xpd' for plotting outside of the plotting region and 'mfrow' and'mfcol' for multi figure plotting.subplot,grconvertX andgrconvertY in R2.7.0 and later
Examples
old.par <- par(no.readonly=TRUE)par(mfrow=c(2,2),xpd=NA)# generate some sample datatmp.x <- rnorm(25, 10, 2)tmp.y <- rnorm(25, 50, 10)tmp.z <- rnorm(25, 0, 1)plot( tmp.x, tmp.y)# draw a diagonal line across the plot areatmp1 <- cnvrt.coords( c(0,1), c(0,1), input='plt' )lines(tmp1$usr, col='blue')# draw a diagonal line accross figure regiontmp2 <- cnvrt.coords( c(0,1), c(1,0), input='fig')lines(tmp2$usr, col='red')# save coordinate of point 1 and y value near top of plot for future plotstmp.point1 <- cnvrt.coords(tmp.x[1], tmp.y[1])tmp.range1 <- cnvrt.coords(NA, 0.98, input='plt')# make a second plot and draw a line linking point 1 in each plotplot(tmp.y, tmp.z)tmp.point2 <- cnvrt.coords( tmp.point1$dev, input='dev' )arrows( tmp.y[1], tmp.z[1], tmp.point2$usr$x, tmp.point2$usr$y, col='green')# draw another plot and add rectangle showing same range in 2 plotsplot(tmp.x, tmp.z)tmp.range2 <- cnvrt.coords(NA, 0.02, input='plt')tmp.range3 <- cnvrt.coords(NA, tmp.range1$dev$y, input='dev')rect( 9, tmp.range2$usr$y, 11, tmp.range3$usr$y, border='yellow')# put a label just to the right of the plot and# near the top of the figure region.text( cnvrt.coords(1.05, NA, input='plt')$usr$x,cnvrt.coords(NA, 0.75, input='fig')$usr$y,"Label", adj=0)par(mfrow=c(1,1))## create a subplot within another plot (see also subplot)plot(1:10, 1:10)tmp <- cnvrt.coords( c( 1, 4, 6, 9), c(6, 9, 1, 4) )par(plt = c(tmp$dev$x[1:2], tmp$dev$y[1:2]), new=TRUE)hist(rnorm(100))par(fig = c(tmp$dev$x[3:4], tmp$dev$y[3:4]), new=TRUE)hist(rnorm(100))par(old.par)Miscellaneous ggplot2 and grid Helper Functions
Description
These functions are used onggplot2 objects or as layers whenbuilding aggplot2 object, and to facilitate use ofgridExtra.colorFacet colors the thin rectangles used to separate panels created byfacet_grid (andprobably byfacet_wrap). A better approach may be found athttps://stackoverflow.com/questions/28652284/.arrGrob is a front-end togridExtra::arrangeGrob thatallows for proper printing. Seehttps://stackoverflow.com/questions/29062766/store-output-from-gridextragrid-arrange-into-an-object/. ThearrGrobprint method callsgrid::grid.draw.
Usage
colorFacet(g, col = adjustcolor("blue", alpha.f = 0.3))arrGrob(...)## S3 method for class 'arrGrob'print(x, ...)Arguments
g | a |
col | color for facet separator rectanges |
... | passed to |
x | an object created by |
Author(s)
Sandy Muspratt
Examples
## Not run: require(ggplot2)s <- summaryP(age + sex ~ region + treatment)colorFacet(ggplot(s)) # prints directly# arrGrob is called by rms::ggplot.Predict and others## End(Not run)combine.levels
Description
Combine Infrequent Levels of a Categorical Variable
Usage
combine.levels( x, minlev = 0.05, m, ord = is.ordered(x), plevels = FALSE, sep = ",")Arguments
x | a factor, 'ordered' factor, or numeric or character variable that will be turned into a 'factor' |
minlev | the minimum proportion of observations in a cell before that cell is combined with one or more cells. If more than one cell has fewer than minlev*n observations, all such cells are combined into a new cell labeled '"OTHER"'. Otherwise, the lowest frequency cell is combined with the next lowest frequency cell, and the level name is the combination of the two old level levels. When 'ord=TRUE' combinations happen only for consecutive levels. |
m | alternative to 'minlev', is the minimum number of observations in a cell before it will be combined with others |
ord | set to 'TRUE' to treat 'x' as if it were an ordered factor, which allows only consecutive levels to be combined |
plevels | by default 'combine.levels' pools low-frequency levels into a category named 'OTHER' when 'x' is not ordered and 'ord=FALSE'. To instead name this category the concatenation of all the pooled level names, separated by a comma, set 'plevels=TRUE'. |
sep | the separator for concatenating levels when 'plevels=TRUE' |
Details
After turning 'x' into a 'factor' if it is not one already, combineslevels of 'x' whose frequency falls below a specified relative frequency 'minlev' or absolute count 'm'. When 'x' is not treated as ordered, all of thesmall frequency levels are combined into '"OTHER"', unless 'plevels=TRUE'.When 'ord=TRUE' or 'x' is an ordered factor, only consecutive levelsare combined. New levels are constructed by concatenating the levels with'sep' as a separator. This is useful when comparing ordinal regressionwith polytomous (multinomial) regression and there are too manycategories for polytomous regression. 'combine.levels' is also usefulwhen assumptions of ordinal models are being checked empirically bycomputing exceedance probabilities for various cutoffs of thedependent variable.
Value
a factor variable, or if 'ord=TRUE' an ordered factor variable
Author(s)
Frank Harrell
Examples
x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1))combine.levels(x, m=3)combine.levels(x, m=3, plevels=TRUE)combine.levels(x, ord=TRUE, m=3)x <- c(rep('A', 1), rep('B', 3), rep('C', 4), rep('D',1), rep('E',1), rep('F',1))combine.levels(x, ord=TRUE, m=3)Combination Plot
Description
Generates a plotly attribute plot given a series of possibly overlapping binary variables
Usage
combplotp( formula, data = NULL, subset, na.action = na.retain, vnames = c("labels", "names"), includenone = FALSE, showno = FALSE, maxcomb = NULL, minfreq = NULL, N = NULL, pos = function(x) 1 * (tolower(x) %in% c("true", "yes", "y", "positive", "+", "present", "1")), obsname = "subjects", ptsize = 35, width = NULL, height = NULL, ...)Arguments
formula | a formula containing all the variables to be cross-tabulated, on the formula's right hand side. There is no left hand side variable. If |
data | input data frame. If none is specified the data are assumed to come from the parent frame. |
subset | an optional subsetting expression applied to |
na.action | see |
vnames | set to |
includenone | set to |
showno | set to |
maxcomb | maximum number of combinations to display |
minfreq | if specified, any combination having a frequency less than this will be omitted from the display |
N | set to an integer to override the global denominator, instead of using the number of rows in the data |
pos | a function of vector returning a logical vector with |
obsname | character string noun describing observations, default is |
ptsize | point size, defaults to 35 |
width | width of |
height | height of |
... | optional arguments to pass to |
Details
Similar to theUpSetR package, draws a special dot chart sometimes called an attribute plot that depicts all possible combination of the binary variables. By default a positive value, indicating that a certain condition pertains for a subject, is any of logicalTRUE, numeric 1,"yes","y","positive","+" or"present" value, and all others are considered negative. The user can override this determination by specifying her ownpos function. Case is ignored in the variable values.
The plot uses solid dots arranged in a vertical line to indicate which combination of conditions is being considered. Frequencies of all possible combinations are shown above the dot chart. Marginal frequencies of positive values for the input variables are shown to the left of the dot chart. More information for all three of these component symbols is provided in hover text.
Variables are sorted in descending order of marginal frqeuencies and likewise for combinations of variables.
Value
plotly object
Author(s)
Frank Harrell
Examples
if (requireNamespace("plotly")) { g <- function() sample(0:1, n, prob=c(1 - p, p), replace=TRUE) set.seed(2); n <- 100; p <- 0.5 x1 <- g(); label(x1) <- 'A long label for x1 that describes it' x2 <- g() x3 <- g(); label(x3) <- 'This is<br>a label for x3' x4 <- g() combplotp(~ x1 + x2 + x3 + x4, showno=TRUE, includenone=TRUE) n <- 1500; p <- 0.05 pain <- g() anxiety <- g() depression <- g() soreness <- g() numbness <- g() tiredness <- g() sleepiness <- g() combplotp(~ pain + anxiety + depression + soreness + numbness + tiredness + sleepiness, showno=TRUE)}completer
Description
Create imputed dataset(s) usingtranscan andaregImpute objects
Usage
completer(a, nimpute, oneimpute = FALSE, mydata)Arguments
a | An object of class |
nimpute | A numeric vector between 1 and |
oneimpute | A logical vector. When set to |
mydata | A data frame in which its missing values will be imputed. |
Details
Similar in function tomice::complete, this function usestranscan andaregImpute objects to impute missing dataand returns the completed dataset(s) as a dataframe or a list.It assumes thattranscan is used for single regression imputation.
Value
A single or a list of completed dataset(s).
Author(s)
Yong-Hao Pua, Singapore General Hospital
Examples
## Not run: mtcars$hp[1:5] <- NAmtcars$wt[1:10] <- NAmyrform <- ~ wt + hp + I(carb)mytranscan <- transcan( myrform, data = mtcars, imputed = TRUE, pl = FALSE, pr = FALSE, trantab = TRUE, long = TRUE)myareg <- aregImpute(myrform, data = mtcars, x=TRUE, n.impute = 5)completer(mytranscan) # single completed datasetcompleter(myareg, 3, oneimpute = TRUE)# single completed dataset based on the `n.impute`th set of multiple imputationcompleter(myareg, 3)# list of completed datasets based on first `nimpute` sets of multiple imputationcompleter(myareg)# list of completed datasets based on all available sets of multiple imputation# To get a stacked data frame of all completed datasets use# do.call(rbind, completer(myareg, data=mydata))# or use rbindlist in data.table## End(Not run)Element Merging
Description
Merges an object by the names of its elements. Inserting elements invalue intox that do not exists inx andreplacing elements inx that exists invalue withvalue elements ifprotect is false.
Usage
consolidate(x, value, protect, ...)## Default S3 method:consolidate(x, value, protect=FALSE, ...)consolidate(x, protect, ...) <- valueArguments
x | named list or vector |
value | named list or vector |
protect | logical; should elements in |
... | currently does nothing; included if ever want to make generic. |
Author(s)
Charles Dupont
See Also
Examples
x <- 1:5names(x) <- LETTERS[x]y <- 6:10names(y) <- LETTERS[y-2]x # c(A=1,B=2,C=3,D=4,E=5)y # c(D=6,E=7,F=8,G=9,H=10)consolidate(x, y) # c(A=1,B=2,C=3,D=6,E=7,F=8,G=9,H=10)consolidate(x, y, protect=TRUE) # c(A=1,B=2,C=3,D=4,E=5,F=8,G=9,H=10)Metadata for a Data Frame
Description
contents is a generic method for whichcontents.data.frameis currently the only method.contents.data.frame creates anobject containing the following attributes of the variables from a data frame: names, labels (if any), units (if any), number offactor levels (if any), factor levels,class, storage mode, and number of NAs.print.contents.data.framewill print the results, with options for sorting the variables.html.contents.data.frame creates HTML code for displaying theresults. This code has hyperlinks so that if the user clicks on thenumber of levels the browser jumps to the correct part of a table offactor levels for all thefactor variables. If long labels arepresent ("longlabel" attributes on variables), these are printedat the bottom and thehtml method links to them through theregular labels. Variables having the samelevels in the sameorder have the levels factored out for brevity.
contents.list prints a directory of datasets whensasxport.get imported more than one SAS dataset.
Ifoptions(prType='html') is in effect, callingprint onan object that is the contents of a data frame will result inrendering the HTML version. If run from the console a browser windowwill open.
Usage
contents(object, ...)## S3 method for class 'data.frame'contents(object, sortlevels=FALSE, id=NULL, range=NULL, values=NULL, ...)## S3 method for class 'contents.data.frame'print(x, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, number=FALSE, ...) ## S3 method for class 'contents.data.frame'html(object, sort=c('none','names','labels','NAs'), prlevels=TRUE, maxlevels=Inf, levelType=c('list','table'), number=FALSE, nshow=TRUE, ...)## S3 method for class 'list'contents(object, dslabels, ...)## S3 method for class 'contents.list'print(x, sort=c('none','names','labels','NAs','vars'), ...)Arguments
object | a data frame. For |
sortlevels | set to |
id | an optional subject ID variable name that if present in |
range | an optional variable name that if present in |
values | an optional variable name that if present in |
x | an object created by |
sort | Default is to print the variables in their original order in thedata frame. Specify one of |
prlevels | set to |
maxlevels | maximum number of levels to print for a |
number | set to |
nshow | set to |
levelType | By default, bullet lists of category levels areconstructed in html. Set |
... | arguments passed from |
dslabels | named vector of SAS dataset labels, created forexample by |
Value
an object of class"contents.data.frame" or"contents.list". For thehtml method is anhtmlcharacter vector object.
Author(s)
Frank Harrell
Vanderbilt University
fh@fharrell.com
See Also
describe,html,upData,extractlabs,hlab
Examples
set.seed(1)dfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE), stringsAsFactors=TRUE)contents(dfr)dfr <- upData(dfr, labels=c(x='Label for x', y='Label for y'))attr(dfr$x, 'longlabel') <- 'A very long label for x that can continue onto multiple long lines of text'k <- contents(dfr)print(k, sort='names', prlevels=FALSE)## Not run: html(k)html(contents(dfr)) # same resultlatex(k$contents) # latex.default just the main information## End(Not run)Power of Cox/log-rank Two-Sample Test
Description
Assumes exponential distributions for both treatment groups.Uses the George-Desu method along withformulas of Schoenfeld that allow estimation of the expected number ofevents in the two groups. To allow for drop-ins (noncompliance to control therapy, crossover tointervention) and noncompliance of the intervention, the method ofLachin and Foulkes is used.
Usage
cpower(tref, n, mc, r, accrual, tmin, noncomp.c=0, noncomp.i=0, alpha=0.05, nc, ni, pr=TRUE)Arguments
tref | time at which mortalities estimated |
n | total sample size (both groups combined). If allocation is unequalso that there are not |
mc | tref-year mortality, control |
r | % reduction in |
accrual | duration of accrual period |
tmin | minimum follow-up time |
noncomp.c | % non-compliant in control group (drop-ins) |
noncomp.i | % non-compliant in intervention group (non-adherers) |
alpha | type I error probability. A 2-tailed test is assumed. |
nc | number of subjects in control group |
ni | number of subjects in intervention group. |
pr | set to |
Details
For handling noncompliance, uses a modification of formula (5.4) ofLachin and Foulkes. Their method is based on a test for the differencein two hazard rates, whereascpower is based on testing the differencein two log hazards. It is assumed here that the same correction factorcan be approximately applied to the log hazard ratio as Lachin and Foulkes applied tothe hazard difference.
Note that Schoenfeld approximates the varianceof the log hazard ratio by4/m, wherem is the total number of events,whereas the George-Desu method uses the slightly better1/m1 + 1/m2.Power from this function will thus differ slightly from that obtained withthe SASsamsizc program.
Value
power
Side Effects
prints
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Peterson B, George SL: Controlled Clinical Trials 14:511–522; 1993.
Lachin JM, Foulkes MA: Biometrics 42:507–519; 1986.
Schoenfeld D: Biometrics 39:499–503; 1983.
See Also
Examples
#In this example, 4 plots are drawn on one page, one plot for each#combination of noncompliance percentage. Within a plot, the#5-year mortality % in the control group is on the x-axis, and#separate curves are drawn for several % reductions in mortality#with the intervention. The accrual period is 1.5y, with all#patients followed at least 5y and some 6.5y.par(mfrow=c(2,2),oma=c(3,0,3,0))morts <- seq(10,25,length=50)red <- c(10,15,20,25)for(noncomp in c(0,10,15,-1)) { if(noncomp>=0) nc.i <- nc.c <- noncomp else {nc.i <- 25; nc.c <- 15} z <- paste("Drop-in ",nc.c,"%, Non-adherence ",nc.i,"%",sep="") plot(0,0,xlim=range(morts),ylim=c(0,1), xlab="5-year Mortality in Control Patients (%)", ylab="Power",type="n") title(z) cat(z,"\n") lty <- 0 for(r in red) { lty <- lty+1 power <- morts i <- 0 for(m in morts) { i <- i+1 power[i] <- cpower(5, 14000, m/100, r, 1.5, 5, nc.c, nc.i, pr=FALSE) } lines(morts, power, lty=lty) } if(noncomp==0)legend(18,.55,rev(paste(red,"% reduction",sep="")), lty=4:1,bty="n")}mtitle("Power vs Non-Adherence for Main Comparison", ll="alpha=.05, 2-tailed, Total N=14000",cex.l=.8)## Point sample size requirement vs. mortality reduction# Root finder (uniroot()) assumes needed sample size is between# 1000 and 40000#nc.i <- 25; nc.c <- 15; mort <- .18red <- seq(10,25,by=.25)samsiz <- redi <- 0for(r in red) { i <- i+1 samsiz[i] <- uniroot(function(x) cpower(5, x, mort, r, 1.5, 5, nc.c, nc.i, pr=FALSE) - .8, c(1000,40000))$root}samsiz <- samsiz/1000par(mfrow=c(1,1))plot(red, samsiz, xlab='% Reduction in 5-Year Mortality', ylab='Total Sample Size (Thousands)', type='n')lines(red, samsiz, lwd=2)title('Sample Size for Power=0.80\nDrop-in 15%, Non-adherence 25%')title(sub='alpha=0.05, 2-tailed', adj=0)Read Comma-Separated Text Data Files
Description
Read comma-separated text data files, allowing optional translationto lower case for variable names after making them valid S names.There is a facility for reading long variable labels as one of therows. If labels are not specified and a final variable name is notthe same as that in the header, the original variable name is saved asa variable label. Usesread.csv if thedata.tablepackage is not in effect, otherwise callsfread.
Usage
csv.get(file, lowernames=FALSE, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), comment.char="", autodate=TRUE, allow=NULL, charfactor=FALSE, sep=',', skip=0, vnames=NULL, labels=NULL, text=NULL, ...)Arguments
file | the file name for import. |
lowernames | set this to |
datevars | character vector of names (after |
datetimevars | character vector of names (after |
dateformat | for |
fixdates | for any of the variables listed in |
comment.char | a character vector of length one containing asingle character or an empty string. Use '""' to turn off theinterpretation of comments altogether. |
autodate | Set to true to allow function to guess at whichvariables are dates |
allow | a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version1.9. |
charfactor | set to |
sep | field separator, defaults to comma |
skip | number of records to skip before data start. Required if |
vnames | number of row containing variable names, default is one |
labels | number of row containing variable labels, default is nolabels |
text | a character string containing the |
... | arguments to pass to |
Details
csv.get reads comma-separated text data files, allowing optionaltranslation to lower case for variable names after making them valid Snames. Original possibly non-legal names are taken to be variablelabels iflabels is not specified. Character or factorvariables containing dates can be converted to date variables.cleanup.import is invoked to finish the job.
Value
a new data frame.
Author(s)
Frank Harrell, Vanderbilt University
See Also
sas.get,data.frame,cleanup.import,read.csv,strptime,POSIXct,Date,fread
Examples
## Not run: dat <- csv.get('myfile.csv')# Read a csv file with junk in the first row, variable names in the# second, long variable labels in the third, and junk in the 4th rowdat <- csv.get('myfile.csv', vnames=2, labels=3, skip=4)## End(Not run)Representative Curves
Description
curveRep finds representative curves from arelatively large collection of curves. The curves usually representtime-response profiles as in serial (longitudinal or repeated) datawith possibly unequal time points and greatly varying sample sizes persubject. After excluding records containing missingx ory, records are first stratified intokn groups having similarsample sizes per curve (subject). Within these strata, curves arenext stratified according to the distribution ofx points percurve (typically measurement times per subject). Theclara clustering/partitioning function is usedto do this, clustering on one, two, or threex characteristicsdepending on the minimum sample size in the current interval of samplesize. If the interval has a minimum number of uniquevalues ofone, clustering is done on the singlex values. If the minimumnumber of uniquex values is two, clustering is done to creategroups that are similar on bothmin(x) andmax(x). Forgroups containing no fewer than three uniquex values,clustering is done on the trio of valuesmin(x),max(x),and the longest gap between any successivex. Then withinsample size andx distribution strata, clustering oftime-response profiles is based onp values ofy allevaluated at the samep equally-spacedx's within thestratum. An option allows per-curve data to be smoothed withlowess before proceeding. Outerx values aretaken as extremes ofx across all curves within the stratum.Linear interpolation within curves is used to estimatey at thegrid ofx's. For curves within the stratum that do not extendto the most extremex values in that stratum, extrapolationuses flat lines from the observed extremes in the curve unlessextrap=TRUE. Thepy values are clustered usingclara.
print andplot methods show results. By specifying anauxiliaryidcol variable toplot, other variables suchas treatment may be depicted to allow the analyst to determine forexample whether subjects on different treatments are assigned todifferent time-response profiles. To write the frequencies of avariable such as treatment in the upper left corner of each panel(instead of the grand total number of clusters in that panel), specifyfreq.
curveSmooth takes a set of curves and smooths them usinglowess. If the number of uniquex points in a curve isless thanp, the smooth is evaluated at the uniquexvalues. Otherwise it is evaluated at an equally spaced set ofx points over the observed range. If fewer than 3 uniquex values are in a curve, those points are used and smoothing is not done.
Usage
curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5, force1 = TRUE, metric = c("euclidean", "manhattan"), smooth=FALSE, extrap=FALSE, pr=FALSE)## S3 method for class 'curveRep'print(x, ...)## S3 method for class 'curveRep'plot(x, which=1:length(res), method=c('all','lattice','data'), m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE, idcol=NULL, freq=NULL, plotfreq=FALSE, xlim=range(x), ylim=range(y), xlab='x', ylab='y', colorfreq=FALSE, ...)curveSmooth(x, y, id, p=NULL, pr=TRUE)Arguments
x | a numeric vector, typically measurement times.For |
y | a numeric vector of response values |
id | a vector of curve (subject) identifiers, the same length as |
kn | number of curve sample size groups to construct. |
kxdist | maximum number of x-distribution clusters to deriveusing |
k | maximum number of x-y profile clusters to derive using |
p | number of |
force1 | By default if any curves have only one point, all curvesconsisting of one point will be placed in a separate stratum. Toprevent this separation, set |
metric | see |
smooth | By default, linear interpolation is used on raw data toobtain |
extrap | set to |
pr | set to |
which | an integer vector specifying which sample size intervalsto plot. Must be specified if |
method | The default makes individual plots of possibly allx-distribution by sample size by cluster combinations. Fewer may beplotted by specifying |
m | the number of curves in a cluster to randomly sample if thereare more than |
nx | applies if |
probs | 3-vector of probabilities with the central quantilefirst. Default uses quartiles. |
fill | for |
idcol | a named vector to be used as a table lookup for colorassignments (does not apply when |
freq | a named vector to be used as a table lookup for a groupingvariable such as treatment. The names are curve |
plotfreq | set to |
colorfreq | set to |
xlim,ylim,xlab,ylab | plotting parameters. Default ranges arethe ranges in the entire set of raw data given to |
... | arguments passed to other functions. |
Details
In the graph titles for the default graphic output,n refers to theminimum sample size,x refers to the sequential x-distributioncluster, andc refers to the sequential x-y profile cluster. Graphsfrommethod = "lattice" are produced byxyplot and in the panel titlesdistribution refers to the x-distribution stratum andcluster refers to the x-y profile cluster.
Value
a list of class"curveRep" with the following elements
res | a hierarchical list first split by sample size intervals,then by x distribution clusters, then containing a vector of clusternumbers with |
ns | a table of frequencies of sample sizes per curve afterremoving |
nomit | total number of records excluded due to |
missfreq | a table of frequencies of number of |
ncuts | cut points for sample size intervals |
kn | number of sample size intervals |
kxdist | number of clusters on x distribution |
k | number of clusters of curves within sample size anddistribution groups |
p | number of points at which to evaluate each curve for clustering |
x | |
y | |
id | input data after removing |
curveSmooth returns a list with elementsx,y,id.
Note
The references describe other methods for derivingrepresentative curves, but those methods were not used here. The lastreference which used a cluster analysis on principal componentsmotivatedcurveRep however. Thekml package does k-means clustering of longitudinal data with imputation.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Segal M. (1994): Representative curves for longitudinal data viaregression trees. J Comp Graph Stat 3:214-233.
Jones MC, Rice JA (1992): Displaying the important features of largecollections of similar curves. Am Statistician 46:140-145.
Zheng X, Simpson JA, et al (2005): Data from a study of effectivenesssuggested potential prognostic factors related to the patterns ofshoulder pain. J Clin Epi 58:823-830.
See Also
Examples
## Not run: # Simulate 200 curves with per-curve sample sizes ranging from 1 to 10# Make curves with odd-numbered IDs have an x-distribution that is random# uniform [0,1] and those with even-numbered IDs have an x-dist. that is# half as wide but still centered at 0.5. Shift y values higher with# increasing IDsset.seed(1)N <- 200nc <- sample(1:10, N, TRUE)id <- rep(1:N, nc)x <- y <- idfor(i in 1:N) { x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75)) y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10)}w <- curveRep(x, y, id, kxdist=2, p=10)wpar(ask=TRUE, mfrow=c(4,5))plot(w) # show everything, profiles going acrosspar(mfrow=c(2,5))plot(w,1) # show n=1 results# Use a color assignment table, assigning low curves to green and# high to red. Unique curve (subject) IDs are the names of the vector.cols <- c(rep('green', N/2), rep('red', N/2))names(cols) <- as.character(1:N)plot(w, 3, idcol=cols)par(ask=FALSE, mfrow=c(1,1))plot(w, 1, 'lattice') # show n=1 resultsplot(w, 3, 'lattice') # show n=4-5 resultsplot(w, 3, 'lattice', idcol=cols) # same but different color mappingplot(w, 3, 'lattice', m=1) # show a single "representative" curve# Show median, 10th, and 90th percentiles of supposedly representative curvesplot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))# Same plot but with much less grouping of x variableplot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)# Use ggplot2 for one sample size intervalz <- plot(w, 2, 'data')require(ggplot2)ggplot(z, aes(x, y, color=curve)) + geom_line() + facet_grid(distribution ~ cluster) + theme(legend.position='none') + labs(caption=z$ninterval[1])# Smooth data before profiling. This allows later plotting to plot# smoothed representative curves rather than raw curves (which# specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)d <- curveSmooth(x, y, id)w <- with(d, curveRep(x, y, id))# Example to show that curveRep can cluster profiles correctly when# there is no noise. In the data there are four profiles - flat, flat# at a higher mean y, linearly increasing then flat, and flat at the# first height except for a sharp triangular peakset.seed(1)x <- 0:100m <- length(x)profile <- matrix(NA, nrow=m, ncol=4)profile[,1] <- rep(0, m)profile[,2] <- rep(3, m)profile[,3] <- c(0:3, rep(3, m-4))profile[,4] <- c(0,1,3,1,rep(0,m-4))col <- c('black','blue','green','red')matplot(x, profile, type='l', col=col)xeval <- seq(0, 100, length.out=5)s <- x matplot(x[s], profile[s,], type='l', col=col)id <- rep(1:100, each=m)X <- Y <- idcols <- character(100)names(cols) <- as.character(1:100)for(i in 1:100) { s <- id==i X[s] <- x j <- sample(1:4,1) Y[s] <- profile[,j] cols[i] <- col[j]}table(cols)yl <- c(-1,4)w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4)plot(w, 1, 'lattice', idcol=cols, ylim=yl)# Found 4 clusters but two have same profilew <- curveRep(X, Y, id, kn=1, kxdist=1, k=3)plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl)# Incorrectly combined black and red because default value p=5 did# not result in different profiles at x=xevalw <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)plot(w, 1, 'lattice', idcol=cols, ylim=yl)# Found correct clusters because evaluated curves at 40 equally# spaced points and could find the sharp triangular peak in profile 4## End(Not run)Cut a Numeric Variable into Intervals
Description
cut2 is a function likecut but left endpoints are inclusive and labels are ofthe form[lower, upper), except that last interval is[lower,upper]. If cuts are given, will by default make sure that cuts include entirerange ofx.Also, if cuts are not given, will cutx into quantile groups (g given) or groupswith a given minimum number of observations (m). Whereas cut creates acategory object,cut2 creates a factor object.m is not guaranteed but is a target.
cutGn guarantees that the grouped variable will have a minimum ofm observations in any group. This is done by an exhaustive algorithm that runs fast due to being coded in Fortran.
Usage
cut2(x, cuts, m=150, g, levels.mean=FALSE, digits, minmax=TRUE,oneval=TRUE, onlycuts=FALSE, formatfun=format, ...)cutGn(x, m, what=c('mean', 'factor', 'summary', 'cuts', 'function'), rcode=FALSE)Arguments
x | numeric vector to classify into intervals |
cuts | cut points |
m | desired minimum number of observations in a group. The algorithm doesnot guarantee that all groups will have at least |
g | number of quantile groups |
levels.mean | set to |
digits | number of significant digits to use in constructing levels. Default is 3(5 if |
minmax | if cuts is specified but |
oneval | if an interval contains only one unique value, the interval will belabeled with the formatted version of that value instead of theinterval endpoints, unless |
onlycuts | set to |
formatfun | formatting function, supports formula notation (if |
... | additional arguments passed to |
what | specifies the kind of vector values to return from |
rcode | set to |
Value
a factor variable with levels of the form[a,b) or formatted means(character strings) unlessonlycuts isTRUE in which casea numeric vector is returned
See Also
Examples
set.seed(1)x <- runif(1000, 0, 100)z <- cut2(x, c(10,20,30))table(z)table(cut2(x, g=10)) # quantile groupstable(cut2(x, m=50)) # group x into intevals with at least 50 obs.table(cutGn(x, m=50, what='factor'))f <- cutGn(x, m=50, what='function')ff(c(-1, 2, 10), what='mean')f(c(-1, 2, 10), what='factor')## Not run: x <- round(runif(200000), 3) system.time(a <- cutGn(x, m=20)) # 0.02s system.time(b <- cutGn(x, m=20, rcode=TRUE)) # 1.51s identical(a, b)## End(Not run)Tips for Creating, Modifying, and Checking Data Frames
Description
This help file contains a template for importing data to create an Rdata frame, correcting some problems resulting from the import andmaking the data frame be stored more efficiently, modifying the dataframe (including better annotating it and changing the names of someof its variables), and checking and inspecting the data frame forreasonableness of the values of its variables and to describe patternsof missing data. Various built-in functions and functions in theHmisc library are used. At the end some methods for creating dataframes “from scratch” withinR are presented.
The examples below attempt to clarify the separation of operationsthat are done on a data frame as a whole, operations that are done ona small subset of its variables without attaching the whole dataframe, and operations that are done on many variables after attachingthe data frame in search position one. It also tries to clarify thatfor analyzing several separate variables usingR commands that do notsupport adata argument, it is helpful to attach the data framein a search position later than position one.
It is often useful to create, modify, and process datasets in thefollowing order.
Import external data into a data frame (if the raw data do notcontain column names, provide these during the import if possible)
Make global changes to a data frame (e.g., changing variablenames)
Change attributes or values of variables within a data frame
Do analyses involving the whole data frame (without attaching it)
(Data frame still in .Data)Do analyses of individual variables (after attaching the dataframe in search position two or later)
Details
The examples below use theFEV dataset fromRosner 1995. Almost any dataset would do. The jcetable dataare taken fromGalobardes, etal.
Presently, giving a variable the"units" attribute (using theHmiscunits function) only benefits theHmiscdescribe function and thermslibrary's version of thelink[rms]{Surv} function. Variableslabels defined with the Hmisclabel function are used bydescribe,summary.formula, and many ofthe plotting functions inHmisc andrms.
References
Alzola CF, Harrell FE (2006):An Introduction to S and the Hmisc and Design Libraries.Chapters 3 and 4,https://hbiostat.org/R/doc/sintro.pdf.
Galobardes, et al. (1998),J Clin Epi 51:875-881.
Rosner B (1995):Fundamentals of Biostatistics, 4th Edition.New York: Duxbury Press.
See Also
scan,read.table,cleanup.import,sas.get,data.frame,attach,detach,describe,datadensity,plot.data.frame,hist.data.frame,naclus,factor,label,units,names,expand.grid,summary.formula,summary.data.frame,casefold,edit,page,plot.data.frame,Cs,combine.levels,upData
Examples
## Not run: # First, we do steps that create or manipulate the data# frame in its entirety. For S-Plus, these are done with# .Data in search position one (the default at the# start of the session).## -----------------------------------------------------------------------# Step 1: Create initial draft of data frame# # We usually begin by importing a dataset from# # another application. ASCII files may be imported# using the scan and read.table functions. SAS# datasets may be imported using the Hmisc sas.get# function (which will carry more attributes from# SAS than using File \dots Import) from the GUI# menus. But for most applications (especially# Excel), File \dots Import will suffice. If using# the GUI, it is often best to provide variable# names during the import process, using the Options# tab, rather than renaming all fields later Of# course, if the data to be imported already have# field names (e.g., in Excel), let S use those# automatically. If using S-Plus, you can use a# command to execute File \dots Import, e.g.:import.data(FileName = "/windows/temp/fev.asc", FileType = "ASCII", DataFrame = "FEV")# Here we name the new data frame FEV rather than# fev, because we wanted to distinguish a variable# in the data frame named fev from the data frame# name. For S-Plus the command will look# instead like the following:FEV <- importData("/tmp/fev.asc")# -----------------------------------------------------------------------# Step 2: Clean up data frame / make it be more# efficiently stored# # Unless using sas.get to import your dataset# (sas.get already stores data efficiently), it is# usually a good idea to run the data frame through# the Hmisc cleanup.import function to change# numeric variables that are always whole numbers to# be stored as integers, the remaining numerics to# single precision, strange values from Excel to# NAs, and character variables that always contain# legal numeric values to numeric variables.# cleanup.import typically halves the size of the# data frame. If you do not specify any parameters# to cleanup.import, the function assumes that no# numeric variable needs more than 7 significant# digits of precision, so all non-integer-valued# variables will be converted to single precision.FEV <- cleanup.import(FEV)# -----------------------------------------------------------------------# Step 3: Make global changes to the data frame# # A data frame has attributes that are "external" to# its variables. There are the vector of its# variable names ("names" attribute), the# observation identifiers ("row.names"), and the# "class" (whose value is "data.frame"). The# "names" attribute is the one most commonly in need# of modification. If we had wanted to change all# the variable names to lower case, we could have# specified lowernames=TRUE to the cleanup.import# invocation above, or typenames(FEV) <- casefold(names(FEV))# The upData function can also be used to change# variable names in two ways (see below).# To change names in a non-systematic way we use# other options. Under Windows/NT the most# straigtforward approach is to change the names# interactively. Click on the data frame in the# left panel of the Object Browser, then in the# right pane click twice (slowly) on a variable.# Use the left arrow and other keys to edit the# name. Click outside that name field to commit the# change. You can also rename columns while in a# Data Sheet. To instead use programming commands# to change names, use something like:names(FEV)[6] <- 'smoke' # assumes you know the positions! names(FEV)[names(FEV)=='smoking'] <- 'smoke' names(FEV) <- edit(names(FEV))# The last example is useful if you are changing# many names. But none of the interactive# approaches such as edit() are handy if you will be# re-importing the dataset after it is updated in# its original application. This problem can be# addressed by saving the new names in a permanent# vector in .Data:new.names <- names(FEV)# Then if the data are re-imported, you can typenames(FEV) <- new.names# to rename the variables.# -----------------------------------------------------------------------# Step 4: Delete unneeded variables# # To delete some of the variables, you can# right-click on variable names in the Object# Browser's right pane, then select Delete. You can# also set variables to have NULL values, which# causes the system to delete them. We don't need# to delete any variables from FEV but suppose we# did need to delete some from mydframe.mydframe$x1 <- NULL mydframe$x2 <- NULLmydframe[c('age','sex')] <- NULL # delete 2 variables mydframe[Cs(age,sex)] <- NULL # same thing# The last example uses the Hmisc short-cut quoting# function Cs. See also the drop parameter to upData.# -----------------------------------------------------------------------# Step 5: Make changes to individual variables# within the data frame# # After importing data, the resulting variables are# seldom self - documenting, so we commonly need to# change or enhance attributes of individual# variables within the data frame.# # If you are only changing a few variables, it is# efficient to change them directly without# attaching the entire data frame.FEV$sex <- factor(FEV$sex, 0:1, c('female','male')) FEV$smoke <- factor(FEV$smoke, 0:1, c('non-current smoker','current smoker')) units(FEV$age) <- 'years'units(FEV$fev) <- 'L' label(FEV$fev) <- 'Forced Expiratory Volume' units(FEV$height) <- 'inches'# When changing more than one or two variables it is# more convenient change the data frame using the# Hmisc upData function.FEV2 <- upData(FEV, rename=c(smoking='smoke'), # omit if renamed above drop=c('var1','var2'), levels=list(sex =list(female=0,male=1), smoke=list('non-current smoker'=0, 'current smoker'=1)), units=list(age='years', fev='L', height='inches'), labels=list(fev='Forced Expiratory Volume'))# An alternative to levels=list(\dots) is for example# upData(FEV, sex=factor(sex,0:1,c('female','male'))).# # Note that we saved the changed data frame into a# new data frame FEV2. If we were confident of the# correctness of our changes we could have stored# the new data frame on top of the old one, under# the original name FEV.# -----------------------------------------------------------------------# Step 6: Check the data frame# # The Hmisc describe function is perhaps the first# function that should be used on the new data# frame. It provides documentation of all the# variables and the frequency tabulation, counts of# NAs, and 5 largest and smallest values are# helpful in detecting data errors. Typing# describe(FEV) will write the results to the# current output window. To put the results in a# new window that can persist, even upon exiting# S, we use the page function. The describe# output can be minimized to an icon but kept ready# for guiding later steps of the analysis.page(describe(FEV2), multi=TRUE) # multi=TRUE allows that window to persist while# control is returned to other windows# The new data frame is OK. Store it on top of the# old FEV and then use the graphical user interface# to delete FEV2 (click on it and hit the Delete# key) or type rm(FEV2) after the next statement.FEV <- FEV2# Next, we can use a variety of other functions to# check and describe all of the variables. As we# are analyzing all or almost all of the variables,# this is best done without attaching the data# frame. Note that plot.data.frame plots inverted# CDFs for continuous variables and dot plots# showing frequency distributions of categorical# ones.summary(FEV)# basic summary function (summary.data.frame) plot(FEV) # plot.data.frame datadensity(FEV) # rug plots and freq. bar charts for all var.hist.data.frame(FEV) # for variables having > 2 values by(FEV, FEV$smoke, summary) # use basic summary function with stratification# -----------------------------------------------------------------------# Step 7: Do detailed analyses involving individual# variables# # Analyses based on the formula language can use# data= so attaching the data frame may not be# required. This saves memory. Here we use the# Hmisc summary.formula function to compute 5# statistics on height, stratified separately by age# quartile and by sex.options(width=80) summary(height ~ age + sex, data=FEV, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5)))# This computes mean height, S.D., median, outer quartilesfit <- lm(height ~ age*sex, data=FEV) summary(fit)# For this analysis we could also have attached the# data frame in search position 2. For other# analyses, it is mandatory to attach the data frame# unless FEV$ prefixes each variable name.# Important: DO NOT USE attach(FEV, 1) or# attach(FEV, pos=1, \dots) if you are only analyzing# and not changing the variables, unless you really# need to avoid conflicts with variables in search# position 1 that have the same names as the# variables in FEV. Attaching into search position# 1 will cause S-Plus to be more of a memory hog.attach(FEV)# Use e.g. attach(FEV[,Cs(age,sex)]) if you only# want to analyze a small subset of the variables# Use e.g. attach(FEV[FEV$sex=='male',]) to# analyze a subset of the observationssummary(height ~ age + sex, fun=function(y)c(smean.sd(y), smedian.hilow(y,conf.int=.5)))fit <- lm(height ~ age*sex)# Run generic summary function on height and fev, # stratified by sexby(data.frame(height,fev), sex, summary)# Cross-classify into 4 sex x smoke groupsby(FEV, list(sex,smoke), summary)# Plot 5 quantiless <- summary(fev ~ age + sex + height, fun=function(y)quantile(y,c(.1,.25,.5,.75,.9)))plot(s, which=1:5, pch=c(1,2,15,2,1), #pch=c('=','[','o',']','='), main='A Discovery', xlab='FEV')# Use the nonparametric bootstrap to compute a # 0.95 confidence interval for the population mean fevsmean.cl.boot(fev) # in Hmisc# Use the Statistics \dots Compare Samples \dots One Sample # keys to get a normal-theory-based C.I. Then do it # more manually. The following method assumes that # there are no NAs in fevsd <- sqrt(var(fev))xbar <- mean(fev)xbarsdn <- length(fev)qt(.975,n-1) # prints 0.975 critical value of t dist. with n-1 d.f.xbar + c(-1,1)*sd/sqrt(n)*qt(.975,n-1) # prints confidence limits# Fit a linear model# fit <- lm(fev ~ other variables \dots)detach()# The last command is only needed if you want to# start operating on another data frame and you want# to get FEV out of the way.# -----------------------------------------------------------------------# Creating data frames from scratch# # Data frames can be created from within S. To# create a small data frame containing ordinary# data, you can use something likedframe <- data.frame(age=c(10,20,30), sex=c('male','female','male'), stringsAsFactors=TRUE)# You can also create a data frame using the Data# Sheet. Create an empty data frame with the# correct variable names and types, then edit in the# data.dd <- data.frame(age=numeric(0),sex=character(0), stringsAsFactors=TRUE)# The sex variable will be stored as a factor, and# levels will be automatically added to it as you# define new values for sex in the Data Sheet's sex# column.# # When the data frame you need to create is defined# by systematically varying variables (e.g., all# possible combinations of values of each variable),# the expand.grid function is useful for quickly# creating the data. Then you can add# non-systematically-varying variables to the object# created by expand.grid, using programming# statements or editing the Data Sheet. This# process is useful for creating a data frame# representing all the values in a printed table.# In what follows we create a data frame# representing the combinations of values from an 8# x 2 x 2 x 2 (event x method x sex x what) table,# and add a non-systematic variable percent to the# data.jcetable <- expand.grid( event=c('Wheezing at any time', 'Wheezing and breathless', 'Wheezing without a cold', 'Waking with tightness in the chest', 'Waking with shortness of breath', 'Waking with an attack of cough', 'Attack of asthma', 'Use of medication'), method=c('Mail','Telephone'), sex=c('Male','Female'), what=c('Sensitivity','Specificity'))jcetable$percent <- c(756,618,706,422,356,578,289,333, 576,421,789,273,273,212,212,212, 613,763,713,403,377,541,290,226, 613,684,632,290,387,613,258,129, 656,597,438,780,732,679,938,919, 714,600,494,877,850,703,963,987, 755,420,480,794,779,647,956,941, 766,423,500,833,833,604,955,986) / 10# In jcetable, event varies most rapidly, then# method, then sex, and what.## End(Not run)Representativeness of Observations in a Data Set
Description
These functions are intended to be used to describe how well a givenset of new observations (e.g., new subjects) were represented in adataset used to develop a predictive model.ThedataRep function forms a data frame that contains all the uniquecombinations of variable values that existed in a given set ofvariable values. Cross–classifications of values are created usingexact values of variables, so for continuous numeric variables it isoften necessary to round them to the nearestv and to possiblycurtail the values to some lower and upper limit before rounding.Herev denotes a numeric constant specifying the matching tolerancethat will be used.dataRep also stores marginal distributionsummaries for all the variables. For numeric variables, all 101percentiles are stored, and for all variables, the frequencydistributions are also stored (frequencies are computed after anyrounding and curtailment of numeric variables). For the purposes ofrounding and curtailing, theroundN function is provided. Aprintmethod will summarize the calculations made bydataRep, and iflong=TRUE all unique combinations of values and their frequencies inthe original dataset are printed.
Thepredict method fordataRep takes a new data frame havingvariables named the same as the original ones (but whose factor levelsare not necessarily in the same order) and examines the collapsedcross-classifications created bydataRep to find how manyobservations were similar to each of the new observations after anyrounding or curtailment of limits is done.predict also does somecalculations to describe how the variable values of the newobservations "stack up" against the marginal distributions of theoriginal data. For categorical variables, the percent of observationshaving a given variable with the value of the new observation (afterrounding for variables that were throughroundN in the formula giventodataRep) is computed. For numeric variables, the percentile ofthe original distribution in which the current value falls will becomputed. For this purpose, the data are not rounded because the 101original percentiles were retained; linear interpolation is used toestimate percentiles for values between two tabulated percentiles.The lowest marginal frequency of matching values across all variablesis also computed. For example, if an age, sex combination matches 10subjects in the original dataset but the age value matches 100 ages(after rounding) and the sex value matches the sex code of 300observations, the lowest marginal frequency is 100, which is a "bestcase" upper limit for multivariable matching. I.e., matching on allvariables has to result on a lower frequency than this amount.Aprint method for the output ofpredict.dataRep prints allcalculations done bypredict by default. Calculations can beselectively suppressed.
Usage
dataRep(formula, data, subset, na.action)roundN(x, tol=1, clip=NULL)## S3 method for class 'dataRep'print(x, long=FALSE, ...)## S3 method for class 'dataRep'predict(object, newdata, ...)## S3 method for class 'predict.dataRep'print(x, prdata=TRUE, prpct=TRUE, ...)Arguments
formula | a formula with no left-hand-side. Continuous numeric variables inneed of rounding should appear in the formula as e.g. |
x | a numeric vector or an object created by |
object | the object created by |
data,subset,na.action | standard modeling arguments. Default |
tol | rounding constant (tolerance is actually |
clip | a 2-vector specifying a lower and upper limit to curtail values of |
long | set to |
newdata | a data frame containing all the variables given to |
prdata | set to |
prpct | set to |
... | unused |
Value
dataRep returns a list of class"dataRep" containing the collapseddata frame and frequency counts along with marginal distributioninformation.predict returns an object of class"predict.dataRep"containing information determined by matching observations innewdata with the original (collapsed) data.
Side Effects
print.dataRep prints.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
See Also
Examples
set.seed(13)num.symptoms <- sample(1:4, 1000,TRUE)sex <- factor(sample(c('female','male'), 1000,TRUE))x <- runif(1000)x[1] <- NAtable(num.symptoms, sex, .25*round(x/.25))d <- dataRep(~ num.symptoms + sex + roundN(x,.25))print(d, long=TRUE)predict(d, data.frame(num.symptoms=1:3, sex=c('male','male','female'), x=c(.03,.5,1.5)))Design Effect and Intra-cluster Correlation
Description
Computes the Kish design effect and corresponding intra-cluster correlationfor a single cluster-sampled variable
Usage
deff(y, cluster)Arguments
y | variable to analyze |
cluster | a variable whose unique values indicate cluster membership. Anytype of variable is allowed. |
Value
a vector with named elementsn (total number of non-missingobservations),clusters (number of clusters after deletingmissing data),rho(intra-cluster correlation), anddeff(design effect).
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
set.seed(1)blood.pressure <- rnorm(1000, 120, 15)clinic <- sample(letters, 1000, replace=TRUE)deff(blood.pressure, clinic)Concise Statistical Description of a Vector, Matrix, Data Frame,or Formula
Description
describe is a generic method that invokesdescribe.data.frame,describe.matrix,describe.vector, ordescribe.formula.describe.vector is the basic function for handling a single variable.This function determines whether the variable is character, factor,category, binary, discrete numeric, and continuous numeric, and printsa concise statistical summary according to each. A numeric variable isdeemed discrete if it has <= 10 distinct values. In this case,quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 distinctvalues. For any variable for which the frequency table is not printed,the 5 lowest and highest values are printed. This behavior can beoverriden for long character variables with many levels using thelistunique parameter, to get a complete tabulation.
describe is especially useful fordescribing data frames created by*.get, as labels, formats,value labels, and (in the case ofsas.get) frequencies of specialmissing values are printed.
For a binary variable, the sum (number of 1's) and mean (proportion of1's) are printed. If the first argument is a formula, a model frameis created and passed to describe.data.frame. If a variableis of class"impute", a count of the number of imputed values isprinted. If a date variable has an attributepartial.date(this is set up bysas.get), counts of how many partial dates areactually present (missing month, missing day, missing both) are also presented.If a variable was created by the special-purpose functionsubsti (whichsubstitutes values of a second variable if the first variable is NA),the frequency table of substitutions is also printed.
For numeric variables,describe adds an item calledInfowhich is a relative information measure using the relative efficiency ofa proportional odds/Wilcoxon test on the variable relative to the sametest on a variable that has no ties.Info is related to howcontinuous the variable is, and ties are less harmful the more untiedvalues there are. The formula forInfo is one minus the sum ofthe cubes of relative frequencies of values divided by one minus thesquare of the reciprocal of the sample size. The lowest informationcomes from a variable having only one distinct value following by ahighly skewed binary variable.Info is reported totwo decimal places.
A latex method exists for converting thedescribe object to aLaTeX file. For numeric variables having more than 20 distinct values,describe saves in its returned object the frequencies of 100evenly spaced bins running from minimum observed value to the maximum.When there are less than or equal to 20 distinct values, the originalvalues are maintained.latex andhtml insert a spike histogram displaying thesefrequency counts in the tabular material using the LaTeX pictureenvironment. For example output seehttps://hbiostat.org/doc/rms/book/chapter7edition1.pdf.Note that the latex method assumes you have the following stylesinstalled in your latex installation: setspace and relsize.
Thehtml method mimics the LaTeX output. This is useful in thecontext of Quarto/Rmarkdown html and html notebook output.Ifoptions(prType='html') is in effect, callingprint onan object that is the result of runningdescribe on a data framewill result in rendering the HTML version. If run from the console abrowser window will open. Whenwhich is specified toprint, whether or notprType='html' is in effect, agt package html table will be produced containing only the types of variables requested. Whenwhich='both' a list withelement namesContinuous andCategorical is produced,making it convenient for the user to print as desired, or to pass thelist directed to theqreportmaketabs function when using Quarto.
Theplot method is fordescribe objects run on dataframes. It produces spike histograms for a graphic ofcontinuous variables and a dot chart for categorical variables, showingcategory proportions. The graphic format isggplot2 if the userhas not setoptions(grType='plotly') or has set thegrTypeoption to something other than'plotly'. Otherwiseplotlygraphics that are interactive are produced, and these can be placed intoan Rmarkdown html notebook. The user must install theplotlypackage for this to work. When the use hovers the mouse over a bin fora raw data value, the actual value will pop-up (formatted usingdigits). When the user hovers over the minimum data value, mostof the information calculated bydescribe will pop up. For eachvariable, the number of missing values is used to assign the color tothe histogram or dot chart, and a legend is drawn. Color is not used ifthere are no missing values in any variable. For categorical variables,hovering over the leftmost point for a variable displays details, andfor all points proportions, numerators, and denominators are displayedin the popup. If both continuous and categorical variables are presentandwhich='both' is specified, theplot method returns anunclassedlist containing two objects, named'Categorical'and'Continuous', in that order.
Sample weights may be specified to any of the functions, resultingin weighted means, quantiles, and frequency tables.
Note: As discussed in Cox and Longton (2008), Stata Technical Bulletin 8(4)pp. 557, the term "unique" has been replaced with "distinct" in theoutput (but not in parameter names).
Whenweights are not used, the pseudomedian and Gini's mean difference are computed fornumeric variables. The pseudomedian is labeledpMedian and is the median of all possible pairwise averages. It is a robust and efficient measure of location that equals the mean and median for symmetric distributions. It is also called the Hodges-Lehmann one-sample estimator. Gini's mean difference is a robust measure of dispersion that is themean absolute difference between any pairs of observations. In simpleoutput Gini's difference is labeledGmd.
formatdescribeSingle is a service function forlatex,html, andprint methods for single variables that is notintended to be called by the user.
Usage
## S3 method for class 'vector'describe(x, descript, exclude.missing=TRUE, digits=4, listunique=0, listnchar=12, weights=NULL, normwt=FALSE, minlength=NULL, shortmChoice=TRUE, rmhtml=FALSE, trans=NULL, lumptails=0.01, ...)## S3 method for class 'matrix'describe(x, descript, exclude.missing=TRUE, digits=4, ...)## S3 method for class 'data.frame'describe(x, descript, exclude.missing=TRUE, digits=4, trans=NULL, ...)## S3 method for class 'formula'describe(x, descript, data, subset, na.action, digits=4, weights, ...)## S3 method for class 'describe'print(x, which = c('both', 'categorical', 'continuous'), ...)## S3 method for class 'describe'latex(object, title=NULL, file=paste('describe',first.word(expr=attr(object,'descript')),'tex',sep='.'), append=FALSE, size='small', tabular=TRUE, greek=TRUE, spacing=0.7, lspace=c(0,0), ...)## S3 method for class 'describe.single'latex(object, title=NULL, vname, file, append=FALSE, size='small', tabular=TRUE, greek=TRUE, lspace=c(0,0), ...)## S3 method for class 'describe'html(object, size=85, tabular=TRUE, greek=TRUE, scroll=FALSE, rows=25, cols=100, ...)## S3 method for class 'describe.single'html(object, size=85, tabular=TRUE, greek=TRUE, ...)formatdescribeSingle(x, condense=c('extremes', 'frequencies', 'both', 'none'), lang=c('plain', 'latex', 'html'), verb=0, lspace=c(0, 0), size=85, ...)## S3 method for class 'describe'plot(x, which=c('both', 'continuous', 'categorical'), what=NULL, sort=c('ascending', 'descending', 'none'), n.unique=10, digits=5, bvspace=2, ...)Arguments
x | a data frame, matrix, vector, or formula. For a data frame, the |
descript | optional title to print for x. The default is the name of the argumentor the "label" attributes of individual variables. When the first argumentis a formula, |
exclude.missing | set toTRUE to print the names of variables that contain only missing values.This list appears at the bottom of the printout, and no space is takenup for such variables in the main listing. |
digits | number of significant digits to print. For |
listunique | For a character variable that is not an |
listnchar | see |
weights | a numeric vector of frequencies or sample weights. Each observationwill be treated as if it were sampled |
minlength | value passed to summary.mChoice |
shortmChoice | set to |
rmhtml | set to |
trans | for |
lumptails | specifies the quantile to use (its complement is alsoused) for grouping observations in the tails so that outliers haveless chance of distorting the variable's range for sparkline spikehistograms. The default is 0.01, i.e., observations below the 0.01quantile are grouped together in the leftmost bin, and observationsabove the 0.99 quantile are grouped to form the last bin. |
normwt | The default, |
object | a result of |
title | unused |
data | a data frame, data table, or list |
subset | a subsetting expression |
na.action | These are used if a formula is specified. |
... | arguments passed to |
file | name of output file (should have a suffix of .tex). Default name isformed from the first word of the |
append | set to |
size | LaTeX text size ( |
tabular | set to |
greek | By default, the |
spacing | By default, the |
lspace | extra vertical scape, in character size units (i.e., "ex"as appended to the space). When using certain font sizes, there istoo much space left around LaTeX verbatim environments. Thistwo-vector specifies space to remove (i.e., the values are negated informing the |
scroll | set to |
rows,cols | the number of rows or columns to allocate for thescrollable box |
vname | unused argument in |
which | specifies whether to plot numeric continuous orbinary/categorical variables, or both. When |
what | character or numeric vector specifying which variables toplot; default is to plot all |
sort | specifies how and whether variables are sorted in order ofthe proportion of positives when |
n.unique | the minimum number of distinct values a numeric variablemust have before |
bvspace | the between-variable spacing for categorical variables.Defaults to 2, meaning twice the amount of vertical space as what isused for between-category spacing within a variable |
condense | specifies whether to condense the output with regard tothe 5 lowest and highest values ( |
lang | specifies the markup language |
verb | set to 1 if a verbatim environment is already in effect for LaTeX |
Details
Ifoptions(na.detail.response=TRUE)has been set andna.action is"na.delete" or"na.keep", summary statistics onthe response variable are printed separately for missing and non-missingvalues of each predictor. The default summary function returnsthe number of non-missing response values and the mean of the lastcolumn of the response values, with anames attribute ofc("N","Mean"). When the response is aSurv object and the mean is used, this willresult in the crude proportion of events being used to summarizethe response. The actual summary function can be designated throughoptions(na.fun.response = "function name").
If you are modifying LaTexparskip or certain other parameters,you may need to shrink the area aroundtabular andverbatim environments produced bylatex.describe. You cando this using for example\usepackage{etoolbox}\makeatletter\preto{\@verbatim}{\topsep=-1.4pt\partopsep=0pt}\preto{\@tabular}{\parskip=2pt\parsep=0pt}\makeatother in the LaTeX preamble.
Multiple choice (mChoice) variables'describe output renders well in html but not when included in aQuarto document.
Value
a list containing elementsdescript,counts,values. The list is of classdescribe. If the inputobject was a matrix or a data frame, the list is a list of lists, one list for each variableanalyzed.latex returns a standardlatex object. For numericvariables having at least 20 distinct values, an additional componentintervalFreq. This component is a list with two elements,range(containing two values) andcount, a vector of 100 integer frequencycounts.print withwhich= returns a 'gt' table object.The user can modify the table by piping formatting changes, columnremovals, and other operations, before final rendering.
Author(s)
Frank Harrell
Vanderbilt University
fh@fharrell.com
See Also
spikecomp,sas.get,quantile,GiniMd,pMedian,table,summary,model.frame.default,naprint,lapply,tapply,Surv,na.delete,na.keep,na.detail.response,latex
Examples
set.seed(1)describe(runif(200),dig=2) #single variable, continuous #get quantiles .05,.10,\dotsdfr <- data.frame(x=rnorm(400),y=sample(c('male','female'),400,TRUE))describe(dfr)## Not run: options(grType='plotly')d <- describe(mydata)p <- plot(d) # create plots for both types of variablesp[[1]]; p[[2]] # or p$Categorical; p$Continuousplotly::subplot(p[[1]], p[[2]], nrows=2) # plot both in oneplot(d, which='categorical') # categorical onesd <- sas.get(".","mydata",special.miss=TRUE,recode=TRUE)describe(d) #describe entire data frameattach(d, 1)describe(relig) #Has special missing values .D .F .M .R .T #attr(relig,"label") is "Religious preference"#relig : Religious preference Format:relig# n missing D F M R T distinct # 4038 263 45 33 7 2 1 8##0:none (251, 6%), 1:Jewish (372, 9%), 2:Catholic (1230, 30%) #3:Jehovah's Witnes (25, 1%), 4:Christ Scientist (7, 0%) #5:Seventh Day Adv (17, 0%), 6:Protestant (2025, 50%), 7:other (111, 3%) # Method for describing part of a data frame: describe(death.time ~ age*sex + rcs(blood.pressure)) describe(~ age+sex) describe(~ age+sex, weights=freqs) # weighted analysis fit <- lrm(y ~ age*sex + log(height)) describe(formula(fit)) describe(y ~ age*sex, na.action=na.delete) # report on number deleted for each variable options(na.detail.response=TRUE) # keep missings separately for each x, report on dist of y by x=NA describe(y ~ age*sex) options(na.fun.response="quantile") describe(y ~ age*sex) # same but use quantiles of y by x=NA d <- describe(my.data.frame) d$age # print description for just age d[c('age','sex')] # print description for two variables d[sort(names(d))] # print in alphabetic order by var. names d2 <- d[20:30] # keep variables 20-30 page(d2) # pop-up window for these variables# Test date/time formats and suppression of times when they don't vary library(chron) d <- data.frame(a=chron((1:20)+.1), b=chron((1:20)+(1:20)/100), d=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=rep(11,20),min=rep(17,20),sec=rep(11,20)), f=ISOdatetime(year=rep(2003,20),month=rep(4,20),day=1:20, hour=1:20,min=1:20,sec=1:20), g=ISOdate(year=2001:2020,month=rep(3,20),day=1:20)) describe(d)# Make a function to run describe, latex.describe, and use the kdvi# previewer in Linux to view the result and easily make a pdf file ldesc <- function(data) { options(xdvicmd='kdvi') d <- describe(data, desc=deparse(substitute(data))) dvi(latex(d, file='/tmp/z.tex'), nomargins=FALSE, width=8.5, height=11) } ldesc(d)## End(Not run)Discrete Vector tools
Description
discrete creates a discrete vector which is distinct from acontinuous vector, or a factor/ordered vector.The other function are tools for manipulating descrete vectors.
Usage
as.discrete(x, ...)## Default S3 method:as.discrete(x, ...)discrete(x, levels = sort(unique.default(x), na.last = TRUE), exclude = NA)## S3 replacement method for class 'discrete'x[...] <- value## S3 method for class 'discrete'x[..., drop = FALSE]## S3 method for class 'discrete'x[[i]]is.discrete(x)## S3 replacement method for class 'discrete'is.na(x) <- value## S3 replacement method for class 'discrete'length(x) <- valueArguments
x | a vector |
drop | Should unused levels be dropped. |
exclude | logical: should |
i | indexing vector |
levels | charater: list of individual level values |
value | index of elements to set to |
... | arguments to be passed to other functions |
Details
as.discrete converts a vector into a discrete vector.
discrete creates a discrete vector from provided values.
is.discrete tests to see if the vector is a discrete vector.
Value
as.discrete,discrete returns a vector ofdiscrete type.
is.discrete returan logicalTRUE if the vector is ofclass discrete other wise it returnsFALSE.
Author(s)
Charles Dupont
See Also
Examples
a <- discrete(1:25)ais.discrete(a)b <- as.discrete(2:4)bEnhanced Dot Chart
Description
dotchart2 is an enhanced version of thedotchart function with several new options.
Usage
dotchart2(data, labels, groups=NULL, gdata=NA, horizontal=TRUE, pch=16, xlab='', ylab='', xlim=NULL, auxdata, auxgdata=NULL, auxtitle, lty=1, lines=TRUE, dotsize = .8, cex = par("cex"), cex.labels = cex, cex.group.labels = cex.labels*1.25, sort.=TRUE, add=FALSE, dotfont=par('font'), groupfont=2, reset.par=add, xaxis=TRUE, width.factor=1.1, lcolor='gray', leavepar=FALSE, axisat=NULL, axislabels=NULL, ...)Arguments
data | a numeric vector whose values are shown on the x-axis |
labels | a vector of labels for each point, corresponding to |
groups | an optional categorical variable indicating how |
gdata | data values for groups, typically summaries such as groupmedians |
horizontal | set to |
pch | default character number or value for plotting dots in dot charts.The default is 16. |
xlab | x-axis title |
ylab | y-axis title |
xlim | x-axis limits. Applies only to |
auxdata | a vector of auxiliary data given to |
auxgdata | similar to |
auxtitle | if |
lty | line type for horizontal lines. Default is 1 for R, 2 for S-Plus |
lines | set to |
dotsize |
|
cex | see |
cex.labels |
|
cex.group.labels | value of |
sort. | set to |
add | set to |
dotfont | font number of plotting dots. Default is one. Use |
groupfont | font number to use in drawing |
reset.par | set to |
xaxis | set to |
width.factor | When the calculated left margin turns out to be faulty, specify afactor by which to multiple the left margin as |
lcolor | color for horizontal reference lines. Default is |
leavepar | set to |
axisat | a vector of tick mark locations to pass to |
axislabels | a vector of strings specifying axis tick marklabels. Useful if transforming the data axis |
... | arguments passed to |
Side Effects
dotchart will leavepar altered ifreset.par=FALSE.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))dotchart2(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y')dotchart2(y2, g, groups=maj, pch=17, add=TRUE)## Compare with dotchart function (no superpositioning or auxdata allowed):## dotchart(y1, g, groups=maj, xlab='Y')## To plot using a transformed scale add for example## axisat=sqrt(pretty(y)), axislabels=pretty(y)Enhanced Version of dotchart Function
Description
These are adaptations of the R dotchart function that sorts categoriestop to bottom, addsauxdata andauxtitle arguments to putextra information in the right margin, and fordotchart3 addsargumentscex.labels,cex.group.labels, andgroupfont. By default, group headings are in a larger, boldfont.dotchart3 also cuts a bit of white space from the top andbottom of the chart. The most significant change, however, is in howx is interpreted. Columns ofx no longer provide analternate way to define groups. Instead, they define superpositionedvalues. This is useful for showing three quartiles, for example. Goingalong with this change, fordotchart3pch can now be avector specifying symbols to use going across columns ofx.x was changed in this way because to put multiple points on aline (e.g., quartiles) and keeping track ofpar() parameters whendotchart2 was called withadd=TRUE was cumbersome.dotchart3 changes the margins to account for horizontal labels.
dotchartp is a version ofdotchart3 for making the chartwith theplotly package.
summaryD creates aggregate data usingsummarize andcallsdotchart3 with suitable arguments to summarize data bymajor and minor categories. Ifoptions(grType='plotly') is ineffect and theplotly package is installed,summaryD usesdotchartp instead ofdotchart3.
summaryDp is a streamlinedsummaryD-like function thatuses thedotchartpl function to render aplotly graphic.It is used to compute summary statistics stratified separately by aseries of variables.
Usage
dotchart3(x, labels = NULL, groups = NULL, gdata = NULL, cex = par("cex"), pch = 21, gpch = pch, bg = par("bg"), color = par("fg"), gcolor = par("fg"), lcolor = "gray", xlim = range(c(x, gdata), na.rm=TRUE), main = NULL, xlab = NULL, ylab = NULL, auxdata = NULL, auxtitle = NULL, auxgdata=NULL, axisat=NULL, axislabels=NULL, cex.labels = cex, cex.group.labels = cex.labels * 1.25, cex.auxdata=cex, groupfont = 2, auxwhere=NULL, height=NULL, width=NULL, ...)dotchartp(x, labels = NULL, groups = NULL, gdata = NULL, xlim = range(c(x, gdata), na.rm=TRUE), main=NULL, xlab = NULL, ylab = '', auxdata=NULL, auxtitle=NULL, auxgdata=NULL, auxwhere=c('right', 'hover'), symbol='circle', col=colorspace::rainbow_hcl, legendgroup=NULL, axisat=NULL, axislabels=NULL, sort=TRUE, digits=4, dec=NULL, height=NULL, width=700, layoutattr=FALSE, showlegend=TRUE, ...) summaryD(formula, data=NULL, fun=mean, funm=fun, groupsummary=TRUE, auxvar=NULL, auxtitle='', auxwhere=c('hover', 'right'), vals=length(auxvar) > 0, fmtvals=format, symbol=if(use.plotly) 'circle' else 21, col=if(use.plotly) colorspace::rainbow_hcl else 1:10, legendgroup=NULL, cex.auxdata=.7, xlab=v[1], ylab=NULL, gridevery=NULL, gridcol=gray(.95), sort=TRUE, ...)summaryDp(formula, fun=function(x) c(Mean=mean(x, na.rm=TRUE), N=sum(! is.na(x))), overall=TRUE, xlim=NULL, xlab=NULL, data=NULL, subset=NULL, na.action=na.retain, ncharsmax=c(50, 30), digits=4, ...)Arguments
x | a numeric vector or matrix |
labels | labels for categories corresponding to rows of |
groups,gdata,cex,pch,gpch,bg,color,gcolor,lcolor,xlim,main,xlab,ylab | see |
auxdata | a vector of information to be put in the right margin,in the same order as |
auxtitle | a column heading for |
auxgdata | similar to |
axisat | a vector of tick mark locations to pass to |
axislabels | a vector of strings specifying axis tick marklabels. Useful if transforming the data axis |
digits | number of significant digits for formatting numeric datain hover text for |
dec | for |
cex.labels |
|
cex.group.labels |
|
cex.auxdata |
|
groupfont | font number for group headings |
auxwhere | for |
... | other arguments passed to some of the graphics functions,or to |
layoutattr | set to |
showlegend | set to |
formula | a formula with one variable on the left hand side (thevariable to compute summary statistics on), and one or twovariables on the right hand side. If there are two variables,the first is taken as the major grouping variable. If the lefthand side variable is a matrix it has to be a legal R variablename, not an expression, and |
data | a data frame or list used to find the variables in |
fun | a summarization function creating a single number from avector. Default is the mean. For |
funm | applies if there are two right hand variables and |
groupsummary | By default, when there are two right-handvariables, |
auxvar | when |
vals | set to |
fmtvals | an optional function to format values before puttingthem in the right margin. Default is the |
symbol | a scalar or vector of |
col | a function or vector of colors to assign to multiple pointsplotted in one line. If a function it will be evaluated with anargument equal to the number of groups/columns. |
legendgroup | see |
gridevery | specify a positive number to draw very faint verticalgrid lines every |
gridcol | color for grid lines; default is very faint gray scale |
sort | specify |
height,width | height and width in pixels for |
overall | set to |
subset | an observation subsetting expression |
na.action | an |
ncharsmax | a 2-vector specifying the number of characters afterwhich an html new line character should be placed, respectively forthe x-axis label and the stratification variable levels |
Value
the function returns invisibly
Author(s)
Frank Harrell
See Also
dotchart,dotchart2,summarize,rlegend
Examples
set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))dotchart3(cbind(y1,y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', pch=c(1,17))## Compare with dotchart function (no superpositioning or auxdata allowed):## dotchart(y1, g, groups=maj, xlab='Y')## Not run: dotchartp(cbind(y1, y2), g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', gdata=cbind(c(0,.1), c(.23,.44)), auxgdata=c(-1,-2), symbol=c('circle', 'line-ns-open'))summaryDp(sbp ~ region + sex + race + cut2(age, g=5), data=mydata)## End(Not run)## Put options(grType='plotly') to have the following use dotchartp## (rlegend will not apply)## Add argument auxwhere='hover' to summaryD or dotchartp to put## aux info in hover text instead of right marginsummaryD(y1 ~ maj + g, xlab='Mean')summaryD(y1 ~ maj + g, groupsummary=FALSE)summaryD(y1 ~ g, fmtvals=function(x) sprintf('%4.2f', x))Y <- cbind(y1, y2) # summaryD cannot handle cbind(...) ~ ...summaryD(Y ~ maj + g, fun=function(y) y[1,], symbol=c(1,17))rlegend(.1, 26, c('y1','y2'), pch=c(1,17))summaryD(y1 ~ maj, fun=function(y) c(Mean=mean(y), n=length(y)), auxvar='n', auxtitle='N')Enhanced Version of dotchart Function for plotly
Description
This function produces aplotly interactive graphic and acceptsa different format of data input than the otherdotchartfunctions. It was written to handle a hierarchical data structureincluding strata that further subdivide the main classes. Strata,indicated by themult variable, are shown on the samehorizontal line, and if the variablebig isFALSE willappear slightly below the main line, using smaller symbols, and havingsome transparency. This is intended to handle output such as thatfrom thesummaryP function when there is a superpositioningvariablegroup and a stratification variablemult,especially when the data have been run through theaddMarginalfunction to createmult categories labelled"All" forwhich the user will specifybig=TRUE to indicate non-stratifiedestimates (stratified only ongroup) to emphasize.
When viewing graphics that usedmult andbig, the usercan click on the legends for the small points forgroups tovanish the finely stratified estimates.
Whengroup is used bymult andbig are not, andwhen thegroup variable has exactly two distinct values, youcan specifyrefgroup to get the difference between twoproportions in addition to the individual proportions. The individualproportions are plotted, but confidence intervals for the differenceare shown in hover text and half-width confidence intervals for thedifference, centered at the midpoint of the proportions, are shown.These have the property of intersecting the two proportions if andonly if there is no significant difference at the1 - conf.intlevel.
Specifyfun=exp andifun=log if estimates and confidencelimits are on the log scale. Make sure that zeros were prevented inthe original calculations. For exponential hazard rates this can beaccomplished by replacing event counts of 0 with 0.5.
Usage
dotchartpl(x, major=NULL, minor=NULL, group=NULL, mult=NULL, big=NULL, htext=NULL, num=NULL, denom=NULL, numlabel='', denomlabel='', fun=function(x) x, ifun=function(x) x, op='-', lower=NULL, upper=NULL, refgroup=NULL, sortdiff=TRUE, conf.int=0.95, minkeep=NULL, xlim=NULL, xlab='Proportion', tracename=NULL, limitstracename='Limits', nonbigtracename='Stratified Estimates', dec=3, width=800, height=NULL, col=colorspace::rainbow_hcl)Arguments
x | a numeric vector used for values on the |
major | major vertical category, e.g., variable labels |
minor | minor vertical category, e.g. category levels withinvariables |
group | superpositioning variable such as treatment |
mult | strata names for further subdivisions without |
big | omit if all levels of |
htext | additional hover text per point |
num | if |
denom | like |
numlabel | character string to put to the right of the numeratorin hover text |
denomlabel | character string to put to the right of thedenominator in hover text |
fun | a transformation to make when printing estimates. Forexample, one may specify |
ifun | inverse transformation of |
op | set to for example |
lower | lower limits for optional error bars |
upper | upper limits for optional error bars |
refgroup | if |
sortdiff |
|
conf.int | confidence level for computing confidence intervalsfor the difference in two proportions. Specify |
minkeep | if |
xlim |
|
xlab |
|
tracename |
|
limitstracename |
|
nonbigtracename |
|
col | a function or vector of colors to assign to |
dec | number of places to the right of the decimal place forformatting numeric quantities in hover text |
width | width of plot in pixels |
height | height of plot in pixels; computed from number of strataby default |
Value
aplotly object. An attributelevelsRemoved isadded ifminkeep is used and any categories were omitted fromthe plot as a result. This is a character vector with categoriesremoved. Ifmajor is present, the strings are of the formmajor:minor
Author(s)
Frank Harrell
See Also
Examples
## Not run: set.seed(1)d <- expand.grid(major=c('Alabama', 'Alaska', 'Arkansas'), minor=c('East', 'West'), group=c('Female', 'Male'), city=0:2)n <- nrow(d)d$num <- round(100*runif(n))d$denom <- d$num + round(100*runif(n))d$x <- d$num / d$denomd$lower <- d$x - runif(n)d$upper <- d$x + runif(n)with(d, dotchartpl(x, major, minor, group, city, lower=lower, upper=upper, big=city==0, num=num, denom=denom, xlab='x'))# Show half-width confidence intervals for Female - Male differences# after subsetting the data to have only one record per# state/region/groupd <- subset(d, city == 0)with(d, dotchartpl(x, major, minor, group, num=num, denom=denom, lower=lower, upper=upper, refgroup='Male'))n <- 500set.seed(1)d <- data.frame( race = sample(c('Asian', 'Black/AA', 'White'), n, TRUE), sex = sample(c('Female', 'Male'), n, TRUE), treat = sample(c('A', 'B'), n, TRUE), smoking = sample(c('Smoker', 'Non-smoker'), n, TRUE), hypertension = sample(c('Hypertensive', 'Non-Hypertensive'), n, TRUE), region = sample(c('North America','Europe','South America', 'Europe', 'Asia', 'Central America'), n, TRUE))d <- upData(d, labels=c(race='Race', sex='Sex'))dm <- addMarginal(d, region)s <- summaryP(race + sex + smoking + hypertension ~ region + treat, data=dm)s$region <- ifelse(s$region == 'All', 'All Regions', as.character(s$region))with(s, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom))s2 <- s[- attr(s, 'rows.to.exclude1'), ]with(s2, dotchartpl(freq / denom, major=var, minor=val, group=treat, mult=region, big=region == 'All Regions', num=freq, denom=denom))# Note these plots can be created by plot.summaryP when options(grType='plotly')# Plot hazard rates and ratios with confidence limits, on log scaled <- data.frame(tx=c('a', 'a', 'b', 'b'), event=c('MI', 'stroke', 'MI', 'stroke'), count=c(10, 5, 5, 2), exposure=c(1000, 1000, 900, 900))# There were no zero event counts in this dataset. In general we# want to handle that, hence the 0.5 belowd <- upData(d, hazard = pmax(0.5, count) / exposure, selog = sqrt(1. / pmax(0.5, count)), lower = log(hazard) - 1.96 * selog, upper = log(hazard) + 1.96 * selog)with(d, dotchartpl(log(hazard), minor=event, group=tx, num=count, denom=exposure, lower=lower, upper=upper, fun=exp, ifun=log, op='/', numlabel='events', denomlabel='years', refgroup='a', xlab='Events Per Person-Year'))## End(Not run)Dual Standard Deviations
Description
Computes one standard deviation for the lower half of the distribution of a numeric vector and another SD for the upper half. By default the center of the distribution for purposes of splitting into "halves" is the mean. The user may override this withcenter. When splitting into halves, observations equal to thecenter value are included in both subsets.
Usage
dualSD(x, na.rm = FALSE, nmin = 10, center = xbar)Arguments
x | a numeric vector |
na.rm | set to |
nmin | the minimum number of non- |
center | center point for making the two subsets. The sample mean is used to compute the two SDs no matter what is specified for |
Details
The purpose of dual SDs is to describe variability for asymmetric distributions. Symmetric distributions are also handled, though slightly less efficiently than a single SD does.
Value
a 2-vector of SDs with namesbottom andtop
Author(s)
Frank Harrell
See Also
Examples
set.seed(1)x <- rnorm(20000)sd(x)dualSD(x)y <- exp(x)s1 <- sd(y)s2 <- dualSD(y)s1s2quantile(y, c(0.025, 0.975))mean(y) + 1.96 * c(-1, 1) * s1mean(y) + 1.96 * c(- s2['bottom'], s2['top'])c(mean=mean(y), pseudomedian=pMedian(y), median=median(y))ebpcomp
Description
Computation of Coordinates of Extended Box Plots Elements
Usage
ebpcomp(x, qref = c(0.5, 0.25, 0.75), probs = c(0.05, 0.125, 0.25, 0.375))Arguments
x | a numeric variable |
qref | quantiles for major corners |
probs | quantiles for minor corners |
Details
For an extended box plots computes all the elements needed for plotting it. This is typically used when adding to aggplot2 plot.
Value
list with elementssegments,lines,points,points2
Author(s)
Frank Harrell
Examples
ebpcomp(1:1000)ecdfSteps
Description
Compute Coordinates of an Empirical Distribution Function
Usage
ecdfSteps(x, extend)Arguments
x | numeric vector, possibly with |
extend | a 2-vector do extend the range of x (low, high). Set |
Details
For a numeric vector uses the R built-inecdf function to computecoordinates of the ECDF, with extension slightly below and above therange ofx by default. This is useful forggplot2 where the ECDF may need to be transformed. The returned object is suitable for creating stratified statistics usingdata.table and other methods.
Value
a list with componentsx andy
Author(s)
Frank Harrell
See Also
Examples
ecdfSteps(0:10)## Not run: # Use data.table for obtaining ECDFs by country and regionw <- d[, ecdfSteps(z, extend=c(1,11)), by=.(country, region)] # d is a DT# Use ggplot2 to make one graph with multiple regions' ECDFs# and use faceting for countriesggplot(w, aes(x, y, color=region)) + geom_step() + facet_wrap(~ country)## End(Not run)Multicolumn Formating
Description
Expands the width either supercolumns or the subcolumns so that thethe sum of the supercolumn widths is the same as the sum of thesubcolumn widths.
Usage
equalBins(widths, subwidths)Arguments
widths | widths of the supercolumns. |
subwidths | list of widths of the subcolumns for each supercolumn. |
Details
This determins the correct subwidths of each of various columns in a tablefor printing. The correct width of the multicolumns is deterimed bysumming the widths of it subcolumns.
Value
widths of the the columns for a table.
Author(s)
Charles Dupont
See Also
Examples
mcols <- c("Group 1", "Group 2")mwidth <- nchar(mcols, type="width")spancols <- c(3,3)ccols <- c("a", "deer", "ad", "cat", "help", "bob")cwidth <- nchar(ccols, type="width")subwidths <- partition.vector(cwidth, spancols)equalBins(mwidth, subwidths)Plot Error Bars
Description
Add vertical error bars to an existing plot or makes a newplot with error bars.
Usage
errbar(x, y, yplus, yminus, cap=0.015, main = NULL, sub=NULL, xlab=as.character(substitute(x)), ylab=if(is.factor(x) || is.character(x)) "" else as.character(substitute(y)), add=FALSE, lty=1, type='p', ylim=NULL, lwd=1, pch=16, errbar.col, Type=rep(1, length(y)), ...)Arguments
x | vector of numeric x-axis values (for vertical error bars) or a factor orcharacter variable (for horizontal error bars, |
y | vector of y-axis values. |
yplus | vector of y-axis values: the tops of the error bars. |
yminus | vector of y-axis values: the bottoms of the error bars. |
cap | the width of the little lines at the tops and bottoms of the error barsin units of the width of the plot. Defaults to |
main | a main title for the plot, passed to |
sub | a sub title for the plot, passed to |
xlab | optional x-axis labels if |
ylab | optional y-axis labels if |
add | set to |
lty | type of line for error bars |
type | type of point. Use |
ylim | y-axis limits. Default is to use range of |
lwd | line width for line segments (not main line) |
pch | character to use as the point. |
errbar.col | color to use for drawing error bars. |
Type | used for horizontal bars only. Is an integer vector with values |
... | other parameters passed to all graphics functions. |
Details
errbar adds vertical error bars to an existing plot or makes a newplot with error bars. It can also make a horizontal error bar plotthat shows error bars for group differences as well as bars forgroups. For the latter type of plot, the lower x-axis scalecorresponds to group estimates and the upper scale corresponds todifferences. The spacings of the two scales are identical but thescale for differences has its origin shifted so that zero may beincluded. If at least one of the confidence intervals includes zero,a vertical dotted reference line at zero is drawn.
Author(s)
Charles Geyer, University of Chicago. Modified by Frank Harrell,Vanderbilt University, to handle missing data, to add the parametersadd andlty, and to implement horizontal charts with differences.
Examples
set.seed(1)x <- 1:10y <- x + rnorm(10)delta <- runif(10)errbar( x, y, y + delta, y - delta )# Show bootstrap nonparametric CLs for 3 group means and for# pairwise differences on same graphgroup <- sample(c('a','b','d'), 200, TRUE)y <- runif(200) + .25*(group=='b') + .5*(group=='d')cla <- smean.cl.boot(y[group=='a'],B=100,reps=TRUE) # usually B=1000a <- attr(cla,'reps')clb <- smean.cl.boot(y[group=='b'],B=100,reps=TRUE)b <- attr(clb,'reps')cld <- smean.cl.boot(y[group=='d'],B=100,reps=TRUE)d <- attr(cld,'reps')a.b <- quantile(a-b,c(.025,.975))a.d <- quantile(a-d,c(.025,.975))b.d <- quantile(b-d,c(.025,.975))errbar(c('a','b','d','a - b','a - d','b - d'), c(cla[1],clb[1],cld[1],cla[1]-clb[1],cla[1]-cld[1],clb[1]-cld[1]), c(cla[3],clb[3],cld[3],a.b[2],a.d[2],b.d[2]), c(cla[2],clb[2],cld[2],a.b[1],a.d[1],b.d[1]), Type=c(1,1,1,2,2,2), xlab='', ylab='')Escapes any characters that would have special meaning in a reqular expression.
Description
Escapes any characters that would have special meaning in a reqular expression.
Usage
escapeRegex(string)escapeBS(string)Arguments
string | string being operated on. |
Details
escapeRegex will escape any characters that would havespecial meaning in a reqular expression. For any stringgrep(regexpEscape(string), string) will always be true.
escapeBS will escape any backslash ‘\’ in a string.
Value
The value of the string with any characters that would havespecial meaning in a reqular expression escaped.
Author(s)
Charles Dupont
Department of Biostatistics
Vanderbilt University
See Also
Examples
string <- "this\\(system) {is} [full]."escapeRegex(string)escapeBS(string)estSeqMarkovOrd
Description
Simulate Comparisons For Use in Sequential Markov Longitudinal Clinical Trial Simulations
Usage
estSeqMarkovOrd( y, times, initial, absorb = NULL, intercepts, parameter, looks, g, formula, ppo = NULL, yprevfactor = TRUE, groupContrast = NULL, cscov = FALSE, timecriterion = NULL, coxzph = FALSE, sstat = NULL, rdsample = NULL, maxest = NULL, maxvest = NULL, nsim = 1, progress = FALSE, pfile = "")Arguments
y | vector of possible y values in order (numeric, character, factor) |
times | vector of measurement times |
initial | a vector of probabilities summing to 1.0 that specifies the frequency distribution of initial values to be sampled from. The vector must have names that correspond to values of |
absorb | vector of absorbing states, a subset of |
intercepts | vector of intercepts in the proportional odds model. There must be one fewer of these than the length of |
parameter | vector of true parameter (effects; group differences) values. These are group 2:1 log odds ratios in the transition model, conditioning on the previous |
looks | integer vector of ID numbers at which maximum likelihood estimates and their estimated variances are computed. For a single look specify a scalar value for |
g | a user-specified function of three or more arguments which in order are |
formula | a formula object given to the |
ppo | a formula specifying the part of |
yprevfactor | see |
groupContrast | omit this argument if |
cscov | applies if |
timecriterion | a function of a time-ordered vector of simulated ordinal responses |
coxzph | set to |
sstat | set to a function of the time vector and the corresponding vector of ordinal responses for a single group if you want to compute a Wilcoxon test on a derived quantity such as the number of days in a given state. |
rdsample | an optional function to do response-dependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state: |
maxest | maximum acceptable absolute value of the contrast estimate, ignored if |
maxvest | like |
nsim | number of simulations (default is 1) |
progress | set to |
pfile | file to which to write progress information. Defaults to |
Details
Simulates sequential clinical trials of longitudinal ordinal outcomes using a first-order Markov model. Looks are done sequentially after subject ID numbers given in the vectorlooks with the earliest possible look being after subject 2. At each look, a subject's repeated records are either all used or all ignored depending on the sequent ID number. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a functiong that has extra arguments specifying the true effect ofparameter the treatmentgroup expecting treatments to be coded 1 and 2.parameter is usually on the scale of a regression coefficient, e.g., a log odds ratio. Fitting is done using therms::lrm() function, unless non-proportional odds is allowed in which caseVGAM::vglm() is used. Iftimecriterion is specified, the function also, for the last data look only, computes the first time at which the criterion is satisfied for the subject or use the event time and event/censoring indicator computed bytimecriterion. The Cox/logrank chi-square statistic for comparing groups on the derived time variable is saved. Ifcoxzph=TRUE, thesurvival package correlation coefficientrho from the scaled partial residuals is also saved so that the user can later determine to what extent the Markov model resulted in the proportional hazards assumption being violated when analyzing on the time scale.vglm is accelerated by saving the first successful fit for the largest sample size and using its coefficients as starting value for furthervglm fits for any sample size for the same setting ofparameter.
Value
a data frame with number of rows equal to the product ofnsim, the length oflooks, and the length ofparameter, with variablessim,parameter,look,est (log odds ratio for group), andvest (the variance of the latter). Iftimecriterion is specified the data frame also containsloghr (Cox log hazard ratio for group),lrchisq (chi-square from Cox test for group), and ifcoxph=TRUE,phchisq, the chi-square for testing proportional hazards. The attributeetimefreq is also present iftimecriterion is present, and it probvides the frequency distribution of derived event times by group and censoring/event indicator. Ifsstat is given, the attributesstat is also present, and it contains an array with dimensions corresponding to simulations, parameter values within simulations,id, and a two-column subarray with columnsgroup andy, the latter being the summary measure computed by thesstat function. The returned data frame also has attributelrmcoef which are the last-look logistic regression coefficient estimates over thensim simulations and the parameter settings, and an attributefailures which is a data frame containing the variablesreason andfrequency cataloging the reasons for unsuccessful model fits.
Author(s)
Frank Harrell
See Also
gbayesSeqSim(),simMarkovOrd(),https://hbiostat.org/R/Hmisc/markov/
estSeqSim
Description
Simulate Comparisons For Use in Sequential Clinical Trial Simulations
Usage
estSeqSim(parameter, looks, gendat, fitter, nsim = 1, progress = FALSE)Arguments
parameter | vector of true parameter (effects; group differences) values |
looks | integer vector of observation numbers at which posterior probabilities are computed |
gendat | a function of three arguments: true parameter value (scalar), sample size for first group, sample size for second group |
fitter | a function of two arguments: 0/1 group indicator vector and the dependent variable vector |
nsim | number of simulations (default is 1) |
progress | set to |
Details
Simulates sequential clinical trials. Looks are done sequentially at observation numbers given in the vectorlooks with the earliest possible look being at observation 2. For each true effect parameter value, simulation, and at each look, runs a function to compute the estimate of the parameter of interest along with its variance. For each simulation, data are first simulated for the last look, and these data are sequentially revealed for earlier looks. The user provides a functiongendat that given a true effect ofparameter and the two sample sizes (for treatment groups 1 and 2) returns a list with vectorsy1 andy2 containing simulated data. The user also provides a functionfitter with argumentsx (group indicator 0/1) andy (response variable) that returns a 2-vector containing the effect estimate and its variance.parameter is usually on the scale of a regression coefficient, e.g., a log odds ratio.
Value
a data frame with number of rows equal to the product ofnsim, the length oflooks, and the length ofparameter.
Author(s)
Frank Harrell
See Also
gbayesSeqSim(),simMarkovOrd(),estSeqMarkovOrd()
Examples
if (requireNamespace("rms", quietly = TRUE)) { # Run 100 simulations, 5 looks, 2 true parameter values # Total simulation time: 2s lfit <- function(x, y) { f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k]) } gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2) } set.seed(1) est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100) head(est)}Flexible Event Chart for Time-to-Event Data
Description
Creates an event chart on the current graphics device. Also, allows userto plot legend on plot area or on separate page.Contains features useful for plotting data with time-to-event outcomesWhich arise in a variety of studiesincluding randomized clinical trials and non-randomized cohort studies.This function can use as input a matrix or a data frame, although greaterutility and ease of use will be seen with a data frame.
Usage
event.chart(data, subset.r = 1:dim(data)[1], subset.c = 1:dim(data)[2], sort.by = NA, sort.ascending = TRUE, sort.na.last = TRUE, sort.after.subset = TRUE, y.var = NA, y.var.type = "n", y.jitter = FALSE, y.jitter.factor = 1, y.renum = FALSE, NA.rm = FALSE, x.reference = NA, now = max(data[, subset.c], na.rm = TRUE), now.line = FALSE, now.line.lty = 2, now.line.lwd = 1, now.line.col = 1, pty = "m", date.orig = c(1, 1, 1960), titl = "Event Chart", y.idlabels = NA, y.axis = "auto", y.axis.custom.at = NA, y.axis.custom.labels = NA, y.julian = FALSE, y.lim.extend = c(0, 0), y.lab = ifelse(is.na(y.idlabels), "", as.character(y.idlabels)), x.axis.all = TRUE, x.axis = "auto", x.axis.custom.at = NA, x.axis.custom.labels = NA, x.julian = FALSE, x.lim.extend = c(0, 0), x.scale = 1, x.lab = ifelse(x.julian, "Follow-up Time", "Study Date"), line.by = NA, line.lty = 1, line.lwd = 1, line.col = 1, line.add = NA, line.add.lty = NA, line.add.lwd = NA, line.add.col = NA, point.pch = 1:length(subset.c), point.cex = rep(0.6, length(subset.c)), point.col = rep(1, length(subset.c)), point.cex.mult = 1., point.cex.mult.var = NA, extra.points.no.mult = rep(NA, length(subset.c)), legend.plot = FALSE, legend.location = "o", legend.titl = titl, legend.titl.cex = 3, legend.titl.line = 1, legend.point.at = list(x = c(5, 95), y = c(95, 30)), legend.point.pch = point.pch, legend.point.text = ifelse(rep(is.data.frame(data), length(subset.c)), names(data[, subset.c]), subset.c), legend.cex = 2.5, legend.bty = "n", legend.line.at = list(x = c(5, 95), y = c(20, 5)), legend.line.text = names(table(as.character(data[, line.by]), exclude = c("", "NA"))), legend.line.lwd = line.lwd, legend.loc.num = 1, ...)Arguments
data | a matrix or data frame with rows corresponding to subjects andcolumns corresponding to variables. Note that for a data frame ormatrix containing multiple time-to-eventdata (e.g., time to recurrence, time to death, and time tolast follow-up), one column is required for each specific event. |
subset.r | subset of rows of original matrix or data frame to place in event chart.Logical arguments may be used here (e.g., |
subset.c | subset of columns of original matrix or data frame to place in event chart;if working with a data frame, a vector of data frame variable names may beused for subsetting purposes (e.g., |
sort.by | column(s) or data frame variable name(s) with which to sort the chart's output.The default is |
sort.ascending | logical flag (which takes effect only if the argument |
sort.na.last | logical flag (which takes effect only if the argument |
sort.after.subset | logical flag (which takes effect only if the argument sort.by is utilized).If |
y.var | variable name or column number of original matrix or data frame withwhich to scale y-axis. Default is |
y.var.type | type of variable specified in |
y.jitter | logical flag (which takes effect only if the argument
The default of |
y.jitter.factor | an argument used with the |
y.renum | logical flag. If |
NA.rm | logical flag. If |
x.reference | column of original matrix or data frame with which to reference the x-axis.That is, if specified, all columns specified in |
now | the “now” date which will be used for top of y-axiswhen creating the Goldman eventchart (see reference below).Default is |
now.line | logical flag. A feature utilized by the Goldman Eventchart.When |
now.line.lty | line type of |
now.line.lwd | line width of |
now.line.col | color of |
pty | graph option, |
date.orig | date of origin to consider if dates are in julian, SAS , or S-Plus datesobject format; default is January 1, 1960 (which is the default originused by both S-Plus and SAS). Utilized when either |
titl | title for event chart. Default is 'Event Chart'. |
y.idlabels | column or data frame variable name used for y-axis labels. For example,if |
y.axis | character string specifying whether program will control labellingof y-axis (with argument |
y.axis.custom.at | user-specified vector of y-axis label locations.Must be used when |
y.axis.custom.labels | user-specified vector of y-axis labels.Must be used when |
y.julian | logical flag (which will only be considered if |
y.lim.extend | two-dimensional vector representing the number of units that the userwants to increase |
y.lab | single label to be used for entire y-axis. Default will be the variable nameor column number of |
x.axis.all | logical flag. If |
x.axis | character string specifying whether program will control labellingof x-axis (with argument |
x.axis.custom.at | user-specified vector of x-axis label locations.Must be used when |
x.axis.custom.labels | user-specified vector of x-axis labels.Must be used when |
x.julian | logical flag (which will only be considered if |
x.lim.extend | two-dimensional vector representing the number of time units (usually in days)that the user wants to increase |
x.scale | a factor whose reciprocal is multiplied to original units of thex-axis. For example, if the original data frame is in units of days, |
x.lab | single label to be used for entire x-axis. Default will be “On Study Date”if |
line.by | column or data frame variable name for plotting unique lines by uniquevalues of vector (e.g., specify |
line.lty | vector of line types corresponding to ascending order of |
line.lwd | vector of line widths corresponding to ascending order of |
line.col | vector of line colors corresponding to ascending order of |
line.add | a 2xk matrix with k=number of pairs of additional line segments to add.For example, if it is of interest to draw additional line segmentsconnecting events one and two, two and three, and four and five,(possibly with different colors), an appropriate The convention use of If NOTE: The drawing of the original default linemay be suppressed (with |
line.add.lty | a kx1 vector corresponding to the columns of |
line.add.lwd | a kx1 vector corresponding to the columns of |
line.add.col | a kx1 vector corresponding to the columns of |
point.pch | vector of |
point.cex | vector of size of points representing each event.If |
point.col | vector of colors of points representing each event.If |
point.cex.mult | a single number (may be non-integer), which is the base multiplier for the value ofthe |
point.cex.mult.var | vector of variables to be used in determining what point.cex.mult is multiplied byfor determining size of plotted points from (possibly a subset of) |
extra.points.no.mult | vector of variables in the dataset to ignore for purposes of using |
legend.plot | logical flag; if |
legend.location | will be used only if |
legend.titl | title for the legend; default is title to be used for main plot.Only used when |
legend.titl.cex | size of text for legend title. Only used when |
legend.titl.line | line location of legend title dictated by |
legend.point.at | location of upper left and lower right corners of legend area tobe utilized for describing events via points and text. |
legend.point.pch | vector of |
legend.point.text | text to be used for describing events; the default is setup for a data frame,as it will print the names of the columns specified by |
legend.cex | size of text for points and event descriptions. Default is 2.5 which is setupfor |
legend.bty | option to put a box around the legend(s); default is to have no box( |
legend.line.at | if |
legend.line.text | text to be used for describing |
legend.line.lwd | vector of line widths corresponding to |
legend.loc.num | number used for locator argument when |
... | additional par arguments for use in main plot. |
Details
if you want to put, say, two eventcharts side-by-side, in a plotregion, you should not set uppar(mfrow=c(1,2)) before running thefirst plot. Instead, you should add the argumentmfg=c(1,1,1,2)to the first plot call followed by the argumentmfg=c(1,2,1,2)to the second plot call.
if dates in original data frame are in a specialized form(eg., mm/dd/yy) of mode CHARACTER, the user must convert those columns tobecome class dates or julian numeric mode (seeDate for more information).For example, in a data frame calledtestdata, with specializeddates in columns 4 thru 10, the following code could be used:as.numeric(dates(testdata[,4:10])). This will convert the columnsto numeric julian dates based on the function's default originof January 1, 1960. If original dates are in class dates or julian form,no extra work is necessary.
In the survival analysis, the data typically come in twocolumns: one column containing survival time and the othercontaining censoring indicator or event code. Theevent.convert function converts this type of data intomultiple columns of event times, one column of each eventtype, suitable for theevent.chart function.
Side Effects
an event chart is created on the current graphics device.If legend.plot =TRUE and legend.location = 'o',a one-page legend will precede the event chart. Please note that parparameters on completion of function will be reset to par parametersexisting prior to start of function.
Author(s)
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org,khess@mdanderson.org
Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca
References
Lee J.J., Hess, K.R., Dubin, J.A. (2000). Extensions and applicationsof event charts.The American Statistician,54:1, 63–70.
Dubin, J.A., Lee, J.J., Hess, K.R. (1997).The Utility of Event Charts.Proceedings of the Biometrics Section, AmericanStatistical Association.
Dubin, J.A., Muller H-G, Wang J-L (2001).Event history graphs for censored survival data.Statistics in Medicine,20: 2951–2964.
Goldman, A.I. (1992).EVENTCHARTS: Visualizing Survival and Other Timed-Events Data.The American Statistician,46:1, 13–18.
See Also
Examples
# The sample data set is an augmented CDC AIDS dataset (ASCII)# which is used in the examples in the help file. This dataset is # described in Kalbfleisch and Lawless (JASA, 1989).# Here, we have included only children 4 years old and younger.# We have also added a new field, dethdate, which# represents a fictitious death date for each patient. There was# no recording of death date on the original dataset. In addition, we have# added a fictitious viral load reading (copies/ml) for each patient at time of AIDS diagnosis,# noting viral load was also not part of the original dataset.# # All dates are julian with julian=0 being # January 1, 1960, and julian=14000 being 14000 days beyond# January 1, 1960 (i.e., May 1, 1998).cdcaids <- data.frame(age=c(4,2,1,1,2,2,2,4,2,1,1,3,2,1,3,2,1,2,4,2,2,1,4,2,4,1,4,2,1,1,3,3,1,3),infedate=c(7274,7727,7949,8037,7765,8096,8186,7520,8522,8609,8524,8213,8455,8739,8034,8646,8886,8549,8068,8682,8612,9007,8461,8888,8096,9192,9107,9001,9344,9155,8800,8519,9282,8673),diagdate=c(8100,8158,8251,8343,8463,8489,8554,8644,8713,8733,8854,8855,8863,8983,9035,9037,9132,9164,9186,9221,9224,9252,9274,9404,9405,9433,9434,9470,9470,9472,9489,9500,9585,9649),diffdate=c(826,431,302,306,698,393,368,1124,191,124,330,642,408,244,1001,391,246,615,1118,539,612,245,813,516,1309,241,327,469,126,317,689,981,303,976),dethdate=c(8434,8304,NA,8414,8715,NA,8667,9142,8731,8750,8963,9120,9005,9028,9445,9180,9189,9406,9711,9453,9465,9289,9640,9608,10010,9488,9523,9633,9667,9547,9755,NA,9686,10084),censdate=c(NA,NA,8321,NA,NA,8519,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,10095,NA,NA),viralload=c(13000,36000,70000,90000,21000,110000,75000,12000,125000,110000,13000,39000,79000,135000,14000,42000,123000,20000,12000,18000,16000,140000,16000,58000,11000,120000,85000,31000,24000,115000,17000,13100,72000,13500))cdcaids <- upData(cdcaids, labels=c(age ='Age, y', infedate='Date of blood transfusion', diagdate='Date of AIDS diagnosis', diffdate='Incubation period (days from HIV to AIDS)', dethdate='Fictitious date of death', censdate='Fictitious censoring date', viralload='Fictitious viral load'))# Note that the style options listed with these# examples are best suited for output to a postscript file (i.e., using# the postscript function with horizontal=TRUE) as opposed to a graphical# window (e.g., motif).# To produce simple calendar event chart (with internal legend):# postscript('example1.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'observation dates', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data calendar event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), legend.point.at = list(c(7210, 8100), c(35, 27)), legend.bty='o')# To produce simple interval event chart (with internal legend):# postscript('example2.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1)))# To produce simple interval event chart (with internal legend),# but now with flexible diagdate symbol size based on viral load variable:# postscript('example2a.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='patients (sorted by AIDS diagnosis date)', titl='AIDS data interval event chart 1a, with viral load at diagdate represented', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), point.cex.mult = 0.00002, point.cex.mult.var = 'viralload', extra.points.no.mult = c(1,NA,1,1), legend.plot=TRUE, legend.location='i', legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1400, 1950), c(7, -1)))# To produce more complicated interval chart which is# referenced by infection date, and sorted by age and incubation period:# postscript('example3.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since diagnosis of AIDS (in days)', y.lab='patients (sorted by age and incubation length)', titl='AIDS data interval event chart 2 (sorted by age, incubation)', point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='diagdate', x.julian=TRUE, sort.by=c('age','diffdate'), line.by='age', line.lty=c(1,3,2,4), line.lwd=rep(1,4), line.col=rep(1,4), legend.bty='o', legend.point.at = list(c(-1350, -800), c(7, -1)), legend.line.at = list(c(-1350, -800), c(16, 8)), legend.line.text=c('age = 1', ' = 2', ' = 3', ' = 4'))# To produce the Goldman chart:# postscript('example4.ps', horizontal=TRUE) event.chart(cdcaids, subset.c=c('infedate','diagdate','dethdate','censdate'), x.lab = 'time since transfusion (in days)', y.lab='dates of observation', titl='AIDS data Goldman event chart 1', y.var = c('infedate'), y.var.type='d', now.line=TRUE, y.jitter=FALSE, point.pch=c(1,2,15,0), point.cex=c(1,1,0.8,0.8), mgp = c(3.1,1.6,0), legend.plot=TRUE, legend.location='i',legend.cex=1.0, legend.point.text=c('transfusion','AIDS diagnosis','death','censored'), x.reference='infedate', x.julian=TRUE, legend.bty='o', legend.point.at = list(c(1500, 2800), c(9300, 10000)))# To convert coded time-to-event data, then, draw an event chart:surv.time <- c(5,6,3,1,2)cens.ind <- c(1,0,1,1,0)surv.data <- cbind(surv.time,cens.ind)event.data <- event.convert(surv.data)event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)Event Conversion for Time-to-Event Data
Description
Convert a two-column data matrix with event time and event code intomultiple column event time with one event in each column
Usage
event.convert(data2, event.time = 1, event.code = 2)Arguments
data2 | a matrix or dataframe with at least 2 columns; by default, the firstcolumn contains the event time and the second column contains the kevent codes (e.g. 1=dead, 0=censord) |
event.time | the column number in data contains the event time |
event.code | the column number in data contains the event code |
Details
In the survival analysis, the data typically come in twocolumns: one column containing survival time and the othercontaining censoring indicator or event code. Theevent.convert function converts this type of data intomultiple columns of event times, one column of each eventtype, suitable for theevent.chart function.
Author(s)
J. Jack Lee and Kenneth R. Hess
Department of Biostatistics
University of Texas
M.D. Anderson Cancer Center
Houston, TX 77030
jjlee@mdanderson.org,khess@mdanderson.org
Joel A. Dubin
Department of Statistics
University of Waterloo
jdubin@uwaterloo.ca
See Also
event.history,Date,event.chart
Examples
# To convert coded time-to-event data, then, draw an event chart:surv.time <- c(5,6,3,1,2)cens.ind <- c(1,0,1,1,0)surv.data <- cbind(surv.time,cens.ind)event.data <- event.convert(surv.data)event.chart(cbind(rep(0,5),event.data),x.julian=TRUE,x.reference=1)Produces event.history graph for survival data
Description
Produces an event history graph for right-censored survival data,including time-dependent covariate status, as described inDubin, Muller, and Wang (2001). Effectively,a Kaplan-Meier curve is produced with supplementary informationregarding individual survival information, censoring information, andstatus over time of an individual time-dependent covariate or time-dependent covariate function for both uncensored and censored individuals.
Usage
event.history(data, survtime.col, surv.col, surv.ind = c(1, 0), subset.rows = NULL, covtime.cols = NULL, cov.cols = NULL, num.colors = 1, cut.cov = NULL, colors = 1, cens.density = 10, mult.end.cens = 1.05, cens.mark.right =FALSE, cens.mark = "-", cens.mark.ahead = 0.5, cens.mark.cutoff = -1e-08, cens.mark.cex = 1, x.lab = "time under observation", y.lab = "estimated survival probability", title = "event history graph", ...)Arguments
data | A matrix or data frame with rows corresponding to units(often individuals) and columns corresponding to survival time,event/censoring indicator. Also, multiple columns may be devoted totime-dependent covariate level and time change. |
survtime.col | Column (in data) representing minimum of time-to-event or right-censoring time for individual. |
surv.col | Column (in data) representing event indicator for an individual.Though, traditionally, such an indicator will be 1 for an event and0 for a censored observation, this indicator can be represented by any two numbers, made explicit by the surv.ind argument. |
surv.ind | Two-element vector representing, respectively, the number for an event, as listed in |
subset.rows | Subset of rows of original matrix or data frame (data) to place in event history graph.Logical arguments may be used here (e.g., |
covtime.cols | Column(s) (in data) representing the time when change of time-dependent covariate (or time-dependent covariate function) occurs. There should be a unique non- |
cov.cols | Column(s) (in data) representing the level of the time-dependent covariate (or time-dependent covariate function). There should be a unique non- |
num.colors | Colors are utilized for the time-dependent covariate level for anindividual. This argument provides the number of unique covariatelevels which will be displayed by mapping the number of colors (via |
cut.cov | This argument allows the user to explicitly state how to define the intervals for the time-dependent covariate, such thatdifferent colors will be allocated to the user-defined covariate levels.For example, for plotting five colors, six ordered points within the span of the data's covariate levels should be provided.Default is |
colors | This is a vector argument defining the actual colors used for the time-dependent covariate levels in the plot, with theindex of this vector corresponding to the ordered levelsof the covariate. The number of colors (i.e., the lengthof the colors vector) should correspond to the value provided to the |
cens.density | This will provide the shading density at the end of the individual bars for those who are censored. For more informationon shading density, see the density argument in the S-Pluspolygon function. Default is |
mult.end.cens | This is a multiplier that extends the length of the longest surviving individual bar (or bars, if a tie exists) if right-censored, presuming that no event times eventually follow thisfinal censored time. Default extends the length 5 percent beyond the length of the observed right-censored survival time. |
cens.mark.right | A logical argument that states whether an explicit mark should be placed to the right of the individual right-censored survival bars. This argument is most useful forlarge sample sizes, where it may be hard to detect the special shading via cens.density, particularly for the short-term survivors. |
cens.mark | Character argument which describes the censored mark that should beused if |
cens.mark.ahead | A numeric argument, which specifies the absolute distanceto be placed between the individual right-censoredsurvival bars and the mark as defined in the above cens.markargument. Default is 0.5 (that is, a half of day, ifsurvival time is measured in days), but may very well needadjusting depending on the maximum survival timeobserved in the dataset. |
cens.mark.cutoff | A negative number very close to 0 (by default |
cens.mark.cex | Numeric argument defining the size of the mark defined in the |
x.lab | Single label to be used for entire x-axis. Default is |
y.lab | Single label to be used for entire y-axis. Default is |
title | Title for the event history graph. Default is |
... | This allows arguments to the plot function call within the |
Details
In order to focus on a particular area of the event history graph,zooming can be performed. This is best done by specifying appropriatexlim andylim arguments at the end of theevent.history function call, taking advantage of the... argument link to the plot function.An example of zooming can be seenin Plate 4 of the paper referenced below.
Please read the reference below to understand how theindividual covariate and survival information is provided in the plot,how ties are handled, how right-censoring is handled, etc.
WARNING
This function has been tested thoroughly, but only within a restricted version and environment, i.e., only within S-Plus 2000, Version 3, and within S-Plus 6.0,version 2, both on a Windows 2000 machine. Hence, we cannot currently vouchfor the function's effectiveness in other versions of S-Plus (e.g., S-Plus 3.4) nor in other operating environments (e.g., Windows 95, Linux or Unix).The function has also been verified to work on R under Linux.
Note
The authors have found better control of the use of color by producing the graphs via the postscript plotting devicein S-Plus. In fact, the provided examples utilize the postscript function.However, your past experiences may be different, and you may prefer to control color directly (to the graphsheetin Windows environment, for example). The event.historyfunction will work with either approach.
Author(s)
Joel Dubin
jdubin@uwaterloo.ca
References
Dubin, J.A., Muller, H.-G., and Wang, J.-L. (2001).Event history graphs for censored survival data.Statistics in Medicine,20, 2951-2964.
See Also
Examples
# Code to produce event history graphs for SIM paper## before generating plots, some pre-processing needs to be performed,# in order to get dataset in proper form for event.history function;# need to create one line per subject and sort by time under observation, # with those experiencing event coming before those tied with censoring time;require('survival')data(heart)# creation of event.history version of heart dataset (call heart.one):heart.one <- matrix(nrow=length(unique(heart$id)), ncol=8)for(i in 1:length(unique(heart$id))) { if(length(heart$id[heart$id==i]) == 1) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i, ])) else if(length(heart$id[heart$id==i]) == 2) heart.one[i,] <- as.numeric(unlist(heart[heart$id==i,][2,])) }heart.one[,3][heart.one[,3] == 0] <- 2 ## converting censored events to 2, from 0if(is.factor(heart$transplant)) heart.one[,7] <- heart.one[,7] - 1 ## getting back to correct transplantation codingheart.one <- as.data.frame(heart.one[order(unlist(heart.one[,2]), unlist(heart.one[,3])),])names(heart.one) <- names(heart)# back to usual censoring indicator:heart.one[,3][heart.one[,3] == 2] <- 0 # note: transplant says 0 (for no transplants) or 1 (for one transplant)# and event = 1 is death, while event = 0 is censored# plot single Kaplan-Meier curve from heart data, first creating survival objectheart.surv <- survfit(Surv(stop, event) ~ 1, data=heart.one, conf.int = FALSE)# figure 3: traditional Kaplan-Meier curve# postscript('ehgfig3.ps', horiz=TRUE)# omi <- par(omi=c(0,1.25,0.5,1.25)) plot(heart.surv, ylab='estimated survival probability', xlab='observation time (in days)') title('Figure 3: Kaplan-Meier curve for Stanford data', cex=0.8)# dev.off()## now, draw event history graph for Stanford heart data; use as Figure 4# postscript('ehgfig4.ps', horiz=TRUE, colors = seq(0, 1, len=20))# par(omi=c(0,1.25,0.5,1.25)) event.history(heart.one, survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10),x.lab = 'time under observation (in days)',title='Figure 4: Event history graph for\nStanford data',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 30.0, cens.mark.cex = 0.85)# dev.off()# now, draw age-stratified event history graph for Stanford heart data; # use as Figure 5# two plots, stratified by age status# postscript('c:\temp\ehgfig5.ps', horiz=TRUE, colors = seq(0, 1, len=20))# par(omi=c(0,1.25,0.5,1.25)) par(mfrow=c(1,2)) event.history(data=heart.one, subset.rows = (heart.one[,4] < 0),survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10), x.lab = 'time under observation\n(in days)',title = 'Figure 5a:\nStanford data\n(age < 48)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85,xlim=c(0,1900)) event.history(data=heart.one, subset.rows = (heart.one[,4] >= 0),survtime.col=heart.one[,2], surv.col=heart.one[,3],covtime.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,1]),cov.cols = cbind(rep(0, dim(heart.one)[1]), heart.one[,7]),num.colors=2, colors=c(6,10),x.lab = 'time under observation\n(in days)',title = 'Figure 5b:\nStanford data\n(age >= 48)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 40.0, cens.mark.cex = 0.85,xlim=c(0,1900))# dev.off()# par(omi=omi)# we will not show liver cirrhosis data manipulation, as it was # a bit detailed; however, here is the # event.history code to produce Figure 7 / Plate 1# Figure 7 / Plate 1 : prothrombin ehg with color## Not run: second.arg <- 1### second.arg is for shadingthird.arg <- c(rep(1,18),0,1)### third.arg is for intensity# postscript('c:\temp\ehgfig7.ps', horiz=TRUE, # colors = cbind(seq(0, 1, len = 20), second.arg, third.arg)) # par(omi=c(0,1.25,0.5,1.25), col=19) event.history(cirrhos2.eh, subset.rows = NULL, survtime.col=cirrhos2.eh$time, surv.col=cirrhos2.eh$event,covtime.cols = as.matrix(cirrhos2.eh[, ((2:18)*2)]),cov.cols = as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),cut.cov = as.numeric(quantile(as.matrix(cirrhos2.eh[, ((2:18)*2) + 1]),c(0,.2,.4,.6,.8,1), na.rm=TRUE) + c(-1,0,0,0,0,1)), colors=c(20,4,8,11,14),x.lab = 'time under observation (in days)',title='Figure 7: Event history graph for liver cirrhosis data (color)',cens.mark.right =TRUE, cens.mark = '-', cens.mark.ahead = 100.0, cens.mark.cex = 0.85)# dev.off()## End(Not run)extractlabs
Description
Extract Labels and Units From Multiple Datasets
Usage
extractlabs(..., print = TRUE)Arguments
... | one ore more data frames or data tables |
print | set to |
Details
For one or more data frames/tables extracts all labels and units and comb ines them over dataset, dropping any variables not having either labels or units defined. The resulting data table is returned and is used by thehlab function if the user stores the result in an objectnamedLabelsUnits. The result isNULL if no variable in any dataset has a non-blanklabel orunits. Variables found in more than one dataset with duplicatelabel andunits are consolidated. A warning message is issued when duplicate variables have conflicting labels or units, and by default, details are printed. No attempt is made to resolve these conflicts.
Value
a data table
Author(s)
Frank Harrell
See Also
label(),contents(),units(),hlab()
Examples
d <- data.frame(x=1:10, y=(1:10)/10)d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)d2 <- dunits(d2$x) <- 'cm'LabelsUnits <- extractlabs(d, d2)LabelsUnitsfImport
Description
General File Import Usingrio
Usage
fImport( file, format, lowernames = c("not mixed", "no", "yes"), und. = FALSE, ...)Arguments
file | name of file to import, or full URL. |
format | format of file to import, usually not needed. See |
lowernames | defaults to changing variable names to all lower case unless the name as mixed upper and lower case, which results in keeping the original characters in the name. Set |
und. | set to |
... | more arguments to pass to |
Details
This is a front-end for therio package'simport function.fImport includes options for setting variable names to lower case and to change underscores in names to periods. Variables on the imported data frame that havelabels are converted to Hmisc packagelabelled class so that subsetting the data frame will preserve the labels.
Value
a data frame created byrio, unless ario option is given to use another format
Author(s)
Frank Harrell
See Also
upData, especially themoveUnits option
Examples
## Not run: # Get a Stata datasetd <- fImport('http://www.principlesofeconometrics.com/stata/alcohol.dta')contents(d)## End(Not run)Find Close Matches
Description
Compares each row inx against all the rows iny, finding rows iny with all columns within a tolerance of the values a given row ofx. The default tolerancetol is zero, i.e., an exact match is required on all columns.For qualifying matches, a distance measure is computed. This isthe sum of squares of differences betweenx andy after scalingthe columns. The default scaling values aretol, and for columnswithtol=1 the scale values are set to 1.0 (since they are ignoredanyway). Matches (up tomaxmatch of them) are stored and listed in order of increasing distance.
Thesummary method prints a frequency distribution of thenumber of matches per observation inx, the median of the minimumdistances for all matches perx, as a function of the number of matches,and the frequency of selection of duplicate observations as those havingthe smallest distance. Theprint method prints the entirematchesanddistance components of the result fromfind.matches.matchCases finds all controls that match cases on a single variablex within a tolerance oftol. This is intended for prospectivecohort studies that use matching for confounder adjustment (eventhough regression models usually work better).
Usage
find.matches(x, y, tol=rep(0, ncol(y)), scale=tol, maxmatch=10)## S3 method for class 'find.matches'summary(object, ...)## S3 method for class 'find.matches'print(x, digits, ...)matchCases(xcase, ycase, idcase=names(ycase), xcontrol, ycontrol, idcontrol=names(ycontrol), tol=NULL, maxobs=max(length(ycase),length(ycontrol))*10, maxmatch=20, which=c('closest','random'))Arguments
x | a numeric matrix or the result of |
y | a numeric matrix with same number of columns as |
xcase | numeric vector to match on for cases |
xcontrol | numeric vector to match on for controls, not necessarilythe same length as |
ycase | a vector or matrix |
ycontrol |
|
tol | a vector of tolerances with number of elements the same as the numberof columns of |
scale | a vector of scaling constants with number of elements the same as thenumber of columns of |
maxmatch | maximum number of matches to allow. For |
object | an object created by |
digits | number of digits to use in printing distances |
idcase | vector the same length as |
idcontrol |
|
maxobs | maximum number of cases and all matching controls combined (maximumdimension of data frame resulting from |
which | set to |
... | unused |
Value
find.matches returns a list of classfind.matches with elementsmatches anddistance. Both elements are matrices with the number of rows equal to the numberof rows inx, and withk columns, wherek is the maximum number ofmatches (<= maxmatch) that occurred. The elements ofmatchesare row identifiers ofy that match, with zeros if fewer thanmaxmatch matches are found (blanks ify had row names).matchCases returns a data frame with variablesidcase (id of casecurrently being matched),type (factor variable with levels"case"and"control"),id (id of case if case row, or id of matchingcase), andy.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Ming K, Rosenbaum PR (2001): A note on optimal matching with variablecontrols using the assignment algorithm. J Comp Graph Stat10:455–463.
Cepeda MS, Boston R, Farrar JT, Strom BL (2003): Optimal matching with avariable number of controls vs. a fixed number of controls for a cohortstudy: trade-offs. J Clin Epidemiology 56:230-237.Note: These papers were not used for the functions here butprobably should have been.
See Also
Examples
y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5))x <- rbind(c(.09,.21), c(.29,.39))yxw <- find.matches(x, y, maxmatch=5, tol=c(.05,.05))set.seed(111) # so can replicate resultsx <- matrix(runif(500), ncol=2)y <- matrix(runif(2000), ncol=2)w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03))w$matches[1:5,]w$distance[1:5,]# Find first x with 3 or more y-matchesnum.match <- apply(w$matches, 1, function(x)sum(x > 0))j <- ((1:length(num.match))[num.match > 2])[1]x[j,]y[w$matches[j,],]summary(w)# For many applications would do something like this:# attach(df1)# x <- cbind(age, sex) # Just do as.matrix(df1) if df1 has no factor objects# attach(df2)# y <- cbind(age, sex)# mat <- find.matches(x, y, tol=c(5,0)) # exact match on sex, 5y on age# Demonstrate matchCasesxcase <- c(1,3,5,12)xcontrol <- 1:6idcase <- c('A','B','C','D')idcontrol <- c('a','b','c','d','e','f')ycase <- c(11,33,55,122)ycontrol <- c(11,22,33,44,55,66)matchCases(xcase, ycase, idcase, xcontrol, ycontrol, idcontrol, tol=1)# If y is a binary response variable, the following code# will produce a Mantel-Haenszel summary odds ratio that # utilizes the matching.# Standard variance formula will not work here because# a control will match more than one case# WARNING: The M-H procedure exemplified here is suspect # because of the small strata and widely varying number# of controls per case.x <- c(1, 2, 3, 3, 3, 6, 7, 12, 1, 1:7)y <- c(0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1)case <- c(rep(TRUE, 8), rep(FALSE, 8))id <- 1:length(x)m <- matchCases(x[case], y[case], id[case], x[!case], y[!case], id[!case], tol=1)iscase <- m$type=='case'# Note: the first tapply on insures that event indicators are# sorted by case id. The second actually does something.event.case <- tapply(m$y[iscase], m$idcase[iscase], sum)event.control <- tapply(m$y[!iscase], m$idcase[!iscase], sum)n.control <- tapply(!iscase, m$idcase, sum)n <- tapply(m$y, m$idcase, length)or <- sum(event.case * (n.control - event.control) / n) / sum(event.control * (1 - event.case) / n)or# Bootstrap this estimator by sampling with replacement from# subjects. Assumes id is unique when combine cases+controls# (id was constructed this way above). The following algorithms# puts all sampled controls back with the cases to whom they were# originally matched.ids <- unique(m$id)idgroups <- split(1:nrow(m), m$id)B <- 50 # in practice use many moreors <- numeric(B)# Function to order w by ids, leaving unassigned elements zeroalign <- function(ids, w) { z <- structure(rep(0, length(ids)), names=ids) z[names(w)] <- w z}for(i in 1:B) { j <- sample(ids, replace=TRUE) obs <- unlist(idgroups[j]) u <- m[obs,] iscase <- u$type=='case' n.case <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='case'))) n.control <- align(ids, tapply(u$type, u$idcase, function(v)sum(v=='control'))) event.case <- align(ids, tapply(u$y[iscase], u$idcase[iscase], sum)) event.control <- align(ids, tapply(u$y[!iscase], u$idcase[!iscase], sum)) n <- n.case + n.control # Remove sets having 0 cases or 0 controls in resample s <- n.case > 0 & n.control > 0 denom <- sum(event.control[s] * (n.case[s] - event.case[s]) / n[s]) or <- if(denom==0) NA else sum(event.case[s] * (n.control[s] - event.control[s]) / n[s]) / denom ors[i] <- or}describe(ors)First Word in a String or Expression
Description
first.word finds the first word in an expression. A word is defined byunlisting the elements of the expression found by the S parser and thenaccepting any elements whose first character is either a letter or period.The principal intended use is for the automatic generation of temporaryfile names where it is important to exclude special characters fromthe file name. For Microsoft Windows, periods in names are deleted andonly up to the first 8 characters of the word is returned.
Usage
first.word(x, i=1, expr=substitute(x))Arguments
x | any scalar character string |
i | word number, default value = 1. Used when the second or |
expr | any S object of mode |
Value
a character string
Author(s)
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu
Examples
first.word(expr=expression(y ~ x + log(w)))Format a Data Frame or Matrix for LaTeX or HTML
Description
format.df does appropriate rounding and decimal alignment, and outputsa character matrix containing the formatted data. Ifx is adata.frame, then do each component separately.Ifx is a matrix, but not a data.frame, make it a data.framewith individual components for the columns.If a componentx$x is a matrix, then do all columns the same.
Usage
format.df(x, digits, dec=NULL, rdec=NULL, cdec=NULL, numeric.dollar=!dcolumn, na.blank=FALSE, na.dot=FALSE, blank.dot=FALSE, col.just=NULL, cdot=FALSE, dcolumn=FALSE, matrix.sep=' ', scientific=c(-4,4), math.row.names=FALSE, already.math.row.names=FALSE, math.col.names=FALSE, already.math.col.names=FALSE, double.slash=FALSE, format.Date="%m/%d/%Y", format.POSIXt="%m/%d/%Y %H:%M:%OS", ...)Arguments
x | a matrix (usually numeric) or data frame |
digits | causes all values in the table to be formatted to |
dec | If |
rdec | a vector specifying the number of decimal places to the right for each row ( |
cdec | a vector specifying the number of decimal places for each column.The vector must have number of items equal to number of columns or componentsof input x. |
cdot | Set to |
na.blank | Set to |
dcolumn | Set to |
numeric.dollar | logical, default |
math.row.names | logical, set true to place dollar signs around the row names. |
already.math.row.names | set to |
math.col.names | logical, set true to place dollar signs around the column names. |
already.math.col.names | set to |
na.dot | Set to |
blank.dot | Set to |
col.just | Input vector |
matrix.sep | When |
scientific | specifies ranges of exponents (or a logical vector) specifying valuesnot to convert to scientific notation. See |
double.slash | should escaping backslashes be themselves escaped. |
format.Date | String used to format objects of the Date class. |
format.POSIXt | String used to format objects of the POSIXt class. |
... | other arguments are accepted and passed to |
Value
a character matrix with character images of properly roundedx.Matrix components of inputx are now just sets of columns ofcharacter matrix.Object attribute"col.just" repeats the value of the argumentcol.just when provided,otherwise, it includes the recommended justification for columns of output.See the discussion of the argumentcol.just.The default justification is ‘l’ for characters and factors,‘r’ for numeric.Whendcolumn==TRUE, numerics will have ‘.’ as the justification character.
Author(s)
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu
See Also
Examples
## Not run: x <- data.frame(a=1:2, b=3:4)x$m <- 10000*matrix(5:8,nrow=2)names(x)dim(x)xformat.df(x, big.mark=",")dim(format.df(x))## End(Not run)Format P Values
Description
format.pval is intended for formatting p-values.
Usage
format.pval(x, pv=x, digits = max(1, .Options$digits - 2), eps = .Machine$double.eps, na.form = "NA", ...)Arguments
pv | a numeric vector. |
x | argument for method compliance. |
digits | how many significant digits are to be used. |
eps | a numerical tolerance: see Details. |
na.form | character representation of |
... | arguments passed to |
Details
format.pval is mainly an auxiliary function forprint.summary.lm etc., and does separate formatting forfixed, floating point and very small values; those less thaneps are formatted as “‘< [eps]’” (where“‘[eps]’” stands forformat(eps, digits)).
Value
A character vector.
Note
This is the baseformat.pval function with theablitiy to pass thensmall argument toformat
Examples
format.pval(c(runif(5), pi^-100, NA))format.pval(c(0.1, 0.0001, 1e-27))format.pval(c(0.1, 1e-27), nsmall=3)Gaussian Bayesian Posterior and Predictive Distributions
Description
gbayes derives the (Gaussian) posterior and optionally the predictivedistribution when both the prior and the likelihood are Gaussian, andwhen the statistic of interest comes from a 2-sample problem.This function is especially useful in obtaining the expected power ofa statistical test, averaging over the distribution of the populationeffect parameter (e.g., log hazard ratio) that is obtained usingpilot data.gbayes is also useful for summarizing studies forwhich the statistic of interest is approximately Gaussian withknown variance. An example is given for comparing two proportionsusing the angular transformation, for which the variance isindependent of unknown parameters except for very extreme probabilities.Aplot method is also given. This plots the prior, posterior, andpredictive distributions on a single graph using a nice default forthe x-axis limits and using thelabcurve function for automaticlabeling of the curves.
gbayes2 uses the method of Spiegelhalter and Freedman (1986) to compute theprobability of correctly concluding that a new treatment is superiorto a control. By this we mean that a 1-alpha normaltheory-based confidence interval for the new minus old treatmenteffect lies wholly to the right ofdelta.w, wheredelta.w is theminimally worthwhile treatment effect (which can be zero to beconsistent with ordinary null hypothesis testing, a method not alwaysmaking sense). This kind of power function is averaged over a priordistribution for the unknown treatment effect. This procedure isapplicable to the situation where a prior distribution is not to beused in constructing the test statistic or confidence interval, but isonly used for specifying the distribution ofdelta, the parameter ofinterest.
Even thoughgbayes2assumes that the test statistic has a normal distribution with knownvariance (which is strongly a function of the sample size in the twotreatment groups), the prior distribution function can be completelygeneral. Instead of using a step-function for the prior distributionas Spiegelhalter and Freedman used in their appendix,gbayes2 usesthe built-inintegrate function for numerical integration.gbayes2 also allows the variance of the test statistic to be generalas long as it is evaluated by the user. The conditional power given theparameter of interestdelta is1 - pnorm((delta.w - delta)/sd + z), where zis the normal critical value corresponding to 1 -alpha/2.
gbayesMixPredNoData derives the predictive distribution of astatistic that is Gaussian givendelta when no data have yet beenobserved and when the prior is a mixture of two Gaussians.
gbayesMixPost derives the posterior density, cdf, or posteriormean ofdelta given the statisticx, when the prior fordelta is a mixture of twoGaussians and whenx is Gaussian givendelta.
gbayesMixPowerNP computes the power for a test fordelta >delta.wfor the case where (1) a Gaussian prior or mixture of two Gaussian priorsis used as the prior distribution, (2) this prior is used in formingthe statistical test or credible interval, (3) no prior is used forthe distribution ofdelta for computing power but instead a fixedsingledelta is given (as in traditional frequentist hypothesistests), and (4) the test statistic has a Gaussian likelihood withknown variance (and mean equal to the specifieddelta).gbayesMixPowerNP is handy where you want to use an earlier study intesting for treatment effects in a new study, but you want to mix withthis prior a non-informative prior. The mixing probabilitymix canbe thought of as the "applicability" of the previous study. As withgbayes2, power here means the probability that the new study willyield a left credible interval that is to the right ofdelta.w.gbayes1PowerNP is a special case ofgbayesMixPowerNP when theprior is a single Gaussian.
Usage
gbayes(mean.prior, var.prior, m1, m2, stat, var.stat, n1, n2, cut.prior, cut.prob.prior=0.025)## S3 method for class 'gbayes'plot(x, xlim, ylim, name.stat='z', ...)gbayes2(sd, prior, delta.w=0, alpha=0.05, upper=Inf, prior.aux)gbayesMixPredNoData(mix=NA, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf'))gbayesMixPost(x=NA, v=NA, mix=1, d0=NA, v0=NA, d1=NA, v1=NA, what=c('density','cdf','postmean'))gbayesMixPowerNP(pcdf, delta, v, delta.w=0, mix, interval, nsim=0, alpha=0.05)gbayes1PowerNP(d0, v0, delta, v, delta.w=0, alpha=0.05)Arguments
mean.prior | mean of the prior distribution |
cut.prior,cut.prob.prior,var.prior | variance of the prior. Use a large number such as 10000 to effectivelyuse a flat (noninformative) prior. Sometimes it is useful to computethe variance so that the prior probability that |
m1 | sample size in group 1 |
m2 | sample size in group 2 |
stat | statistic comparing groups 1 and 2, e.g., log hazard ratio, differencein means, difference in angular transformations of proportions |
var.stat | variance of |
x | an object returned by |
sd | the standard deviation of the treatment effect |
prior | a function of possibly a vector of unknown treatment effects,returning the prior density at those values |
pcdf | a function computing the posterior CDF of the treatment effect |
delta | a true unknown single treatment effect to detect |
v | the variance of the statistic |
n1 | number of future observations in group 1, for obtaining a predictivedistribution |
n2 | number of future observations in group 2 |
xlim | vector of 2 x-axis limits. Default is the mean of the posterior plus orminus 6 standard deviations of the posterior. |
ylim | vector of 2 y-axis limits. Default is the range over combined prior and posterior densities. |
name.stat | label for x-axis. Default is |
... | optional arguments passed to |
delta.w | the minimum worthwhile treatment difference to detech. The default iszero for a plain uninteristing null hypothesis. |
alpha | type I error, or more accurately one minus the confidence level for atwo-sided confidence limit for the treatment effect |
upper | upper limit of integration over the prior distribution multiplied bythe normal likelihood for the treatment effect statistic. Default isinfinity. |
prior.aux | argument to pass to |
mix | mixing probability or weight for the Gaussian prior having mean |
d0 | mean of the first Gaussian distribution (only Gaussian for |
v0 | variance of the first Gaussian (only Gaussian for |
d1 | mean of the second Gaussian (if |
v1 | variance of the second Gaussian (if |
what | specifies whether the predictive density or the CDF is to becomputed. Default is |
interval | a 2-vector containing the lower and upper limit for possible values ofthe test statistic |
nsim | defaults to zero, causing |
Value
gbayes returns a list of class"gbayes" containing the followingnames elements:mean.prior,var.prior,mean.post,var.post, andifn1 is specified,mean.pred andvar.pred. Note thatmean.pred is identical tomean.post.gbayes2 returns a singlenumber which is the probability of correctly rejecting the nullhypothesis in favor of the new treatment.gbayesMixPredNoDatareturns a function that can be used to evaluate the predictive densityor cumulative distribution.gbayesMixPost returns a function thatcan be used to evaluate the posterior density or cdf.gbayesMixPowerNPreturns a vector containing two values ifnsim = 0. The first value is thecritical value for the test statistic that will make the left credibleinterval >delta.w, and the second value is the power. Ifnsim > 0,it returns the power estimate and confidence limits for it ifnsim >0. The examples show how to use these functions.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Spiegelhalter DJ, Freedman LS, Parmar MKB (1994): Bayesian approaches torandomized trials. JRSS A 157:357–416. Results forgbayes are derived fromEquations 1, 2, 3, and 6.
Spiegelhalter DJ, Freedman LS (1986): A predictive approach toselecting the size of a clinical trial, based on subjective clinicalopinion. Stat in Med 5:1–13.
Joseph, Lawrence and Belisle, Patrick (1997): Bayesian sample sizedetermination for normal means and differences between normal means.The Statistician 46:209–226.
Grouin, JM, Coste M, Bunouf P, Lecoutre B (2007): Bayesian sample sizedetermination in non-sequential clinical trials: Statistical aspects andsome regulatory considerations. Stat in Med 26:4914–4924.
See Also
Examples
# Compare 2 proportions using the var stabilizing transformation# arcsin(sqrt((x+3/8)/(n+3/4))) (Anscombe), which has variance # 1/[4(n+.5)]m1 <- 100; m2 <- 150deaths1 <- 10; deaths2 <- 30f <- function(events,n) asin(sqrt((events+3/8)/(n+3/4)))stat <- f(deaths1,m1) - f(deaths2,m2)var.stat <- function(m1, m2) 1/4/(m1+.5) + 1/4/(m2+.5)cat("Test statistic:",format(stat)," s.d.:", format(sqrt(var.stat(m1,m2))), "\n")#Use unbiased prior with variance 1000 (almost flat)b <- gbayes(0, 1000, m1, m2, stat, var.stat, 2*m1, 2*m2)print(b)plot(b)#To get posterior Prob[parameter > w] use # 1-pnorm(w, b$mean.post, sqrt(b$var.post))#If g(effect, n1, n2) is the power function to#detect an effect of 'effect' with samples size for groups 1 and 2#of n1,n2, estimate the expected power by getting 1000 random#draws from the posterior distribution, computing power for#each value of the population effect, and averaging the 1000 powers#This code assumes that g will accept vector-valued 'effect'#For the 2-sample proportion problem just addressed, 'effect'#could be taken approximately as the change in the arcsin of#the square root of the probability of the eventg <- function(effect, n1, n2, alpha=.05) { sd <- sqrt(var.stat(n1,n2)) z <- qnorm(1 - alpha/2) effect <- abs(effect) 1 - pnorm(z - effect/sd) + pnorm(-z - effect/sd)}effects <- rnorm(1000, b$mean.post, sqrt(b$var.post))powers <- g(effects, 500, 500)hist(powers, nclass=35, xlab='Power')describe(powers)# gbayes2 examples# First consider a study with a binary response where the# sample size is n1=500 in the new treatment arm and n2=300# in the control arm. The parameter of interest is the # treated:control log odds ratio, which has variance# 1/[n1 p1 (1-p1)] + 1/[n2 p2 (1-p2)]. This is not# really constant so we average the variance over plausible# values of the probabilities of response p1 and p2. We# think that these are between .4 and .6 and we take a # further short cutv <- function(n1, n2, p1, p2) 1/(n1*p1*(1-p1)) + 1/(n2*p2*(1-p2))n1 <- 500; n2 <- 300ps <- seq(.4, .6, length=100)vguess <- quantile(v(n1, n2, ps, ps), .75)vguess# 75% # 0.02183459# The minimally interesting treatment effect is an odds ratio# of 1.1. The prior distribution on the log odds ratio is# a 50:50 mixture of a vague Gaussian (mean 0, sd 100) and# an informative prior from a previous study (mean 1, sd 1)prior <- function(delta) 0.5*dnorm(delta, 0, 100)+0.5*dnorm(delta, 1, 1)deltas <- seq(-5, 5, length=150)plot(deltas, prior(deltas), type='l')# Now compute the power, averaged over this priorgbayes2(sqrt(vguess), prior, log(1.1))# [1] 0.6133338# See how much power is lost by ignoring the previous# study completelygbayes2(sqrt(vguess), function(delta)dnorm(delta, 0, 100), log(1.1))# [1] 0.4984588# What happens to the power if we really don't believe the treatment# is very effective? Let's use a prior distribution for the log# odds ratio that is uniform between log(1.2) and log(1.3).# Also check the power against a true null hypothesisprior2 <- function(delta) dunif(delta, log(1.2), log(1.3))gbayes2(sqrt(vguess), prior2, log(1.1))# [1] 0.1385113gbayes2(sqrt(vguess), prior2, 0)# [1] 0.3264065# Compare this with the power of a two-sample binomial test to# detect an odds ratio of 1.25bpower(.5, odds.ratio=1.25, n1=500, n2=300)# Power # 0.3307486# For the original prior, consider a new study with equal# sample sizes n in the two arms. Solve for n to get a# power of 0.9. For the variance of the log odds ratio# assume a common p in the center of a range of suspected# probabilities of response, 0.3. For this example we# use a zero null value and the uniform prior abovev <- function(n) 2/(n*.3*.7)pow <- function(n) gbayes2(sqrt(v(n)), prior2)uniroot(function(n) pow(n)-0.9, c(50,10000))$root# [1] 2119.675# Check this valuepow(2119.675)# [1] 0.9# Get the posterior density when there is a mixture of two priors,# with mixing probability 0.5. The first prior is almost# non-informative (normal with mean 0 and variance 10000) and the# second has mean 2 and variance 0.3. The test statistic has a value# of 3 with variance 0.4.f <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3)args(f)# Plot this densitydelta <- seq(-2, 6, length=150)plot(delta, f(delta), type='l')# Add to the plot the posterior density that used only# the almost non-informative priorlines(delta, f(delta, mix=1), lty=2)# The same but for an observed statistic of zerolines(delta, f(delta, mix=1, x=0), lty=3)# Derive the CDF instead of the densityg <- gbayesMixPost(3, 4, mix=0.5, d0=0, v0=10000, d1=2, v1=0.3, what='cdf')# Had mix=0 or 1, gbayes1PowerNP could have been used instead# of gbayesMixPowerNP below# Compute the power to detect an effect of delta=1 if the variance# of the test statistic is 0.2gbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12))# Do the same thing by simulationgbayesMixPowerNP(g, 1, 0.2, interval=c(-10,12), nsim=20000)# Compute by what factor the sample size needs to be larger# (the variance needs to be smaller) so that the power is 0.9ratios <- seq(1, 4, length=50)pow <- single(50)for(i in 1:50) pow[i] <- gbayesMixPowerNP(g, 1, 0.2/ratios[i], interval=c(-10,12))[2]# Solve for ratio using reverse linear interpolationapprox(pow, ratios, xout=0.9)$y# Check this by computing powergbayesMixPowerNP(g, 1, 0.2/2.1, interval=c(-10,12))# So the study will have to be 2.1 times as large as earlier thoughtgbayesSeqSim
Description
Simulate Bayesian Sequential Treatment Comparisons Using a Gaussian Model
Usage
gbayesSeqSim(est, asserts)Arguments
est | data frame created by |
asserts | list of lists. The first element of each list is the user-specified name for each assertion/prior combination, e.g., |
Details
Simulate a sequential trial under a Gaussian model for parameter estimates, and Gaussian priors using simulated estimates and variances returned byestSeqSim. For each row of the data frameest and for each prior/assertion combination, computes the posterior probability of the assertion.
Value
a data frame with number of rows equal to that ofest with a number of new columns equal to the number of assertions added. The new columns are namedp1,p2,p3, ... (posterior probabilities),mean1,mean2, ... (posterior means), andsd1,sd2, ... (posterior standard deviations). The returned data frame also has an attributeasserts added which is the originalasserts augmented with any derivedmu andsigma and converted to a data frame, and another attributealabels which is a named vector used to mapp1,p2, ... to the user-provided labels inasserts.
Author(s)
Frank Harrell
See Also
gbayes(),estSeqSim(),simMarkovOrd(),estSeqMarkovOrd()
Examples
## Not run: # Simulate Bayesian operating characteristics for an unadjusted# proportional odds comparison (Wilcoxon test)# For 100 simulations, 5 looks, 2 true parameter values, and# 2 assertion/prior combinations, compute the posterior probability# Use a low-level logistic regression call to speed up simuluations# Use data.table to compute various summary measures# Total simulation time: 2slfit <- function(x, y) {f <- rms::lrm.fit(x, y) k <- length(coef(f)) c(coef(f)[k], vcov(f)[k, k])}gdat <- function(beta, n1, n2) { # Cell probabilities for a 7-category ordinal outcome for the control group p <- c(2, 1, 2, 7, 8, 38, 42) / 100 # Compute cell probabilities for the treated group p2 <- pomodm(p=p, odds.ratio=exp(beta)) y1 <- sample(1 : 7, n1, p, replace=TRUE) y2 <- sample(1 : 7, n2, p2, replace=TRUE) list(y1=y1, y2=y2)}# Assertion 1: log(OR) < 0 under prior with prior mean 0.1 and sigma 1 on log OR scale# Assertion 2: OR between 0.9 and 1/0.9 with prior mean 0 and sigma computed so that# P(OR > 2) = 0.05asserts <- list(list('Efficacy', '<', 0, mu=0.1, sigma=1), list('Similarity', 'in', log(c(0.9, 1/0.9)), cutprior=log(2), tailprob=0.05))set.seed(1)est <- estSeqSim(c(0, log(0.7)), looks=c(50, 75, 95, 100, 200), gendat=gdat, fitter=lfit, nsim=100)z <- gbayesSeqSim(est, asserts)head(z)attr(z, 'asserts')# Compute the proportion of simulations that hit targets (different target posterior# probabilities for efficacy vs. similarity)# For the efficacy assessment compute the first look at which the target# was hit (set to infinity if never hit)require(data.table)z <- data.table(z)u <- z[, .(first=min(p1 > 0.95)), by=.(parameter, sim)]# Compute the proportion of simulations that ever hit the target and# that hit it by the 100th subjectu[, .(ever=mean(first < Inf)), by=.(parameter)]u[, .(by75=mean(first <= 100)), by=.(parameter)]## End(Not run)Step function confidence intervals for ggplot2
Description
Produces a step function confidence interval for survival curves. This function is taken fromtheutile.visuals package by Eric Finnesgard. That package is not used because of itsstrong dependencies.
Usage
geom_stepconfint( mapping = NULL, data = NULL, stat = "identity", position = "identity", na.rm = FALSE, ...)Arguments
mapping | Aesthetic mappings with aes() function. Like geom_ribbon(), you must providecolumns for x, ymin (lower limit), ymax (upper limit). |
data | The data to be displayed in this layer. Can inherit from ggplot parent. |
stat | The statistical transformation to use on the data for this layer, as a string.Defaults to 'identity'. |
position | Position adjustment, either as a string, or the result of a call to aposition adjustment function. |
na.rm | If FALSE, the default, missing values are removed with a warning. If TRUE,missing values are silently removed. |
... | Optional. Any other ggplot geom_ribbon() arguments. |
Note
Originally adapted from the survminer package <https://github.com/kassambara/survminer>.
Author(s)
Eric Finnesgard
Examples
require(survival)require(ggplot2)f <- survfit(Surv(time, status) ~ trt, data = diabetic)d <- with(f, data.frame(time, surv, lower, upper, trt=rep(names(f$strata), f$strata)))ggplot(d, aes(x = time, y=surv)) + geom_step(aes(color = trt)) + geom_stepconfint(aes(ymin = lower, ymax = upper, fill = trt), alpha = 0.3) + coord_cartesian(c(0, 50)) + scale_x_continuous(expand = c(0.02,0)) + labs(x = 'Time', y = 'Freedom From Event') + scale_color_manual( values = c('#d83641', '#1A45A7'), name = 'Treatment', labels = c('None', 'Laser'), aesthetics = c('colour', 'fill'))Download and Install Datasets forHmisc,rms, and StatisticalModeling
Description
This function downloads and makes ready to use datasets from the mainweb site for theHmisc andrms libraries. ForR, thedatasets were stored in compressedsave format andgetHdata makes them available by runningloadafter download. For S-Plus, the datasets were stored indata.dump format and are made available by runningdata.restore after import. The dataset is run through thecleanup.import function. CallinggetHdata with nofile argument provides a character vector of names of availabledatasets that are currently on the web site. ForR,R's defaultbrowser can optionally be launched to viewhtml files that werealready prepared using theHmisc commandhtml(contents()) or to view ‘.txt’ or ‘.html’ datadescription files when available.
Ifoptions(localHfiles=TRUE) the scripts are read from local directory~/web/data/repo instead of from the web server.
Usage
getHdata(file, what = c("data", "contents", "description", "all"), where="https://hbiostat.org/data/repo")Arguments
file | an unquoted name of a dataset on the web site, e.g. ‘prostate’.Omit |
what | specify |
where |
Value
getHdata() without afile argument returns a charactervector of dataset base names. When a dataset is downloaded, the dataframe is placed in search position one and is not returned as value ofgetHdata.
Author(s)
Frank Harrell
See Also
download.file,cleanup.import,data.restore,load
Examples
## Not run: getHdata() # download list of available datasetsgetHdata(prostate) # downloads, load( ) or data.restore( ) # runs cleanup.import for S-Plus 6getHdata(valung, "contents") # open browser (options(browser="whatever")) # after downloading valung.html # (result of html(contents()))getHdata(support, "all") # download and open one browser windowdatadensity(support)attach(support) # make individual variables availablegetHdata(plasma, "all") # download and open two browser windows # (description file is available for plasma)## End(Not run)Interact with github rscripts Project
Description
The github rscripts project athttps://github.com/harrelfe/rscripts contains R scripts that areprimarily analysis templates for teaching with RStudio. This functionallows the user to print an organized list of available scripts, todownload a script andsource() it into the current session (thedefault), todownload a script and load it into an RStudio script editor window, tolist scripts whose major category contains a given string (ignoringcase), or to list all major and minor categories. Ifoptions(localHfiles=TRUE) the scripts are read from local directory~/R/rscripts instead of from github.
Usage
getRs(file=NULL, guser='harrelfe', grepo='rscripts', gdir='raw/master', dir=NULL, browse=c('local', 'browser'), cats=FALSE, put=c('source', 'rstudio'))Arguments
file | a character string containing a script file name.Omit |
guser | GitHub user name, default is |
grepo | Github repository name, default is |
gdir | Github directory under which to find retrievable files |
dir | directory under |
browse | When showing the rscripts contents directory, thedefault is to list in tabular form in the console. Specify |
cats | Leave at the default ( |
put | Leave at the default ( |
Value
a data frame or list, depending on arguments
Author(s)
Frank Harrell and Cole Beck
See Also
Examples
## Not run: getRs() # list available scriptsscripts <- getRs() # likewise, but store in an object that can easily # be viewed on demand in RStudiogetRs('introda.r') # download introda.r and put in script editorgetRs(cats=TRUE) # list available major and minor categoriescategories <- getRs(cats=TRUE)# likewise but store results in a list for later viewinggetRs(cats='reg') # list all scripts in a major category containing 'reg'getRs('importREDCap.r') # source() to define a function# source() a new version of the Hmisc package's cut2 function:getRs('cut2.s', grepo='Hmisc', dir='R')## End(Not run)Open a Zip File From a URL
Description
Allows downloading and reading of a zip file containing one file
Usage
getZip(url, password=NULL)Arguments
url | either a path to a local file or a valid URL. |
password | required to decode password-protected zip files |
Details
Allows downloading and reading of zip file containing one file.The file may be password protected. If a password is needed then one will be requested unless given.
Note: to make password-protected zip file z.zip, do zip -e z myfile
Value
Returns a file O/I pipe.
Author(s)
Frank E. Harrell
See Also
Examples
## Not run: read.csv(getZip('http://test.com/z.zip'))## End(Not run)getabd
Description
Data from The Analysis of Biological Data by Shitlock and Schluter
Usage
getabd(name = "", lowernames = FALSE, allow = "_")Arguments
name | name of dataset to fetch. Omit to get a data table listing all available datasets. |
lowernames | set to |
allow | set to |
Details
Fetches csv files for exercises in the book
Value
data frame with attributeslabel andurl
Author(s)
Frank Harrell
Frequency Scatterplot
Description
Usesggplot2 to plot a scatterplot or dot-like chart for the casewhere there is a very large number of overlapping values. This worksfor continuous and categoricalx andy. For continuousvariables it serves the same purpose as hexagonal binning. Counts foroverlapping points are grouped into quantile groups and level oftransparency and rainbow colors are used to provide count information.
Instead, you can specifystick=TRUE not use color but to encodecell frequencies with the height of a black line y-centered at the middle of the bins.Relative frequencies are not transformed, and the maximum cellfrequency is shown in a caption. Every point with at least afrequency of one is depicted with a full-height light gray verticalline, scaled to the above overall maximum frequency. In this way torelative frequency is to proportion of these light gray lines that areblack, and one can see points whose frequencies are too low to see theblack lines.
The result can also be passed toggplotly. Actual cellfrequencies are added to the hover text in that case using thelabelggplot2 aesthetic.
Usage
ggfreqScatter(x, y, by=NULL, bins=50, g=10, cuts=NULL, xtrans = function(x) x, ytrans = function(y) y, xbreaks = pretty(x, 10), ybreaks = pretty(y, 10), xminor = NULL, yminor = NULL, xlab = as.character(substitute(x)), ylab = as.character(substitute(y)), fcolors = viridisLite::viridis(10), nsize=FALSE, stick=FALSE, html=FALSE, prfreq=FALSE, ...)Arguments
x | x-variable |
y | y-variable |
by | an optional vector used to make separate plots for eachdistinct value using |
bins | for continuous |
g | number of quantile groups to make for frequency counts. Use |
cuts | instead of using |
xtrans,ytrans | functions specifying transformations to be madebefore binning and plotting |
xbreaks,ybreaks | vectors of values to label on axis, on originalscale |
xminor,yminor | values at which to put minor tick marks, onoriginal scale |
xlab,ylab | axis labels. If not specified and variable has a |
fcolors |
|
nsize | set to |
stick | set to |
html | set to |
prfreq | set to |
... | arguments to pass to |
Value
aggplot object
Author(s)
Frank Harrell
See Also
Examples
require(ggplot2)set.seed(1)x <- rnorm(1000)y <- rnorm(1000)count <- sample(1:100, 1000, TRUE)x <- rep(x, count)y <- rep(y, count)# color=alpha=NULL below makes loess smooth over all pointsg <- ggfreqScatter(x, y) + # might add g=0 if using plotly geom_smooth(aes(color=NULL, alpha=NULL), se=FALSE) + ggtitle("Using Deciles of Frequency Counts, 2500 Bins")g# plotly::ggplotly(g, tooltip='label') # use plotly, hover text = freq. only# Plotly makes it somewhat interactive, with hover text tooltips# Instead use varying-height sticks to depict frequenciesggfreqScatter(x, y, stick=TRUE) + labs(subtitle='Relative height of black lines to gray linesis proportional to cell frequency.Note that points with even tiny frequency are visable(gray line with no visible black line).')# Try with x categoricalx1 <- sample(c('cat', 'dog', 'giraffe'), length(x), TRUE)ggfreqScatter(x1, y)# Try with y categoricaly1 <- sample(LETTERS[1:10], length(x), TRUE)ggfreqScatter(x, y1)# Both categorical, larger point symbols, box instead of circleggfreqScatter(x1, y1, shape=15, size=7)# Vary box size insteadggfreqScatter(x1, y1, nsize=TRUE, shape=15)ggplotlyr
Description
Renderplotly Graphic from aggplot2 Object
Usage
ggplotlyr(ggobject, tooltip = "label", remove = "txt: ", ...)Arguments
ggobject | an object produced by |
tooltip | attribute specified to |
remove | extraneous text to remove from hover text. Default is set to assume |
... | other arguments passed to |
Details
Usesplotly::ggplotly() to render aplotly graphic with a specified tooltip attribute, removing extraneous text thatggplotly puts in hover text whentooltip='label'
Value
aplotly object
Author(s)
Frank Harrell
hashCheck
Description
Check for Changes in List of Objects
Usage
hashCheck(..., file, .print. = TRUE, .names. = NULL)Arguments
... | a list of objects including data frames, vectors, functions, and all other types of R objects that represent dependencies of a certain calculation |
file | name of file in which results are stored |
.print. | set to |
.names. | vector of names of original arguments if not calling |
Details
Given an RDS file name and a list of objects, does the following:
makes a vector of hashes, one for each object. Function objects are run through
deparseso that the environment of the function will not be considered.see if the file exists; if not, return a list with result=NULL,
hash= new vector of hashes,changed='All'if the file exists, read the file and its hash attribute as
prevhashif
prevhashis not identical to hash:if.print.=TRUE(default), print to console a summary of what's changedreturn a list with result=NULL,hash= new hash vector, changedif
prevhash = hash, return a list with result=file object,hash=new hash, changed=”
Setoptions(debughash=TRUE) to trace results in/tmp/debughash.txt
Value
alist with elementsresult (the computations),hash (the new hash), andchanged which details what changed to make computations need to be run
Author(s)
Frank Harrell
Harrell-Davis Distribution-Free Quantile Estimator
Description
Computes the Harrell-Davis (1982) quantile estimator and jacknifestandard errors of quantiles. The quantile estimator is a weightedlinear combination or order statistics in which the order statisticsused in traditional nonparametric quantile estimators are given thegreatest weight. In small samples the H-D estimator is more efficientthan traditional ones, and the two methods are asymptoticallyequivalent. The H-D estimator is the limit of a bootstrap average asthe number of bootstrap resamples becomes infinitely large.
Usage
hdquantile(x, probs = seq(0, 1, 0.25), se = FALSE, na.rm = FALSE, names = TRUE, weights=FALSE)Arguments
x | a numeric vector |
probs | vector of quantiles to compute |
se | set to |
na.rm | set to |
names | set to |
weights | set to |
Details
A Fortran routine is used to compute the jackknife leave-out-onequantile estimates. Standard errors are not computed for quantiles 0 or1 (NAs are returned).
Value
A vector of quantiles. Ifse=TRUE this vector will have anattributese added to it, containing the standard errors. Ifweights=TRUE, also has a"weights" attribute which is a matrix.
Author(s)
Frank Harrell
References
Harrell FE, Davis CE (1982): A new distribution-free quantileestimator. Biometrika 69:635-640.
Hutson AD, Ernst MD (2000): The exact bootstrap mean and variance ofan L-estimator. J Roy Statist Soc B 62:89-94.
See Also
Examples
set.seed(1)x <- runif(100)hdquantile(x, (1:3)/4, se=TRUE)## Not run: # Compare jackknife standard errors with those from the bootstraplibrary(boot)boot(x, function(x,i) hdquantile(x[i], probs=(1:3)/4), R=400)## End(Not run)Moving and Hiding Table of Contents
Description
Moving and hiding table of contents for Rmd HTML documents
Usage
hidingTOC( buttonLabel = "Contents", levels = 3, tocSide = c("right", "left"), buttonSide = c("right", "left"), posCollapse = c("margin", "top", "bottom"), hidden = FALSE)Arguments
buttonLabel | the text on the button that hides and unhides thetable of contents. Defaults to |
levels | the max depth of the table of contents that it is desired tohave control over the display of. (defaults to 3) |
tocSide | which side of the page should the table of contents be placedon. Can be either |
buttonSide | which side of the page should the button that hides the TOCbe placed on. Can be either |
posCollapse | if |
hidden | Logical should the table of contents be hidden at page loadDefaults to |
Details
hidingTOC creates a table of contents in a Rmd document thatcan be hidden at the press of a button. It also generate buttons that allowthe hiding or unhiding of the diffrent level depths of the table of contents.
Value
a HTML formated text string to be inserted into an markdown document
Author(s)
Thomas Dupont
Examples
## Not run: hidingTOC()## End(Not run)Histograms for Variables in a Data Frame
Description
This functions tries to compute the maximum number of histograms thatwill fit on one page, then it draws a matrix of histograms. If thereare more qualifying variables than will fit on a page, the functionwaits for a mouse click before drawing the next page.
Usage
## S3 method for class 'data.frame'hist(x, n.unique = 3, nclass = "compute", na.big = FALSE, rugs = FALSE, freq=TRUE, mtitl = FALSE, ...)Arguments
x | a data frame |
n.unique | minimum number of unique values a variable must havebefore a histogram is drawn |
nclass | number of bins. Default ismax(2,trunc(min(n/10,25*log(n,10))/2)), where n is the number ofnon-missing values for a variable. |
na.big | set to |
rugs | set to |
freq | see |
mtitl | set to a character string to set aside extra outside topmargin and to use the string for an overall title |
... | arguments passed to |
Value
the number of pages drawn
Author(s)
Frank E Harrell Jr
See Also
Examples
d <- data.frame(a=runif(200), b=rnorm(200), w=factor(sample(c('green','red','blue'), 200, TRUE)))hist.data.frame(d) # in R, just say hist(d)Back to Back Histograms
Description
Takes two vectors or a list withx andy components, and produces back to back histograms of the two datasets.
Usage
histbackback(x, y, brks=NULL, xlab=NULL, axes=TRUE, probability=FALSE, xlim=NULL, ylab='', ...)Arguments
x,y | either two vectors or a list given as |
brks | vector of the desired breakpoints for the histograms. |
xlab | a vector of two character strings naming the two datasets. |
axes | logical flag stating whether or not to label the axes. |
probability | logical flag: if |
xlim | x-axis limits. First value must be negative, as the left histogram isplaced at negative x-values. Second value must be positive, for theright histogram. To make the limits symmetric, use e.g. |
ylab | label for y-axis. Default is no label. |
... | additional graphics parameters may be given. |
Value
a list is returned invisibly with the following components:
left | the counts for the dataset plotted on the left. |
right | the counts for the dataset plotted on the right. |
breaks | the breakpoints used. |
Side Effects
a plot is produced on the current graphics device.
Author(s)
Pat Burns
Salomon Smith Barney
London
pburns@dorado.sbi.com
See Also
Examples
options(digits=3)set.seed(1)histbackback(rnorm(20), rnorm(30))fool <- list(x=rnorm(40), y=rnorm(40))histbackback(fool)age <- rnorm(1000,50,10)sex <- sample(c('female','male'),1000,TRUE)histbackback(split(age, sex))agef <- age[sex=='female']; agem <- age[sex=='male']histbackback(list(Female=agef,Male=agem), probability=TRUE, xlim=c(-.06,.06))Use plotly to Draw Stratified Spike Histogram and Box Plot Statistics
Description
Usesplotly to draw horizontal spike histograms stratified bygroup, plus the mean (solid dot) and vertical bars for thesequantiles: 0.05 (red, short), 0.25 (blue, medium), 0.50 (black, long),0.75 (blue, medium), and 0.95 (red, short). The robust dispersion measureGini's mean difference and the SD may optionally be added. These areshown as horizontal lines starting at the minimum value ofxhaving a length equal to the mean difference or SD. Even when Gini'sand SD are computed, they are not drawn unless the user clicks on theirlegend entry.
Spike histograms have the advantage of effectively showing the raw data for bothsmall and huge datasets, and unlike box plots allow multi-modality to beeasily seen.
histboxpM plots multiple histograms stacked vertically, forvariables in a data frame having a commongroup variable (if any)and combined usingplotly::subplot.
dhistboxp is likehistboxp but noplotly graphicsare actually drawn. Instead, a data frame suitable for use withplotlyM is returned. Fordhistboxp an additional level ofstratificationstrata is implemented.group causes adifferent result here to produce back-to-back histograms (in the case oftwo groups) for each level ofstrata.
Usage
histboxp(p = plotly::plot_ly(height=height), x, group = NULL, xlab=NULL, gmd=TRUE, sd=FALSE, bins = 100, wmax=190, mult=7, connect=TRUE, showlegend=TRUE)dhistboxp(x, group = NULL, strata=NULL, xlab=NULL, gmd=FALSE, sd=FALSE, bins = 100, nmin=5, ff1=1, ff2=1)histboxpM(p=plotly::plot_ly(height=height, width=width), x, group=NULL, gmd=TRUE, sd=FALSE, width=NULL, nrows=NULL, ncols=NULL, ...)Arguments
p |
|
x | a numeric vector, or for |
group | a discrete grouping variable. If omitted, defaults to avector of ones |
strata | a discrete numeric stratification variable. Values arealso used to space out different spike histograms. Defaultsto a vector of ones. |
xlab | x-axis label, defaults to labelled version include unitsof measurement if any |
gmd | set to |
sd | set to |
width | width in pixels |
nrows | number of rows for layout of multiple plots |
ncols | number of columns for layout of multiple plots. At mostone of |
bins | number of equal-width bins to use for spike histogram. Ifthe number of distinct values of |
nmin | minimum number of non-missing observations for agroup-stratum combination before the spike histogram andquantiles are drawn |
ff1,ff2 | fudge factors for position and bar length for spike histograms |
wmax,mult | tweaks for margin to allocate |
connect | set to |
showlegend | used if producing multiple plots to be combined with |
... | other arguments for |
Value
aplotly object. Fordhistboxp a data frame asexpected byplotlyM
Author(s)
Frank Harrell
See Also
histSpike,plot.describe,scat1d
Examples
## Not run: dist <- c(rep(1, 500), rep(2, 250), rep(3, 600))Distribution <- factor(dist, 1 : 3, c('Unimodal', 'Bimodal', 'Trimodal'))x <- c(rnorm(500, 6, 1), rnorm(200, 3, .7), rnorm(50, 7, .4), rnorm(200, 2, .7), rnorm(300, 5.5, .4), rnorm(100, 8, .4))histboxp(x=x, group=Distribution, sd=TRUE)X <- data.frame(x, x2=runif(length(x)))histboxpM(x=X, group=Distribution, ncols=2) # separate plots## End(Not run)hlab
Description
Easy Extraction of Labels/Units Expressions for Plotting
Usage
hlab(x, name = NULL, html = FALSE, plotmath = TRUE)Arguments
x | a single variable name, unquoted |
name | a single character string providing an alternate way to name |
html | set to |
plotmath | set to |
Details
Given a single unquoted variable, first looks to see if a non-NULLLabelsUnits object exists (produced byextractlabs()). WhenLabelsUnits does not exist or isNULL, looks up the attributes in the current dataset, which defaults tod or may be specified byoptions(current_ds='name of the data frame/table'). Finally the existence of a variable of the given name in the global environment is checked. When a variable is not found in any of these three sources or has a blanklabel andunits, anexpression() with the variable name alone is returned. Ifhtml=TRUE, HTML strings are constructed instead, suitable forplotly graphics.
The result is useful forxlab andylab in base plotting functions or inggplot2, along with being useful forlabs inggplot2. See example.
Value
an expression created bylabelPlotmath withplotmath=TRUE
Author(s)
Frank Harrell
See Also
label(),units(),contents(),hlabs(),extractlabs(),plotmath
Examples
d <- data.frame(x=1:10, y=(1:10)/10)d <- upData(d, labels=c(x='X', y='Y'), units=c(x='mmHg'), print=FALSE)hlab(x)hlab(x, html=TRUE)hlab(z)require(ggplot2)ggplot(d, aes(x, y)) + geom_point() + labs(x=hlab(x), y=hlab(y))# Can use xlab(hlab(x)) + ylab(hlab(y)) also# Store names, labels, units for all variables in d in objectLabelsUnits <- extractlabs(d)# Remove d; labels/units still foundrm(d)hlab(x)# Remove LabelsUnits and use a current dataset named# d2 instead of the default drm(LabelsUnits)options(current_ds='d2')hlabs
Description
Front-end to ggplot2 labs Function
Usage
hlabs(x, y, html = FALSE)Arguments
x | a single variable name, unquoted |
y | a single variable name, unquoted |
html | set to |
Details
Runsx,y, or both throughhlab() and passes the constructed labels to theggplot2::labs function to specify x- and y-axis labels specially formatted for units of measurement
Value
result ofggplot2::labs()
Author(s)
Frank Harrell
Examples
# Name the current dataset d, or specify a name with# options(curr_ds='...') or run `extractlabs`, then# ggplot(d, aes(x,y)) + geom_point() + hlabs(x,y)# to specify only the x-axis label use hlabs(x), or to# specify only the y-axis label use hlabs(y=...)Matrix of Hoeffding's D Statistics
Description
Computes a matrix of Hoeffding's (1948)D statistics for allpossible pairs of columns of a matrix.D is a measure of thedistance betweenF(x,y) andG(x)H(y), whereF(x,y)is the joint CDF ofX andY, andG andH aremarginal CDFs. Missing values are deleted in pairs rather than deletingall rows ofx having any missing variables. TheDstatistic is robust against a wide variety of alternatives toindependence, such as non-monotonic relationships. The larger the valueofD, the more dependent areX andY (for manytypes of dependencies).D used here is 30 times Hoeffding'soriginalD, and ranges from -0.5 to 1.0 if there are no ties inthe data.print.hoeffd prints the information derived byhoeffd. The higher the value ofD, the more dependent arex andy.hoeffd also computes the mean and maximumabsolute values of the difference between the joint empirical CDF andthe product of the marginal empirical CDFs.
Usage
hoeffd(x, y)## S3 method for class 'hoeffd'print(x, ...)Arguments
x | a numeric matrix with at least 5 rows and at least 2 columns (if |
y | a numeric vector or matrix which will be concatenated to |
... | ignored |
Details
Uses midranks in case of ties, as described by Hollander and Wolfe.P-values are approximated by linear interpolation on the tablein Hollander and Wolfe, which uses the asymptotically equivalentBlum-Kiefer-Rosenblatt statistic. ForP<.0001 or>0.5,P values arecomputed using a well-fitting linear regression function inlog P vs.the test statistic.Ranks (but not bivariate ranks) are computed using efficientalgorithms (see reference 3).
Value
a list with elementsD, thematrix of D statistics,n thematrix of number of observations used in analyzing each pair of variables,andP, the asymptotic P-values.Pairs with fewer than 5 non-missing values have the D statistic set to NA.The diagonals ofn are the number of non-NAs for the single variablecorresponding to that row and column.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat19:546–57.
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods,pp. 228–235, 423. New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.
See Also
Examples
x <- c(-2, -1, 0, 1, 2)y <- c(4, 1, 0, 1, 4)z <- c(1, 2, 3, 4, NA)q <- c(1, 2, 3, 4, 5)hoeffd(cbind(x,y,z,q))# Hoeffding's test can detect even one-to-many dependencyset.seed(1)x <- seq(-10,10,length=200)y <- x*sign(runif(200,-1,1))plot(x,y)hoeffd(x,y)Convert an S object to HTML
Description
html is a generic function, for which only two methods are currentlyimplemented,html.latex and a rudimentaryhtml.data.frame. The former uses theHeVeA LaTeX to HTML translator by Maranget to create an HTML file from a LaTeX file likethe one produced bylatex.html.default just runshtml.data.frame.htmlVerbatim prints all of its arguments to the console in anhtml verbatim environment, using a specified percent of the prevailingcharacter size. This is useful for R Markdown withknitr.
Most of the html-producing functions in the Hmisc and rms packagesreturn a character vector passed throughhtmltools::HTML so thatkintr will correctly format the result without the need for theuser puttingresults='asis' in the chunk header.
Usage
html(object, ...)## S3 method for class 'latex'html(object, file, where=c('cwd', 'tmp'), method=c('hevea', 'htlatex'), rmarkdown=FALSE, cleanup=TRUE, ...)## S3 method for class 'data.frame'html(object, file=paste(first.word(deparse(substitute(object))),'html',sep='.'), header, caption=NULL, rownames=FALSE, align='r', align.header='c', bold.header=TRUE, col.header='Black', border=2, width=NULL, size=100, translate=FALSE, append=FALSE, link=NULL, linkCol=1, linkType=c('href','name'), disableq=FALSE, ...) ## Default S3 method:html(object, file=paste(first.word(deparse(substitute(object))),'html',sep='.'), append=FALSE, link=NULL, linkCol=1, linkType=c('href','name'), ...)htmlVerbatim(..., size=75, width=85, scroll=FALSE, rows=10, cols=100, propts=NULL, omit1b=FALSE)Arguments
object | a data frame or an object created by |
file | name of the file to create. The default filename is |
where | for |
method | default is to use system command |
rmarkdown | set to |
cleanup | if using |
header | vector of column names. Defaults to names in |
caption | a character string to be used as a caption before thetable |
rownames | set to |
align | alignment for table columns (all are assumed to have thesame if is a scalar). Specify |
align.header | same coding as for |
bold.header | set to |
col.header | color for column headers |
border | set to 0 to not include table cell borders, 1 to includeonly outer borders, or 2 (the default) to put borders around cells too |
translate | set to |
width | optional table width for |
size | a number between 0 and 100 representing the percent of theprevailing character size to be used by |
append | set to |
link | character vector specifying hyperlink names to attach toselected elements of the matrix or data frame. No hyperlinks are usedif |
linkCol | column number of |
linkType | defaults to |
disableq | set to |
... | ignored except for |
scroll | set to |
rows,cols | the number of rows and columns to devote to the visablepart of the scrollable box |
propts | options, besides |
omit1b | for |
Author(s)
Frank E. Harrell, Jr.
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
References
Maranget, Luc. HeVeA: a LaTeX to HTML translater.URL: http://para.inria.fr/~maranget/hevea/
See Also
Examples
## Not run: x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','e')))w <- latex(x)h <- html(w) # run HeVeA to convert .tex to .htmlh <- html(x) # convert x directly to htmlw <- html(x, link=c('','B')) # hyperlink first row first col to B# Assuming system package tex4ht is installed, easily convert advanced# LaTeX tables to htmlgetHdata(pbc)s <- summaryM(bili + albumin + stage + protime + sex + age + spiders ~ drug, data=pbc, test=TRUE)w <- latex(s, npct='slash', file='s.tex')z <- html(w)browseURL(z$file)d <- describe(pbc)w <- latex(d, file='d.tex')z <- html(w)browseURL(z$file)## End(Not run)htmltabc
Description
Simple HTML Table of Verbatim Output
Usage
htmltabv(..., cols = 2, propts = list(quote = FALSE))Arguments
... | objects to |
cols | number of columns in the html table |
propts | an option list of arguments to pass to the |
Details
Usescapture.output to capture as character strings the results ofrunningprint() on each element of.... If an element of... haslength of 1 and is a blank string, nothing is printed for that cellother than its name (not in verbatim).
Value
character string of html
Author(s)
Frank Harrell
Generic Functions and Methods for Imputation
Description
These functions do simple andtranscan imputation and print, summarize, and subscriptvariables that have NAs filled-in with imputed values. The simpleimputation method involves filling in NAs with constants,with a specified single-valued function of the non-NAs, or froma sample (with replacement) from the non-NA values (this is usefulin multiple imputation).More complex imputations can be donewith thetranscan function, which also works with the generic methodsshown here, i.e.,impute can take atranscan object and use theimputed values created bytranscan (withimputed=TRUE) to fill-in NAs.Theprint method places * after variable values that were imputed.Thesummary method summarizes all imputed values and then usesthe nextsummary method available for the variable.The subscript method preserves attributes of the variable and subsetsthe list of imputed values corresponding with how the variable wassubsetted. Theis.imputed function is for checking if observationsare imputed.
Usage
impute(x, ...)## Default S3 method:impute(x, fun=median, ...)## S3 method for class 'impute'print(x, ...)## S3 method for class 'impute'summary(object, ...)is.imputed(x)Arguments
x | a vector or an object created by |
fun | the name of a function to use in computing the (single) imputed value from the non-NAs. The default is |
object | an object of class |
... | ignored |
Value
a vector with class"impute" placed in front of existing classes.Foris.imputed, a vector of logical values is returned (allTRUE ifobject is not of classimpute).
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
transcan,impute.transcan,describe,na.include,sample
Examples
age <- c(1,2,NA,4)age.i <- impute(age)# Could have used impute(age,2.5), impute(age,mean), impute(age,"random")age.isummary(age.i)is.imputed(age.i)intMarkovOrd
Description
Compute Parameters for Proportional Odds Markov Model
Usage
intMarkovOrd( y, times, initial, absorb = NULL, intercepts, extra = NULL, g, target, t, ftarget = NULL, onlycrit = FALSE, constraints = NULL, printsop = FALSE, ...)Arguments
y | vector of possible y values in order (numeric, character, factor) |
times | vector of measurement times |
initial | initial value of |
absorb | vector of absorbing states, a subset of |
intercepts | vector of initial guesses for the intercepts |
extra | an optional vector of intial guesses for other parameters passed to |
g | a user-specified function of three or more arguments which in order are |
target | vector of target state occupancy probabilities at time |
t | target times. Can have more than one element only if |
ftarget | an optional function defining constraints that relate to transition probabilities. The function returns a penalty which is a sum of absolute differences in probabilities from target probabilities over possibly multiple targets. The |
onlycrit | set to |
constraints | a function of two arguments: the vector of current intercept values and the vector of |
printsop | set to |
... | optional arguments to pass to |
Details
Given a vectorintercepts of initial guesses at the intercepts in a Markov proportional odds model, and a vectorextra if there are other parameters, solves for theintercepts andextra vectors that yields a set of occupancy probabilities at timet that equal, as closely as possible, a vector of target values.
Value
list containing two vectors namedintercepts andextra unlessoncrit=TRUE in which case the best achieved sum of absolute errors is returned
Author(s)
Frank Harrell
See Also
https://hbiostat.org/R/Hmisc/markov/
knitr Setup and plotly Service Function
Description
knitrSet sets up knitr to use better default parameters for base graphics,better code formatting, and to allow several arguments to be passedfrom code chunk headers, such asbty,mfrow,ps,bot (extra bottom margin for base graphics),top (extratop margin),left (extra left margin),rt (extra rightmargin),lwd,mgp,las,tcl,axes,xpd,h (usuallyfig.height in knitr),w (usuallyfig.width in knitr),wo(out.width in knitr),ho (out.height in knitr),cap (characterstring containing figure caption),scap (character stringcontaining short figure caption for table of figures). Thecapfile argument facilities auto-generating a table of figuresfor certain Rmarkdown report themes. This is done by the addition ofa hook function that appends data to thecapfile file each timea chunk runs that has a long or short caption in the chunk header.
plotlySave saves a plotly graphic with namefoo.pngwherefoo is the name of the current chunk. You must have afreeplotly account fromplot.ly to use this function,and you must have runSys.setenv(plotly_username="your_plotly_username") andSys.setenv(plotly_api_key="your_api_key"). The API key can befound in one's profile settings.
Usage
knitrSet(basename=NULL, w=if(! bd) 4, h=if(! bd) 3, wo=NULL, ho=NULL, fig.path=if(length(basename)) basename else '', fig.align=if(! bd) 'center', fig.show='hold', fig.pos=if(! bd) 'htbp', fig.lp = if(! bd) paste('fig', basename, sep=':'), dev=switch(lang, latex='pdf', markdown='png', blogdown=NULL, quarto=NULL), tidy=FALSE, error=FALSE, messages=c('messages.txt', 'console'), width=61, decinline=5, size=NULL, cache=FALSE, echo=TRUE, results='markup', capfile=NULL, lang=c('latex','markdown','blogdown','quarto'))plotlySave(x, ...)Arguments
basename | base name to be added in front of graphics filenames. |
w,h | default figure width and height in inches |
wo,ho | default figure rendering width and height, in integerpixels or percent as a character string, e.g. |
fig.path | path for figures. To put figures in a subdirectoryspecify e.g. |
fig.align,fig.show,fig.pos,fig.lp,tidy,cache,echo,results,error,size | see knitr documentation |
dev | graphics device, with default figured from |
messages | By default warning and other messages such as thosefrom loading packages are sent to file |
width | text output width for R code and output |
decinline | number of digits to the right of the decimal point toround numeric values appearing inside Sexpr |
capfile | the name of a file in the current working directorythat is used to accumulate chunk labels, figure cross-referencetags, and figure short captions (long captions if no short captionis defined) for the purpose of using |
lang | Default is |
x | a |
... | additional arguments passed to |
Author(s)
Frank Harrell
See Also
Examples
## Not run: # Typical call (without # comment symbols):# <<echo=FALSE>>=# require(Hmisc)# knitrSet()# @knitrSet() # use all defaults and don't use a graphics file prefixknitrSet('modeling') # use modeling- prefix for a major section or chapterknitrSet(cache=TRUE, echo=FALSE) # global default to cache and not print codeknitrSet(w=5,h=3.75) # override default figure width, height# ```{r chunkname}# p <- plotly::plot_ly(...)# plotlySave(p) # creates fig.path/chunkname.png## End(Not run)Label Curves, Make Keys, and Interactively Draw Points and Curves
Description
labcurve optionally draws a set of curves then labels the curves.A variety of methods for drawing labels are implemented, ranging frompositioning using the mouse to automatic labeling to automatic placementof key symbols with manual placement of key legends to automaticplacement of legends. For automatic positioning of labels or keys, acurve is labeled at a point that is maximally separated from all of theother curves. Gaps occurring when curves do not start or end at thesame x-coordinates are given preference for positioning labels. Iflabels are offset from the curves (the default behaviour), if theclosest curve to curve i is above curve i, curve i is labeled below itsline. If the closest curve is below curve i, curve i is labeled aboveits line. These directions are reversed if the resulting labels wouldappear outside the plot region.
Both ordinary lines and step functions are handled, and there is anoption to draw the labels at the same angle as the curve within alocal window.
Unless the mouse is used to position labels or plotting symbols areplaced along the curves to distinguish them, curves are examined at 100(by default) equally spaced points over the range of x-coordinates inthe current plot area. Linear interpolation is used to gety-coordinates to line up (step function or constant interpolation isused for step functions). There is an option to instead examine allcurves at the set of unique x-coordinates found by unioning thex-coordinates of all the curves. This option is especially useful whenplotting step functions. By settingadj="auto" you can havelabcurve try to optimally left- or right-justify labels dependingon the slope of the curves at the points at which labels would becentered (plus a vertical offset). This is especially useful whenlabels must be placed on steep curve sections.
You can use theon top method to write (short) curve namesdirectly on the curves (centered on the y-coordinate). This isespecially useful when there are many curves whose full labels would runinto each other. You can plot letters or numbers on the curves, forexample (using thekeys option), and havelabcurve use thekey function to provide long labels for these short ones (see theend of the example). There is another option for connecting labels tocurves using arrows. Whenkeys is a vector of integers, it istaken to represent plotting symbols (pchs), and these symbols areplotted at equally-spaced x-coordinates on each curve (by default, using5 points per curve). The points are offset in the x-direction betweencurves so as to minimize the chance of collisions.
To add a legend defining line types, colors, or line widths with nosymbols, specifykeys="lines", e.g.,labcurve(curves,keys="lines", lty=1:2).
putKey provides a different way to usekey() by allowingthe user to specify vectors for labels, line types, plotting characters,etc. Elements that do not apply (e.g.,pch for lines(type="l")) may beNA. When a series of points isrepresented by both a symbol and a line, the corresponding elements ofbothpch andlty,col., orlwd will benon-missing.
putKeyEmpty, given vectors of all the x-y coordinates that have beenplotted, useslargest.empty to find the largest empty rectangle largeenough to hold the key, and draws the key usingputKey.
drawPlot is a simple mouse-driven function for drawing series oflines, step functions, polynomials, Bezier curves, and points, andautomatically labeling the point groups usinglabcurve orputKeyEmpty. WhendrawPlot is invoked it createstemporary functionsPoints,Curve, andAbline.The user calls these functions insidethe call todrawPlot to define groups of points in the order theyare defined with the mouse.Abline is used to callablineand not actually great a group of points. For some curve types, thecurve generated to represent the corresponding series of points is drawnafter all points are entered for that series, and this curve may bedifferent than the simple curve obtained by connecting points at themouse clicks. For example, to draw a general smooth Bezier curve theuser need only click on a few points, and she must overshoot the finalcurve coordinates to define the curve. The originally entered pointsare not erased once the curve is drawn. The same goes for stepfunctions and polynomials. If youplot() the object returned bydrawPlot, however, only final curves will be shown. The lastexamples show how to usedrawPlot.
Thelargest.empty function finds the largest rectangle that is largeenough to hold a rectangle of a given height and width, such that therectangle does not contain any of a given set of points. This isused bylabcurve andputKeyEmpty to position keys at the mostempty part of an existing plot. The default method was created by HansBorchers.
Usage
labcurve(curves, labels=names(curves), method=NULL, keys=NULL, keyloc=c("auto","none"), type="l", step.type=c("left", "right"), xmethod=if(any(type=="s")) "unique" else "grid", offset=NULL, xlim=NULL, tilt=FALSE, window=NULL, npts=100, cex=NULL, adj="auto", angle.adj.auto=30, lty=pr$lty, lwd=pr$lwd, col.=pr$col, transparent=TRUE, arrow.factor=1, point.inc=NULL, opts=NULL, key.opts=NULL, empty.method=c('area','maxdim'), numbins=25, pl=!missing(add), add=FALSE, ylim=NULL, xlab="", ylab="", whichLabel=1:length(curves), grid=FALSE, xrestrict=NULL, ...)putKey(z, labels, type, pch, lty, lwd, cex=par('cex'), col=rep(par('col'),nc), transparent=TRUE, plot=TRUE, key.opts=NULL, grid=FALSE)putKeyEmpty(x, y, labels, type=NULL, pch=NULL, lty=NULL, lwd=NULL, cex=par('cex'), col=rep(par('col'),nc), transparent=TRUE, plot=TRUE, key.opts=NULL, empty.method=c('area','maxdim'), numbins=25, xlim=pr$usr[1:2], ylim=pr$usr[3:4], grid=FALSE)drawPlot(..., xlim=c(0,1), ylim=c(0,1), xlab='', ylab='', ticks=c('none','x','y','xy'), key=FALSE, opts=NULL)# Points(label=' ', type=c('p','r'),# n, pch=pch.to.use[1], cex=par('cex'), col=par('col'),# rug = c('none','x','y','xy'), ymean)# Curve(label=' ',# type=c('bezier','polygon','linear','pol','loess','step','gauss'),# n=NULL, lty=1, lwd=par('lwd'), col=par('col'), degree=2,# evaluation=100, ask=FALSE)# Abline(\dots)## S3 method for class 'drawPlot'plot(x, xlab, ylab, ticks, key=x$key, keyloc=x$keyloc, ...)largest.empty(x, y, width=0, height=0, numbins=25, method=c('exhaustive','rexhaustive','area','maxdim'), xlim=pr$usr[1:2], ylim=pr$usr[3:4], pl=FALSE, grid=FALSE)Arguments
curves | a list of lists, each of which have at least two components: a vector of |
z | a two-element list specifying the coordinate of the center of the key,e.g. |
labels | For |
x | see below |
y | for |
... | For |
width | see below |
height | for |
method |
For |
keys | This causes keys (symbols or short text) to be drawn on or besidecurves, and if |
keyloc | When |
type | for |
step.type | type of step functions used (default is |
xmethod | method for generating the unique set of x-coordinates to examine (see above). Default is |
offset | distance in y-units between the center of the label and the line beinglabeled. Default is 0.75 times the height of an "m" that would bedrawn in a label. For R grid/lattice you must specify offset usingthe |
xlim | limits for searching for label positions, and is also used to set upplots when |
tilt | set to |
window | width of a window, in x-units, to use in determining the local slopefor tilting labels. Default is 0.5 times number of characters in thelabel times the x-width of an "m" in the current character size and font. |
npts | number of points to use if |
cex | character size to pass to |
adj | Default is |
angle.adj.auto | see |
lty | vector of line types which were used to draw the curves.This is only used when keys are drawn. If all of theline types, line widths, and line colors are the same, lines are not drawn in the key. |
lwd | vector of line widths which were used to draw the curves.This is only used when keys are drawn. See |
col. | vector of integer color numbers |
col | vector of integer color numbers for use in curve labels, symbols,lines, and legends. Default is |
transparent | Default is |
arrow.factor | factor by which to multiply default arrow lengths |
point.inc | When |
opts | an optional list which can be used to specify any of the optionsto |
key.opts | a list of extra arguments you wish to pass to |
empty.method | see below |
numbins | These two arguments are passed to the |
pl | set to |
add | By default, when curves are actually drawn by |
ylim | When a plot has already been started, |
xlab | see below |
ylab | x-axis and y-axis labels when |
whichLabel | integer vector corresponding to |
grid | set to |
xrestrict | When having |
pch | vector of plotting characters for |
plot | set to |
ticks | tells |
key | for |
Details
The internal functionsPoints,Curve,Abline haveunique arguments as follows.
label:for
PointsandCurveis a singlecharacter string to label that group of pointsn:number of points to accept from the mouse. Defaultis to input points until a right mouse click.
rug:for
Points. Default is"none"tonot show the marginal x or y distributions as rug plots, for thepoints entered. Other possibilities are used to executescat1dto show the marginal distribution of x, y, or bothas rug plots.ymean:for
Points, subtracts a constant fromeach y-coordinate entered to make the overall meanymeandegree:degree of polynomial to fit to points by
Curveevaluation:number of points at which to evaluateBezier curves, polynomials, and other functions in
Curveask:set
ask=TRUEto give the user theopportunity to try again at specifying points for Bezier curves,step functions, and polynomials
Thelabcurve function used some code from the functionplot.multicurve writtenby Rod Tjoelker of The Boeing Company (tjoelker@espresso.rt.cs.boeing.com).
If there is only one curve, a label is placed at the middle x-value,and no fancy features such asangle or positive/negative offsets areused.
key is called once (with the argumentplot=FALSE) to find the keydimensions. Then an empty rectangle with at least these dimensions issearched for usinglargest.empty. Thenkey is called again to drawthe key there, using the argumentcorner=c(.5,.5) so that the centerof the rectangle can be specified tokey.
If you want to plot the data, an easier way to uselabcurve isthroughxYplot as shown in some of its examples.
Value
labcurve returns an invisible list with componentsx, y, offset, adj, cex, col, and iftilt=TRUE,angle.offset is the amount to add toy to draw a label.offset is negative if the label is drawn below the line.adj is a vector containing the values 0, .5, 1.
largest.empty returns a list with elementsx andyspecifying the coordinates of the center of the rectangle which wasfound, and elementrect containing the 4x andycoordinates of the corners of the found empty rectangle. Thearea of the rectangle is also returned.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
approx,text,legend,scat1d,xYplot,abline
Examples
n <- 2:8m <- length(n)type <- c('l','l','l','l','s','l','l')# s=step function l=ordinary line (polygon)curves <- vector('list', m)plot(0,1,xlim=c(0,1),ylim=c(-2.5,4),type='n')set.seed(39)for(i in 1:m) { x <- sort(runif(n[i])) y <- rnorm(n[i]) lines(x, y, lty=i, type=type[i], col=i) curves[[i]] <- list(x=x,y=y)}labels <- paste('Label for',letters[1:m])labcurve(curves, labels, tilt=TRUE, type=type, col=1:m)# Put only single letters on curves at points of # maximum space, and use key() to define the letters,# with automatic positioning of the key in the most empty# part of the plot# Have labcurve do the plotting, leaving extra space for keynames(curves) <- labelslabcurve(curves, keys=letters[1:m], type=type, col=1:m, pl=TRUE, ylim=c(-2.5,4))# Put plotting symbols at equally-spaced points,# with a key for the symbols, ignoring line typeslabcurve(curves, keys=1:m, lty=1, type=type, col=1:m, pl=TRUE, ylim=c(-2.5,4))# Plot and label two curves, with line parameters specified with dataset.seed(191)ages.f <- sort(rnorm(50,20,7))ages.m <- sort(rnorm(40,19,7))height.f <- pmin(ages.f,21)*.2+60height.m <- pmin(ages.m,21)*.16+63labcurve(list(Female=list(ages.f,height.f,col=2), Male =list(ages.m,height.m,col=3,lty='dashed')), xlab='Age', ylab='Height', pl=TRUE)# add ,keys=c('f','m') to label curves with single letters# For S-Plus use lty=2# Plot power for testing two proportions vs. n for various odds ratios, # using 0.1 as the probability of the event in the control group. # A separate curve is plotted for each odds ratio, and the curves are# labeled at points of maximum separationn <- seq(10, 1000, by=10)OR <- seq(.2,.9,by=.1)pow <- lapply(OR, function(or,n)list(x=n,y=bpower(p1=.1,odds.ratio=or,n=n)), n=n)names(pow) <- format(OR)labcurve(pow, pl=TRUE, xlab='n', ylab='Power')# Plot some random data and find the largest empty rectangle# that is at least .1 wide and .1 tallx <- runif(50)y <- runif(50)plot(x, y)z <- largest.empty(x, y, .1, .1)zpoints(z,pch=3) # mark center of rectangle, orpolygon(z$rect, col='blue') # to draw the rectangle, or#key(z$x, z$y, \dots stuff for legend)# Use the mouse to draw a series of points using one symbol, and# two smooth curves or straight lines (if two points are clicked), # none of these being labeled# d <- drawPlot(Points(), Curve(), Curve())# plot(d)## Not run: # Use the mouse to draw a Gaussian density, two series of points# using 2 symbols, one Bezier curve, a step function, and raw data# along the x-axis as a 1-d scatter plot (rug plot). Draw a key.# The density function is fit to 3 mouse clicks# Abline draws a dotted horizontal reference lined <- drawPlot(Curve('Normal',type='gauss'), Points('female'), Points('male'), Curve('smooth',ask=TRUE,lty=2), Curve('step',type='s',lty=3), Points(type='r'), Abline(h=.5, lty=2), xlab='X', ylab='y', xlim=c(0,100), key=TRUE)plot(d, ylab='Y')plot(d, key=FALSE) # label groups using labcurve## End(Not run)Label Attribute of an Object
Description
label(x) retrieves thelabel attribute ofx.label(x) <- "a label" stores the label attribute, and also putsthe classlabelled as the first class ofx (for S-Plusthis class is not used and methods for handling this class arenot defined so the"label" and"units" attributes are lostupon subsetting). The reason for having this class is so that thesubscripting method forlabelled,[.labelled, can preservethelabel attribute in S. Also, theprintmethod forlabelled objects prefaces the print with the object'slabel (andunits if there). If the variable is also givena"units" attribute using theunits function, subsettingthe variable (using[.labelled) will also retain the"units" attribute.
label can optionally append a"units" attribute to thestring, and it can optionally return a string or expression (forR'splotmath facility) suitable for plotting.labelPlotmathis a function that also has this function, when the input arguments arethe'label' and'units' rather than a vector having thoseattributes. Whenplotmath mode is used to construct labels, the'label' or'units' may contain math expressions but theyare typed verbatim if they contain percent signs, blanks, orunderscores.labelPlotmath can optionally create theexpression as a character string, which is useful in buildingggplot commands.
ForSurv objects,label first looks to see if there isan overall"label" attribute for the object, then it looks forsaved attributes thatSurv put in the"inputAttributes"object, looking first at theevent variable, thentime2,and finallytime. You can restrict the looking by specifyingtype.
labelLatex constructs suitable LaTeX labels a variable or from thelabel andunits arguments, optionally right-justifyingunits ifhfill=TRUE. This is useful when making tableswhen the variable in question is not a column heading. Ifxis specified,label andunits values are extracted fromits attributes instead of from the other arguments.
Label (actuallyLabel.data.frame) is a function which generatesS source code that makes the labels in all the variables in a dataframe easy to edit.
llist is likelist except that it preserves the names orlabels of the component variables in the variableslabelattribute. This can be useful when looping over variables or usingsapply orlapply. By usingllist instead oflist one can annotate the output with the current variable's nameor label.llist also defines anames attribute for thelist and pulls thenames from the arguments' expressions fornon-named arguments.
prList prints a list with element names (without the dollarsign as in default list printing) and if an element of the list is anunclassed list with a name, all of those elements are printed, withtitles of the form "primary list name : inner list name". This isespecially useful for Rmarkdown html notebooks when a user-writtenfunction creates multiple html and graphical outputs to all be printedin a code chunk. Optionally the names can be printed after theobject, and thehtmlfig option provides more capabilities whenmaking html reports.prList does not work for regular htmldocuments.
putHfig is similar toprList but for a single graphicalobject that is rendered with aprint method, making it easy tospecify long captions, and short captions for the table of contents inHTML documents.Table of contents entries are generated with the short caption, whichis taken as the long caption if there is none. One can optionally notmake a table of contents entry. If argumenttable=TRUE tablecaptions will be produced instead. Usingexpcoll,markupSpecshtml functionexpcoll will be used tomake tables expand upon clicking an arrow rather than always appear.
putHcap is likeputHfig except that itassumes that users render the graphics or table outside of theputHcap call. This allows things to work in ordinary htmldocuments.putHcap does not handle collapsed text.
plotmathTranslate is a simple function that translates certaincharacter strings to character strings that can be used as part ofRplotmath expressions. If the input string has a space or percentinside, the string is surrounded by a call toplotmath'spaste function.
as.data.frame.labelled is a utility function that is called by[.data.frame. It is just a copy ofas.data.frame.vector.data.frame.labelled is another utility function, that adds aclass"labelled" to every variable in a data frame that has a"label" attribute but not a"labelled" class.
relevel.labelled is a method for preservinglabels with therelevel function.
reLabelled is used to add a'labelled' class back tovariables in data frame that have a 'label' attribute but no 'labelled'class. Useful for changingcleanup.import()'d S-Plus dataframes back to general form forR and old versions of S-Plus.
Usage
label(x, default=NULL, ...)## Default S3 method:label(x, default=NULL, units=plot, plot=FALSE, grid=FALSE, html=FALSE, ...)## S3 method for class 'Surv'label(x, default=NULL, units=plot, plot=FALSE, grid=FALSE, html=FALSE, type=c('any', 'time', 'event'), ...)## S3 method for class 'data.frame'label(x, default=NULL, self=FALSE, ...)label(x, ...) <- value## Default S3 replacement method:label(x, ...) <- value## S3 replacement method for class 'data.frame'label(x, self=TRUE, ...) <- valuelabelPlotmath(label, units=NULL, plotmath=TRUE, html=FALSE, grid=FALSE, chexpr=FALSE)labelLatex(x=NULL, label='', units='', size='smaller[2]', hfill=FALSE, bold=FALSE, default='', double=FALSE)## S3 method for class 'labelled'print(x, ...) ## or x - calls print.labelledLabel(object, ...)## S3 method for class 'data.frame'Label(object, file='', append=FALSE, ...)llist(..., labels=TRUE)prList(x, lcap=NULL, htmlfig=0, after=FALSE)putHfig(x, ..., scap=NULL, extra=NULL, subsub=TRUE, hr=TRUE, table=FALSE, file='', append=FALSE, expcoll=NULL)putHcap(..., scap=NULL, extra=NULL, subsub=TRUE, hr=TRUE, table=FALSE, file='', append=FALSE)plotmathTranslate(x)data.frame.labelled(object)## S3 method for class 'labelled'relevel(x, ...)reLabelled(object)combineLabels(...)Arguments
x | any object (for |
self | lgoical, where to interact with the object or its components |
units | set to |
plot | set to |
default | if |
grid | CurrentlyR's |
html | set to |
type | for |
label | a character string containing a variable's label |
plotmath | set to |
chexpr | set to |
size | LaTeX size for |
hfill | set to |
bold | set to |
double | set to |
value | the label of the object, or "". |
object | a data frame |
... | a list of variables or expressions to be formed into a |
file | the name of a file to which to write S source code. Default is |
append | set to |
labels | set to |
lcap | an optional vector of character strings corresponding toelements in |
htmlfig | for |
after | set to |
scap | a character string specifying the short (or possibly only)caption. |
extra | an optional vector of character strings. When presentthe long caption will be put in the first column of an HTML tableand the elements of |
subsub | set to |
hr | applies if a caption is present. Specify |
table | set to |
expcoll | character string to be visible, with a clickable arrowfollowing to allow initial hiding of a table and its captions.Cannot be used with |
Value
label returns the label attribute of x, if any; otherwise, "".label is usedmost often for the individual variables in data frames. The functionsas.get copies labels over from SAS if they exist.
See Also
sas.get,describe,extractlabs,hlab
Examples
age <- c(21,65,43)y <- 1:3label(age) <- "Age in Years"plot(age, y, xlab=label(age))data <- data.frame(age=age, y=y)label(data)label(data, self=TRUE) <- "A data frame"label(data, self=TRUE)x1 <- 1:10x2 <- 10:1label(x2) <- 'Label for x2'units(x2) <- 'mmHg'x2x2[1:5]dframe <- data.frame(x1, x2)Label(dframe)labelLatex(x2, hfill=TRUE, bold=TRUE)labelLatex(label='Velocity', units='m/s')##In these examples of llist, note that labels are printed after##variable names, because of print.labelleda <- 1:3b <- 4:6label(b) <- 'B Label'llist(a,b)llist(a,b,d=0)llist(a,b,0)w <- llist(a, b>5, d=101:103)sapply(w, function(x){ hist(as.numeric(x), xlab=label(x)) # locator(1) ## wait for mouse click})# Or: for(u in w) {hist(u); title(label(u))}latestFile
Description
Find File With Latest Modification Time
Usage
latestFile(pattern, path = ".", verbose = TRUE)Arguments
pattern | a regular expression; see |
path | full path, defaulting to current working directory |
verbose | set to |
Details
Subject to matching onpattern finds the last modified file, and ifverbose isTRUE reports on how many total files matchedpattern.
Value
the name of the last modified file
Author(s)
Frank Harrell
See Also
Convert an S object to LaTeX, and Related Utilities
Description
latex converts its argument to a ‘.tex’ file appropriatefor inclusion in a LaTeX2e document.latex is a genericfunction that calls one oflatex.default,latex.function,latex.list.
latex.defaultdoes appropriate rounding and decimal alignment and produces afile containing a LaTeX tabular environment to print the matrix or data.framex as a table.
latex.function prepares an S function for printing by issuingsedcommands that are similar to those in theS.to.latex procedure in thes.to.latex package (Chambersand Hastie, 1993).latex.function can also produceverbatim output or output that works with theSweavelLaTeX style.
latex.list callslatex recursively for each element in the argument.
latexTranslate translates particular items in characterstrings to LaTeX format, e.g., makes ‘a^2 = a\$^2\$’ for superscript withinvariable labels. LaTeX names of greek letters (e.g.,"alpha")will have backslashes added ifgreek==TRUE. Math mode isinserted as needed.latexTranslate assumes that input text always has matches,e.g.[) [] (] (), and that surrounding by ‘\$\$’ is OK.
htmlTranslate is similar tolatexTranslate but for htmltranslation. It doesn't need math mode and assumes dollar signs arejust that.
latexSN converts a vector floating point numbers to characterstrings using LaTeX exponents. Dollar signs to enter math mode are notadded. Similarly,htmlSN converts to scientific notation in html.
latexVerbatim on an object executes the object'sprint method,capturing the output for a file inside a LaTeX verbatim environment.
dvi uses the systemlatex command to compile LaTeX code producedbylatex, including any needed styles.dviwill put a ‘\documentclass{report}’ and ‘\end{document}’ wrapperaround a file produced bylatex. By default, the ‘geometry’ LaTeX package isused to omit all margins and to set the paper size to a default of5.5in wide by 7in tall. The result ofdvi is a .dvi file. To bothformat and screen display a non-default size, use for exampleprint(dvi(latex(x), width=3, height=4),width=3,height=4). Note thatyou can use something like ‘xdvi -geometry 460x650 -margins 2.25infile’ without changing LaTeX defaults to emulate this.
dvips will use the systemdvips command to print the .dvi file tothe default system printer, or create a postscript file iffileis specified.
dvigv uses the systemdvips command to convert the input objectto a .dvi file, and uses the systemdvips command to convert it topostscript. Then the postscript file is displayed using Ghostview(assumed to be the system commandgv).
There areshow methods for displaying typeset LaTeXon the screen using the systemxdvicommand. If youshow a LaTeX file created bylatex without running it throughdvi usingshow.dvi(object), theshow method will run it throughdvi automatically.Theseshow methods are not S Version 4 methods so you have to use full names suchasshow.dvi andshow.latex. Use theprint methods formore automatic display of typesetting, e.g. typinglatex(x) willinvoke xdvi to view the typeset document.
Usage
latex(object, ...)## Default S3 method:latex(object, title=first.word(deparse(substitute(object))), file=paste(title, ".tex", sep=""), append=FALSE, label=title, rowlabel=title, rowlabel.just="l", cgroup=NULL, n.cgroup=NULL, rgroup=NULL, n.rgroup=NULL, cgroupTexCmd="bfseries", rgroupTexCmd="bfseries", rownamesTexCmd=NULL, colnamesTexCmd=NULL, cellTexCmds=NULL, rowname, cgroup.just=rep("c",length(n.cgroup)), colheads=NULL, extracolheads=NULL, extracolsize='scriptsize', dcolumn=FALSE, numeric.dollar=!dcolumn, cdot=FALSE, longtable=FALSE, draft.longtable=TRUE, ctable=FALSE, booktabs=FALSE, table.env=TRUE, here=FALSE, lines.page=40, caption=NULL, caption.lot=NULL, caption.loc=c('top','bottom'), star=FALSE, double.slash=FALSE, vbar=FALSE, collabel.just=rep("c",nc), na.blank=TRUE, insert.bottom=NULL, insert.bottom.width=NULL, insert.top=NULL, first.hline.double=!(booktabs | ctable), where='!tbp', size=NULL, center=c('center','centering','centerline','none'), landscape=FALSE, multicol=TRUE, math.row.names=FALSE, already.math.row.names=FALSE, math.col.names=FALSE, already.math.col.names=FALSE, hyperref=NULL, continued='continued', ...) # x is a matrix or data.frame## S3 method for class 'function'latex(object,title=first.word(deparse(substitute(object))),file=paste(title, ".tex", sep=""),append=FALSE,assignment=TRUE, type=c('example','verbatim','Sinput'), width.cutoff=70, size='', ...)## S3 method for class 'list'latex( object, title=first.word(deparse(substitute(object))), file=paste(title, ".tex", sep=""), append=FALSE, label, caption, caption.lot, caption.loc=c('top','bottom'), ...)## S3 method for class 'latex'print(x, ...)latexTranslate(object, inn=NULL, out=NULL, pb=FALSE, greek=FALSE, na='', ...)htmlTranslate(object, inn=NULL, out=NULL, greek=FALSE, na='', code=htmlSpecialType(), ...)latexSN(x)htmlSN(x, pretty=TRUE, ...)latexVerbatim(x, title=first.word(deparse(substitute(x))), file=paste(title, ".tex", sep=""), append=FALSE, size=NULL, hspace=NULL, width=.Options$width, length=.Options$length, ...)dvi(object, ...)## S3 method for class 'latex'dvi(object, prlog=FALSE, nomargins=TRUE, width=5.5, height=7, ...)## S3 method for class 'dvi'print(x, ...)dvips(object, ...)## S3 method for class 'latex'dvips(object, ...)## S3 method for class 'dvi'dvips(object, file, ...)## S3 method for class 'latex'show(object) # or show.dvi(object) or just objectdvigv(object, ...)## S3 method for class 'latex'dvigv(object, ...) # or gvdvi(dvi(object))## S3 method for class 'dvi'dvigv(object, ...)Arguments
object | For |
x | any object to be |
title | name of file to create without the ‘.tex’ extension. If thisoption is not set, value/string of |
file | name of the file to create. The default file name is ‘x.tex’ where |
append | defaults to |
label | a text string representing a symbolic label for the table for referencingin the LaTeX ‘\label’ and ‘\ref’ commands. |
rowlabel | If |
rowlabel.just | If |
cgroup | a vector of character strings defining major column headings. The default isto have none. |
n.cgroup | a vector containing the number of columns for which each element incgroup is a heading. For example, specify |
rgroup | a vector of character strings containing headings for row groups. |
n.rgroup | integer vector giving the number of rows in each grouping. If |
cgroupTexCmd | A character string specifying a LaTeX command to beused to format column group labels. The default, |
rgroupTexCmd | A character string specifying a LaTeX command to beused to format row group labels. The default, |
rownamesTexCmd | A character string specifying a LaTeXcommand to be used to format rownames. The default, |
colnamesTexCmd | A character string specifying a LaTeX command to beused to format column labels. The default, |
cellTexCmds | A matrix of character strings which are LaTeXcommands to beused to format each element, or cell, of the object. The matrixmust have the same |
na.blank | Set to |
insert.bottom | an optional character string to typeset at the bottom of the table.For |
insert.bottom.width | character string; a tex width controlling the width of theinsert.bottom text. Currently only does something with using |
insert.top | a character string to insert as a heading rightbefore beginning |
first.hline.double | set to |
rowname | rownames for |
cgroup.just | justification for labels for column groups. Defaults to |
colheads | a character vector of column headings if you don't wantto use |
extracolheads | an optional vector of extra column headings that will appear under themain headings (e.g., sample sizes). This character vector does notneed to include an empty space for any |
extracolsize | size for |
dcolumn | see |
numeric.dollar | logical, default |
math.row.names | logical, set true to place dollar signs around the row names. |
already.math.row.names | set to |
math.col.names | logical, set true to place dollar signs around the column names. |
already.math.col.names | set to |
hyperref | if |
continued | a character string used to indicate pages after thefirst when making a long table |
cdot | see |
longtable | Set to |
draft.longtable | I forgot what this does. |
ctable | set to |
booktabs | set |
table.env | Set |
here | Set to |
lines.page | Applies if |
caption | a text string to use as a caption to print at the top of the firstpage of the table. Default is no caption. |
caption.lot | a text string representing a short caption to be used in the “List of Tables”.By default, LaTeX will use |
caption.loc | set to |
star | apply the star option for ctables to allow a table to spread overtwo columns when in twocolumn mode. |
double.slash | set to |
vbar | logical. When |
collabel.just | justification for column labels. |
assignment | logical. When |
where | specifies placement of floats if a table environment is used. Defaultis |
size | size of table text if a size change is needed (default is no change).For example you might specify |
center | default is |
landscape | set to |
type | The default uses the S |
width.cutoff | width of function text output in columns; see |
... | other arguments are accepted and ignored except that |
inn,out | specify additional input and translated strings over the usualdefaults |
pb | If |
greek | set to |
na | single character string to translate |
code | set to |
pretty | set to |
hspace | horizontal space, e.g., extra left margin for verbatim text. Defaultis none. Use e.g. |
length | for S-Plus only; is the length of the output page forprinting and capturing verbatim text |
width,height | are the |
prlog | set to |
multicol | set to |
nomargins | set to |
Details
latex.default optionally outputs a LaTeX comment containing the callingstatement. To output this comment, runoptions(omitlatexcom=FALSE) before running. The default behavior or suppressing the comment is helpfulwhen running RMarkdown to produce pdf output using LaTeX, as this usespandoc which is fooled into try to escape the percentcomment symbol.
If running under Windows and using MikTeX,latex andyapmust be in your system path, andyap is used to browse‘.dvi’ files created bylatex. You should install the‘geometry.sty’ and ‘ctable.sty’ styles in MikTeX to make optimum useoflatex().
On Mac OS X, you may have to append the ‘/usr/texbin’ directory to thesystem path. Thanks to Kevin Thorpe(kevin.thorpe@utoronto.ca) one way to set up Mac OS X isto install ‘X11’ and ‘X11SDK’ if not already installed,start ‘X11’ within the R GUI, and issue the commandSys.setenv( PATH=paste(Sys.getenv("PATH"),"/usr/texbin",sep=":") ). To avoid any complications of using ‘X11’ under MacOS, userscan install the ‘TeXShop’ package, which will associate‘.dvi’ files with a viewer that displays a ‘pdf’ version ofthe file after a hidden conversion from ‘dvi’ to ‘pdf’.
System options can be used to specify external commands to be used.Defaults are given byoptions(xdvicmd='xdvi') oroptions(xdvicmd='yap'),options(dvipscmd='dvips'),options(latexcmd='latex'). For MacOS specifyoptions(xdvicmd='MacdviX') or if TeXShop is installed,options(xdvicmd='open').
To use ‘pdflatex’ rather than ‘latex’, setoptions(latexcmd='pdflatex'),options(dviExtension='pdf'), and setoptions('xdvicmd') to your chosen PDF previewer.
If running S-Plus and your directory for temporary files is not‘/tmp’ (Unix/Linux) or ‘\windows\temp’ (Windows), add yourowntempdir function such astempdir <- function() "/yourmaindirectory/yoursubdirectory"
To prevent the latex file from being displayed store the result oflatex in an object, e.g.w <- latex(object, file='foo.tex').
Value
latex anddvi return alist of classlatex ordvi containing character stringelementsfile andstyle.file contains the name of thegenerated file, andstyle is a vector (possibly empty) of styles tobe included using the LaTeX2e ‘\usepackage’ command.
latexTranslate returns a vector of character strings
Side Effects
creates various system files and runs various Linux/UNIX systemcommands which are assumed to be in the system path.
Author(s)
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
Richard M. Heiberger,
Department of Statistics,
Temple University, Philadelphia, PA.
rmh@temple.edu
David R. Whiting,
School of Clinical Medical Sciences (Diabetes),
University of Newcastle upon Tyne, UK.
david.whiting@ncl.ac.uk
See Also
Examples
x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that')))## Not run: latex(x) # creates x.tex in working directory# The result of the above command is an object of class "latex"# which here is automatically printed by the latex print method.# The latex print method prepends and appends latex headers and# calls the latex program in the PATH. If the latex program is# not in the PATH, you will get error messages from the operating# system.w <- latex(x, file='/tmp/my.tex')# Does not call the latex program as the print method was not invokedprint.default(w)# Shows the contents of the w variable without attempting to latex it.d <- dvi(w) # compile LaTeX document, make .dvi # latex assumed to be in pathd # or show(d) : run xdvi (assumed in path) to displayw # or show(w) : run dvi then xdvidvips(d) # run dvips to print documentdvips(w) # run dvi then dvipslibrary(tools)texi2dvi('/tmp/my.tex') # compile and produce pdf file in working dir.## End(Not run)latex(x, file="") # just write out LaTeX code to screen## Not run: # Use paragraph formatting to wrap text to 3 in. wide in a columnd <- data.frame(x=1:2, y=c(paste("a", paste(rep("very",30),collapse=" "),"long string"), "a short string"))latex(d, file="", col.just=c("l", "p{3in}"), table.env=FALSE)## End(Not run)## Not run: # After running latex( ) multiple times with different special styles in# effect, make a file that will call for the needed LaTeX packages when# latex is run (especially when using Sweave with R)if(exists(latexStyles)) cat(paste('\usepackage{',latexStyles,'}',sep=''), file='stylesused.tex', sep='\n')# Then in the latex job have something like:# \documentclass{article}# \input{stylesused}# \begin{document}# ...## End(Not run)Check whether the options for latex functions have been specified.
Description
Check whether the options for latex functions have been specified.If any ofoptions()[c("latexcmd","dviExtension","xdvicmd")]areNULL, an error message is displayed.
Usage
latexCheckOptions(...)Arguments
... | Any arguments are ignored. |
Value
If anyNULL options are detected, the invisible text of theerror message. If all three options have non-NULL values, NULL.
Author(s)
Richard M. Heiberger <rmh@temple.edu>
See Also
Enhanced Dot Chart for LaTeX Picture Environment with epic
Description
latexDotchart is a translation of thedotchart3 functionfor producing a vector of character strings containing LaTeX pictureenvironment markup that mimicsdotchart3 output. The LaTeXepic andcolor packages are required. Theadd andhorizontal=FALSE options are not available forlatexDotchart, however.
Usage
latexDotchart(data, labels, groups=NULL, gdata=NA, xlab='', auxdata, auxgdata=NULL, auxtitle, w=4, h=4, margin, lines=TRUE, dotsize = .075, size='small', size.labels='small', size.group.labels='normalsize', ttlabels=FALSE, sort.=TRUE, xaxis=TRUE, lcolor='gray', ...)Arguments
data | a numeric vector whose values are shown on the x-axis |
labels | a vector of labels for each point, corresponding to |
groups | an optional categorical variable indicating how |
gdata | data values for groups, typically summaries such as groupmedians |
xlab | x-axis title |
auxdata | a vector of auxiliary data, of the same lengthas the first ( |
auxgdata | similar to |
auxtitle | if |
w | width of picture in inches |
h | height of picture in inches |
margin | a 4-vector representing, in inches, the margin to theleft of the x-axis, below the y-axis, to the right of the x-axis,and above the y-axis. By default these are computed making educatedcases about how to accommodate |
lines | set to |
dotsize | diameter of filled circles, in inches, for drawing dots |
size | size of text in picture. This and the next two argumentsare LaTeX font commands without the opening backslash, e.g., |
size.labels | size of labels |
size.group.labels | size of labels corresponding to |
ttlabels | set to |
sort. | set to |
xaxis | set to |
lcolor | color for horizontal reference lines. Default is |
... | ignored |
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
## Not run: z <- latexDotchart(c(.1,.2), c('a','bbAAb'), xlab='This Label', auxdata=c(.1,.2), auxtitle='Zcriteria')f <- '/tmp/t.tex'cat('\documentclass{article}\n\usepackage{epic,color}\n\begin{document}\n', file=f)cat(z, sep='\n', file=f, append=TRUE)cat('\end{document}\n', file=f, append=TRUE)set.seed(135)maj <- factor(c(rep('North',13),rep('South',13)))g <- paste('Category',rep(letters[1:13],2))n <- sample(1:15000, 26, replace=TRUE)y1 <- runif(26)y2 <- pmax(0, y1 - runif(26, 0, .1))z <- latexDotchart(y1, g, groups=maj, auxdata=n, auxtitle='n', xlab='Y', size.group.labels='large', ttlabels=TRUE)f <- '/tmp/t2.tex'cat('\documentclass{article}\n\usepackage{epic,color}\n\begin{document}\n\framebox{', file=f)cat(z, sep='\n', file=f, append=TRUE)cat('}\end{document}\n', file=f, append=TRUE)## End(Not run)Convert a Data Frame or Matrix to a LaTeX Tabular
Description
latexTabular creates a character vector representing a matrix ordata frame in a simple ‘tabular’ environment.
Usage
latexTabular(x, headings=colnames(x), align =paste(rep('c',ncol(x)),collapse=''), halign=paste(rep('c',ncol(x)),collapse=''), helvetica=TRUE, translate=TRUE, hline=0, center=FALSE, ...)Arguments
x | a matrix or data frame, or a vector that is automaticallyconverted to a matrix |
headings | a vector of character strings specifying columnheadings for ‘latexTabular’, defaulting to |
align | a character strings specifying columnalignments for ‘latexTabular’, defaulting to |
halign | a character strings specifying alignment forcolumn headings, defaulting to centered. |
helvetica | set to |
translate | set to |
hline | set to 1 to put |
center | set to |
... | if present, |
Value
a character string containing LaTeX markup
Author(s)
Frank E. Harrell, Jr.,
Department of Biostatistics,
Vanderbilt University,
fh@fharrell.com
See Also
Examples
x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that')))latexTabular(x) # a character string with LaTeX markupCreate LaTeX Thermometers and Colored Needles
Description
latexTherm creates a LaTeX picture environment for drawing aseries of thermometers whose heights depict the values of a variabley assumed to bescaled from 0 to 1. This is useful for showing fractions of sampleanalyzed in any table or plot, intended for a legend. For example, fourthermometers might be used to depict the fraction of enrolled patientsincluded in the current analysis, the fraction randomized, the fractionof patients randomized to treatment A being analyzed, and the fractionrandomized to B being analyzed. The picture is placedinside a LaTeX macro definition for macro variable namedname, tobe invoked by the user later in the LaTeX file usingnamepreceeded by a backslash.
Ify has an attribute"table", it is assumed to contain acharacter string with LaTeX code. This code is used as a tooltip popupfor PDF using the LaTeXocgtools package or using styletooltips. Typically the code will contain atabularenvironment. The user must define a LaTeX macrotooltipn thattakes two arguments (original object and pop-up object) that doesthe pop-up.
latexNeedle is similar tolatexTherm except that verticalneedles are produced and each may have its own color. A grayscale boxis placed around the needles and provides the 0-1y-axisreference. Horizontal grayscale grid lines may be drawn.
pngNeedle is similar tolatexNeedle but is for generatingsmall png graphics. The full graphics file name is returned invisibly.
Usage
latexTherm(y, name, w = 0.075, h = 0.15, spacefactor = 1/2, extra = 0.07, file = "", append = TRUE)latexNeedle(y, x=NULL, col='black', href=0.5, name, w=.05, h=.15, extra=0, file = "", append=TRUE)pngNeedle(y, x=NULL, col='black', href=0.5, lwd=3.5, w=6, h=18, file=tempfile(fileext='.png'))Arguments
y | a vector of 0-1 scaled values. Boxes and their frames areomitted for |
x | a vector corresponding to |
name | name of LaTeX macro variable to be defined |
w | width of a single box (thermometer) in inches. For |
h | height of a single box in inches. For |
spacefactor | fraction of |
extra | extra space in inches to set aside to the right of andabove the series of boxes or frame |
file | name of file to which to write LaTeX code. Default is theconsole. Also used as base file name for png graphic. Default forthat is from |
append | set to |
col | a vector of colors corresponding to positions in |
href | values of |
lwd | line width of needles for |
Author(s)
Frank Harrell
Examples
## Not run: # The following is in the Hmisc tests directory# For a knitr example see latexTherm.Rnw in that directoryct <- function(...) cat(..., sep='')ct('\documentclass{report}\begin{document}\n')latexTherm(c(1, 1, 1, 1), name='lta')latexTherm(c(.5, .7, .4, .2), name='ltb')latexTherm(c(.5, NA, .75, 0), w=.3, h=1, name='ltc', extra=0)latexTherm(c(.5, NA, .75, 0), w=.3, h=1, name='ltcc')latexTherm(c(0, 0, 0, 0), name='ltd')ct('This is a the first:\lta and the second:\ltb\\ and the thirdwithout extra:\ltc END\\\nThird with extra:\ltcc END\\ \vspace{2in}\\ All data = zero, frame only:\ltd\\\end{document}\n')w <- pngNeedle(c(.2, .5, .7))cat(tobase64image(w)) # can insert this directly into an html file## End(Not run)Legend Creation Functions
Description
Wrapers to plot defined legend ploting functions
Usage
Key(...)Key2(...)sKey(...)Arguments
... | arguments to pass to wrapped functions |
Pretty-print the Structure of a Data Object
Description
This is a function to pretty-print the structure of any data object(usually a list). It is similar to the R functionstr.
Usage
list.tree(struct, depth=-1, numbers=FALSE, maxlen=22, maxcomp=12, attr.print=TRUE, front="", fill=". ", name.of, size=TRUE)Arguments
struct | The object to be displayed |
depth | Maximum depth of recursion (of lists within lists ...) to be printed; negativevalue means no limit on depth. |
numbers | If TRUE, use numbers in leader instead of dots torepresent position in structure. |
maxlen | Approximate maximum length (in characters) allowed on each line to give thefirst few values of a vector. maxlen=0 suppresses printing any values. |
maxcomp | Maximum number of components of any list that will be described. |
attr.print | Logical flag, determining whether a description of attributes will be printed. |
front | Front material of a line, for internal use. |
fill | Fill character used for each level of indentation. |
name.of | Name of object, for internal use (deparsed version of struct by default). |
size | Logical flag, should the size of the object in bytes be printed? A description of the structure of struct will be printed in outlineform, with indentationfor each level of recursion, showing the internal storage mode, length,class(es) if any, attributes, and first few elements of each data vector.By default each level of list recursion is indicated by a "." and attributes by "A". |
Author(s)
Alan Zaslavsky,zaslavsk@hcp.med.harvard.edu
See Also
Examples
X <- list(a=ordered(c(1:30,30:1)),b=c("Rick","John","Allan"), c=diag(300),e=cbind(p=1008:1019,q=4))list.tree(X)# In R you can say str(X)Apply a Function to Rows of a Matrix or Vector
Description
mApply is liketapply except that the first argument canbe a matrix or a vector, and the output is cleaned up ifsimplify=TRUE.It uses code adapted from Tony Plate (tplate@blackmesacapital.com) tooperate on grouped submatrices.
AsmApply can be much faster than usingby, it is oftenworth the trouble of converting a data frame to a numeric matrix forprocessing bymApply.asNumericMatrix will do this, andmatrix2dataFrame will convert a numeric matrix back into a dataframe.
Usage
mApply(X, INDEX, FUN, ..., simplify=TRUE, keepmatrix=FALSE)Arguments
X | a vector or matrix capable of being operated on by thefunction specified as the |
INDEX | list of factors, each of same number of rows as 'X' has. |
FUN | the function to be applied. In the case of functions like'+', ' |
... | optional arguments to 'FUN'. |
simplify | set to 'FALSE' to suppress simplification of the result in toan array, matrix, etc. |
keepmatrix | set to |
Value
FormApply, the returned value is a vector, matrix, or list.IfFUN returns more than one number, the result is an array ifsimplify=TRUE and is a list otherwise. If a matrix is returned,its rows correspond to unique combinations ofINDEX. IfINDEX is a list with more than one vector,FUN returnsmore than one number, andsimplify=FALSE, the returned value is alist that is an array with the first dimension corresponding to the lastvector inINDEX, the second dimension corresponding to the nextto last vector inINDEX, etc., and the elements of the list-arraycorrespond to the values computed byFUN. In this situation thereturned value is a regular array ifsimplify=TRUE. The orderof dimensions is as previously but the additional (last) dimensioncorresponds to values computed byFUN.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
asNumericMatrix,matrix2dataFrame,tapply,sapply,lapply,mapply,by.
Examples
require(datasets, TRUE)a <- mApply(iris[,-5], iris$Species, mean)Methods for Storing and Analyzing Multiple Choice Variables
Description
mChoice is a function that is useful for grouping variables that representindividual choices on a multiple choice question. These choices aretypically factor or character values but may be of any type. Levelsof component factor variables need not be the same; all unique levels(or unique character values) are collected over all of the multiplevariables. Then a new character vector is formed with integer choicenumbers separated by semicolons. Optimally, a database system wouldhave exported the semicolon-separated character strings with alevels attribute containing strings defining value labelscorresponding to the integer choice numbers.mChoice is afunction for creating a multiple-choice variable after the fact.mChoice variables are explicitly handed by thedescribeandsummary.formula functions.NAs or blanks in inputvariables are ignored.
format.mChoice will convert the multiple choice representationto text form by substitutinglevels for integer codes.as.double.mChoice converts themChoice object to abinary numeric matrix, one column per used level (or all levels ofdrop=FALSE. This is called bythe user by invokingas.numeric. There is aprint method and asummary method, and aprintmethod for thesummary.mChoice object. Thesummarymethod computes frequencies of all two-way choice combinations, thefrequencies of the top 5 combinations, information about which otherchoices are present when each given choice is present, and thefrequency distribution of the number of choices per observation. Thissummary output is used in thedescribe function. Theprint method returns an html character string ifoptions(prType='html') is in effect ifrender=FALSE orrenders the html otherwise. This is used byprint.describe andis most effective whenshort=TRUE is specified tosummary.
in.mChoice creates a logical vector the same length asxwhose elements areTRUE when the observation inxcontains at least one of the codes or value labels in the secondargument.
match.mChoice creates an integer vector of the indexes of allelements intable which contain any of the speicified levels
nmChoice returns an integer vector of the number of choicesthat were made
is.mChoice returnsTRUE is the argument is a multiplechoice variable.
Usage
mChoice(..., label='', sort.levels=c('original','alphabetic'), add.none=FALSE, drop=TRUE, ignoreNA=TRUE)## S3 method for class 'mChoice'format(x, minlength=NULL, sep=";", ...)## S3 method for class 'mChoice'as.double(x, drop=FALSE, ...)## S3 method for class 'mChoice'print(x, quote=FALSE, max.levels=NULL, width=getOption("width"), ...)## S3 method for class 'mChoice'as.character(x, ...)## S3 method for class 'mChoice'summary(object, ncombos=5, minlength=NULL, drop=TRUE, short=FALSE, ...)## S3 method for class 'summary.mChoice'print(x, prlabel=TRUE, render=TRUE, ...)## S3 method for class 'mChoice'x[..., drop=FALSE]match.mChoice(x, table, nomatch=NA, incomparables=FALSE)inmChoice(x, values, condition=c('any', 'all'))inmChoicelike(x, values, condition=c('any', 'all'), ignore.case=FALSE, fixed=FALSE)nmChoice(object)is.mChoice(x)## S3 method for class 'mChoice'Summary(..., na.rm)Arguments
na.rm | Logical: remove |
table | a vector (mChoice) of values to be matched against. |
nomatch | value to return if a value for |
incomparables | logical whether incomparable values should be compaired. |
... | a series of vectors |
label | a character string |
sort.levels | set |
add.none | Set |
drop | set |
ignoreNA | set to |
x | an object of class |
object | an object of class |
ncombos | maximum number of combos. |
width | With of a line of text to be formated |
quote | quote the output |
max.levels | max levels to be displayed |
minlength | By default no abbreviation of levels is done in |
short | set to |
sep | character to use to separate levels when formatting |
prlabel | set to |
render | applies of |
values | a scalar or vector. If |
condition | set to |
ignore.case | set to |
fixed | see |
Value
mChoice returns a character vector of class"mChoice"plus attributes"levels" and"label".summary.mChoice returns an object of class"summary.mChoice".inmChoice andinmChoicelikereturn a logical vector.format.mChoice returns a character vector, andas.double.mChoice returns a binary numeric matrix.nmChoice returns an integer vector.print.summary.mChoice returns an html character string ifoptions(prType='html') is in effect.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
options(digits=3)set.seed(3)n <- 20sex <- factor(sample(c("m","f"), n, rep=TRUE))age <- rnorm(n, 50, 5)treatment <- factor(sample(c("Drug","Placebo"), n, rep=TRUE))# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail', 'Muscle Ache','Depressed')symptom1 <- sample(symp, n, TRUE)symptom2 <- sample(symp, n, TRUE)symptom3 <- sample(symp, n, TRUE)cbind(symptom1, symptom2, symptom3)[1:5,]Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')Symptomsprint(Symptoms, long=TRUE)format(Symptoms[1:5])inmChoice(Symptoms,'Headache')inmChoicelike(Symptoms, 'head', ignore.case=TRUE)levels(Symptoms)inmChoice(Symptoms, 3)# Find all subjects with either of two symptomsinmChoice(Symptoms, c('Headache','Hangnail'))# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selections# Find all subjects with both symptomsinmChoice(Symptoms, c('Headache', 'Hangnail'), condition='all')meanage <- N <- numeric(5)for(j in 1:5) { meanage[j] <- mean(age[inmChoice(Symptoms,j)]) N[j] <- sum(inmChoice(Symptoms,j))}names(meanage) <- names(N) <- levels(Symptoms)meanageN# Manually compute mean age for 2 symptomsmean(age[symptom1=='Headache' | symptom2=='Headache' | symptom3=='Headache'])mean(age[symptom1=='Hangnail' | symptom2=='Hangnail' | symptom3=='Hangnail'])summary(Symptoms)#Frequency table sex*treatment, sex*Symptomssummary(sex ~ treatment + Symptoms, fun=table)# Check:ma <- inmChoice(Symptoms, 'Muscle Ache')table(sex[ma])# could also do:# summary(sex ~ treatment + mChoice(symptom1,symptom2,symptom3), fun=table)#Compute mean age, separately by 3 variablessummary(age ~ sex + treatment + Symptoms)summary(age ~ sex + treatment + Symptoms, method="cross")f <- summary(treatment ~ age + sex + Symptoms, method="reverse", test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)creates a string that is a repeat of a substring
Description
Takes a character and creates a string that is the character repeatedlen times.
Usage
makeNstr(char, len)Arguments
char | character to be repeated |
len | number of times to repeat |
Value
A string that ischar repeatedlen times.
Author(s)
Charles Dupont
See Also
Examples
makeNstr(" ", 5)Read Tables in a Microsoft Access Database
Description
Assuming themdbtools package has been installed on yoursystem and is in the system path,mdb.get importsone or more tables in a Microsoft Access database. Date-timevariables are converted to dates orchron package date-timevariables. Thecsv.get function is used to importautomatically exported csv files. Iftables is unspecified all tables in the database are retrieved. If more thanone table is imported, the result is a list of data frames.
Usage
mdb.get(file, tables=NULL, lowernames=FALSE, allow=NULL, dateformat='%m/%d/%y', mdbexportArgs='-b strip', ...)Arguments
file | the file name containing the Access database |
tables | character vector specifying the names of tables toimport. Default is to import all tables. Specify |
lowernames | set this to |
allow | a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version1.9. |
dateformat | see |
mdbexportArgs | command line arguments to issue to mdb-export.Set to |
... | arguments to pass to |
Details
Uses themdbtools package executablesmdb-tables,mdb-schema, andmdb-export (with by default option-b strip to drop any binary output). In Debian/Ubuntu Linux runapt get install mdbtools.cleanup.import is invoked bycsv.get to transformvariables and store them as efficiently as possible.
Value
a new data frame or a list of data frames
Author(s)
Frank Harrell, Vanderbilt University
See Also
data.frame,cleanup.import,csv.get,Date,chron
Examples
## Not run: # Read all tables in the Microsoft Access database Nwind.mdbd <- mdb.get('Nwind.mdb')contents(d)for(z in d) print(contents(z))# Just print the names of tables in the databasemdb.get('Nwind.mdb', tables=TRUE)# Import one tableOrders <- mdb.get('Nwind.mdb', tables='Orders')## End(Not run)meltData
Description
Melt a Dataset To Examine All Xs vs Y
Usage
meltData( formula, data, tall = c("right", "left"), vnames = c("labels", "names"), sepunits = FALSE, ...)Arguments
formula | a formula |
data | data frame or table |
tall | see above |
vnames | set to |
sepunits | set to |
... | passed to |
Details
Uses a formula with one or more left hand side variables (Y) and one or more right hand side variables (X). Usesdata.table::melt() to meltdata so that each X is played against the same Y iftall='right' (the default) or each Y is played against the same X combination iftall='left'. The resulting data table has variables Y with their original names (iftall='right') or variables X with their original names (iftall='left'),variable, andvalue. By defaultvariable is taken aslabel()s of thetall variables.
Value
data table
Author(s)
Frank Harrell
See Also
Examples
d <- data.frame(y1=(1:10)/10, y2=(1:10)/100, x1=1:10, x2=101:110)label(d$x1) <- 'X1'units(d$x1) <- 'mmHg'm=meltData(y1 + y2 ~ x1 + x2, data=d, units=TRUE) # consider also html=TRUEprint(m)m=meltData(y1 + y2 ~ x1 + x2, data=d, tall='left')print(m)Draw Axes With Side-Specific mgp Parameters
Description
mgp.axis is a version ofaxis that uses the appropriateside-specificmgp parameter (seepar) to accountfor different space requirements for axis labels vertical vs. horizontaltick marks.mgp.axis also fixes a bug inaxis(2,...)that causes it to assumelas=1.
mgp.axis.labels is used so that different spacing between tickmarks and axis tick mark labels may be specified for x- and y-axes. Usemgp.axis.labels('default') to set defaults. Users can set valuesmanually usingmgp.axis.labels(x,y) wherex andyare 2nd value ofpar('mgp') to use. Usemgp.axis.labels(type=w) to retrieve values, wherew='x','y','x and y','xy', to get 3mgp values(first 3 types) or 2mgp.axis.labels.
Usage
mgp.axis(side, at = NULL, ..., mgp = mgp.axis.labels(type = if (side == 1 | side == 3) "x" else "y"), axistitle = NULL, cex.axis=par('cex.axis'), cex.lab=par('cex.lab'))mgp.axis.labels(value,type=c('xy','x','y','x and y'))Arguments
side,at | see |
... | arguments passed through to |
mgp,cex.axis,cex.lab | see |
axistitle | if specified will cause |
value | vector of values to which to set system option |
type | see above |
Value
mgp.axis.labels returns the value ofmgp (only thesecond element ofmgp iftype="xy" or a list withelementsx andy iftype="x or y", each listelement being a 3-vector) for the appropriate axis ifvalue is not specified, otherwise itreturns nothing but the system optionmgp.axis.labels is set.
mgp.axis returns nothing.
Side Effects
mgp.axis.labels stores the value in thesystem optionmgp.axis.labels
Author(s)
Frank Harrell
See Also
Examples
## Not run: mgp.axis.labels(type='x') # get default value for x-axismgp.axis.labels(type='y') # get value for y-axismgp.axis.labels(type='xy') # get 2nd element of both mgpsmgp.axis.labels(type='x and y') # get a list with 2 elementsmgp.axis.labels(c(3,.5,0), type='x') # setoptions('mgp.axis.labels') # retrieveplot(..., axes=FALSE)mgp.axis(1, "X Label")mgp.axis(2, "Y Label")## End(Not run)Miscellaneous Functions for Epidemiology
Description
Themhgr function computes the Cochran-Mantel-Haenszel stratifiedrisk ratio and its confidence limits using the Greenland-Robins varianceestimator.
Thelrcum function takes the results of a series of 2x2 tablesrepresenting the relationship between test positivity and diagnosis andcomputes positive and negative likelihood ratios (with all theirdeficiencies) and the variance oftheir logarithms. Cumulative likelihood ratios and their confidenceintervals (assuming independence of tests) are computed, assuming astring of all positive tests or a string of all negative tests. Themethod of Simel et al as described in Altman et al is used.
Usage
mhgr(y, group, strata, conf.int = 0.95)## S3 method for class 'mhgr'print(x, ...)lrcum(a, b, c, d, conf.int = 0.95)## S3 method for class 'lrcum'print(x, dec=3, ...)Arguments
y | a binary response variable |
group | a variable with two unique values specifying comparison groups |
strata | the stratification variable |
conf.int | confidence level |
x | an object created by |
a | frequency of true positive tests |
b | frequency of false positive tests |
c | frequency of false negative tests |
d | frequency of true negative tests |
dec | number of places to the right of the decimal to print for |
... | addtitional arguments to be passed to other print functions |
Details
Uses equations 4 and 13 from Greenland and Robins.
Value
a list of class"mhgr" or of class"lrcum".
Author(s)
Frank E Harrell Jrfh@fharrell.com
References
Greenland S, Robins JM (1985): Estimation of a common effect parameterfrom sparse follow-up data. Biometrics 41:55-68.
Altman DG, Machin D, Bryant TN, Gardner MJ, Eds. (2000): Statistics withConfidence, 2nd Ed. Bristol: BMJ Books, 105-110.
Simel DL, Samsa GP, Matchar DB (1991): Likelihood ratios withconfidence: sample size estimation for diagnostic test studies. JClin Epi 44:763-770.
See Also
Examples
# Greate Migraine dataset used in Example 28.6 in the SAS PROC FREQ guided <- expand.grid(response=c('Better','Same'), treatment=c('Active','Placebo'), sex=c('female','male'))d$count <- c(16, 11, 5, 20, 12, 16, 7, 19)d# Expand data frame to represent raw datar <- rep(1:8, d$count)d <- d[r,]with(d, mhgr(response=='Better', treatment, sex))# Discrete survival time example, to get Cox-Mantel relative risk and CL# From Stokes ME, Davis CS, Koch GG, Categorical Data Analysis Using the# SAS System, 2nd Edition, Sectino 17.3, p. 596-599## Input data in Table 17.5d <- expand.grid(treatment=c('A','P'), center=1:3)d$healed2w <- c(15,15,17,12, 7, 3)d$healed4w <- c(17,17,17,13,17,17)d$notHealed4w <- c( 2, 7,10,15,16,18)d# Reformat to the way most people would collect raw datad1 <- d[rep(1:6, d$healed2w),]d1$time <- '2'd1$y <- 1d2 <- d[rep(1:6, d$healed4w),]d2$time <- '4'd2$y <- 1d3 <- d[rep(1:6, d$notHealed4w),]d3$time <- '4'd3$y <- 0d <- rbind(d1, d2, d3)d$healed2w <- d$healed4w <- d$notHealed4w <- NULLd# Finally, duplicate appropriate observations to create 2 and 4-week# risk sets. Healed and not healed at 4w need to be in the 2-week# risk set as not healedd2w <- subset(d, time=='4')d2w$time <- '2'd2w$y <- 0d24 <- rbind(d, d2w)with(d24, table(y, treatment, time, center))# Matches Table 17.6with(d24, mhgr(y, treatment, interaction(center, time, sep=';')))# Get cumulative likelihood ratios and their 0.95 confidence intervals# based on the following two tables## Disease Disease# + - + -# Test + 39 3 20 5# Test - 21 17 22 15lrcum(c(39,20), c(3,5), c(21,22), c(17,15))Minor Tick Marks
Description
Adds minor tick marks to an existing plot. All minor tick marks thatwill fit on the axes will be drawn.
Usage
minor.tick(nx=2, ny=2, tick.ratio=0.5, x.args = list(), y.args = list())Arguments
nx | number of intervals in which to divide the area between major tick marks onthe X-axis. Set to 1 to suppress minor tick marks. |
ny | same as |
tick.ratio | ratio of lengths of minor tick marks to major tick marks. The lengthof major tick marks is retrieved from |
x.args | additionl arguments (e.g. |
y.args | same as |
Side Effects
plots
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Earl Bellinger
Max Planck Institute
earlbellinger@gmail.com
Viktor Horvath
Brandeis University
vhorvath@brandeis.edu
See Also
Examples
# Plot with default settingsplot(runif(20), runif(20))minor.tick()# Plot with arguments passed to axis()plot(c(0,1), c(0,1), type = 'n', axes = FALSE, ann = FALSE)# setting up a plot without axes and annotationpoints(runif(20), runif(20)) # plotting dataaxis(1, pos = 0.5, lwd = 2) # showing X-axis at Y = 0.5 with formattingaxis(2, col = 2) # formatted Y-axisminor.tick( nx = 4, ny = 4, tick.ratio = 0.3, x.args = list(pos = 0.5, lwd = 2), # X-minor tick format argumnets y.args = list(col = 2)) # Y-minor tick format argumentsmovStats
Description
Moving Estimates Using Overlapping Windows
Usage
movStats( formula, stat = NULL, discrete = FALSE, space = c("n", "x"), eps = if (space == "n") 15, varyeps = FALSE, nignore = 10, xinc = NULL, xlim = NULL, times = NULL, tunits = "year", msmooth = c("smoothed", "raw", "both"), tsmooth = c("supsmu", "lowess"), bass = 8, span = 1/4, maxdim = 6, penalty = NULL, trans = function(x) x, itrans = function(x) x, loess = FALSE, ols = FALSE, qreg = FALSE, lrm = FALSE, orm = FALSE, hare = FALSE, ordsurv = FALSE, lrm_args = NULL, family = "logistic", k = 5, tau = (1:3)/4, melt = FALSE, data = environment(formula), pr = c("none", "kable", "plain", "margin"))Arguments
formula | a formula with the analysis variable on the left and the x-variable on the right, following by optional stratification variables |
stat | function of one argument that returns a named list of computed values. Defaults to computing mean and quartiles + N except when y is binary in which case it computes moving proportions. If y has two columns the default statistics are Kaplan-Meier estimates of cumulative incidence at a vector of |
discrete | set to |
space | defines whether intervals used fixed width or fixed sample size |
eps | tolerance for window (half width of window). For |
varyeps | applies to |
nignore | see description, default is to exclude |
xinc | increment in x to evaluate stats, default is xlim range/100 for |
xlim | 2-vector of limits to evaluate if |
times | vector of times for evaluating one minus Kaplan-Meier estimates |
tunits | time units when |
msmooth | set to |
tsmooth | defaults to the super-smoother |
bass | the |
span | the |
maxdim | passed to |
penalty | passed to |
trans | transformation to apply to x |
itrans | inverse transformation |
loess | set to TRUE to also compute loess estimates |
ols | set to TRUE to include rcspline estimate of mean using ols |
qreg | set to TRUE to include quantile regression estimates w rcspline |
lrm | set to TRUE to include logistic regression estimates w rcspline |
orm | set to TRUE to include ordinal logistic regression estimates w rcspline (mean + quantiles in |
hare | set to TRUE to include hazard regression estimtes of incidence at |
ordsurv | set to TRUE to include ordinal regression estimates of incidence at |
lrm_args | a |
family | link function for ordinal regression (see |
k | number of knots to use for ols, lrm, qreg restricted cubic splines. Linearity is forced for binary |
tau | quantile numbers to estimate with quantile regression |
melt | set to TRUE to melt data table and derive Type and Statistic |
data | data.table or data.frame, default is calling frame |
pr | defaults to no printing of window information. Use |
Details
Function to compute moving averages and other statistics as a functionof a continuous variable, possibly stratified by other variables.Estimates are made by creating overlapping moving windows andcomputing the statistics defined in the stat function for each window.The default method,space='n' creates varying-width intervals each having a sample size of2*eps +1, and the smooth estimates are made everyxinc observations. Outer intervals are not symmetric in sample size (but the mean x in those intervals will reflect that) unlesseps=nignore, as outer intervals are centered at observationsnignore andn - nignore + 1 where the default fornignore is 10. The mean x-variable within each windows is taken to represent that window. Iftrans anditrans are given, x means are computed on thetrans(x) scale and thenitrans'd. Forspace='x', by default estimates are made on to thenignore smallest to thenignore largestobserved values of the x variable to avoid extrapolation and tohelp getting the moving statistics off on an adequate start forthe left tail. Also by default the moving estimates are smoothed usingsupsmu.Whenmelt=TRUE you can feed the result intoggplot like this:ggplot(w, aes(x=age, y=crea, col=Type)) + geom_line() +facet_wrap(~ Statistic)
Seehere for several examples.
Value
a data table, with attributeinfon which is a data frame with rows corresponding to strata and columnsN,Wmean,Wmin,Wmax ifstat computedN. These summarize the number of observations used in the windows. Ifvaryeps=TRUE there is an additional columneps with the computed per-stratumeps. Whenspace='n' andxinc is not given, the computedxinc also appears as a column. An additional attributeinfo is akable object ready for printing to describe the window characteristics.
Author(s)
Frank Harrell
Margin Titles
Description
Writes overall titles and subtitles after a multiple image plot is drawn.Ifpar()$oma==c(0,0,0,0),title is used instead ofmtext, to drawtitles or subtitles that are inside the plotting region for a single plot.
Usage
mtitle(main, ll, lc, lr=format(Sys.time(),'%d%b%y'), cex.m=1.75, cex.l=.5, ...)Arguments
main | main title to be centered over entire figure, default is none |
ll | subtitle for lower left of figure, default is none |
lc | subtitle for lower center of figure, default is none |
lr | subtitle for lower right of figure, default is today's date in format23Jan91 for UNIX or R (Thu May 30 09:08:13 1996 format for Windows). Set to |
cex.m | character size for main, default is 1.75 |
cex.l | character size for subtitles |
... | other arguments passed to |
Value
nothing
Side Effects
plots
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
See Also
Examples
#Set up for 1 plot on figure, give a main title,#use date for lrplot(runif(20),runif(20))mtitle("Main Title")#Set up for 2 x 2 matrix of plots with a lower left subtitle and overall titlepar(mfrow=c(2,2), oma=c(3,0,3,0))plot(runif(20),runif(20))plot(rnorm(20),rnorm(20))plot(exp(rnorm(20)),exp(rnorm(20)))mtitle("Main Title",ll="n=20")Plot Multiple Lines
Description
Plots multiple lines based on a vectorx and a matrixy,draws thin vertical lines connecting limits represented by columns ofy beyond the first. It is assumed that either (1) the secondand third columns ofy represent lower and upper confidencelimits, or that (2) there is an even number of columns beyond thefirst and these represent ascending quantiles that are symmetricallyarranged around 0.5. Ifoptions(grType='plotly') is in effect,usesplotly graphics instead ofgrid or base graphics.Forplotly you may want to set the list of possible colors,etc. usingpobj=plot_ly(colors=...).lwd,lty,lwd.vertare ignored underplotly.
Usage
multLines(x, y, pos = c('left', 'right'), col='gray', lwd=1, lty=1, lwd.vert = .85, lty.vert = 1, alpha = 0.4, grid = FALSE, pobj=plotly::plot_ly(), xlim, name=colnames(y)[1], legendgroup=name, showlegend=TRUE, ...)Arguments
x | a numeric vector |
y | a numeric matrix with number of rows equal to the number of |
pos | when |
col | a color used to connect |
lwd | line width for main lines |
lty | line types for main lines |
lwd.vert | line width for vertical lines |
lty.vert | line type for vertical lines |
alpha | transparency |
grid | set to |
pobj | an already started |
xlim | global x-axis limits (required if using |
name | trace name if using |
legendgroup | legend group name if using |
showlegend | whether or not to show traces in legend, if using |
... | passed to |
Author(s)
Frank Harrell
Examples
if (requireNamespace("plotly")) { x <- 1:4 y <- cbind(x, x-3, x-2, x-1, x+1, x+2, x+3) plot(NA, NA, xlim=c(1,4), ylim=c(-2, 7)) multLines(x, y, col='blue') multLines(x, y, col='red', pos='right')}nCoincident
Description
Number of Coincident Points
Usage
nCoincident(x, y, bins = 400)Arguments
x | numeric vector |
y | numeric vector |
bins | number of bins in both directions |
Details
Computes the number of x,y pairs that are likely to be obscured in a regular scatterplot, in the sense of overlapping pairs after binning intobins xbins squares wherebins defaults to 400.NAs are removed first.
Value
integer count
Author(s)
Frank Harrell
Examples
nCoincident(c(1:5, 4:5), c(1:5, 4:5)/10)Row-wise Deletion na.action
Description
Does row-wise deletion asna.omit, but adds frequency of missing valuesfor each predictorto the"na.action" attribute of the returned model frame.Optionally stores further details ifoptions(na.detail.response=TRUE).
Usage
na.delete(frame)Arguments
frame | a model frame |
Value
a model frame with rows deleted and the"na.action" attribute added.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
na.omit,na.keep,na.detail.response,model.frame.default,naresid,naprint
Examples
# options(na.action="na.delete")# ols(y ~ x)Detailed Response Variable Information
Description
This function is called by certainna.action functions ifoptions(na.detail.response=TRUE) is set. By default, this functionreturns a matrix of counts of non-NAs and the mean of the response variablecomputed separately by whether or not each predictor is NA. The defaultaction uses the last column of aSurv object, in effect computing theproportion of events. Other summary functions may be specified byusingoptions(na.fun.response="name of function").
Usage
na.detail.response(mf)Arguments
mf | a model frame |
Value
a matrix, with rows representing the different statistics that arecomputed for the response, and columns representing the differentsubsets for each predictor (NA and non-NA value subsets).
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
na.omit,na.delete,model.frame.default,naresid,naprint,describe
Examples
# sex# [1] m f f m f f m m m m m m m m f f f m f m# age# [1] NA 41 23 30 44 22 NA 32 37 34 38 36 36 50 40 43 34 22 42 30# y# [1] 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0# options(na.detail.response=TRUE, na.action="na.delete", digits=3)# lrm(y ~ age*sex)## Logistic Regression Model# # lrm(formula = y ~ age * sex)### Frequencies of Responses# 0 1 # 10 8## Frequencies of Missing Values Due to Each Variable# y age sex # 0 2 0### Statistics on Response by Missing/Non-Missing Status of Predictors## age=NA age!=NA sex!=NA Any NA No NA # N 2.0 18.000 20.00 2.0 18.000# Mean 0.5 0.444 0.45 0.5 0.444## \dots\dots# options(na.action="na.keep")# describe(y ~ age*sex)# Statistics on Response by Missing/Non-Missing Status of Predictors## age=NA age!=NA sex!=NA Any NA No NA # N 2.0 18.000 20.00 2.0 18.000# Mean 0.5 0.444 0.45 0.5 0.444## \dots# options(na.fun.response="table") #built-in function table()# describe(y ~ age*sex)## Statistics on Response by Missing/Non-Missing Status of Predictors## age=NA age!=NA sex!=NA Any NA No NA # 0 1 10 11 1 10# 1 1 8 9 1 8## \dotsDo-nothing na.action
Description
Does not delete rows containing NAs, but does add details concerningthe distribution of the response variable ifoptions(na.detail.response=TRUE).Thisna.action is primarily for use withdescribe.formula.
Usage
na.keep(mf)Arguments
mf | a model frame |
Value
the same model frame with the"na.action" attribute
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
na.omit,na.delete,model.frame.default,na.detail.response,naresid,naprint,describe
Examples
options(na.action="na.keep", na.detail.response=TRUE)x1 <- runif(20)x2 <- runif(20)x2[1:4] <- NAy <- rnorm(20)describe(y ~ x1*x2)Compute Number of Observations for Left Hand Side of Formula
Description
After removing any artificial observations added byaddMarginal, computes the number ofnon-missing observations for all left-hand-side variables informula. Ifformula contains a termid(variable)variable is assumed to be a subject ID variable, and only uniquesubject IDs are counted. If group is given and its value is the name ofa variable in the right-hand-side of the model, an additional objectnobsg is returned that is a matrix with as many columns as thereare left-hand variables, and as many rows as there are levels to thegroup variable. This matrix has the further breakdown of uniquenon-missing observations bygroup. The concatenation of all IDvariables, is returned in alist elementid.
Usage
nobsY(formula, group=NULL, data = NULL, subset = NULL, na.action = na.retain, matrixna=c('all', 'any'))Arguments
formula | a formula object |
group | character string containing optional name of astratification variable for computing sample sizes |
data | a data frame |
subset | an optional subsetting criterion |
na.action | an optional |
matrixna | set to |
Value
an integer, with an attribute"formula" containing theoriginal formula but with anid variable (if present) removed
Examples
d <- expand.grid(sex=c('female', 'male', NA), country=c('US', 'Romania'), reps=1:2)d$subject.id <- c(0, 0, 3:12)dm <- addMarginal(d, sex, country)dim(dm)nobsY(sex + country ~ 1, data=d)nobsY(sex + country ~ id(subject.id), data=d)nobsY(sex + country ~ id(subject.id) + reps, group='reps', data=d)nobsY(sex ~ 1, data=d)nobsY(sex ~ 1, data=dm)nobsY(sex ~ id(subject.id), data=dm)Creates a string of arbitry length
Description
Creates a vector of strings which consists of the string segment given ineach element of thestring vector repeatedtimes.
Usage
nstr(string, times)Arguments
string | character: vector of string segments to berepeated. Will be recycled if argument |
times | integer: vector of number of times to repeat thecorisponding segment. Will be recycled if argument |
Value
returns a character vector the same length as the longest of the two arguments.
Note
Will throw a warning if the length of the longer argment is not a evenmultiple of the shorter argument.
Author(s)
Charles Dupont
See Also
Examples
nstr(c("a"), c(0,3,4))nstr(c("a", "b", "c"), c(1,2,3))nstr(c("a", "b", "c"), 4)Extract number of intercepts
Description
Extract the number of intercepts from a model
Usage
num.intercepts(fit, type=c('fit', 'var', 'coef'))Arguments
fit | a model fit object |
type | the default is to return the formal number of intercepts used when fittingthe model. Set |
Value
num.intercepts returns an integer with the number of interceptsin the model.
See Also
Minimally Group an Ordinal Variable So Bootstrap Samples Will Contain All Distinct Values
Description
When bootstrapping models for ordinal Y when Y is fairly continuous, it is frequently the case that one or more bootstrap samples will not include one or more of the distinct original Y values. When fitting an ordinal model (including a Cox PH model), this means that an intercept cannot be estimated, and the parameter vectors will not align over bootstrap samples. To prevent this from happening, some grouping of Y may be necessary. TheordGroupBoot function usescutGn() to group Y so that the minimum number in any group is guaranteed to not exceed a certain integerm.ordGroupBoot tries a range ofm and stops at the lowestm such that either allB tested bootstrap samples contain all the original distinct values of Y (ifB>0), or that the probability that a given sample of sizen with replacement will contain all the distinct original values exceedsaprob (B=0). This probability is computed approximately using an approximation to the probability of complete sample coverage from thecoupon collector's problem and is quite accurate for our purposes.
Usage
ordGroupBoot( y, B = 0, m = 7:min(15, floor(n/3)), what = c("mean", "factor", "m"), aprob = 0.9999, pr = TRUE)Arguments
y | a numeric vector |
B | number of bootstrap samples to test, or zero to use a coverage probability approximation |
m | range of minimum group sizes to test; the default range is usually adequate |
what | specifies that either the mean |
aprob | minimum coverage probability sought |
pr | set to |
Value
a numeric vector corresponding toy but grouped, containing eithr the mean ofy in each group or a factor variable representing groupedy, either with the minimumm that satisfied the required sample covrage
Author(s)
Frank Harrell
See Also
Examples
set.seed(1)x <- c(1:6, NA, 7:22)ordGroupBoot(x, m=5:10)ordGroupBoot(x, m=5:10, B=5000, what='factor')pMedian
Description
Pseudomedian
Usage
pMedian( x, na.rm = FALSE, conf.int = 0, B = 1000, type = c("percentile", "bca"))Arguments
x | a numeric vector |
na.rm | set to |
conf.int | confidence level, defaulting to 0 so that no confidence limits are computed. Set to a number between 0 and 1 to compute bootstrap confidence limits |
B | number of bootstrap samples if |
type | type of bootstrap interval, defaulting to |
Details
Uses fast Fortran code to compute the pseudomedian of a numeric vector. The pseudomedian is the median of all possible midpoints of two observations. The pseudomedian is also called the Hodges-Lehmann one-sample estimator. The Fortran code is was originally from JF Monahan, and was converted to C++ in theDescTools package. It has been converted to Fortran 2018 here. Bootstrap confidence intervals are optionally computed.
If n > 250,000 a random sample of 250,000 values ofx is used to limit execution time. For n > 1,000 only the percentile bootstrap confidence interval is computed.
Bootstrapping uses the Fortran subroutine directly, for efficiency.
Value
a scalar numeric value ifconf.int = 0, or a 3-vector otherwise, with named elementsestimate, lower, upper and attributetype. If the number of non-missing values is less than 5,NA is returned for both lower and upper limits.
See Also
https://dl.acm.org/toc/toms/1984/10/3/,https://www4.stat.ncsu.edu/~monahan/jul10/,https://www.fharrell.com/post/aci/
Examples
x <- c(1:4, 10000)pMedian(x)pMedian(x, conf.int=0.95)# Compare with brute force calculation and with wilcox.testw <- outer(x, x, '+')median(w[lower.tri(w, diag=TRUE)]) / 2wilcox.test(x, conf.int=TRUE)pairUpDiff
Description
Pair-up and Compute Differences
Usage
pairUpDiff( x, major = NULL, minor = NULL, group, refgroup, lower = NULL, upper = NULL, minkeep = NULL, sortdiff = TRUE, conf.int = 0.95)Arguments
x | a numeric vector |
major | an optional factor or character vector |
minor | an optional factor or character vector |
group | a required factor or character vector with two levels |
refgroup | a character string specifying which level of |
lower | an optional numeric vector giving the lower |
upper | similar to |
minkeep | the minimum value of |
sortdiff | set to |
conf.int | confidence level; must have been the value used to compute |
Details
This function sets up for plotting half-width confidence intervals for differences, sorting by descending order of differences within major categories, especially for dot charts as produced bydotchartpl(). Given a numeric vectorx and a grouping (superpositioning) vectorgroup with exactly two levels, computes differences in possibly transformedx between levels ofgroup for the two observations that are equal onmajor andminor. Iflower andupper are specified, usingconf.int and approximate normality on the transformed scale to backsolve for the standard errors of estimates, and uses approximate normality to get confidence intervals on differences by taking the square root of the sum of squares of the two standard errors. Coordinates for plotting half-width confidence intervals are also computed. These intervals may be plotted on the same scale asx, having the property that they overlap the twox values if and only if there is no "significant" difference at theconf.int level.
Value
a list of two objects both sorted by descending values of differences inx. TheX object is a data frame that contains the original variables sorted by descending differences acrossgroup and in addition a variablesubscripts denoting the subscripts of original observations with possible re-sorting and dropping depending onsortdiff andminkeep. TheD data frame contains sorted differences (diff),major,minor,sd of difference,lower andupper confidence limits for the difference,mid, the midpoint of the twox values involved in the difference,lowermid, the midpoint minus 1/2 the width of the confidence interval, anduppermid, the midpoint plus 1/2 the width of the confidence interval. Another element returned isdropped which is a vector ofmajor /minor combinations dropped due tominkeep.
Author(s)
Frank Harrell
Examples
x <- c(1, 4, 7, 2, 5, 3, 6)pairUpDiff(x, c(rep('A', 4), rep('B', 3)), c('u','u','v','v','z','z','q'), c('a','b','a','b','a','b','a'), 'a', x-.1, x+.1)Box-Percentile Panel Function for Trellis
Description
For all their good points, box plots have a high ink/information ratioin that they mainly display 3 quartiles. Many practitioners havefound that the "outer values" are difficult to explain tonon-statisticians and many feel that the notion of "outliers" is toodependent on (false) expectations that data distributions should be Gaussian.
panel.bpplot is apanel function for use withtrellis, especially forbwplot. It draws box plots(without the whiskers) with any number of user-specified "corners"(corresponding to different quantiles), but it also draws box-percentileplots similar to those drawn by Jeffrey Banfield's(umsfjban@bill.oscs.montana.edu)bpplot function. To quote from Banfield, "box-percentile plots supply moreinformation about the univariate distributions. At any height thewidth of the irregular 'box' is proportional to the percentile of thatheight, up to the 50th percentile, and above the 50th percentile thewidth is proportional to 100 minus the percentile. Thus, the width atany given height is proportional to the percent of observations thatare more extreme in that direction. As in boxplots, the median, 25thand 75th percentiles are marked with line segments across the box."
panel.bpplot can also be used with base graphics to add extendedbox plots to an existing plot, by specifyingnogrid=TRUE, height=....
panel.bpplot is a generalization ofbpplot andpanel.bwplot in that it works withtrellis (making the plots horizontal so thatcategory labels are more visable), it allows the user to specify thequantiles to connect and those for which to draw reference lines, and it displays means (by default using dots).
bpplt draws horizontal box-percentile plot much like those drawnbypanel.bpplot but taking as the starting point a matrixcontaining quantiles summarizing the data.bpplt is primarilyintended to be used internally byplot.summary.formula.reverse orplot.summaryM but when used with no arguments has a general purpose: to draw anannotated example box-percentile plot with the default quantiles usedand with the mean drawn with a solid dot. This schematic plot isrendered nicely in postscript with an image height of 3.5 inches.
bppltp is likebpplt but forplotly graphics, andit does not draw an annotated extended box plot example.
bpplotM uses thelatticebwplot function to depictmultiple numeric continuous variables with varying scales in a singlelattice graph, after reshaping the dataset into a tall and thinformat.
Usage
panel.bpplot(x, y, box.ratio=1, means=TRUE, qref=c(.5,.25,.75), probs=c(.05,.125,.25,.375), nout=0, nloc=c('right lower', 'right', 'left', 'none'), cex.n=.7, datadensity=FALSE, scat1d.opts=NULL, violin=FALSE, violin.opts=NULL, font=box.dot$font, pch=box.dot$pch, cex.means =box.dot$cex, col=box.dot$col, nogrid=NULL, height=NULL, ...)# E.g. bwplot(formula, panel=panel.bpplot, panel.bpplot.parameters)bpplt(stats, xlim, xlab='', box.ratio = 1, means=TRUE, qref=c(.5,.25,.75), qomit=c(.025,.975), pch=16, cex.labels=par('cex'), cex.points=if(prototype)1 else 0.5, grid=FALSE)bppltp(p=plotly::plot_ly(), stats, xlim, xlab='', box.ratio = 1, means=TRUE, qref=c(.5,.25,.75), qomit=c(.025,.975), teststat=NULL, showlegend=TRUE)bpplotM(formula=NULL, groups=NULL, data=NULL, subset=NULL, na.action=NULL, qlim=0.01, xlim=NULL, nloc=c('right lower','right','left','none'), vnames=c('labels', 'names'), cex.n=.7, cex.strip=1, outerlabels=TRUE, ...)Arguments
x | continuous variable whose distribution is to be examined |
y | grouping variable |
box.ratio | see |
means | set to |
qref | vector of quantiles for which to draw reference lines. These do notneed to be included in |
probs | vector of quantiles to display in the box plot. These should all beless than 0.5; the mirror-image quantiles are added automatically. Bydefault, |
nout | tells the function to use |
nloc | location to plot number of non- |
cex.n | character size for |
datadensity | set to |
scat1d.opts | a list containing named arguments (without abbreviations) to pass to |
violin | set to |
violin.opts | a list of options to pass to |
cex.means | character size for dots representing means |
font,pch,col | see |
nogrid | set to |
height | if |
... | arguments passed to |
stats,xlim,xlab,qomit,cex.labels,cex.points,grid | undocumented arguments to |
p | an already-started |
teststat | an html expression containing a test statistic |
showlegend | set to |
formula | a formula with continuous numeric analysis variables onthe left hand side and stratification variables on the right.The first variable on the right is the one that will vary thefastest, forming the |
groups | see above |
data | an optional data frame |
subset | an optional subsetting expression or logical vector |
na.action | specifies a function to possibly subset the dataaccording to |
qlim | the outer quantiles to use for scaling each panel in |
vnames | default is to use variable |
cex.strip | character size for panel strip labels |
outerlabels | if |
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Esty WW, Banfield J: The box-percentile plot. J StatisticalSoftware 8 No. 17, 2003.
See Also
bpplot,panel.bwplot,scat1d,quantile,Ecdf,summaryP,useOuterStrips
Examples
set.seed(13)x <- rnorm(1000)g <- sample(1:6, 1000, replace=TRUE)x[g==1][1:20] <- rnorm(20)+3 # contaminate 20 x's for group 1# default trellis box plotrequire(lattice)bwplot(g ~ x)# box-percentile plot with data density (rug plot)bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.49,by=.01), datadensity=TRUE)# add ,scat1d.opts=list(tfrac=1) to make all tick marks the same size# when a group has > 125 observations# small dot for means, show only .05,.125,.25,.375,.625,.75,.875,.95 quantilesbwplot(g ~ x, panel=panel.bpplot, cex.means=.3)# suppress means and reference lines for lower and upper quartilesbwplot(g ~ x, panel=panel.bpplot, probs=c(.025,.1,.25), means=FALSE, qref=FALSE)# continuous plot up until quartiles ("Tootsie Roll plot")bwplot(g ~ x, panel=panel.bpplot, probs=seq(.01,.25,by=.01))# start at quartiles then make it continuous ("coffin plot")bwplot(g ~ x, panel=panel.bpplot, probs=seq(.25,.49,by=.01))# same as previous but add a spike to give 0.95 intervalbwplot(g ~ x, panel=panel.bpplot, probs=c(.025,seq(.25,.49,by=.01)))# decile plot with reference lines at outer quintiles and medianbwplot(g ~ x, panel=panel.bpplot, probs=c(.1,.2,.3,.4), qref=c(.5,.2,.8))# default plot with tick marks showing all observations outside the outer# box (.05 and .95 quantiles), with very small ticksbwplot(g ~ x, panel=panel.bpplot, nout=.05, scat1d.opts=list(frac=.01))# show 5 smallest and 5 largest observationsbwplot(g ~ x, panel=panel.bpplot, nout=5)# Use a scat1d option (preserve=TRUE) to ensure that the right peak extends # to the same position as the extreme scat1dbwplot(~x , panel=panel.bpplot, probs=seq(.00,.5,by=.001), datadensity=TRUE, scat1d.opt=list(preserve=TRUE))# Add an extended box plot to an existing base graphics plotplot(x, 1:length(x))panel.bpplot(x, 1070, nogrid=TRUE, pch=19, height=15, cex.means=.5)# Draw a prototype showing how to interpret the plotsbpplt()# Example for bpplotMset.seed(1)n <- 800d <- data.frame(treatment=sample(c('a','b'), n, TRUE), sex=sample(c('female','male'), n, TRUE), age=rnorm(n, 40, 10), bp =rnorm(n, 120, 12), wt =rnorm(n, 190, 30))label(d$bp) <- 'Systolic Blood Pressure'units(d$bp) <- 'mmHg'bpplotM(age + bp + wt ~ treatment, data=d)bpplotM(age + bp + wt ~ treatment * sex, data=d, cex.strip=.8)bpplotM(age + bp + wt ~ treatment*sex, data=d, violin=TRUE, violin.opts=list(col=adjustcolor('blue', alpha.f=.15), border=FALSE))bpplotM(c('age', 'bp', 'wt'), groups='treatment', data=d)# Can use Hmisc Cs function, e.g. Cs(age, bp, wt)bpplotM(age + bp + wt ~ treatment, data=d, nloc='left')# Without treatment: bpplotM(age + bp + wt ~ 1, data=d)## Not run: # Automatically find all variables that appear to be continuousgetHdata(support)bpplotM(data=support, group='dzgroup', cex.strip=.4, cex.means=.3, cex.n=.45)# Separate displays for categorical vs. continuous baseline variablesgetHdata(pbc)pbc <- upData(pbc, moveUnits=TRUE)s <- summaryM(stage + sex + spiders ~ drug, data=pbc)plot(s)Key(0, .5)s <- summaryP(stage + sex + spiders ~ drug, data=pbc)plot(s, val ~ freq | var, groups='drug', pch=1:3, col=1:3, key=list(x=.6, y=.8))bpplotM(bili + albumin + protime + age ~ drug, data=pbc)## End(Not run)Patitions an object into different sets
Description
Partitions an object into subsets of length defined in thesepargument.
Usage
partition.vector(x, sep, ...)partition.matrix(x, rowsep, colsep, ...)Arguments
x | object to be partitioned. |
sep | determines how many elements should go into each set. Thesum of |
rowsep | determins how many rows should go into each set. Thesum of |
colsep | determins how many columns should go into each set. Thesum of |
... | arguments used in other methods of |
Value
A list of equal length assep containing the partitioned objects.
Author(s)
Charles Dupont
See Also
Examples
a <- 1:7partition.vector(a, sep=c(1,3,2,1))First Principal Component
Description
Given a numeric matrix which may or may not containNAs,pc1 standardizes the columns to have mean 0 and variance 1 andcomputes the first principal component usingprcomp. Theproportion of variance explained by this component is printed, and soare the coefficients of the original (not scaled) variables. Thesecoefficients may be applied to the raw data to obtain the first PC.
Usage
pc1(x, hi)Arguments
x | numeric matrix |
hi | if specified, the first PC is scaled so that its maximumvalue is |
Value
The vector of observations with the first PC. An attribute"coef" is attached to this vector."coef" contains theraw-variable coefficients.
Author(s)
Frank Harrell
See Also
Examples
set.seed(1)x1 <- rnorm(100)x2 <- x1 + rnorm(100)w <- pc1(cbind(x1,x2))attr(w,'coef')plot.princmp
Description
Plot Method for princmp
Usage
## S3 method for class 'princmp'plot( x, which = c("scree", "loadings"), k = x$k, offset = 0.8, col = 1, adj = 0, ylim = NULL, add = FALSE, abbrev = 25, nrow = NULL, ...)Arguments
x | results of 'princmp' |
which | '‘scree'’ or '‘loadings’' |
k | number of components to show, default is 'k' specified to 'princmp' |
offset | controls positioning of text labels for cumulative fraction of variance explained |
col | color of plotted text in scree plot |
adj | angle for plotting text in scree plot |
ylim | y-axis scree plotting limits, a 2-vector |
add | set to 'TRUE' to add a line to an existing scree plot without drawing axes |
abbrev | an integer specifying the variable name length above which names are passed through [abbreviate(..., minlength=abbrev)] |
nrow | number of rows to use in plotting loadings. Defaults to the 'ggplot2' 'facet_wrap' default. |
... | unused |
Details
Uses base graphics to by default plot the scree plot from a [princmp()] result, showing cumultive proportion of variance explained. Alternatively the standardized PC loadings are shown in a 'ggplot2' bar chart.
Value
‘ggplot2' object if 'which=’loadings''
Author(s)
Frank Harrell
plotCorrM
Description
Plot Correlation Matrix and Correlation vs. Time Gap
Usage
plotCorrM( r, what = c("plots", "data"), type = c("rectangle", "circle"), xlab = "", ylab = "", maxsize = 12, xangle = 0)Arguments
r | correlation matrix |
what | specifies whether to return plots or the data frame used in making the plots |
type | specifies whether to use bottom-aligned rectangles (the default) or centered circles |
xlab | x-axis label for correlation matrix |
ylab | y-axis label for correlation matrix |
maxsize | maximum circle size if |
xangle | angle for placing x-axis labels, defaulting to 0. Consider using |
Details
Constructs twoggplot2 graphics. The first is a half matrix of rectangles where the height of the rectangle is proportional to the absolute value of the correlation coefficient, with positive and negative coefficients shown in different colors. The second graphic is a variogram-like graph of correlation coefficients on the y-axis and absolute time gap on the x-axis, with aloess smoother added. The times are obtained from the correlation matrix's row and column names if these are numeric. If any names are not numeric, the times are taken as the integers 1, 2, 3, ... The two graphics areggplotly-ready if you useplotly::ggplotly(..., tooltip='label').
Value
a list containing twoggplot2 objects ifwhat='plots', or a data frame ifwhat='data'
Author(s)
Frank Harrell
Examples
set.seed(1)r <- cor(matrix(rnorm(100), ncol=10))g <- plotCorrM(r)g[[1]] # plot matrixg[[2]] # plot correlation vs gap time# ggplotlyr(g[[2]])# ggplotlyr uses ggplotly with tooltip='label' then removes# txt: from hover textPlot Precision of Estimate of Pearson Correlation Coefficient
Description
This function plots the precision (margin of error) of theproduct-moment linear correlation coefficient r vs. sample size, for a given vector ofcorrelation coefficientsrho. Precision is defined as the largerof the upper confidence limit minus rho and rho minus the lower confidencelimit.labcurve is used to automatically label the curves.
Usage
plotCorrPrecision(rho = c(0, 0.5), n = seq(10, 400, length.out = 100), conf.int = 0.95, offset=0.025, ...)Arguments
rho | single or vector of true correlations. A worst-caseprecision graph results from rho=0 |
n | vector of sample sizes to use on the x-axis |
conf.int | confidence coefficient; default uses 0.95 confidencelimits |
offset | see |
... | other arguments to |
Author(s)
Xing Wang and Frank Harrell
See Also
Examples
plotCorrPrecision()plotCorrPrecision(rho=0)plotly Multiple
Description
Generates multiple plotly graphics, driven by specs in a data frame
Usage
plotlyM( data, x = ~x, y = ~y, xhi = ~xhi, yhi = ~yhi, htext = NULL, multplot = NULL, strata = NULL, fitter = NULL, color = NULL, size = NULL, showpts = !length(fitter), rotate = FALSE, xlab = NULL, ylab = NULL, ylabpos = c("top", "y"), xlim = NULL, ylim = NULL, shareX = TRUE, shareY = FALSE, height = NULL, width = NULL, nrows = NULL, ncols = NULL, colors = NULL, alphaSegments = 1, alphaCline = 0.3, digits = 4, zeroline = TRUE)Arguments
data | input data frame |
x | formula specifying the x-axis variable |
y | formula for y-axis variable |
xhi | formula for upper x variable limits ( |
yhi | formula for upper y variable limit ( |
htext | formula for hovertext variable |
multplot | formula specifying a variable in |
strata | formula specifying an optional stratification variable |
fitter | a fitting such as |
color |
|
size |
|
showpts | if |
rotate | set to |
xlab | x-axis label. May contain html. |
ylab | a named vector of y-axis labels, possibly containing html (see example below). The names of the vector must correspond to levels of the |
ylabpos | position of y-axis labels. Default is on top left of plot. Specify |
xlim | 2-vector of x-axis limits, optional |
ylim | 2-vector of y-axis limits, optional |
shareX | specifies whether x-axes should be shared when they align vertically over multiple plots |
shareY | specifies whether y-axes should be shared when they align horizontally over multiple plots |
height | height of the combined image in pixels |
width | width of the combined image in pixels |
nrows | the number of rows to produce using |
ncols | the number of columns to produce using |
colors | the color palette. Leave unspecified to use the default |
alphaSegments | alpha transparency for line segments (when |
alphaCline | alpha transparency for lines used to connect points |
digits | number of significant digits to use in constructing hovertext |
zeroline | set to |
Details
Generates multipleplotly traces and combines them withplotly::subplot. The traces are controlled by specifications in data framedata plus various arguments.data must contain these variables:x,y, andtracename (ifcolor is not an "AsIs" color such as~ I('black')), and can contain these optional variables:xhi,yhi (rows containingNA for bothxhi andyhi represent points, and those with non-NAxhi oryhi represent segments,connect (set toTRUE for rows for points, to connect the symbols),legendgroup (seeplotly documentation), andhtext (hovertext). If thecolor argument is given and it is not an "AsIs" color, the variable named in thecolor formula must also be indata. Likewise forsize. If themultplot is given, the variable given in the formula must be indata. Ifstrata is present, another level of separate plots is generated by levels ofstrata, within levels ofmultplot.
Iffitter is specified, x,y coordinates for an individual plot arerun throughfitter, and a line plot is made instead of showing data points. Alternatively you can specifyfitter='ecdf' to compute and plot emirical cumulative distribution functions.
Value
plotly object produced bysubplot
Author(s)
Frank Harrell
Examples
## Not run: set.seed(1)pts <- expand.grid(v=c('y1', 'y2', 'y3'), x=1:4, g=c('a', 'b'), yhi=NA, tracename='mean', legendgroup='mean', connect=TRUE, size=4)pts$y <- round(runif(nrow(pts)), 2)segs <- expand.grid(v=c('y1', 'y2', 'y3'), x=1:4, g=c('a', 'b'), tracename='limits', legendgroup='limits', connect=NA, size=6)segs$y <- runif(nrow(pts))segs$yhi <- segs$y + runif(nrow(pts), .05, .15)z <- rbind(pts, segs)xlab <- labelPlotmath('X<sub>12</sub>', 'm/sec<sup>2</sup>', html=TRUE)ylab <- c(y1=labelPlotmath('Y1', 'cm', html=TRUE), y2='Y2', y3=labelPlotmath('Y3', 'mm', html=TRUE))W=plotlyM(z, multplot=~v, color=~g, xlab=xlab, ylab=ylab, ncols=2, colors=c('black', 'blue'))W2=plotlyM(z, multplot=~v, color=~I('black'), xlab=xlab, ylab=ylab, colors=c('black', 'blue'))## End(Not run)Plot smoothed estimates
Description
Plot smoothed estimates of x vs. y, handling missing data for lowessor supsmu, and adding axis labels. Optionally suppresses plottingextrapolated estimates. An optionalgroup variable can bespecified to compute and plot the smooth curves by levels ofgroup. Whengroup is present, thedatadensityoption will draw tick marks showing the location of the rawx-values, separately for each curve.plsmo has anoption to plot connected points for raw data, with no smoothing. Thenon-panel version ofplsmo allowsy to be a matrix, forwhich smoothing is done separately over its columns. If bothgroup and multi-columny are used, the number of curvesplotted is the product of the number of groups and the number ofy columns.
method='intervals' is often used when y is binary, as it may betricky to specify a reasonable smoothing parameter tolowess orsupsmu in this case. The'intervals' method uses thecutGn function to form intervals of x containing a minimum ofmobs observations. For each interval theifun functionsummarizes y, with the default being the mean (proportions for binaryy). The results are plotted as step functions, with verticaldiscontinuities drawn with a saturation of 0.15 of the original color.A plus sign is drawn at the mean x within each interval.For this approach, the default x-range is the entire raw data range,andtrim andevaluate are ignored. Forpanel.plsmo it is best to specifytype='l' when using'intervals'.
panel.plsmo is apanel function fortrellis for thexyplot function that usesplsmo and its options to drawone or more nonparametric function estimates on each panel. This hasadvantages over usingxyplot withpanel.xyplot andpanel.loess: (1) by default it will invokelabcurve tolabel the curves where they are most separated, (2) thedatadensity option will put rug plots on each curve (instead of asingle rug plot at the bottom of the graph), and (3) whenpanel.plsmo invokesplsmo it can use the "super smoother"(supsmu function) instead oflowess, or passmethod='intervals'.panel.plsmo senses when agroup variable is specified toxyplot sothat it can invokepanel.superpose instead ofpanel.xyplot. Usingpanel.plsmo throughtrellishas some advantages over callingplsmo directly in thatconditioning variables are allowed andtrellis uses nicer fontsetc.
When agroup variable was used,panel.plsmo creates a functionKey in the session frame that the user can invoke to draw a key forindividual data point symbols used for thegroups. By default, the key is positioned at the upper rightcorner of the graph. IfKey(locator(1)) is specified, the key willappear so that its upper left corner is at the coordinates of themouse click.
Forggplot2 graphics the counterparts arestat_plsmo andhistSpikeg.
Usage
plsmo(x, y, method=c("lowess","supsmu","raw","intervals"), xlab, ylab, add=FALSE, lty=1 : lc, col=par("col"), lwd=par("lwd"), iter=if(length(unique(y))>2) 3 else 0, bass=0, f=2/3, mobs=30, trim, fun, ifun=mean, group, prefix, xlim, ylim, label.curves=TRUE, datadensity=FALSE, scat1d.opts=NULL, lines.=TRUE, subset=TRUE, grid=FALSE, evaluate=NULL, ...)#To use panel function:#xyplot(formula=y ~ x | conditioningvars, groups,# panel=panel.plsmo, type='b', # label.curves=TRUE,# lwd = superpose.line$lwd, # lty = superpose.line$lty, # pch = superpose.symbol$pch, # cex = superpose.symbol$cex, # font = superpose.symbol$font, # col = NULL, scat1d.opts=NULL, \dots)Arguments
x | vector of x-values, NAs allowed |
y | vector or matrix of y-values, NAs allowed |
method |
|
xlab | x-axis label iff add=F. Defaults of label(x) or argument name. |
ylab | y-axis label, like xlab. |
add | Set to T to call lines instead of plot. Assumes axes already labeled. |
lty | line type, default=1,2,3,..., corresponding to columns of |
col | color for each curve, corresponding to |
lwd | vector of line widths for the curves, corresponding to |
iter | iter parameter if |
bass | bass parameter if |
f | passed to the |
mobs | for |
trim | only plots smoothed estimates between trim and 1-trim quantilesof x. Default is to use 10th smallest to 10th largest x in the group if the number of observations in the group exceeds 200 (0 otherwise).Specify trim=0 to plot over entire range. |
fun | after computing the smoothed estimates, if |
ifun | a summary statistic function to apply to the |
group | a variable, either a |
prefix | a character string to appear in group of group labels. The presence of |
xlim | a vector of 2 x-axis limits. Default is observed range. |
ylim | a vector of 2 y-axis limits. Default is observed range. |
label.curves | set to |
datadensity | set to |
scat1d.opts | a list of options to hand to |
lines. | set to |
subset | a logical or integer vector specifying a subset to use for processing,with respect too all variables being analyzed |
grid | set to |
evaluate | number of points to keep from smoother. If specified, anequally-spaced grid of |
... | optional arguments that are passed to |
type | set to |
pch,cex,font | vectors of graphical parameters corresponding to the |
Value
plsmo returns a list of curves (x and y coordinates) that was passed tolabcurve
Side Effects
plots, andpanel.plsmo creates theKey function in the session frame.
See Also
lowess,supsmu,label,quantile,labcurve,scat1d,xyplot,panel.superpose,panel.xyplot,stat_plsmo,histSpikeg,cutGn
Examples
set.seed(1)x <- 1:100y <- x + runif(100, -10, 10)plsmo(x, y, "supsmu", xlab="Time of Entry") #Use label(y) or "y" for ylabplsmo(x, y, add=TRUE, lty=2)#Add lowess smooth to existing plot, with different line typeage <- rnorm(500, 50, 15)survival.time <- rexp(500)sex <- sample(c('female','male'), 500, TRUE)race <- sample(c('black','non-black'), 500, TRUE)plsmo(age, survival.time < 1, fun=qlogis, group=sex) # plot logit by sex#Bivariate Ysbp <- 120 + (age - 50)/10 + rnorm(500, 0, 8) + 5 * (sex == 'male')dbp <- 80 + (age - 50)/10 + rnorm(500, 0, 8) - 5 * (sex == 'male')Y <- cbind(sbp, dbp)plsmo(age, Y)plsmo(age, Y, group=sex)#Plot points and smooth trend line using trellis # (add type='l' to suppress points or type='p' to suppress trend lines)require(lattice)xyplot(survival.time ~ age, panel=panel.plsmo)#Do this for multiple panelsxyplot(survival.time ~ age | sex, panel=panel.plsmo)#Repeat this using equal sample size intervals (n=25 each) summarized by#the median, then a proportion (mean of binary y)xyplot(survival.time ~ age | sex, panel=panel.plsmo, type='l', method='intervals', mobs=25, ifun=median)ybinary <- ifelse(runif(length(sex)) < 0.5, 1, 0)xyplot(ybinary ~ age, groups=sex, panel=panel.plsmo, type='l', method='intervals', mobs=75, ifun=mean, xlim=c(0, 120))#Do this for subgroups of points on each panel, show the data#density on each curve, and draw a key at the default locationxyplot(survival.time ~ age | sex, groups=race, panel=panel.plsmo, datadensity=TRUE)Key()#Use wloess.noiter to do a fast weighted smoothplot(x, y)lines(wtd.loess.noiter(x, y))lines(wtd.loess.noiter(x, y, weights=c(rep(1,50), 100, rep(1,49))), col=2)points(51, y[51], pch=18) # show overly weighted point#Try to duplicate this smooth by replicating 51st observation 100 timeslines(wtd.loess.noiter(c(x,rep(x[51],99)),c(y,rep(y[51],99)), type='ordered all'), col=3)#Note: These two don't agree exactlyPower and Sample Size for Ordinal Response
Description
popower computes the power for a two-tailed two sample comparisonof ordinal outcomes under the proportional odds ordinal logisticmodel. The power is the same as that of the Wilcoxon test but withties handled properly.posamsize computes the total sample sizeneeded to achieve a given power. Both functions compute the efficiencyof the design compared with a design in which the response variableis continuous.print methods exist for both functions. Any of theinput arguments may be vectors, in which case a vector of powers orsample sizes is returned. These functions use the methods ofWhitehead (1993).
pomodm is a function that assists in translating odds ratios todifferences in mean or median on the original scale.
simPOcuts simulates simple unadjusted two-group comparisons undera PO model to demonstrate the natural sampling variability that causesestimated odds ratios to vary over cutoffs of Y.
propsPO usesggplot2 to plot a stacked bar chart ofproportions stratified by a grouping variable (and optionally a stratification variable), with an optionaladditional graph showing what the proportions would be had proportionalodds held and an odds ratio was applied to the proportions in areference group. If the result is passed toggplotly, customizedtooltip hover text will appear.
propsTrans usesggplot2 to plot all successivetransition proportions.formula has the state variable on theleft hand side, the first right-hand variable is time, and the secondright-hand variable is a subject ID variable.\
multEventChart usesggplot2 to plot event chartsshowing state transitions, account for absorbing states/events. It isbased on code written by Lucy D'Agostino McGowan posted athttps://livefreeordichotomize.com/posts/2020-05-21-survival-model-detective-1/.
Usage
popower(p, odds.ratio, n, n1, n2, alpha=0.05)## S3 method for class 'popower'print(x, ...)posamsize(p, odds.ratio, fraction=.5, alpha=0.05, power=0.8)## S3 method for class 'posamsize'print(x, ...)pomodm(x=NULL, p, odds.ratio=1)simPOcuts(n, nsim=10, odds.ratio=1, p)propsPO(formula, odds.ratio=NULL, ref=NULL, data=NULL, ncol=NULL, nrow=NULL )propsTrans(formula, data=NULL, labels=NULL, arrow='\u2794', maxsize=12, ncol=NULL, nrow=NULL)multEventChart(formula, data=NULL, absorb=NULL, sortbylast=FALSE, colorTitle=label(y), eventTitle='Event', palette='OrRd', eventSymbols=c(15, 5, 1:4, 6:10), timeInc=min(diff(unique(x))/2))Arguments
p | a vector of marginal cell probabilities which must add up to one.For |
odds.ratio | the odds ratio to be able to detect. It doesn'tmatter which group is in the numerator. For |
n | total sample size for |
n1 | for |
n2 | for |
nsim | number of simulated studies to create by |
alpha | type I error |
x | an object created by |
fraction | for |
power | for |
formula | an R formula expressure for |
ref | for |
data | a data frame or |
labels | for |
arrow | character to use as the arrow symbol for transitions in |
nrow,ncol | see |
maxsize | maximum symbol size |
... | unused |
absorb | character vector specifying the subset of levels of theleft hand side variable that are absorbing states such as death orhospital discharge |
sortbylast | set to |
colorTitle | label for legend for status |
eventTitle | label for legend for |
palette | a single character string specifying the |
eventSymbols | vector of symbol codes. Default for first twosymbols is a solid square and an open diamond. |
timeInc | time increment for the x-axis. Default is 1/2 theshortest gap between any two distincttimes in the data. |
Value
a list containingpower,eff (relative efficiency), andapprox.se (approximate standard error of log odds ratio) forpopower, or containingn andeff forposamsize.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Whitehead J (1993): Sample size calculations for ordered categoricaldata. Stat in Med 12:2257–2271.
Julious SA, Campbell MJ (1996): Letter to the Editor. Stat in Med 15:1065–1066. Shows accuracy of formula for binary response case.
See Also
simRegOrd,bpower,cpower,impactPO
Examples
# For a study of back pain (none, mild, moderate, severe) here are the# expected proportions (averaged over 2 treatments) that will be in# each of the 4 categories:p <- c(.1,.2,.4,.3)popower(p, 1.2, 1000) # OR=1.2, total n=1000posamsize(p, 1.2)popower(p, 1.2, 3148)# If p was the vector of probabilities for group 1, here's how to# compute the average over the two groups:# p2 <- pomodm(p=p, odds.ratio=1.2)# pavg <- (p + p2) / 2# Compare power to test for proportions for binary case,# proportion of events in control group of 0.1p <- 0.1; or <- 0.85; n <- 4000popower(c(1 - p, p), or, n) # 0.338bpower(p, odds.ratio=or, n=n) # 0.320# Add more categories, starting with 0.1 in middlep <- c(.8, .1, .1)popower(p, or, n) # 0.543p <- c(.7, .1, .1, .1)popower(p, or, n) # 0.67# Continuous scale with final level have prob. 0.1p <- c(rep(1 / n, 0.9 * n), 0.1)popower(p, or, n) # 0.843# Compute the mean and median x after shifting the probability# distribution by an odds ratio under the proportional odds modelx <- 1 : 5p <- c(.05, .2, .2, .3, .25)# For comparison make up a sample that looks like thisX <- rep(1 : 5, 20 * p)c(mean=mean(X), median=median(X))pomodm(x, p, odds.ratio=1) # still have to figure out the right medianpomodm(x, p, odds.ratio=0.5)# Show variation of odds ratios over possible cutoffs of Y even when PO# truly holds. Run 5 simulations for a total sample size of 300.# The two groups have 150 subjects each.s <- simPOcuts(300, nsim=5, odds.ratio=2, p=p)round(s, 2)# An ordinal outcome with levels a, b, c, d, e is measured at 3 times# Show the proportion of values in each outcome category stratified by# time. Then compute what the proportions would be had the proportions# at times 2 and 3 been the proportions at time 1 modified by two odds ratios set.seed(1)d <- expand.grid(time=1:3, reps=1:30)d$y <- sample(letters[1:5], nrow(d), replace=TRUE)propsPO(y ~ time, data=d, odds.ratio=function(time) c(1, 2, 4)[time])# To show with plotly, save previous result as object p and then:# plotly::ggplotly(p, tooltip='label')# Add a stratification variable and don't consider an odds ratiod <- expand.grid(time=1:5, sex=c('female', 'male'), reps=1:30)d$y <- sample(letters[1:5], nrow(d), replace=TRUE)propsPO(y ~ time + sex, data=d) # may add nrow= or ncol=# Show all successive transition proportion matricesd <- expand.grid(id=1:30, time=1:10)d$state <- sample(LETTERS[1:4], nrow(d), replace=TRUE)propsTrans(state ~ time + id, data=d)pt1 <- data.frame(pt=1, day=0:3, status=c('well', 'well', 'sick', 'very sick'))pt2 <- data.frame(pt=2, day=c(1,2,4,6), status=c('sick', 'very sick', 'coma', 'death'))pt3 <- data.frame(pt=3, day=1:5, status=c('sick', 'very sick', 'sick', 'very sick', 'discharged'))pt4 <- data.frame(pt=4, day=c(1:4, 10), status=c('well', 'sick', 'very sick', 'well', 'discharged'))d <- rbind(pt1, pt2, pt3, pt4)d$status <- factor(d$status, c('discharged', 'well', 'sick', 'very sick', 'coma', 'death'))label(d$day) <- 'Day'require(ggplot2)multEventChart(status ~ day + pt, data=d, absorb=c('death', 'discharged'), colorTitle='Status', sortbylast=TRUE) + theme_classic() + theme(legend.position='bottom')princmp
Description
Enhanced Output for Principal and Sparse Principal Components
Usage
princmp( formula, data = environment(formula), method = c("regular", "sparse"), k = min(5, p - 1), kapprox = min(5, k), cor = TRUE, sw = FALSE, nvmax = 5)Arguments
formula | a formula with no left hand side, or a numeric matrix |
data | a data frame or table. By default variables come from the calling environment. |
method | specifies whether to use regular or sparse principal components are computed |
k | the number of components to plot, display, and return |
kapprox | the number of components to approximate with stepwise regression when |
cor | set to |
sw | set to |
nvmax | maximum number of predictors to allow in stepwise regression PC approximations |
Details
Expands any categorical predictors into indicator variables, and callsprincomp (ifmethod='regular' (the default)) orsPCAgrid in thepcaPP package (method='sparse') to compute lasso-penalized sparse principal components. By default all variables are first scaled by their standard deviation after observations with anyNAs on any variables informula are removed. Loadings of standardized variables, and iforig=TRUE loadings on the original data scale are printed. Ifpl=TRUE a scree plot is drawn with text added to indicate cumulative proportions of variance explained. Ifsw=TRUE, theleaps packageregsubsets function is used to approximate the PCs using forward stepwise regression with the original variables as individual predictors.
Aprint method prints the results and aplot method plots the scree plot of variance explained.
Value
a list of classprincmp with elementsscores, a k-column matrix with principal component scores, withNAs when the input data had anNA, and other components useful for printing and plotting. Ifk=1scores is a vector. Other components includevars (vector of variances explained),method,k.
Author(s)
Frank Harrell
prints a list of lists in a visually readable format.
Description
Takes a list that is composed of other lists and matrixes and printsit in a visually readable format.
Usage
## S3 method for class 'char.list'print(x, ..., hsep = c("|"), vsep = c("-"), csep = c("+"), print.it = TRUE, rowname.halign = c("left", "centre", "right"), rowname.valign = c("top", "centre", "bottom"), colname.halign = c("centre", "left", "right"), colname.valign = c("centre", "top", "bottom"), text.halign = c("right", "centre", "left"), text.valign = c("top", "centre", "bottom"), rowname.width, rowname.height, min.colwidth = .Options$digits, max.rowheight = NULL, abbreviate.dimnames = TRUE, page.width = .Options$width, colname.width, colname.height, prefix.width, superprefix.width = prefix.width)Arguments
x | list object to be printed |
... | place for extra arguments to reside. |
hsep | character used to separate horizontal fields |
vsep | character used to separate veritcal feilds |
csep | character used where horizontal and veritcal separators meet. |
print.it | should the value be printed to the console or returned as a string. |
rowname.halign | horizontal justification of row names. |
rowname.valign | verical justification of row names. |
colname.halign | horizontal justification of column names. |
colname.valign | verical justification of column names. |
text.halign | horizontal justification of cell text. |
text.valign | vertical justification of cell text. |
rowname.width | minimum width of row name strings. |
rowname.height | minimum height of row name strings. |
min.colwidth | minimum column width. |
max.rowheight | maximum row height. |
abbreviate.dimnames | should the row and column names be abbreviated. |
page.width | width of the page being printed on. |
colname.width | minimum width of the column names. |
colname.height | minimum height of the column names |
prefix.width | maximum width of the rowname columns |
superprefix.width | maximum width of the super rowname columns |
Value
String that formated table of the list object.
Author(s)
Charles Dupont
Function to print a matrix with stacked cells
Description
Prints a dataframe or matrix in stacked cells. Line break charctersin a matrix element will result in a line break in that cell, but tabcharacters are not supported.
Usage
## S3 method for class 'char.matrix'print(x, file = "", col.name.align = "cen", col.txt.align = "right", cell.align = "cen", hsep = "|", vsep = "-", csep = "+", row.names = TRUE, col.names = FALSE, append = FALSE, top.border = TRUE, left.border = TRUE, ...)Arguments
x | a matrix or dataframe |
file | name of file if file output is desired. If left empty,output will be to the screen |
col.name.align | if column names are used, they can be alignedright, left or centre. Default |
col.txt.align | how character columns are aligned. Optionsare the same as for |
cell.align | how numbers are displayed in columns |
hsep | character string to use as horizontal separator,i.e. what separates columns |
vsep | character string to use as vertical separator,i.e. what separates rows. Length cannot be more than one. |
csep | character string to use where vertical and horizontalseparators cross. If |
row.names | logical: are we printing the names of the rows? |
col.names | logical: are we printing the names of the columns? |
append | logical: if |
top.border | logical: do we want a border along the top above thecolumns? |
left.border | logical: do we want a border along the left of thefirst column? |
... | unused |
Details
If any column ofx is a mixture of character and numeric, thedistinction between character and numeric columns will be lost. Thisis especially so if the matrix is of a form where you would not wantto print the column names, the column information being in the rows atthe beginning of the matrix.
Row names, if not specified in the making of the matrix will simply benumbers. To prevent printing them, setrow.names = FALSE.
Value
No value is returned. The matrix or dataframe will be printed to fileor to the screen.
Author(s)
Patrick Connollyp.connolly@hortresearch.co.nz
See Also
write,write.table
Examples
data(HairEyeColor)print.char.matrix(HairEyeColor[ , , "Male"], col.names = TRUE)print.char.matrix(HairEyeColor[ , , "Female"], col.txt.align = "left", col.names = TRUE)z <- rbind(c("", "N", "y"), c("[ 1.34,40.3)\n[40.30,48.5)\n[48.49,58.4)\n[58.44,87.8]", " 50\n 50\n 50\n 50", "0.530\n0.489\n0.514\n0.507"), c("female\nmale", " 94\n106", "0.552\n0.473" ), c("", "200", "0.510"))dimnames(z) <- list(c("", "age", "sex", "Overall"),NULL)print.char.matrix(z)print.princmp
Description
Print Results of princmp
Usage
## S3 method for class 'princmp'print(x, which = c("none", "standardized", "original", "both"), k = x$k, ...)Arguments
x | results of |
which | specifies which loadings to print, the default being |
k | number of components to show, defaults to |
... | unused |
Details
Simple print method forprincmp()
Value
nothing
Author(s)
Frank Harrell
printL
Description
Print an object or a named list of objects. When multiple objects are given, their names are printed before their contents. When an object is a vector that is not longer thanmaxoneline and its elements are not named, all the elements will be printed on one line separated by commas. Whendec is given, numeric vectors or numeric columns of data frames or data tables are rounded to the nearestdec before printing. This function is especially helpful when printing objects in a Quarto or RMarkdown document and the code is not currently being shown to place the output in context.
Usage
printL(..., dec = NULL, maxoneline = 5)Arguments
... | any number of objects to |
dec | optional decimal places to the right of the decimal point for rounding |
maxoneline | controls how many elements may be printed on a single line for |
Value
nothing
Author(s)
Frank Harrell
See Also
Examples
w <- pi + 1 : 2printL(w=w)printL(w, dec=3)printL('this is it'=c(pi, pi, 1, 2), yyy=pi, z=data.frame(x=pi+1:2, y=3:4, z=c('a', 'b')), qq=1:10, dec=4)Print and Object with its Name
Description
Prints an object with its name and with an optional descriptivetext string. This is useful for annotating analysis output files andfor debugging.
Usage
prn(x, txt, file, head=deparse(substitute(x), width.cutoff=500)[1])Arguments
x | any object |
txt | optional text string |
file | optional file name. By default, writes to console. |
head | optional heading. Default is derived from the user's expression for |
Side Effects
prints
See Also
Examples
x <- 1:5prn(x)# prn(fit, 'Full Model Fit')Selectively Print Lines of a Text Vector
Description
Given one or two regular expressions or exact text matches, removeselements of the input vector that match these specifications. Omittedlines are replaced by .... This is useful for selectivelysuppressing some of the printed output of R functions such asregression fitting functions, especially in the context of makingstatistical reports using Sweave or Odfweave.
Usage
prselect(x, start = NULL, stop = NULL, i = 0, j = 0, pr = TRUE)Arguments
x | input character vector |
start | text or regular expression to look for starting line to omit. Ifomitted, deletions start at the first line. |
stop | text or regular expression to look for ending line to omit. Ifomitted, deletions proceed until the last line. |
i | increment in number of first line to delete after match is found |
j | increment in number of last line to delete after match is found |
pr | set to |
Value
an invisible vector of retained lines of text
Author(s)
Frank Harrell
See Also
Examples
x <- c('the','cat','ran','past','the','dog')prselect(x, 'big','bad') # omit nothing- no matchprselect(x, 'the','past') # omit first 4 linesprselect(x,'the','junk') # omit nothing- no match for stopprselect(x,'ran','dog') # omit last 4 linesprselect(x,'cat') # omit lines 2-prselect(x,'cat',i=1) # omit lines 3-prselect(x,'cat','past') # omit lines 2-4prselect(x,'cat','past',j=1) # omit lines 2-5prselect(x,'cat','past',j=-1)# omit lines 2-3prselect(x,'t$','dog') # omit lines 2-6; t must be at end# Example for Sweave: run a regression analysis with the rms package# then selectively output only a portion of what print.ols prints.# (Thanks to \email{romain.francois@dbmail.com})# <<z,eval=FALSE,echo=T>>=# library(rms)# y <- rnorm(20); x1 <- rnorm(20); x2 <- rnorm(20)# ols(y ~ x1 + x2)# <<echo=F>>=# z <- capture.output( {# <<z>># } )# prselect(z, 'Residuals:') # keep only summary stats; or:# prselect(z, stop='Coefficients', j=-1) # keep coefficients, rmse, R^2; or:# prselect(z, 'Coefficients', 'Residual standard error', j=-1) # omit coef# @Date/Time/Directory Stamp the Current Plot
Description
Date-time stamp the current plot in the extreme lower rightcorner. Optionally add the current working directory and arbitrary othertext to the stamp.
Usage
pstamp(txt, pwd = FALSE, time. = TRUE)Arguments
txt | an optional single text string |
pwd | set to |
time. | set to |
Details
Certain functions are not supported for S-Plus under Windows. ForR,results may not be satisfactory ifpar(mfrow=) is in effect.
Author(s)
Frank Harrell
Examples
plot(1:20)pstamp(pwd=TRUE, time=FALSE)qcrypt
Description
Store and Encrypt R Objects or Files or Read and Decrypt Them
Usage
qcrypt(obj, base, service = "R-keyring-service", file, pw)Arguments
obj | an R object to write to disk and encrypt (if |
base | base file name when creating a file. Not used when |
service | a fairly arbitrary |
file | full name of file to encrypt or decrypt |
pw | a single character string containing an actual password |
Details
qcrypt is used to protect sensitive information on a user's computer or when transmitting a copy of the file to another R user. Unencrypted information only exists for a moment, and the encryption password does not appear in the user's script but instead is managed by thekeyring package to remember the password across R sessions, and thegetPass package, which pops up a password entry window and does not allow the password to be visible. The password is requested only once, except perhaps when the user logs out of their operating system session or reboots.
The keyring can be bypassed and the password entered in a popup window by specifyingservice=NA. This is the preferred approach when sending an encrypted file to a user on a different computer.
qcrypt writes R objects to disk in a temporary file using theqs packageqsave function. The file is quickly encrypted using thesafer package, and the temporary unencryptedqs file is deleted. When reading an encrypted file the process is reversed.
To save an object in an encrypted file, specify the object as the first argumentobj and specify a base file name as a character string in the second argumentbase. The fullqs file name will be of the formbase.qs.encrypted in the user's current working directory. To unencrypt the file into a short-lived temporary file and useqs::qread to read it, specify the base file name as a character string with the first argument, and do not specify thebase argument.
Alternatively,qcrypt can be used to encrypt or decrypt existing files of any type using the same password and keyring mechanism. The former is done by specifyingfile that does not end in'.encrypted' and the latter is done by endingfile with'.encrypted'. Whenfile does not contain a path it is assumed to be in the current working directory. When a file is encrypted the original file is removed. Files are decrypted into a temporary directory created bytempdir(), with the name of the file being the value offile with'.encrypted' removed.
Interactive password provision works when runningR,Rscript,RStudio, orQuarto but does not work when runningR CMD BATCH.getPass fails underRStudio on Macs.
It is also possible to pass the password as thepw argument. This is only safe if running interactively and the password is defined by typing e.g.pw <- 'whateverpassword' in the console, then running the script interactively withpw=pw added to theqcrypt call.
SeeR Workflow for more information.
Value
(invisibly) the full encrypted file name if writing the file, or the restored R object if reading the file. When decrypting a general file withfile=, the returned value is the full path to a temporary file containing the decrypted data.
Author(s)
Frank Harrell
Examples
## Not run: # Suppose x is a data.table or data.frame# The first time qcrypt is run with a service a password will# be requested. It will be remembered across sessions thanks to# the keyring packageqcrypt(x, 'x') # creates x.qs.encrypted in current working directoryx <- qcrypt('x') # unencrypts x.qs.encrypted into a temporary # directory, uses qs::qread to read it, and # stores the result in x# Encrypt a general file using a different passwordqcrypt(file='report.pdf', service='pdfkey')# Decrypt that filefi <- qcrypt(file='report.pdf.encrypted', service='pdfkey')fi contains the full unencrypted file name which is in a temporary directory# Encrypt without using a keyringqcrypt(x, 'x', service=NA)x <- qcrypt('x', service=NA)pw <- 'somepassword' # run this in the consolex <- qcrypt('x', pw=pw) # interactively run this in a script## End(Not run)qrxcenter
Description
Mean-center a data matrix and QR transform it
Usage
qrxcenter(x, ...)Arguments
x | a numeric matrix or vector with at least 2 rows |
... | passed to |
Details
For a numeric matrixx (or a numeric vector that is automatically changed to a one-column matrix), computes column means and subtracts them fromx columns, and passes this matrix tobase::qr() to orthogonalize columns. Columns of the transformedx are negated as needed so that original directions are preserved (which are arbitrary with QR decomposition). Instead of the defaultqr operation for which sums of squares of column values are 1.0,qrxcenter makes all the transformed columns have standard deviation of 1.0.
Value
a list with componentsx (transformed data matrix),R (the matrix that can be used to transform rawx and to transform regression coefficients computed on transformedx back to the original space),Ri (transforms transformedx back to original scale except forxbar), andxbar (vector of means of originalx columns')
Examples
set.seed(1)age <- 1:10country <- sample(c('Slovenia', 'Italy', 'France'), 10, TRUE)x <- model.matrix(~ age + country)[, -1]xw <- qrxcenter(x)w# Reproduce w$xsweep(x, 2, w$xbar) %*% w$R# Reproduce x from w$xsweep(w$x %*% w$Ri, 2, w$xbar, FUN='+')# See also https://hbiostat.org/r/examples/gtrans/gtrans#sec-splinebasisr2describe
Description
Summarize Strength of Relationships Using R-Squared From Linear Regression
Usage
r2describe(x, nvmax = 10)Arguments
x | numeric matrix with 2 or more columns |
nvmax | maxmum number of columns of x to use in predicting a given column |
Details
Function to useleaps::regsubsets() to briefly describe which variables more strongly predict another variable. Variables are in a numeric matrix and are assumed to be transformed so that relationships are linear (e.g., usingredun() ortranscan().)
Value
nothing
Author(s)
Frank Harrell
Examples
## Not run: r <- redun(...)r2describe(r$scores)## End(Not run)Generate Multinomial Random Variables with Varying Probabilities
Description
Given a matrix of multinomial probabilities where rows correspond toobservations and columns to categories (and each row sums to 1),generates a matrix with the same number of rows as hasprobs andwithm columns. The columns represent multinomial cell numbers,and within a row the columns are all samples from the same multinomialdistribution. The code is a modification of that in theimpute.polyreg function in theMICE package.
Usage
rMultinom(probs, m)Arguments
probs | matrix of probabilities |
m | number of samples for each row of |
Value
an integer matrix havingm columns
See Also
Examples
set.seed(1)w <- rMultinom(rbind(c(.1,.2,.3,.4),c(.4,.3,.2,.1)),200)t(apply(w, 1, table)/200)Matrix of Correlations and P-values
Description
rcorr Computes a matrix of Pearson'sr or Spearman'srho rank correlation coefficients for all possible pairs ofcolumns of a matrix. Missing values are deleted in pairs rather thandeleting all rows ofx having any missing variables. Ranks arecomputed using efficient algorithms (see reference 2), using midranksfor ties.
Usage
rcorr(x, y, type=c("pearson","spearman"))## S3 method for class 'rcorr'print(x, ...)Arguments
x | a numeric matrix with at least 5 rows and at least 2 columns (if |
y | a numeric vector or matrix which will be concatenated to |
type | specifies the type of correlations to compute. Spearman correlationsare the Pearson linear correlations computed on the ranks of non-missingelements, using midranks for ties. |
... | argument for method compatiblity. |
Details
Uses midranks in case of ties, as described by Hollander and Wolfe.P-values are approximated by using thet orF distributions.
Value
rcorr returns a list with elementsr, thematrix of correlations,n thematrix of number of observations used in analyzing each pair of variables,P, the asymptotic P-values, andtype.Pairs with fewer than 2 non-missing values have the r values set to NA.The diagonals ofn are the number of non-NAs for the single variablecorresponding to that row and column.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods.New York: Wiley.
Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): NumericalRecipes in C. Cambridge: Cambridge University Press.
See Also
hoeffd,cor,combine.levels,varclus,dotchart3,impute,chisq.test,cut2.
Examples
x <- c(-2, -1, 0, 1, 2)y <- c(4, 1, 0, 1, 4)z <- c(1, 2, 3, 4, NA)v <- c(1, 2, 3, 4, 5)rcorr(cbind(x,y,z,v))Rank Correlation for Censored Data
Description
Computes the c index and the correspondinggeneralization of Somers' Dxy rank correlation for a censored responsevariable. Also works for uncensored and binary responses, although its use of all possible pairingsmakes it slow for this purpose. Dxy and c are related byDxy=2(c-0.5).
rcorr.cens handles one predictor variable.rcorrcenscomputes rank correlation measures separately by a series ofpredictors. In addition,rcorrcens has a rough way of handlingcategorical predictors. If a categorical (factor) predictor has twolevels, it is coverted to a numeric having values 1 and 2. If it hasmore than 2 levels, an indicator variable is formed for the mostfrequently level vs. all others, and another indicator for the secondmost frequent level and all others. The correlation is taken as themaximum of the two (in absolute value).
Usage
rcorr.cens(x, S, outx=FALSE)## S3 method for class 'formula'rcorrcens(formula, data=NULL, subset=NULL, na.action=na.retain, exclude.imputed=TRUE, outx=FALSE, ...)Arguments
x | a numeric predictor variable |
S | an |
outx | set to |
formula | a formula with a |
data,subset,na.action | the usual options for models. Default for |
exclude.imputed | set to |
... | extra arguments passed to |
Value
rcorr.cens returns a vector with the following named elements:C Index,Dxy,S.D.,n,missing,uncensored,Relevant Pairs,Concordant, andUncertain
n | number of observations not missing on any input variables |
missing | number of observations missing on |
relevant | number of pairs of non-missing observations for which |
concordant | number of relevant pairs for which |
uncertain | number of pairs of non-missing observations for whichcensoring prevents classification of concordance of |
rcorrcens.formula returns an object of classbiVarwhich is documented with thebiVar function.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Newson R: Confidence intervals for rank statistics: Somers' D and extensions. Stata Journal 6:309-334; 2006.
See Also
concordance,somers2,biVar,rcorrp.cens
Examples
set.seed(1)x <- round(rnorm(200))y <- rnorm(200)rcorr.cens(x, y, outx=TRUE) # can correlate non-censored variableslibrary(survival)age <- rnorm(400, 50, 10)bp <- rnorm(400,120, 15)bp[1] <- NAd.time <- rexp(400)cens <- runif(400,.5,2)death <- d.time <= censd.time <- pmin(d.time, cens)rcorr.cens(age, Surv(d.time, death))r <- rcorrcens(Surv(d.time, death) ~ age + bp)rplot(r)# Show typical 0.95 confidence limits for ROC areas for a sample size# with 24 events and 62 non-events, for varying population ROC areas# Repeat for 138 events and 102 non-eventsset.seed(8)par(mfrow=c(2,1))for(i in 1:2) { n1 <- c(24,138)[i] n0 <- c(62,102)[i] y <- c(rep(0,n0), rep(1,n1)) deltas <- seq(-3, 3, by=.25) C <- se <- deltas j <- 0 for(d in deltas) { j <- j + 1 x <- c(rnorm(n0, 0), rnorm(n1, d)) w <- rcorr.cens(x, y) C[j] <- w['C Index'] se[j] <- w['S.D.']/2 } low <- C-1.96*se; hi <- C+1.96*se print(cbind(C, low, hi)) errbar(deltas, C, C+1.96*se, C-1.96*se, xlab='True Difference in Mean X', ylab='ROC Area and Approx. 0.95 CI') title(paste('n1=',n1,' n0=',n0,sep='')) abline(h=.5, v=0, col='gray') true <- 1 - pnorm(0, deltas, sqrt(2)) lines(deltas, true, col='blue')}par(mfrow=c(1,1))Rank Correlation for Paired Predictors with a Possibly CensoredResponse, and Integrated Discrimination Index
Description
Computes U-statistics to test for whether predictor X1 is moreconcordant than predictor X2, extendingrcorr.cens. Formethod=1, estimates the fraction of pairs for which thex1 difference is more impressive than thex2difference. Formethod=2, estimates the fraction of pairs forwhichx1 is concordant withS butx2 is not.
For binary responses the functionimproveProb provides severalassessments of whether one set of predicted probabilities is betterthan another, using the methods describe inPencina et al (2007). This involves NRI and IDI to test forwhether predictions from modelx1 are significantly differentfrom those obtained from predictions from modelx2. This is adistinct improvement over comparing ROC areas, sensitivity, orspecificity.
Usage
rcorrp.cens(x1, x2, S, outx=FALSE, method=1)improveProb(x1, x2, y)## S3 method for class 'improveProb'print(x, digits=3, conf.int=.95, ...)Arguments
x1 | first predictor (a probability, for |
x2 | second predictor (a probability, for |
S | a possibly right-censored |
outx | set to |
method | see above |
y | a binary 0/1 outcome variable |
x | the result from |
digits | number of significant digits for use in printing the result of |
conf.int | level for confidence limits |
... | unused |
Details
Ifx1,x2 represent predictions from models, thesefunctions assume either that you are using a separate sample from theone used to build the model, or that the amount of overfitting inx1 equals the amount of overfitting inx2. An exampleof the latter is giving both models equal opportunity to be complex sothat both models have the same number of effective degrees of freedom,whether a predictor was included in the model or was screened out by avariable selection scheme.
Note that in the first part of their paper,Pencina et al.presented measures that required binning the predicted probabilities.Those measures were then replaced with better continuous measures thatare implementedhere.
Value
a vector of statistics forrcorrp.cens, or a list with classimproveProb of statistics forimproveProb:
n | number of cases |
na | number of events |
nb | number of non-events |
pup.ev | mean of pairwise differences in probabilities for those with eventsand a pairwise difference of |
pup.ne | mean of pairwise differences in probabilities for those withoutevents and a pairwise difference of |
pdown.ev | mean of pairwise differences in probabilities for those with eventsand a pairwise difference of |
pdown.ne | mean of pairwise differences in probabilities for those withoutevents and a pairwise difference of |
nri | Net Reclassification Index = |
se.nri | standard error of NRI |
z.nri | Z score for NRI |
nri.ev | Net Reclassification Index = |
se.nri.ev | SE of NRI of events |
z.nri.ev | Z score for NRI of events |
nri.ne | Net Reclassification Index = |
se.nri.ne | SE of NRI of non-events |
z.nri.ne | Z score for NRI of non-events |
improveSens | improvement in sensitivity |
improveSpec | improvement in specificity |
idi | Integrated Discrimination Index |
se.idi | SE of IDI |
z.idi | Z score of IDI |
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
Scott Williams
Division of Radiation Oncology
Peter MacCallum Cancer Centre, Melbourne, Australia
scott.williams@petermac.org
References
Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS (2008):Evaluating the added predictive ability of a new marker: From areaunder the ROC curve to reclassification and beyond. Stat in Med 27:157-172.DOI: 10.1002/sim.2929
Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS:Rejoinder: Comments on Integrated discrimination and net reclassificationimprovements-Practical advice. Stat in Med 2007; DOI: 10.1002/sim.3106
Pencina MJ, D'Agostino RB, Steyerberg EW (2011): Extensions of netreclassification improvement calculations to measure usefulness of newbiomarkers. Stat in Med 30:11-21; DOI: 10.1002/sim.4085
See Also
rcorr.cens,somers2,Surv,val.prob,concordance
Examples
set.seed(1)library(survival)x1 <- rnorm(400)x2 <- x1 + rnorm(400)d.time <- rexp(400) + (x1 - min(x1))cens <- runif(400,.5,2)death <- d.time <= censd.time <- pmin(d.time, cens)rcorrp.cens(x1, x2, Surv(d.time, death))#rcorrp.cens(x1, x2, y) ## no censoringset.seed(1)x1 <- runif(1000)x2 <- runif(1000)y <- sample(0:1, 1000, TRUE)rcorrp.cens(x1, x2, y)improveProb(x1, x2, y)Restricted Cubic Spline Design Matrix
Description
Computes matrix that expands a single variable into the terms neededto fit a restricted cubic spline (natural spline) function using thetruncated power basis. Two normalization options are given forsomewhat reducing problems of ill-conditioning. The antiderivativefunction can be optionally created. If knot locations are not given,they will be estimated from the marginal distribution ofx.
Usage
rcspline.eval(x, knots, nk=5, inclx=FALSE, knots.only=FALSE, type="ordinary", norm=2, rpm=NULL, pc=FALSE, fractied=0.05)Arguments
x | a vector representing a predictor variable |
knots | knot locations. If not given, knots will be estimated using defaultquantiles of |
nk | number of knots. Default is 5. The minimum value is 3. |
inclx | set to |
knots.only | return the estimated knot locations but not the expanded matrix |
type | ‘"ordinary"’ to fit the function, ‘"integral"’ to fit itsanti-derivative. |
norm | ‘0’ to use the terms as originally given byDevlin andWeeks (1986), ‘1’ to normalize non-linear terms by the cubeof the spacing between the last two knots, ‘2’ to normalize bythe square of the spacing between the first and last knots (thedefault). |
rpm | If given, any |
pc | Set to |
fractied | If the fraction of observations tied at the lowest and/or highestvalues of |
Value
Ifknots.only=TRUE, returns a vector of knotlocations. Otherwise returns a matrix withx (ifinclx=TRUE) followed by\code{nk}-2 nonlinear terms. Thematrix has an attributeknots which is the vector of knotsused. Whenpc isTRUE, an additional attribute isstored:pcparms, which contains thecenter andscale vectors and therotation matrix.
References
Devlin TF and Weeks BJ (1986): Spline functions for logistic regressionmodeling. Proc 11th Annual SAS Users Group Intnl Conf, p. 646–651.Cary NC: SAS Institute, Inc.
See Also
Examples
x <- 1:100rcspline.eval(x, nk=4, inclx=TRUE)#lrm.fit(rcspline.eval(age,nk=4,inclx=TRUE), death)x <- 1:1000attributes(rcspline.eval(x))x <- c(rep(0, 744),rep(1,6), rep(2,4), rep(3,10),rep(4,2),rep(6,6), rep(7,3),rep(8,2),rep(9,4),rep(10,2),rep(11,9),rep(12,10),rep(13,13), rep(14,5),rep(15,5),rep(16,10),rep(17,6),rep(18,3),rep(19,11),rep(20,16), rep(21,6),rep(22,16),rep(23,17), 24, rep(25,8), rep(26,6),rep(27,3), rep(28,7),rep(29,9),rep(30,10),rep(31,4),rep(32,4),rep(33,6),rep(34,6), rep(35,4), rep(36,5), rep(38,6), 39, 39, 40, 40, 40, 41, 43, 44, 45)attributes(rcspline.eval(x, nk=3))attributes(rcspline.eval(x, nk=5))u <- c(rep(0,30), 1:4, rep(5,30))attributes(rcspline.eval(u))Plot Restricted Cubic Spline Function
Description
Provides plots of the estimated restricted cubic spline functionrelating a single predictor to the response for a logistic or Coxmodel. Thercspline.plot function does not allow forinteractions as dolrm andcph, but it canprovide detailed output for checking spline fits. This function usesthercspline.eval,lrm.fit, and Therneau'scoxph.fit functions and plots the estimated splineregression and confidence limits, placing summary statistics on thegraph. If there are no adjustment variables,rcspline.plot canalso plot two alternative estimates of the regression function whenmodel="logistic": proportions or logit proportions on groupeddata, and a nonparametric estimate. The nonparametric regressionestimate is based on smoothing the binary responses and taking thelogit transformation of the smoothed estimates, if desired. Thesmoothing usessupsmu.
Usage
rcspline.plot(x,y,model=c("logistic", "cox", "ols"), xrange, event, nk=5, knots=NULL, show=c("xbeta","prob"), adj=NULL, xlab, ylab, ylim, plim=c(0,1), plotcl=TRUE, showknots=TRUE, add=FALSE, subset, lty=1, noprint=FALSE, m, smooth=FALSE, bass=1, main="auto", statloc)Arguments
x | a numeric predictor |
y | a numeric response. For binary logistic regression, |
model |
|
xrange | range for evaluating |
event | event/censoring indicator if |
nk | number of knots |
knots | knot locations, default based on quantiles of |
show |
|
adj | optional matrix of adjustment variables |
xlab |
|
ylab |
|
ylim |
|
plim |
|
plotcl | plot confidence limits |
showknots | show knot locations with arrows |
add | add this plot to an already existing plot |
subset | subset of observations to process, e.g. |
lty | line type for plotting estimated spline function |
noprint | suppress printing regression coefficients and standard errors |
m | for |
smooth | plot nonparametric estimate if |
bass | smoothing parameter (see |
main | main title, default is |
statloc | location of summary statistics. Default positioning by clicking leftmouse button where upper left corner of statistics shouldappear. Alternative is |
Value
list with components (‘knots’, ‘x’, ‘xbeta’,‘lower’, ‘upper’) which are respectively the knot locations,design matrix, linear predictor, and lower and upper confidence limits
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
See Also
lrm,cph,rcspline.eval,plot,supsmu,coxph.fit,lrm.fit
Examples
#rcspline.plot(cad.dur, tvdlm, m=150)#rcspline.plot(log10(cad.dur+1), tvdlm, m=150)Re-state Restricted Cubic Spline Function
Description
This function re-states a restricted cubic spline function inthe un-linearly-restricted form. Coefficients for that form arereturned, along with anR functional representation of this functionand a LaTeX character representation of the function.rcsplineFunction is a fast function that creates a function tocompute a restricted cubic spline function with given coefficients andknots, without reformatting the function to be pretty (i.e., intounrestricted form).
Usage
rcspline.restate(knots, coef, type=c("ordinary","integral"), x="X", lx=nchar(x), norm=2, columns=65, before="& &", after="\\", begin="", nbegin=0, digits=max(8, .Options$digits))rcsplineFunction(knots, coef, norm=2, type=c('ordinary', 'integral'))Arguments
knots | vector of knots used in the regression fit |
coef | vector of coefficients from the fit. If the length of |
type | The default is to represent the cubic spline function correspondingto the coefficients and knots. Set |
x | a character string to use as the variable name in the LaTeX expressionfor the formula. |
lx | length of |
norm | normalization that was used in deriving the original nonlinear termsused in the fit. See |
columns | maximum number of symbols in the LaTeX expression to allow beforeinserting a newline (‘\\’) command. Set to a very largenumber to keep text all on one line. |
before | text to place before each line of LaTeX output. Use ‘"& &"’for an equation array environment in LaTeX where you want to have aleft-hand prefix e.g. ‘"f(X) & = &"’ or using‘"\lefteqn"’. |
after | text to place at the end of each line of output. |
begin | text with which to start the first line of output. Useful whenadding LaTeX output to part of an existing formula |
nbegin | number of columns of printable text in |
digits | number of significant digits to write for coefficients and knots |
Value
rcspline.restate returns a vector of coefficients. Thecoefficients are un-normalized and two coefficients are added that arelinearly dependent on the other coefficients and knots. The vector ofcoefficients has four attributes.knots is a vector of knots,latex is a vector of text strings with the LaTeXrepresentation of the formula.columns.used is the number ofcolumns used in the output string since the last newline command.function is anR function, which is also return in characterstring format as thetext attribute.rcsplineFunctionreturns anR function with argumentsx (a user-suppliednumeric vector at which to evaluate the function), and someautomatically-supplied other arguments.
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
See Also
rcspline.eval,ns,rcs,latex,Function.transcan
Examples
set.seed(1)x <- 1:100y <- (x - 50)^2 + rnorm(100, 0, 50)plot(x, y)xx <- rcspline.eval(x, inclx=TRUE, nk=4)knots <- attr(xx, "knots")coef <- lsfit(xx, y)$coefoptions(digits=4)# rcspline.restate must ignore interceptw <- rcspline.restate(knots, coef[-1], x="{\\rm BP}")# could also have used coef instead of coef[-1], to include interceptcat(attr(w,"latex"), sep="\n")xtrans <- eval(attr(w, "function"))# This is an S function of a single argumentlines(x, coef[1] + xtrans(x), type="l")# Plots fitted transformationxtrans <- rcsplineFunction(knots, coef)xtranslines(x, xtrans(x), col='blue')#x <- blood.pressurexx.simple <- cbind(x, pmax(x-knots[1],0)^3, pmax(x-knots[2],0)^3, pmax(x-knots[3],0)^3, pmax(x-knots[4],0)^3)pred.value <- coef[1] + xx.simple %*% wplot(x, pred.value, type='l') # same as aboveReshape Matrices and Serial Data
Description
If the first argument is a matrix,reShape strings out its valuesand creates row and column vectors specifying the row and column eachelement came from. This is useful for sending matrices to Trellisfunctions, for analyzing or plotting results oftable orcrosstabs, or for reformatting serial data stored in a matrix (withrows representing multiple time points) into vectors. The number ofobservations in the new variables will be the product of the number ofrows and number of columns in the input matrix. If the firstargument is a vector, theid andcolvar variables are used torestructure it into a matrix, withNAs for elements that correspondedto combinations ofid andcolvar values that did not exist in thedata. When more than one vector is given, multiple matrices arecreated. This is useful for restructuring irregular serial data intoregular matrices. It is also useful for converting data produced byexpand.grid into a matrix (see the last example). The number ofrows of the new matrices equals the number of unique values ofid,and the number of columns equals the number of unique values ofcolvar.
When the first argument is a vector and theid is a data frame(even with only one variable),reShape will produce a data frame, and the unique groups areidentified by combinations of the values of all variables inid.If a data frameconstant is specified, the variables in this dataframe are assumed to be constant within combinations ofidvariables (if not, an arbitrary observation inconstant will beselected for each group). A row ofconstant corresponding to thetargetid combination is then carried along when creating thedata frame result.
A different behavior ofreShape is achieved whenbase andrepsare specified. In that casex must be a list or data frame, andthose data are assumed to contain one or more non-repeatingmeasurements (e.g., baseline measurements) and one or more repeatedmeasurements represented by variables named by pasting together thecharacter strings in the vectorbase with the integers 1, 2, ...,reps. The input data are rearranged by repeating each value of thebaseline variablesreps times and by transposing each observation'svalues of one of the set of repeated measurements asrepsobservations under the variable whose name does not have an integerpasted to the end. ifx has arow.names attribute, thoseobservation identifiers are each repeatedreps times in the outputobject. See the last example.
Usage
reShape(x, ..., id, colvar, base, reps, times=1:reps, timevar='seqno', constant=NULL)Arguments
x | a matrix or vector, or, when |
... | other optional vectors, if |
id | A numeric, character, category, or factor variable containing subjectidentifiers, or a data frame of such variables that in combination formgroups of interest. Required if |
colvar | A numeric, character, category, or factor variable containing columnidentifiers. |
base | vector of character strings containing base names of repeatedmeasurements |
reps | number of times variables named in |
times | when |
timevar | specifies the name of the time variable to create if |
constant | a data frame with the same number of rows in |
Details
In convertingdimnames to vectors, the resulting variables arenumeric if all elements of the matrix dimnames can be converted tonumeric, otherwise the corresponding row or column variable remainscharacter. When thedimnames ifx have anames attribute, thosetwo names become the new variable names. Ifx is a vector andanother vector is also given (in...), the matrices in the resultinglist are named the same as the input vector calling arguments. Youcan specify customized names for these on-the-fly by usinge.g.reShape(X=x, Y=y, id= , colvar= ). The new names will then beX andY instead ofx andy. A new variable namedseqnno isalso added to the resulting object.seqno indicates the sequentialrepeated measurement number. Whenbase andtimes arespecified, this new variable is named the character value oftimevar and the valuesare given by a table lookup into the vectortimes.
Value
Ifx is a matrix, returns a list containing the row variable, thecolumn variable, and theas.vector(x) vector, named the same as thecalling argument was called forx. Ifx is a vector and no othervectors were specified as..., the result is a matrix. If at leastone vector was given to..., the result is a list containingkmatrices, wherek one plus the number of vectors in.... Ifxis a list or data frame, the same type of object is returned. Ifx is a vector andid is a data frame, a data frame will bethe result.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
See Also
reshape,as.vector,matrix,dimnames,outer,table
Examples
set.seed(1)Solder <- factor(sample(c('Thin','Thick'),200,TRUE),c('Thin','Thick'))Opening <- factor(sample(c('S','M','L'), 200,TRUE),c('S','M','L'))tab <- table(Opening, Solder)tabreShape(tab)# attach(tab) # do further processing# An example where a matrix is created from irregular vectorsfollow <- data.frame(id=c('a','a','b','b','b','d'), month=c(1, 2, 1, 2, 3, 2), cholesterol=c(225,226, 320,319,318, 270))followattach(follow)reShape(cholesterol, id=id, colvar=month)detach('follow')# Could have done :# reShape(cholesterol, triglyceride=trig, id=id, colvar=month)# Create a data frame, reshaping a long dataset in which groups are# formed not just by subject id but by combinations of subject id and# visit number. Also carry forward a variable that is supposed to be# constant within subject-visit number combinations. In this example,# it is not constant, so an arbitrary visit number will be selected.w <- data.frame(id=c('a','a','a','a','b','b','b','d','d','d'), visit=c( 1, 1, 2, 2, 1, 1, 2, 2, 2, 2), k=c('A','A','B','B','C','C','D','E','F','G'), var=c('x','y','x','y','x','y','y','x','y','z'), val=1:10)with(w, reShape(val, id=data.frame(id,visit), constant=data.frame(k), colvar=var))# Get predictions from a regression model for 2 systematically# varying predictors. Convert the predictions into a matrix, with# rows corresponding to the predictor having the most values, and# columns corresponding to the other predictor# d <- expand.grid(x2=0:1, x1=1:100)# pred <- predict(fit, d)# reShape(pred, id=d$x1, colvar=d$x2) # makes 100 x 2 matrix# Reshape a wide data frame containing multiple variables representing# repeated measurements (3 repeats on 2 variables; 4 subjects)set.seed(33)n <- 4w <- data.frame(age=rnorm(n, 40, 10), sex=sample(c('female','male'), n,TRUE), sbp1=rnorm(n, 120, 15), sbp2=rnorm(n, 120, 15), sbp3=rnorm(n, 120, 15), dbp1=rnorm(n, 80, 15), dbp2=rnorm(n, 80, 15), dbp3=rnorm(n, 80, 15), row.names=letters[1:n])options(digits=3)wu <- reShape(w, base=c('sbp','dbp'), reps=3)ureShape(w, base=c('sbp','dbp'), reps=3, timevar='week', times=c(0,3,12))Redundancy Analysis
Description
Uses flexible parametric additive models (seeareg and itsuse of regression splines), or alternatively to run a regular regressionafter replacing continuous variables with ranks, todetermine how well each variable can be predicted from the remainingvariables. Variables are dropped in a stepwise fashion, removing themost predictable variable at each step. The remaining variables are usedto predict. The process continues until no variable still in the listof predictors can be predicted with anR^2 or adjustedR^2of at leastr2 or until dropping the variable with the highestR^2 (adjusted or ordinary) would cause a variable that was droppedearlier to no longer be predicted at least at ther2 level fromthe now smaller list of predictors.
There is also an optionqrank to expand each variable into twocolumns containing the rank and square of the rank. Whenever ranks areused, they are computed as fractional ranks for numerical reasons.
Usage
redun(formula, data=NULL, subset=NULL, r2 = 0.9, type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE, rank=qrank, qrank=FALSE, allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...)## S3 method for class 'redun'print(x, digits=3, long=TRUE, ...)Arguments
formula | a formula. Enclose a variable in |
data | a data frame, which must be omitted if |
subset | usual subsetting expression |
r2 | ordinary or adjusted |
type | specify |
nk | number of knots to use for continuous variables. Use |
tlinear | set to |
rank | set to |
qrank | set to |
allcat | set to |
minfreq | For a binary or categorical variable, there must be atleast two categories with at least |
iterms | set to |
pc | if |
pr | set to |
... | arguments to pass to |
x | an object created by |
digits | number of digits to which to round |
long | set to |
Details
A categorical variable is deemedredundant if a linear combination of dummy variables representing it canbe predicted from a linear combination of other variables. For example,if there were 4 cities in the data and each city's rainfall was alsopresent as a variable, with virtually the same rainfall reported for allobservations for a city, city would be redundant given rainfall (orvice-versa; the one declared redundant would be the first one in theformula). If two cities had the same rainfall,city might bedeclared redundant even though tied cities might be deemed non-redundantin another setting. To ensure that all categories may be predicted wellfrom other variables, use theallcat option. To ignorecategories that are too infrequent or too frequent, setminfreqto a nonzero integer. When the number of observations in the categoryis below this number or the number of observations not in the categoryis below this number, no attempt is made to predict observations beingin that category individually for the purpose of redundancy detection.
Value
an object of class"redun" including an element"scores", a numeric matrix with all transformed values when each variable was the dependent variable and the first canonical variate was computed
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
areg,dataframeReduce,transcan,varclus,r2describe,subselect::genetic
Examples
set.seed(1)n <- 100x1 <- runif(n)x2 <- runif(n)x3 <- x1 + x2 + runif(n)/10x4 <- x1 + x2 + x3 + runif(n)/10x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))x6 <- 1*(x5=='a' | x5=='c')redun(~x1+x2+x3+x4+x5+x6, r2=.8)redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)# x5 is no longer redundant but x6 isredun(~x1+x2+x3+x4+x5+x6, r2=.8, rank=TRUE)redun(~x1+x2+x3+x4+x5+x6, r2=.8, qrank=TRUE)# To help decode which variables made a particular variable redundant:# r <- redun(...)# r2describe(r$scores)Special Version of legend for R
Description
rlegend is a version oflegend forR that implementsplot=FALSE, addsgrid=TRUE, and defaultslty,lwd,pch toNULL and checks forlength>0rather thanmissing(), so it's easier to deal withnon-applicable parameters. But whengrid is in effect, thepreferred function to use isrlegendg, which calls thelatticedraw.key function.
Usage
rlegend(x, y, legend, fill, col = "black", lty = NULL, lwd = NULL, pch = NULL, angle = NULL, density = NULL, bty = "o", bg = par("bg"), pt.bg = NA, cex = 1, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = 0, text.width = NULL, merge = do.lines && has.pch, trace = FALSE, ncol = 1, horiz = FALSE, plot = TRUE, grid = FALSE, ...)rlegendg(x, y, legend, col=pr$col[1], lty=NULL, lwd=NULL, pch=NULL, cex=pr$cex[1], other=NULL)Arguments
x,y,legend,fill,col,lty,lwd,pch,angle,density,bty,bg,pt.bg,cex,xjust,yjust,x.intersp,y.intersp,adj,text.width,merge,trace,ncol,horiz | see |
plot | set to |
grid | set to |
... | see |
other | a list containing other arguments to pass to |
Value
a list with elementsrect andtext.rect haselementsw, h, left, top with size/position information.
Author(s)
Frank Harrell and R-Core
See Also
Bootstrap Repeated Measurements Model
Description
For a dataset containing a time variable, a scalar response variable,and an optional subject identification variable, obtains least squaresestimates of the coefficients of a restricted cubic spline function ora linear regression in time after adjusting for subject effectsthrough the use of subject dummy variables. Then the fit isbootstrappedB times, either by treating time and subject ID asfixed (i.e., conditioning the analysis on them) or as randomvariables. For the former, the residuals from the original model fitare used as the basis of the bootstrap distribution. For the latter,samples are taken jointly from the time, subject ID, and responsevectors to obtain unconditional distributions.
If a subjectid variable is given, the bootstrap sampling willbe based on samples with replacement from subjects rather than fromindividual data points. In other words, either none or all of a givensubject's data will appear in a bootstrap sample. This clustersampling takes into account any correlation structure that might existwithin subjects, so that confidence limits are corrected forwithin-subject correlation. Assuming that ordinary least squaresestimates, which ignore the correlation structure, are consistent(which is almost always true) and efficient (which would not be truefor certain correlation structures or for datasets in which the numberof observation times vary greatly from subject to subject), theresulting analysis will be a robust, efficient repeated measuresanalysis for the one-sample problem.
Predicted values of the fitted models are evaluated by default at agrid of 100 equally spaced time points ranging from the minimum tomaximum observed time points. Predictions are for the average subjecteffect. Pointwise confidence intervals are optionally computedseparately for each of the points on the time grid. However,simultaneous confidence regions that control the level of confidencefor the entire regression curve lying within a band are often moreappropriate, as they allow the analyst to draw conclusions aboutnuances in the mean time response profile that were not statedapriori. The method ofTibshirani (1997) is used to easilyobtain simultaneous confidence sets for the set of coefficients of thespline or linear regression function as well as the average interceptparameter (over subjects). Here one computes the objective criterion(here both the -2 log likelihood evaluated at the bootstrap estimateof beta but with respect to the original design matrix and responsevector, and the sum of squared errors in predicting the originalresponse vector) for the original fit as well as for all of thebootstrap fits. The confidence set of the regression coefficients isthe set of all coefficients that are associated with objectivefunction values that are less than or equal to say the 0.95 quantileof the vector of\code{B} + 1 objective function values. Forthe coefficients satisfying this condition, predicted curves arecomputed at the time grid, and minima and maxima of these curves arecomputed separately at each time point toderive the finalsimultaneous confidence band.
By default, the log likelihoods that are computed for obtaining thesimultaneous confidence band assume independence within subject. Thiswill cause problems unless such log likelihoods have very high rankcorrelation with the log likelihood allowing for dependence. To allowfor correlation or to estimate the correlation function, see thecor.pattern argument below.
Usage
rm.boot(time, y, id=seq(along=time), subset, plot.individual=FALSE, bootstrap.type=c('x fixed','x random'), nk=6, knots, B=500, smoother=supsmu, xlab, xlim, ylim=range(y), times=seq(min(time), max(time), length=100), absorb.subject.effects=FALSE, rho=0, cor.pattern=c('independent','estimate'), ncor=10000, ...)## S3 method for class 'rm.boot'plot(x, obj2, conf.int=.95, xlab=x$xlab, ylab=x$ylab, xlim, ylim=x$ylim, individual.boot=FALSE, pointwise.band=FALSE, curves.in.simultaneous.band=FALSE, col.pointwise.band=2, objective=c('-2 log L','sse','dep -2 log L'), add=FALSE, ncurves, multi=FALSE, multi.method=c('color','density'), multi.conf =c(.05,.1,.2,.3,.4,.5,.6,.7,.8,.9,.95,.99), multi.density=c( -1,90,80,70,60,50,40,30,20,10, 7, 4), multi.col =c( 1, 8,20, 5, 2, 7,15,13,10,11, 9, 14), subtitles=TRUE, ...)Arguments
time | numeric time vector |
y | continuous numeric response vector of length the same as |
x | an object returned from |
id | subject ID variable. If omitted, it is assumed that eachtime-response pair is measured on a different subject. |
subset | subset of observations to process if not all the data |
plot.individual | set to |
bootstrap.type | specifies whether to treat the time and subject ID variables asfixed or random |
nk | number of knots in the restricted cubic spline function fit. Thenumber of knots may be 0 (denoting linear regression) or an integergreater than 2 in which k knots results in |
knots | vector of knot locations. May be specified if |
B | number of bootstrap repetitions. Default is 500. |
smoother | a smoothing function that is used if |
xlab | label for x-axis. Default is |
xlim | specifies x-axis plotting limits. Default is to use range of timesspecified to |
ylim | for |
times | a sequence of times at which to evaluated fitted values andconfidence limits. Default is 100 equally spaced points in theobserved range of |
absorb.subject.effects | If |
rho | The log-likelihood function that is used as the basis ofsimultaneous confidence bands assumes normality with independencewithin subject. To check the robustness of this assumption, if |
cor.pattern | More generally than using an equal-correlation structure, you canspecify a function of two time vectors that generates as manycorrelations as the length of these vectors. For example, |
ncor | the maximum number of pairs of time values used in estimating thecorrelation function if |
... | other arguments to pass to |
obj2 | a second object created by |
conf.int | the confidence level to use in constructing simultaneous, andoptionally pointwise, bands. Default is 0.95. |
ylab | label for y-axis. Default is the |
individual.boot | set to |
pointwise.band | set to |
curves.in.simultaneous.band | set to |
col.pointwise.band | color for the pointwise confidence band. Default is ‘2’,which defaults to red for default Windows S-PLUS setups. |
objective | the default is to use the -2 times log of the Gaussian likelihoodfor computing the simultaneous confidence region. If neither |
add | set to |
ncurves | when using |
multi | set to |
multi.method | specifies the method of shading when |
multi.conf | vector of confidence levels, in ascending order. Default is to use12 confidence levels ranging from 0.05 to 0.99. |
multi.density | vector of densities in lines per inch corresponding to |
multi.col | vector of colors corresponding to |
subtitles | set to |
Details
Observations having missingtime ory are excluded fromthe analysis.
As most repeated measurement studies consider the times as designpoints, the fixed covariable case is the default. Bootstrapping theresiduals from the initial fit assumes that the model is correctlyspecified. Even if the covariables are fixed, doing an unconditionalbootstrap is still appropriate, and for large sample sizesunconditional confidence intervals are only slightly wider thanconditional ones. For moderate to small sample sizes, thebootstrap.type="x random" method can be fairly conservative.
If not all subjects have the same number of observations (afterdeleting observations containing missing values) and ifbootstrap.type="x fixed", bootstrapped residual vectors mayhave a length m that is different from the number of originalobservations n. Ifm > n for a bootstraprepetition, the first n elements of the randomly drawn residualsare used. Ifm < n, the residual vector is appendedwith a random sample with replacement of lengthn - m from itself. A warning message is issued if this happens.If the number of time points per subject varies, the bootstrap resultsforbootstrap.type="x fixed" can still be invalid, as thismethod assumes that a vector (over subjects) of all residuals can beadded to the original yhats, and varying number of points will causemis-alignment.
Forbootstrap.type="x random" in the presence of significantsubject effects, the analysis is approximate as the subjects used inany one bootstrap fit will not be the entire list of subjects. Theaverage (over subjects used in the bootstrap sample) intercept is usedfrom that bootstrap sample as a predictor of average subject effectsin the overall sample.
Once the bootstrap coefficient matrix is stored byrm.boot,plot.rm.boot can be run multiple times with different options(e.g, different confidence levels).
Seebootcov in therms library for a generalapproach to handling repeated measurement data for ordinary linearmodels, binary and ordinal models, and survival models, using theunconditional bootstrap.bootcov does not handle bootstrappingresiduals.
Value
an object of classrm.boot is returned byrm.boot. Theprincipal object stored in the returned object is a matrix ofregression coefficients for the original fit and all of the bootstraprepetitions (objectCoef), along with vectors of thecorresponding -2 log likelihoods are sums of squared errors. Theoriginal fit object fromlm.fit.qr is stored infit. For this fit, a cell means model is used for theid effects.
plot.rm.boot returns a list containing the vector of times usedfor plotting along with the overall fitted values, lower and uppersimultaneous confidence limits, and optionally the pointwiseconfidence limits.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Feng Z, McLerran D, Grizzle J (1996): A comparison of statistical methods forclustered data analysis with Gaussian error. Stat in Med 15:1793–1806.
Tibshirani R, Knight K (1997):Model search and inference by bootstrap "bumping". Technical Report, Department of Statistics, University of Toronto.
https://www.jstor.org/stable/1390820. Presented at the Joint StatisticalMeetings, Chicago, August 1996.
Efron B, Tibshirani R (1993): An Introduction to the Bootstrap.New York: Chapman and Hall.
Diggle PJ, Verbyla AP (1998): Nonparametric estimation of covariancestructure in logitudinal data. Biometrics 54:401–415.
Chapman IM, Hartman ML, et al (1997): Effect of aging on thesensitivity of growth hormone secretion to insulin-like growthfactor-I negative feedback. J Clin Endocrinol Metab 82:2996–3004.
Li Y, Wang YG (2008): Smooth bootstrap methods for analysis oflongitudinal data. Stat in Med 27:937-953. (potential improvements tocluster bootstrap; not implemented here)
See Also
rcspline.eval,lm,lowess,supsmu,bootcov,units,label,polygon,reShape
Examples
# Generate multivariate normal responses with equal correlations (.7)# within subjects and no correlation between subjects# Simulate realizations from a piecewise linear population time-response# profile with large subject effects, and fit using a 6-knot spline# Estimate the correlation structure from the residuals, as a function# of the absolute time difference# Function to generate n p-variate normal variates with mean vector u and# covariance matrix S# Slight modification of function written by Bill Venables# See also the built-in function rmvnormmvrnorm <- function(n, p = 1, u = rep(0, p), S = diag(p)) { Z <- matrix(rnorm(n * p), p, n) t(u + t(chol(S)) %*% Z)}n <- 20 # Number of subjectssub <- .5*(1:n) # Subject effects# Specify functional form for time trend and compute non-stochastic componenttimes <- seq(0, 1, by=.1)g <- function(times) 5*pmax(abs(times-.5),.3)ey <- g(times)# Generate multivariate normal errors for 20 subjects at 11 times# Assume equal correlations of rho=.7, independent subjectsnt <- length(times)rho <- .7 set.seed(19) errors <- mvrnorm(n, p=nt, S=diag(rep(1-rho,nt))+rho)# Note: first random number seed used gave rise to mean(errors)=0.24!# Add E[Y], error components, and subject effectsy <- matrix(rep(ey,n), ncol=nt, byrow=TRUE) + errors + matrix(rep(sub,nt), ncol=nt)# String out data into long vectors for times, responses, and subject IDy <- as.vector(t(y))times <- rep(times, n)id <- sort(rep(1:n, nt))# Show lowess estimates of time profiles for individual subjectsf <- rm.boot(times, y, id, plot.individual=TRUE, B=25, cor.pattern='estimate', smoother=lowess, bootstrap.type='x fixed', nk=6)# In practice use B=400 or 500# This will compute a dependent-structure log-likelihood in addition# to one assuming independence. By default, the dep. structure# objective will be used by the plot method (could have specified rho=.7)# NOTE: Estimating the correlation pattern from the residual does not# work in cases such as this one where there are large subject effects# Plot fits for a random sample of 10 of the 25 bootstrap fitsplot(f, individual.boot=TRUE, ncurves=10, ylim=c(6,8.5))# Plot pointwise and simultaneous confidence regionsplot(f, pointwise.band=TRUE, col.pointwise=1, ylim=c(6,8.5))# Plot population response curve at average subject effectts <- seq(0, 1, length=100)lines(ts, g(ts)+mean(sub), lwd=3)## Not run: ## Handle a 2-sample problem in which curves are fitted # separately for males and females and we wish to estimate the# difference in the time-response curves for the two sexes. # The objective criterion will be taken by plot.rm.boot as the # total of the two sums of squared errors for the two models#knots <- rcspline.eval(c(time.f,time.m), nk=6, knots.only=TRUE)# Use same knots for both sexes, and use a times vector that # uses a range of times that is included in the measurement # times for both sexes#tm <- seq(max(min(time.f),min(time.m)), min(max(time.f),max(time.m)),length=100)f.female <- rm.boot(time.f, bp.f, id.f, knots=knots, times=tm)f.male <- rm.boot(time.m, bp.m, id.m, knots=knots, times=tm)plot(f.female)plot(f.male)# The following plots female minus male response, with # a sequence of shaded confidence band for the differenceplot(f.female,f.male,multi=TRUE)# Do 1000 simulated analyses to check simultaneous coverage # probability. Use a null regression model with Gaussian errorsn.per.pt <- 30n.pt <- 10null.in.region <- 0for(i in 1:1000) { y <- rnorm(n.pt*n.per.pt) time <- rep(1:n.per.pt, n.pt)# Add the following line and add ,id=id to rm.boot to use clustering# id <- sort(rep(1:n.pt, n.per.pt))# Because we are ignoring patient id, this simulation is effectively# using 1 point from each of 300 patients, with times 1,2,3,,,30 f <- rm.boot(time, y, B=500, nk=5, bootstrap.type='x fixed') g <- plot(f, ylim=c(-1,1), pointwise=FALSE) null.in.region <- null.in.region + all(g$lower<=0 & g$upper>=0) prn(c(i=i,null.in.region=null.in.region))}# Simulation Results: 905/1000 simultaneous confidence bands # fully contained the horizontal line at zero## End(Not run)rmClose
Description
Remove close values from a numeric vector that are not at the outer limtis. This is useful for removing axis breaks that overlap when plotting.
Usage
rmClose(x, minfrac = 0.05)Arguments
x | a numeric vector with no |
minfrac | minimum allowed spacing between consecutive ordered |
Value
a sorted numeric vector of non-close values ofx
Author(s)
Frank Harrell
Examples
rmClose(c(1, 2, 4, 47, 48, 49, 50), minfrac=0.07)runParallel
Description
parallel Package Easy Front-End
Usage
runParallel( onecore, reps, seed = round(runif(1, 0, 10000)), cores = max(1, parallel::detectCores() - 1), simplify = TRUE, along)Arguments
onecore | function to run the analysis on one core |
reps | total number of repetitions |
seed | species the base random number seed. The seed used for core i will be |
cores | number of cores to use, defaulting to one less than the number available |
simplify | set to FALSE to not create an outer list if a |
along | see Details |
Details
Given a functiononecore that runs the needed set of simulations onone CPU core, and given a total number of repetitionsreps, determinesthe number of available cores and by default uses one less than that.By default the number of cores is one less than the number availableon your machine.reps is divided as evenly as possible over these cores, and batchesare run on the cores using theparallel packagemclapply function.The current per-core repetition number is continually updated inyour system's temporary directory (/tmp for Linux and Mac, TEMP for Windows)in a file name progressX.log where X is the core number.The random number seed is set for each core and is equal tothe scalarseed - core number + 1. The default seed is a randomnumber between 0 and 10000 but it's best if the user provides theseed so the simulation is reproducible.The total run time is computed and printedonefile must create a named list of all the results created duringthat one simulation batch. Elements of this list must be data frames,vectors, matrices, or arrays. Upon completion of all batches,all the results are rbind'd and saved in a single list.
onecore must have an argumentreps that will tell the functionhow many simulations to run for one batch, another argumentshowprogresswhich is a function to be called inside onecore to write to theprogress file for the current core and repetition, and an argumentcorewhich informsonecore which sequential core number (batch number) it isprocessing.When callingshowprogress insideonecore, the arguments, in order,must be the integer value of the repetition to be noted, the number of reps,core, an optional 4th argumentother that can contain a singlecharacter string to add to the output, and an optional 5th argumentpr.You can setpr=FALSE to suppress printing and haveshowprogressreturn the file name for holding progress information if you want tocustomize printing.
If any of the objects appearing as list elements produced by onecoreare multi-dimensional arrays, you must specify an integer value foralong. This specifies to theabind packageabind functionthe dimension along which to bind the arrays. For example, if thefirst dimension of the array corresponding to repetitions, you wouldspecify along=1. All arrays present must use the samealong unlessalong is a named vector and the names match elements of thesimulation result object.Setsimplify=FALSE if you don't want the result simplified ifonecore produces only one list element. The default returns thefirst (and only) list element rather than the list if there is only oneelement.
Whenonecore returns adata.table,runParallel simplifies all this and merelyrbinds all the per-core data tables into one large data table. In that case when youhaveonecore include a column containing a simulation number, it is wise to prependthat number with the core number so that you will have unique simulation IDs whenall the cores' results are combined.
Seehere for examples.
Value
result from combining all the parallel runs, formatting as similar to the result produced from one run as possible
Author(s)
Frank Harrell
runifChanged
Description
Re-run Code if an Input Changed
Usage
runifChanged(fun, ..., file = NULL, .print. = TRUE, .inclfun. = TRUE)Arguments
fun | the (usually slow) function to run |
... | input objects the result of running the function is dependent on |
file | file in which to store the result of |
.print. | set to |
.inclfun. | set to |
Details
UseshashCheck to run a function and save the results if specified inputs have changed, otherwise to retrieve results from a file. This makes it easy to see if any objects changed that require re-running a long simulation, and reports on any changes. The file name is taken as the chunk name appended with.rds unless it is given asfile=.fun has no arguments. Set.inclfun.=FALSE to not includefun in the hash check (for legacy uses). The typical workflow is as follows.
f <- function( ) {# . . . do the real work with multiple function calls ...}seed <- 3set.seed(seed)w <- runifChanged(f, seed, obj1, obj2, ....)seed, obj1, obj2, ... are all the objects thatf() uses that if changedwould give a different result off(). This can include functions such asthose in a package, andf will be re-run if any of the function's codechanges.f is also re-run if the code insidef changes.The result off is stored withsaveRDS by default in file namedxxx.rdswherexxx is the label for the current chunk. To control this use insteadfile=xxx.rds add the file argument torunifChanged(...). If nothing haschanged and the file already exists, the file is read to create the resultobject (e.g.,w above). Iff() needs to be run, the hashed input objectsare stored as attributes for the result then the enhanced result is written to the file.
Seehere for examples.
Value
the result of runningfun
Author(s)
Frank Harrell
Sample Size for 2-sample Binomial
Description
Computes sample size(s) for 2-sample binomial problem given vector orscalar probabilities in the two groups.
Usage
samplesize.bin(alpha, beta, pit, pic, rho=0.5)Arguments
alpha | scalar ONE-SIDED test size, or two-sided size/2 |
beta | scalar or vector of powers |
pit | hypothesized treatment probability of success |
pic | hypothesized control probability of success |
rho | proportion of the sample devoted to treated group ( |
Value
TOTAL sample size(s)
AUTHOR
Rick Chappell
Dept. of Statistics and Human Oncology
University of Wisconsin at Madison
chappell@stat.wisc.edu
Examples
alpha <- .05beta <- c(.70,.80,.90,.95)# N1 is a matrix of total sample sizes whose# rows vary by hypothesized treatment success probability and# columns vary by power# See Meinert's book for formulae.N1 <- samplesize.bin(alpha, beta, pit=.55, pic=.5)N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.60, pic=.5))N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.65, pic=.5))N1 <- rbind(N1, samplesize.bin(alpha, beta, pit=.70, pic=.5))attr(N1,"dimnames") <- NULL#Accounting for 5% noncompliance in the treated groupinflation <- (1/.95)**2print(round(N1*inflation+.5,0))Convert a SAS Dataset to an S Data Frame
Description
Converts aSAS dataset into an S data frame. You may choose to extract only a subset of variables or a subset of observations in theSAS dataset.You may have the function automatically convert
PROC FORMAT
-codedvariables to factor objects. The originalSAS codes are stored in anattribute calledsas.codes and these may be added back to thelevels of afactor variable using thecode.levels function.Information about special missing values may be captured in an attributeof each variable having special missing values. This attribute iscalledspecial.miss, and such variables are given classspecial.miss.There areprint,[],format, andis.special.missmethods for such variables.Thechron function is used to set up date, time, and date-time variables.If using S-Plus 5 or 6 or later, thetimeDate function is usedinstead.Under R,Dates is used for dates andchronfor date-times. For times withoutdates, these still need to be stored in date-time format in POSIX.SuchSAS time variables are given a major class ofPOSIXt and aformat.POSIXt function so that the date portion (which willalways be 1/1/1970) will not print by default.If a date variable represents a partial date (0.5 added ifmonth missing, 0.25 added if day missing, 0.75 if both), an attributepartial.date is added to the variable, and the variable also becomesa classimputed variable.Thedescribe function uses information about partial dates andspecial missing values.There is an option to automatically uncompress (orgunzip) compressedSAS datasets.
Usage
sas.get(libraryName, member, variables=character(0), ifs=character(0), format.library=libraryName, id, dates.=c("sas","yymmdd","yearfrac","yearfrac2"), keep.log=TRUE, log.file="_temp_.log", macro=sas.get.macro, data.frame.out=existsFunction("data.frame"), clean.up=FALSE, quiet=FALSE, temp=tempfile("SaS"), formats=TRUE, recode=formats, special.miss=FALSE, sasprog="sas", as.is=.5, check.unique.id=TRUE, force.single=FALSE, pos, uncompress=FALSE, defaultencoding="latin1", var.case="lower")is.special.miss(x, code)## S3 method for class 'special.miss'x[..., drop=FALSE]## S3 method for class 'special.miss'print(x, ...)## S3 method for class 'special.miss'format(x, ...)sas.codes(object)code.levels(object)Arguments
libraryName | character string naming the directory in which the dataset is kept. |
drop | logical. If |
member | character string giving the second part of the two partSAS dataset name. (The first part is irrelevant here - it is mapped to the UNIX directory name.) |
x | a variable that may have been created by |
variables | vector of character strings naming the variables in theSAS dataset. The S dataset will contain only those variables from theSAS dataset. To get all of the variables (the default), an empty string may be given.It is a fatal error if any one of the variables is notin theSAS dataset. You can use |
ifs | a vector of character strings, each containing oneSAS “subsetting if”statement. These will be used to extract a subset of the observations in theSAS dataset. |
format.library | The UNIX directory containing the file ‘formats.sct’, which containsthe definitions of the user defined formats used in this dataset.By default, we look for the formats in the same directory as the data.The user defined formats must be available (soSAS can read the data). |
formats | Set |
recode | This parameter defaults to |
special.miss | For numeric variables, any missing values are stored as NA in S.You can recover special missing values by setting |
id | The name of the variable to be used as the row names of the S dataset.The id variable becomes the |
dates. | specifies the format for storingSAS dates in theresulting data frame |
as.is | |
check.unique.id | If B23 . |
force.single | By default,SAS numeric variables having LENGTH 8 variable. Set LENGTH statement.R does not have single precision, so no attempt is made to convert tosingle if running R. |
dates | One of the character strings YYMMDD (year%%100, month, day).Note thatR will store these as numbers, not ascharacter strings. If |
keep.log | logical flag: if |
log.file | the name of theSAS log file. |
macro | the name of an S object in the current search path that contains the text oftheSAS macro called byR. TheR object is a character vector thatcan be edited using for example |
data.frame.out | logical flag: if |
clean.up | logical flag: if |
quiet | logical flag: if |
temp | the prefix to use for the temporary files. Two characterswill be added to this, the resulting namemust fit on your file system. |
sasprog | the name of the system command to invokeSAS |
uncompress | set to |
pos | by default, a list or data frame which contains all the variables is returned.If you specify |
code | a special missing value code (‘A’ through ‘Z’ or ‘\_’) to checkagainst. If |
defaultencoding | encoding to assume if the SAS dataset does not specify one. Defaults to "latin1". |
var.case | default is to change case of SAS variable names tolower case. Specify alternatively |
object | a variable in a data frame created by |
... | ignored |
Details
If you specifyspecial.miss = TRUE and there are no special missingvalues in the dataSAS dataset, theSAS step will bomb.
For variables having a
PROC FORMAT VALUE
format with some of the levels undefined,sas.get will interpret thosevalues asNA if you are usingrecode.
LRECL
s to quadruple them, for example.
Value
ifdata.frame.out isTRUE, the output willbe a data frame resembling theSAS dataset. Ifidwas specified, that column of the data frame will be usedas the row names of the data frame. Each variable in the data frameor vector in the list will have the attributeslabel andformatcontainingSAS labels and formats. Underscores in formats areconverted to periods. Formats for character variables have\$ placedin front of their names.Ifformats isTRUE and there are any appropriate format definitions informat.library, the returnedobject will have attributeformats containing lists named thesame as the format names (with periods substituted for underscores andcharacter formats prefixed by\$).Each of these lists has a vector calledvalues and one calledlabels with the
PROC FORMAT; VALUE ...
definitions.
Ifdata.frame.out isFALSE, the output willbe a list of vectors, each containing a variable from theSASdataset. Ifid was specified, that element of the list willbe used as theid attribute of the entire list.
Side Effects
if aSAS error occurs andquiet isFALSE, then theSAS log file will beprinted under the control of theless pager.
BACKGROUND
The references cited below explain the structure ofSAS datasets and howthey are stored underUNIX.SeeSAS Language for a discussion of the “subsetting if” statement.
Note
You must be able to runSAS (by typingsas) on your system.If the S command!sas does not startSAS, then this function cannot work.
If you are reading time ordate-time variables, you will need to execute the commandlibrary(chron)to print those variables or the data frame if thetimeDate functionis not available.
Author(s)
Terry Therneau, Mayo Clinic
Frank Harrell, Vanderbilt University
Bill Dunlap, University of Washington and Insightful Corporation
Michael W. Kattan, Cleveland Clinic Foundation
Reinhold Koch (encoding)
References
SAS Institute Inc. (1990).SAS Language: Reference, Version 6.First Edition.SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1988).SAS Technical Report P-176,Using theSAS System, Release 6.03, under UNIX Operating Systems and Derivatives.SAS Institute Inc., Cary, North Carolina.
SAS Institute Inc. (1985).SAS Introductory Guide.Third Edition.SAS Institute Inc., Cary, North Carolina.
See Also
data.frame,describe,label,upData,cleanup.import
Examples
## Not run: sas.contents("saslib", "mice")# [1] "dose" "ld50" "strain" "lab_no"attr(, "n"):# [1] 117mice <- sas.get("saslib", mem="mice", var=c("dose", "strain", "ld50"))plot(mice$dose, mice$ld50)nude.mice <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",ifs="if strain='nude'")nude.mice.dl <- sas.get(lib=unix("echo $HOME/saslib"), mem="mice",var=c("dose", "ld50"), ifs="if strain='nude'")# Get a dataset from current directory, recode PROC FORMAT; VALUE \dots # variables into factors with labels of the form "good(1)" "better(2)",# get special missing values, recode missing codes .D and .R into new# factor levels "Don't know" and "Refused to answer" for variable q1d <- sas.get(".", "mydata", recode=2, special.miss=TRUE)attach(d)nl <- length(levels(q1))lev <- c(levels(q1), "Don't know", "Refused")q1.new <- as.integer(q1)q1.new[is.special.miss(q1,"D")] <- nl+1q1.new[is.special.miss(q1,"R")] <- nl+2q1.new <- factor(q1.new, 1:(nl+2), lev)# Note: would like to use factor() in place of as.integer \dots but# factor in this case adds "NA" as a category leveld <- sas.get(".", "mydata")sas.codes(d$x) # for PROC FORMATted variables returns original data codesd$x <- code.levels(d$x) # or attach(d); x <- code.levels(x)# This makes levels such as "good" "better" "best" into e.g.# "1:good" "2:better" "3:best", if the original SAS values were 1,2,3# Retrieve the same variables from another dataset (or an update of# the original dataset)mydata2 <- sas.get('mydata2', var=names(d))# This only works if none of the original SAS variable names contained _mydata2 <- cleanup.import(mydata2) # will make true integer variables# Code from Don MacQueen to generate SAS dataset to test import of# date, time, date-time variables# data ssd.test;# d1='3mar2002'd ;# dt1='3mar2002 9:31:02'dt;# t1='11:13:45't;# output;## d1='3jun2002'd ;# dt1='3jun2002 9:42:07'dt;# t1='11:14:13't;# output;# format d1 mmddyy10. dt1 datetime. t1 time.;# run;## End(Not run)Enhanced Importing of SAS Transport Files using read.xport
Description
Uses theread.xport andlookup.xport functions in theforeign library to import SAS datasets. SAS date, time, anddate/time variables are converted respectively toDate, POSIX, orPOSIXct objects inR, variable names are converted to lower case, SAS labels are associatedwith variables, and (by default) integer-valued variables are convertedfrom storage modedouble tointeger. If the user ranPROC FORMAT CNTLOUT= in SAS and included the resulting dataset inthe SAS version 5 transport file, variables having customized formatsthat do not include any ranges (i.e., variables having standardPROC FORMAT; VALUE label formats) will have their format labels lookedup, and these variables are converted to Sfactors.
For those users having access to SAS,method='csv' is preferredwhen importing several SAS datasets.Run SAS macroexportlib.sas available fromhttps://github.com/harrelfe/Hmisc/blob/master/src/sas/exportlib.sasto convert all SAS datasets in a SAS data library (from any enginesupported by your system) intoCSV files. If any customizedformats are used, it is assumed that thePROC FORMAT CNTLOUT=dataset is in the data library as a regular SAS dataset, as above.
SASdsLabels reads a file containingPROC CONTENTSprinted output to parse dataset labels, assuming thatPROCCONTENTS was run on an entire library.
Usage
sasxport.get(file, lowernames=TRUE, force.single = TRUE, method=c('read.xport','dataload','csv'), formats=NULL, allow=NULL, out=NULL, keep=NULL, drop=NULL, as.is=0.5, FUN=NULL)sasdsLabels(file)Arguments
file | name of a file containing the SAS transport file. |
lowernames | set to |
force.single | set to |
method | set to |
formats | a data frame or list (like that created by |
allow | a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9. |
out | a character string specifying a directory in which to writeseparateR |
keep | a vector of names of SAS datasets to process (original SASupper case names). Must include |
drop | a vector of names of SAS datasets to ignore (original SASupper case names) |
as.is | SAS character variables are converted to S factorobjects if |
FUN | an optional function that will be run on each data framecreated, when |
Details
Seecontents.list for a way to print thedirectory of SAS datasets when more than one was imported.
Value
If there is more than one dataset in the transport file other than thePROC FORMAT file, the result is a list of data framescontaining all the non-PROC FORMAT datasets. Otherwise theresult is the single data frame. There is an exception ifoutis specified; that causes separateRsave files to be writtenand the returned value to be a list corresponding to the SAS datasets,with keyPROC CONTENTS information in a data frame making upeach part of the list.sasdsLabels returns a namedvector of dataset labels, with names equal to the dataset names.
Author(s)
Frank E Harrell Jr
See Also
read.xport,label,sas.get,Dates,DateTimeClasses,lookup.xport,contents,describe
Examples
## Not run: # SAS code to generate test dataset:# libname y SASV5XPT "test2.xpt";## PROC FORMAT; VALUE race 1=green 2=blue 3=purple; RUN;# PROC FORMAT CNTLOUT=format;RUN; * Name, e.g. 'format', unimportant;# data test;# LENGTH race 3 age 4;# age=30; label age="Age at Beginning of Study";# race=2;# d1='3mar2002'd ;# dt1='3mar2002 9:31:02'dt;# t1='11:13:45't;# output;## age=31;# race=4;# d1='3jun2002'd ;# dt1='3jun2002 9:42:07'dt;# t1='11:14:13't;# output;# format d1 mmddyy10. dt1 datetime. t1 time. race race.;# run;# data z; LENGTH x3 3 x4 4 x5 5 x6 6 x7 7 x8 8;# DO i=1 TO 100;# x3=ranuni(3);# x4=ranuni(5);# x5=ranuni(7);# x6=ranuni(9);# x7=ranuni(11);# x8=ranuni(13);# output;# END;# DROP i;# RUN;# PROC MEANS; RUN;# PROC COPY IN=work OUT=y;SELECT test format z;RUN; *Creates test2.xpt;w <- sasxport.get('test2.xpt')# To use an existing copy of test2.xpt available on the web:w <- sasxport.get('https://github.com/harrelfe/Hmisc/raw/master/inst/tests/test2.xpt')describe(w$test) # see labels, format names for dataset test# Note: if only one dataset (other than format) had been exported,# just do describe(w) as sasxport.get would not create a list for thatlapply(w, describe)# see descriptive stats for both datasetscontents(w$test) # another way to see variable attributeslapply(w, contents)# show contents of both datasetsoptions(digits=7) # compare the following matrix with PROC MEANS outputt(sapply(w$z, function(x) c(Mean=mean(x),SD=sqrt(var(x)),Min=min(x),Max=max(x))))## End(Not run)One-Dimensional Scatter Diagram, Spike Histogram, or Density
Description
scat1d adds tick marks (bar codes. rug plot) on any of the foursides of an existing plot, corresponding with non-missing values of avectorx. This is used to show the data density. Can alsoplace the tick marks along a curve by specifying y-coordinates to goalong with thex values.
If any two values ofx are within\code{eps}*w ofeach other, whereeps defaults to .001 and w is the spanof the intended axis, values ofx are jittered by adding avalue uniformly distributed in[-\code{jitfrac}*w, \code{jitfrac}*w], wherejitfrac defaults to.008. Specifyingpreserve=TRUE invokesjitter2 with adifferent logic of jittering. Allows plotting random sub-segments tohandle very largex vectors (seetfrac).
jitter2 is a generic method for jittering, which does not addrandom noise. It retains unique values and ranks, and randomly spreadsduplicate values at equidistant positions within limits of enclosingvalues.jitter2 is especially useful for numeric variables withdiscrete values, like rating scales. Missing values are allowed andare returned. Currently implemented methods arejitter2.defaultfor vectors andjitter2.data.frame which returns a data.framewith each numeric column jittered.
datadensity is a generic method used to show data densities inmore complex situations. Here, anotherdatadensity method isdefined for data frames. Depending on thewhich argument, someor all of the variables in a data frame will be displayed, withscat1d used to display continuous variables and, by default,bars used to display frequencies of categorical, character, ordiscrete numeric variables. For such variables, when the total lengthof value labels exceeds 200, only the first few characters from eachlevel are used. By default,datadensity.data.frame willconstruct one axis (i.e., one strip) per variable in the data frame.Variable names appear to the left of the axes, and the number ofmissing values (if greater than zero) appear to the right of the axes.An optionalgroup variable can be used for stratification,where the different strata are depicted using different colors. Iftheq vector is specified, the desired quantiles (over allgroups) are displayed with solid triangles below each axis.
When the sample size exceeds 2000 (this value may be modified usingthenhistSpike argument,datadensity callshistSpike instead ofscat1d to show the data density fornumeric variables. This results in a histogram-like display thatmakes the resulting graphics file much smaller. In this case,datadensity uses theminf argument (see below) so thatvery infrequent data values will not be lost on the variable's axis,although this will slightly distortthe histogram.
histSpike is another method for showing a high-resolution datadistribution that is particularly good for very large datasets (say\code{n} > 1000). By default,histSpike bins thecontinuousx variable into 100 equal-width bins and thencomputes the frequency counts within bins (ifn does not exceed10, no binning is done). Ifadd=FALSE (the default), thefunction displays either proportions or frequencies as in a verticalhistogram. Instead of bars, spikes are used to depict thefrequencies. Ifadd=FALSE, the function assumes you are addingsmall density displays that are intended to take up a small amount ofspace in the margins of the overall plot. Thefrac argument isused as withscat1d to determine the relative length of thewhole plot that is used to represent the maximum frequency. Nojittering is done byhistSpike.
histSpike can also graph a kernel density estimate forx, or add a small density curve to any of 4 sides of anexisting plot. Wheny orcurve is specified, thedensity or spikes are drawn with respect to the curve rather than thex-axis.
histSpikeg is similar tohistSpike but is for adding layersto aggplot2 graphics object or traces to aplotlyobject.histSpikeg can also addlowess curves to the plot.
ecdfpM makes aplotly graph or series of graphs showingpossibly superposed empirical cumulative distribution functions.
Usage
scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac, eps=ifelse(preserve,0,.001), lwd=0.1, col=par("col"), y=NULL, curve=NULL, bottom.align=FALSE, preserve=FALSE, fill=1/3, limit=TRUE, nhistSpike=2000, nint=100, type=c('proportion','count','density'), grid=FALSE, ...)jitter2(x, ...)## Default S3 method:jitter2(x, fill=1/3, limit=TRUE, eps=0, presorted=FALSE, ...)## S3 method for class 'data.frame'jitter2(x, ...)datadensity(object, ...)## S3 method for class 'data.frame'datadensity(object, group, which=c("all","continuous","categorical"), method.cat=c("bar","freq"), col.group=1:10, n.unique=10, show.na=TRUE, nint=1, naxes, q, bottom.align=nint>1, cex.axis=sc(.5,.3), cex.var=sc(.8,.3), lmgp=NULL, tck=sc(-.009,-.002), ranges=NULL, labels=NULL, ...)# sc(a,b) means default to a if number of axes <= 3, b if >=50, use# linear interpolation within 3-50histSpike(x, side=1, nint=100, bins=NULL, frac=.05, minf=NULL, mult.width=1, type=c('proportion','count','density'), xlim=range(x), ylim=c(0,max(f)), xlab=deparse(substitute(x)), ylab=switch(type,proportion='Proportion', count ='Frequency', density ='Density'), y=NULL, curve=NULL, add=FALSE, minimal=FALSE, bottom.align=type=='density', col=par('col'), lwd=par('lwd'), grid=FALSE, ...)histSpikeg(formula=NULL, predictions=NULL, data, plotly=NULL, lowess=FALSE, xlim=NULL, ylim=NULL, side=1, nint=100, frac=function(f) 0.01 + 0.02*sqrt(f-1)/sqrt(max(f,2)-1), span=3/4, histcol='black', showlegend=TRUE)ecdfpM(x, group=NULL, what=c('F','1-F','f','1-f'), q=NULL, extra=c(0.025, 0.025), xlab=NULL, ylab=NULL, height=NULL, width=NULL, colors=NULL, nrows=NULL, ncols=NULL, ...)Arguments
x | a vector of numeric data, or a data frame (for |
object | a data frame or list (even with unequal number of observations pervariable, as long as |
side | axis side to use (1=bottom (default for |
frac | fraction of smaller of vertical and horizontal axes for tick marklengths. Can be negative to move tick marks outside of plot. For |
jitfrac | fraction of axis for jittering. If |
tfrac | Fraction of tick mark to actually draw. If |
eps | fraction of axis for determining overlapping points in |
lwd | line width for tick marks, passed to |
col | color for tick marks, passed to |
y | specify a vector the same length as |
curve | a list containing elements |
minimal | for |
bottom.align | set to |
preserve | set to |
fill | maximum fraction of the axis filled by jittered values. If |
limit | specifies a limit for maximum shift in jittered values. Duplicatevalues will be spread within |
nhistSpike | If the number of observations exceeds or equals |
type | used by or passed to |
grid | set to |
nint | number of intervals to divide each continuous variable's axis for |
bins | for |
... | optional arguments passed to |
presorted | set to |
group | an optional stratification variable, which is converted to a |
which | set |
method.cat | set |
col.group | colors representing the |
n.unique | number of unique values a numeric variable must have before it isconsidered to be a continuous variable |
show.na | set to |
naxes | number of axes to draw on each page before starting a new plot. Youcan set |
q | a vector of quantiles to display. By default, quantiles are notshown. |
extra | a two-vector specifying the fraction of the xrange to add on the left and the fraction to add on the right |
cex.axis | character size for draw labels for axis tick marks |
cex.var | character size for variable names and frequence of |
lmgp | spacing between numeric axis labels and axis (see |
tck | see |
ranges | a list containing ranges for some or all of the numeric variables.If |
labels | a vector of labels to use in labeling the axes for |
minf | For |
mult.width | multiplier for the smoothing window width computed by |
xlim | a 2-vector specifying the outer limits of |
ylim | y-axis range for plotting (if |
xlab | x-axis label ( |
ylab | y-axis label ( |
add | set to |
formula | a formula of the form |
predictions | the data frame being plotted by |
data | for |
plotly | an existing |
lowess | set to |
span | passed to |
histcol | color of line segments (tick marks) for |
showlegend | set to |
what | set to |
height,width | passed to |
colors | a vector of colors to pas to |
nrows,ncols | passed to |
Details
Forscat1d the length of line segments used isfrac*min(par()$pin)/par()$uin[opp] data units, whereopp is the index of the opposite axis andfrac defaultsto .02. Assumes thatplot has already been called. Currentpar("usr") is used to determine the range of data for the axisof the current plot. This range is used in jittering and inconstructing line segments.
Value
histSpike returns the actual range ofx used in its binning.histSpikeg returns a list ofggplot2 layers thatggplot2will easily add with+.
Side Effects
scat1d adds line segments to plot.datadensity.data.frame draws a complete plot.histSpikedraws a complete plot or adds to an existing plot.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
Nashville TN, USA
fh@fharrell.com
Martin Maechler (improvedscat1d)
Seminar fuer Statistik
ETH Zurich SWITZERLAND
maechler@stat.math.ethz.ch
Jens Oehlschlaegel-Akiyoshi (wrotejitter2)
Center for Psychotherapy Research
Christian-Belser-Strasse 79a
D-70597 Stuttgart Germany
oehl@psyres-stuttgart.de
See Also
segments,jitter,rug,plsmo,lowess,stripplot,hist.data.frame,Ecdf,hist,histogram,table,density,stat_plsmo,histboxp
Examples
plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 )scat1d(x) # density bars on top of graphscat1d(y, 4) # density bars at righthistSpike(x, add=TRUE) # histogram instead, 100 binshistSpike(y, 4, add=TRUE)histSpike(x, type='density', add=TRUE) # smooth density at bottomhistSpike(y, 4, type='density', add=TRUE)smooth <- lowess(x, y) # add nonparametric regression curvelines(smooth) # Note: plsmo() does thisscat1d(x, y=approx(smooth, xout=x)$y) # data density on curvescat1d(x, curve=smooth) # same effect as previous commandhistSpike(x, curve=smooth, add=TRUE) # same as previous but with histogramhistSpike(x, curve=smooth, type='density', add=TRUE) # same but smooth density over curveplot(x <- rnorm(250), y <- 3*x + rnorm(250)/2)scat1d(x, tfrac=0) # dots randomly spaced from axisscat1d(y, 4, frac=-.03) # bars outside axisscat1d(y, 2, tfrac=.2) # same bars with smaller random fractionx <- c(0:3,rep(4,3),5,rep(7,10),9)plot(x, jitter2(x)) # original versus jittered valuesabline(0,1) # unique values unjittered on ablinepoints(x+0.1, jitter2(x, limit=FALSE), col=2) # allow locally maximum jitteringpoints(x+0.2, jitter2(x, fill=1), col=3); abline(h=seq(0.5,9,1), lty=2) # fill 3/3 instead of 1/3x <- rnorm(200,0,2)+1; y <- x^2x2 <- round((x+rnorm(200))/2)*2x3 <- round((x+rnorm(200))/4)*4dfram <- data.frame(y,x,x2,x3)plot(dfram$x2, dfram$y) # jitter2 via scat1dscat1d(dfram$x2, y=dfram$y, preserve=TRUE, col=2)scat1d(dfram$x2, preserve=TRUE, frac=-0.02, col=2)scat1d(dfram$y, 4, preserve=TRUE, frac=-0.02, col=2)pairs(jitter2(dfram)) # pairs for jittered data.frame# This gets reasonable pairwise scatter plots for all combinations of# variables where## - continuous variables (with unique values) are not jittered at all, thus# all relations between continuous variables are shown as they are,# extreme values have exact positions.## - discrete variables get a reasonable amount of jittering, whether they# have 2, 3, 5, 10, 20 \dots levels## - different from adding noise, jitter2() will use the available space# optimally and no value will randomly mask another## If you want a scatterplot with lowess smooths on the *exact* values and# the point clouds shown jittered, you just need#pairs( dfram ,panel=function(x,y) { points(jitter2(x),jitter2(y)) lines(lowess(x,y)) } )datadensity(dfram) # graphical snapshot of entire data framedatadensity(dfram, group=cut2(dfram$x2,g=3)) # stratify points and frequencies by # x2 tertiles and use 3 colors# datadensity.data.frame(split(x, grouping.variable))# need to explicitly invoke datadensity.data.frame when the# first argument is a list## Not run: require(rms)require(ggplot2)f <- lrm(y ~ blood.pressure + sex * (age + rcs(cholesterol,4)), data=d)p <- Predict(f, cholesterol, sex)g <- ggplot(p, aes(x=cholesterol, y=yhat, color=sex)) + geom_line() + xlab(xl2) + ylim(-1, 1)g <- g + geom_ribbon(data=p, aes(ymin=lower, ymax=upper), alpha=0.2, linetype=0, show_guide=FALSE)g + histSpikeg(yhat ~ cholesterol + sex, p, d)# colors <- c('red', 'blue')# p <- plot_ly(x=x, y=y, color=g, colors=colors, mode='markers')# histSpikep(p, x, y, z, color=g, colors=colors)w <- data.frame(x1=rnorm(100), x2=exp(rnorm(100)))g <- c(rep('a', 50), rep('b', 50))ecdfpM(w, group=g, ncols=2)## End(Not run)Score a Series of Binary Variables
Description
Creates a new variable from a series of logical conditions. The newvariable can be a hierarchical category or score derived from consideringthe rightmostTRUE value among the input variables, an additive pointscore, a union, or any of several others by specifying a function using thefun argument.
Usage
score.binary(..., fun=max, points=1:p, na.rm=funtext == "max", retfactor=TRUE)Arguments
... | a list of variables or expressions which are considered to be binaryor logical |
fun | a function to compute on each row of the matrix represented bya specific observation of all the variables in |
points | points to assign to successive elements of |
na.rm | set to |
retfactor | applies if |
Value
afactor object ifretfactor=TRUE andfun=max or a numeric vectorotherwise. Will not contain NAs ifna.rm=TRUE unless every variable ina row isNA. If afactor objectis returned, it has levels"none" followed by characterstring versions of the arguments given in... .
See Also
Examples
set.seed(1)age <- rnorm(25, 70, 15)previous.disease <- sample(0:1, 25, TRUE)#Hierarchical scale, highest of 1:age>70 2:previous.diseasescore.binary(age>70, previous.disease, retfactor=FALSE)#Same as above but return factor variable with levels "none" "age>70" # "previous.disease"score.binary(age>70, previous.disease)#Additive scale with weights 1:age>70 2:previous.diseasescore.binary(age>70, previous.disease, fun=sum)#Additive scale, equal weightsscore.binary(age>70, previous.disease, fun=sum, points=c(1,1))#Same as saying points=1#Union of variables, to create a new binary variablescore.binary(age>70, previous.disease, fun=any)Character String Editing and Miscellaneous Character Handling Functions
Description
This suite of functions was written to implement many of the featuresof the UNIXsed program entirely within S (functionsedit).Thesubstring.location function returns the first and last positionnumbers that a sub-string occupies in a larger string. Thesubstring2<-function does the opposite of the builtin functionsubstring.It is namedsubstring2 because for S-Plus there is a built-infunctionsubstring, but it does not handle multiple replacements ina single string.replace.substring.wild edits character strings in the fashion of"change xxxxANYTHINGyyyy to aaaaANYTHINGbbbb", if the "ANYTHING"passes an optional user-specifiedtest function. Here, the"yyyy" string is searched for from right to left to handlebalancing parentheses, etc.numeric.stringandall.digits are two examples oftest functions, to check,respectively if each of a vector of strings is a legal numeric or if it contains onlythe digits 0-9. For the case whereold="*$" or "^*", or forreplace.substring.wild with the same values ofold or withfront=TRUE orback=TRUE,sedit (ifwild.literal=FALSE) andreplace.substring.wild will edit the largest substringsatisfyingtest.
substring2 is just a copy ofsubstring so thatsubstring2<- will work.
Usage
sedit(text, from, to, test, wild.literal=FALSE)substring.location(text, string, restrict)# substring(text, first, last) <- setto # S-Plus onlyreplace.substring.wild(text, old, new, test, front=FALSE, back=FALSE)numeric.string(string)all.digits(string)substring2(text, first, last)substring2(text, first, last) <- valueArguments
text | a vector of character strings for |
from | a vector of character strings to translate from, for |
to | a vector of character strings to translate to, for |
string | a single character string, for |
first | a vector of integers specifying the first position to replace for |
last | a vector of integers specifying the ending positions of the charactersubstrings to be replaced. The default is to go to the end ofthe string. When |
setto | a character string or vector of character strings used as replacements,in |
old | a character string to translate from for |
new | a character string to translate to for |
test | a function of a vector of character strings returning a logical vectorwhose elements are |
wild.literal | set to |
restrict | a vector of two integers for |
front | specifying |
back | specifying |
value | a character vector |
Value
sedit returns a vector of character strings the same length astext.substring.location returns a list with components namedfirstandlast, each specifying a vector of character positions correspondingto matches.replace.substring.wild returns a single character string.numeric.string andall.digits return a single logical value.
Side Effects
substring2<- modifies its first argument
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
See Also
Examples
x <- 'this string'substring2(x, 3, 4) <- 'IS'xsubstring2(x, 7) <- ''xsubstring.location('abcdefgabc', 'ab')substring.location('abcdefgabc', 'ab', restrict=c(3,999))replace.substring.wild('this is a cat','this*cat','that*dog')replace.substring.wild('there is a cat','is a*', 'is not a*')replace.substring.wild('this is a cat','is a*', 'Z')qualify <- function(x) x==' 1.5 ' | x==' 2.5 'replace.substring.wild('He won 1.5 million $','won*million', 'lost*million', test=qualify)replace.substring.wild('He won 1 million $','won*million', 'lost*million', test=qualify)replace.substring.wild('He won 1.2 million $','won*million', 'lost*million', test=numeric.string)x <- c('a = b','c < d','hello')sedit(x, c('=','he*o'),c('==','he*'))sedit('x23', '*$', '[*]', test=numeric.string)sedit('23xx', '^*', 'Y_{*} ', test=all.digits)replace.substring.wild("abcdefabcdef", "d*f", "xy")x <- "abcd"substring2(x, "bc") <- "BCX"xsubstring2(x, "B*d") <- "B*D"xseqFreq
Description
Find Sequential Exclusions Due to NAs
Usage
seqFreq(..., labels = NULL, noneNA = FALSE)Arguments
... | any number of variables |
labels | if specified variable labels will be used in place of variable names |
noneNA | set to |
Details
Finds the variable with the highest number ofNAs. From the non-NAs on that variable find the next variable from those remaining with the highest number ofNAs. Proceed in like fashion. The resulting variable summarizes sequential exclusions in a hierarchical fashion. Seethis for more information.
Value
factor variable withobs.per.numcond attribute
Author(s)
Frank Harrell
Display Colors, Plotting Symbols, and Symbol Numeric Equivalents
Description
show.pch plots the definitions of thepch parameters.show.col plots definitions of integer-valued colors.character.table draws numeric equivalents of all latincharacters; the character on linexy and columnz of thetable has numeric code"xyz", which you would surround in quotesand preceed by a backslash.
Usage
show.pch(object = par("font"))show.col(object=NULL)character.table(font=1)Arguments
object | font for |
font | font |
Author(s)
Pierre Joyetpierre.joyet@bluewin.ch, Frank Harrell
See Also
Examples
## Not run: show.pch()show.col()character.table()## End(Not run)Display image from psfrag LaTeX strings
Description
showPsfrag is used to display (using ghostview) a postscriptimage that contained psfrag LaTeX strings, by building a small LaTeXscript and runninglatex anddvips.
Usage
showPsfrag(filename)Arguments
filename | name or character string or character vector specifying fileprefix. |
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Grant MC, Carlisle (1998): The PSfrag System, Version 3. Fulldocumentation is obtained by searching www.ctan.org for ‘pfgguide.ps’.
See Also
postscript,par,ps.options,mgp.axis.labels,pdf,trellis.device,setTrellis
simMarkovOrd
Description
Simulate Ordinal Markov Process
Usage
simMarkovOrd( n = 1, y, times, initial, X = NULL, absorb = NULL, intercepts, g, carry = FALSE, rdsample = NULL, ...)Arguments
n | number of subjects to simulate |
y | vector of possible y values in order (numeric, character, factor) |
times | vector of measurement times |
initial | initial value of |
X | an optional vector of matrix of baseline covariate values passed to |
absorb | vector of absorbing states, a subset of |
intercepts | vector of intercepts in the proportional odds model. There must be one fewer of these than the length of |
g | a user-specified function of three or more arguments which in order are |
carry | set to |
rdsample | an optional function to do response-dependent sampling. It is a function of these arguments, which are vectors that stop at any absorbing state: |
... | additional arguments to pass to |
Details
Simulates longitudinal data for subjects following a first-order Markov process under a proportional odds model. Optionally, response-dependent sampling can be done, e.g., if a subject hits a specified state at time t, measurements are removed for times t+1, t+3, t+5, ... This is applicable when for example a study of hospitalized patients samples every day, Y=1 denotes patient discharge to home, and sampling is less frequent outside the hospital. This example assumes that arriving home is not an absorbing state, i.e., a patient could return to the hospital.
Value
data frame with one row per subject per time, and columns id, time, gap, yprev, y
Author(s)
Frank Harrell
See Also
https://hbiostat.org/R/Hmisc/markov/
Simulate Power for Adjusted Ordinal Regression Two-Sample Test
Description
This function simulates the power of a two-sample test from aproportional odds ordinal logistic model for a continuous responsevariable- a generalization of the Wilcoxon test. The continuous datamodel is normal with equal variance. Nonlinear covariateadjustment is allowed, and the user can optionally specify discreteordinal level overrides to the continuous response. For example, ifthe main response is systolic blood pressure, one can add two ordinalcategories higher than the highest observed blood pressure to captureheart attack or death.
Usage
simRegOrd(n, nsim=1000, delta=0, odds.ratio=1, sigma, p=NULL, x=NULL, X=x, Eyx, alpha=0.05, pr=FALSE)Arguments
n | combined sample size (both groups combined) |
nsim | number of simulations to run |
delta | difference in means to detect, for continuous portion ofresponse variable |
odds.ratio | odds ratio to detect for ordinal overrides ofcontinuous portion |
sigma | standard deviation for continuous portion of response |
p | a vector of marginal cell probabilities which must add up to one.The |
x | optional covariate to adjust for - a vector of length |
X | a design matrix for the adjustment covariate |
Eyx | a function of |
alpha | type I error |
pr | set to |
Value
a list containingn, delta, sigma, power, betas, se, pvals wherepower is the estimated power (scalar), andbetas, se,pvals arensim-vectors containing, respectively, the ordinalmodel treatment effect estimate, standard errors, and 2-tailedp-values. When a model fit failed, the corresponding entries inbetas, se, pvals areNA andpower is the proportionof non-failed iterations for which the treatment p-value is significantat thealpha level.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
See Also
Examples
## Not run: ## First use no ordinal high-end category overrides, and compare power## to t-test when there is no covariaten <- 100delta <- .5sd <- 1require(pwr)power.t.test(n = n / 2, delta=delta, sd=sd, type='two.sample') # 0.70set.seed(1)w <- simRegOrd(n, delta=delta, sigma=sd, pr=TRUE) # 0.686## Now do ANCOVA with a quadratic effect of a covariaten <- 100x <- rnorm(n)w <- simRegOrd(n, nsim=400, delta=delta, sigma=sd, x=x, X=cbind(x, x^2), Eyx=function(x) x + x^2, pr=TRUE)w$power # 0.68## Fit a cubic spline to some simulated pilot data and use the fitted## function as the true equation in the power simulationrequire(rms)N <- 1000set.seed(2)x <- rnorm(N)y <- x + x^2 + rnorm(N, 0, sd=sd)f <- ols(y ~ rcs(x, 4), x=TRUE)n <- 100j <- sample(1 : N, n, replace=n > N)x <- x[j]X <- f$x[j,]w <- simRegOrd(n, nsim=400, delta=delta, sigma=sd, x=x, X=X, Eyx=Function(f), pr=TRUE)w$power ## 0.70## Finally, add discrete ordinal category overrides and high end of y## Start with no effect of treatment on these ordinal event levels (OR=1.0)w <- simRegOrd(n, nsim=400, delta=delta, odds.ratio=1, sigma=sd, x=x, X=X, Eyx=Function(f), p=c(.98, .01, .01), pr=TRUE)w$power ## 0.61 (0.3 if p=.8 .1 .1, 0.37 for .9 .05 .05, 0.50 for .95 .025 .025)## Now assume that odds ratio for treatment is 2.5## First compute power for clinical endpoint portion of Y aloneor <- 2.5p <- c(.9, .05, .05)popower(p, odds.ratio=or, n=100) # 0.275## Compute power of t-test on continuous part of Y alonepower.t.test(n = 100 / 2, delta=delta, sd=sd, type='two.sample') # 0.70## Note this is the same as the p.o. model power from simulation above## Solve for OR that gives the same power estimate from popowerpopower(rep(.01, 100), odds.ratio=2.4, n=100) # 0.706## Compute power for continuous Y with ordinal overridew <- simRegOrd(n, nsim=400, delta=delta, odds.ratio=or, sigma=sd, x=x, X=X, Eyx=Function(f), p=c(.9, .05, .05), pr=TRUE)w$power ## 0.72## End(Not run)List Simplification
Description
Takes a list where each element is a group of rows that have beenspanned by a multirow row and combines it into one large matrix.
Usage
simplifyDims(x)Arguments
x | list of spanned rows |
Details
All rows must have the same number of columns. This is used to formatthe list for printing.
Value
a matrix that contains all of the spanned rows.
Author(s)
Charles Dupont
See Also
Examples
a <- list(a = matrix(1:25, ncol=5), b = matrix(1:10, ncol=5), c = 1:5)simplifyDims(a)Compute Summary Statistics on a Vector
Description
A number of statistical summary functions is provided for usewithsummary.formula andsummarize (as well astapply and by themselves).smean.cl.normal computes 3 summary variables: the sample mean andlower and upper Gaussian confidence limits based on the t-distribution.smean.sd computes the mean and standard deviation.smean.sdl computes the mean plus or minus a constant times thestandard deviation.smean.cl.boot is a very fast implementation of the basicnonparametric bootstrap for obtaining confidence limits for thepopulation mean without assuming normality.These functions all delete NAs automatically.smedian.hilow computes the sample median and a selected pair ofouter quantiles having equal tail areas.
Usage
smean.cl.normal(x, mult=qt((1+conf.int)/2,n-1), conf.int=.95, na.rm=TRUE)smean.sd(x, na.rm=TRUE)smean.sdl(x, mult=2, na.rm=TRUE)smean.cl.boot(x, conf.int=.95, B=1000, na.rm=TRUE, reps=FALSE)smedian.hilow(x, conf.int=.95, na.rm=TRUE)Arguments
x | for summary functions |
na.rm | defaults to |
mult | for |
conf.int | for |
B | number of bootstrap resamples for |
reps | set to |
Value
a vector of summary statistics
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
set.seed(1)x <- rnorm(100)smean.sd(x)smean.sdl(x)smean.cl.normal(x)smean.cl.boot(x)smedian.hilow(x, conf.int=.5) # 25th and 75th percentiles# Function to compute 0.95 confidence interval for the difference in two means# g is grouping variablebootdif <- function(y, g) { g <- as.factor(g) a <- attr(smean.cl.boot(y[g==levels(g)[1]], B=2000, reps=TRUE),'reps') b <- attr(smean.cl.boot(y[g==levels(g)[2]], B=2000, reps=TRUE),'reps') meandif <- diff(tapply(y, g, mean, na.rm=TRUE)) a.b <- quantile(b-a, c(.025,.975)) res <- c(meandif, a.b) names(res) <- c('Mean Difference','.025','.975') res}solve Function with tol argument
Description
A slightly modified version ofsolve that allows a tolerance argumentfor singularity (tol) which is passed toqr.
Usage
solvet(a, b, tol=1e-09)Arguments
a | a square numeric matrix |
b | a numeric vector or matrix |
tol | tolerance for detecting linear dependencies in columns of |
See Also
Somers' Dxy Rank Correlation
Description
Computes Somers' Dxy rank correlation between a variablex and abinary (0-1) variabley, and the corresponding receiver operatingcharacteristic curve areac. Note thatDxy = 2(c-0.5).somers allows for aweights variable, which specifies frequenciesto associate with each observation.
Usage
somers2(x, y, weights=NULL, normwt=FALSE, na.rm=TRUE)Arguments
x | typically a predictor variable. |
y | a numeric outcome variable coded |
weights | a numeric vector of observation weights (usually frequencies). Omitor specify a zero-length vector to do an unweighted analysis. |
normwt | set to |
na.rm | set to |
Details
Thercorr.cens function, which although slower thansomers2 for largesample sizes, can also be used to obtain Dxy for non-censored binaryy, and it has the advantage of computing the standard deviation ofthe correlation index.
Value
a vector with the named elementsC,Dxy,n (number of non-missingpairs), andMissing. Uses the formulaC = (mean(rank(x)[y == 1]) - (n1 + 1)/2)/(n - n1), wheren1 is thefrequency ofy=1.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
See Also
concordance,rcorr.cens,rank,wtd.rank,
Examples
set.seed(1)predicted <- runif(200)dead <- sample(0:1, 200, TRUE)roc.area <- somers2(predicted, dead)["C"]# Check weightsx <- 1:6y <- c(0,0,1,0,1,1)f <- c(3,2,2,3,2,1)somers2(x, y)somers2(rep(x, f), rep(y, f))somers2(x, y, f)soprobMarkovOrd
Description
State Occupancy Probabilities for First-Order Markov Ordinal Model
Usage
soprobMarkovOrd(y, times, initial, absorb = NULL, intercepts, g, ...)Arguments
y | a vector of possible y values in order (numeric, character, factor) |
times | vector of measurement times |
initial | initial value of |
absorb | vector of absorbing states, a subset of |
intercepts | vector of intercepts in the proportional odds model, with length one less than the length of |
g | a user-specified function of three or more arguments which in order are |
... | additional arguments to pass to |
Value
matrix with rows corresponding to times and columns corresponding to states, with values equal to exact state occupancy probabilities
Author(s)
Frank Harrell
See Also
https://hbiostat.org/R/Hmisc/markov/
soprobMarkovOrdm
Description
State Occupancy Probabilities for First-Order Markov Ordinal Model from a Model Fit
Usage
soprobMarkovOrdm( object, data, times, ylevels, absorb = NULL, tvarname = "time", pvarname = "yprev", gap = NULL)Arguments
object | a fit object created by |
data | a single observation list or data frame with covariate settings, including the initial state for Y |
times | vector of measurement times |
ylevels | a vector of ordered levels of the outcome variable (numeric or character) |
absorb | vector of absorbing states, a subset of |
tvarname | name of time variable, defaulting to |
pvarname | name of previous state variable, defaulting to |
gap | name of time gap variable, defaults assuming that gap time is not in the model |
Details
Computes state occupancy probabilities for a single setting of baseline covariates. If the model fit was fromrms::blrm(), these probabilities are from all the posterior draws of the basic model parameters. Otherwise they are maximum likelihood point estimates.
Value
ifobject was not a Bayesian model, a matrix with rows corresponding to times and columns corresponding to states, with values equal to exact state occupancy probabilities. Ifobject was created byblrm, the result is a 3-dimensional array with the posterior draws as the first dimension.
Author(s)
Frank Harrell
See Also
https://hbiostat.org/R/Hmisc/markov/
spikecomp
Description
Compute Elements of a Spike Histogram
Usage
spikecomp( x, method = c("tryactual", "simple", "grid"), lumptails = 0.01, normalize = TRUE, y, trans = NULL, tresult = c("list", "segments", "roundeddata"))Arguments
x | a numeric variable |
method | specifies the binning and output method. The default is |
lumptails | the quantile to use for lumping values into a single left and a single right bin for two of the methods. When outer quantiles using |
normalize | set to |
y | a vector of frequencies corresponding to |
trans | a list with three elements: the name of a transformation to make on |
tresult | applies only to |
Details
Derives the line segment coordinates need to draw a spike histogram. This is useful for adding elements toggplot2 plots and for thedescribe function to construct spike histograms. Date/time variables are handled by doing calculations on the underlying numeric scale then converting back to the original class. For them the left endpoint of the first bin is taken as the minimal data value instead of rounded usingpretty().
Value
wheny is specified, a list with elementsx andy. Whenmethod='tryactual' the returned value depends ontresult. Formethod='grid', a list with elementsx andy and scalar elementroundedTo containing the typical bin width. Herex is a character string.
Author(s)
Frank Harrell
Examples
spikecomp(1:1000)spikecomp(1:1000, method='grid')## Not run: On a data.table d use ggplot2 to make spike histograms by country and sex groupss <- d[, spikecomp(x, tresult='segments'), by=.(country, sex)]ggplot(s) + geom_segment(aes(x=x, y=y1, xend=x, yend=y2, alpha=I(0.3))) + scale_y_continuous(breaks=NULL, labels=NULL) + ylab('') + facet_grid(country ~ sex)## End(Not run)Simulate Power of 2-Sample Test for Survival under Complex Conditions
Description
Given functions to generate random variables for survival times andcensoring times,spower simulates the power of a user-given2-sample test for censored data. By default, the logrank (Cox2-sample) test is used, and alogrank function for comparing 2groups is provided. Optionally a Cox model is fitted for each eachsimulated dataset and the log hazard ratios are saved (this requiresthesurvival package). Aprint method prints variousmeasures from these. For composingR functions to generate randomsurvival times under complex conditions, theQuantile2 functionallows the user to specify the intervention:control hazard ratio as afunction of time, the probability of a control subject actuallyreceiving the intervention (dropin) as a function of time, and theprobability that an intervention subject receives only the controlagent as a function of time (non-compliance, dropout).Quantile2 returns a function that generates either control orintervention uncensored survival times subject to non-constanttreatment effect, dropin, and dropout. There is aplot methodfor plotting the results ofQuantile2, which will aid inunderstanding the effects of the two types of non-compliance andnon-constant treatment effects.Quantile2 assumes that thehazard function for either treatment group is a mixture of the controland intervention hazard functions, with mixing proportions defined bythe dropin and dropout probabilities. It computes hazards andsurvival distributions by numerical differentiation and integrationusing a grid of (by default) 7500 equally-spaced time points.
Thelogrank function is intended to be used withspowerbut it can be used by itself. It returns the 1 degree of freedomchi-square statistic, with the associated Pike hazard ratio estimate as an attribute.
TheWeibull2 function accepts as input two vectors, onecontaining two times and one containing two survival probabilities, andit solves for the scale and shape parameters of the Weibull distribution(S(t) = e^{-\alpha {t}^{\gamma}})which will yieldthose estimates. It creates anR function to evaluate survivalprobabilities from this Weibull distribution.Weibull2 isuseful in creating functions to pass as the first argument toQuantile2.
TheLognorm2 andGompertz2 functions are similar toWeibull2 except that they produce survival functions for thelog-normal and Gompertz distributions.
Whencox=TRUE is specified tospower, the analyst may wishto extract the two margins of error by using theprint methodforspower objects (see example below) and take the maximum ofthe two.
Usage
spower(rcontrol, rinterv, rcens, nc, ni, test=logrank, cox=FALSE, nsim=500, alpha=0.05, pr=TRUE)## S3 method for class 'spower'print(x, conf.int=.95, ...)Quantile2(scontrol, hratio, dropin=function(times)0, dropout=function(times)0, m=7500, tmax, qtmax=.001, mplot=200, pr=TRUE, ...)## S3 method for class 'Quantile2'print(x, ...)## S3 method for class 'Quantile2'plot(x, what=c("survival", "hazard", "both", "drop", "hratio", "all"), dropsep=FALSE, lty=1:4, col=1, xlim, ylim=NULL, label.curves=NULL, ...)logrank(S, group)Gompertz2(times, surv)Lognorm2(times, surv)Weibull2(times, surv)Arguments
rcontrol | a function of n which returns n random uncensoredfailure times for the control group. |
rinterv | similar to |
rcens | a function of n which returns n random censoring times.It is assumed that both treatment groups have the same censoringdistribution. |
nc | number of subjects in the control group |
ni | number in the intervention group |
scontrol | a function of a time vector which returns the survival probabilitiesfor the control group at those times assuming that all patients arecompliant. |
hratio | a function of time which specifies the intervention:control hazardratio (treatment effect) |
x | an object of class “Quantile2” created by |
conf.int | confidence level for determining fold-change margins of error inestimating the hazard ratio |
S | a |
group | group indicators have length equal to the number of rows in |
times | a vector of two times |
surv | a vector of two survival probabilities |
test | any function of a |
cox | If true |
nsim | number of simulations to perform (default=500) |
alpha | type I error (default=.05) |
pr | If |
dropin | a function of time specifying the probability that a control subjectactually is treated with the new intervention at the correspondingtime |
dropout | a function of time specifying the probability of an interventionsubject dropping out to control conditions. As a function of time, |
m | number of time points used for approximating functions (default is7500) |
tmax | maximum time point to use in the grid of |
qtmax | survival probability corresponding to the last time point used forapproximating survival and hazard functions. Default is 0.001. For |
mplot | number of points used for approximating functions for use inplotting (default is 200 equally spaced points) |
... | optional arguments passed to the |
what | a single character constant (may be abbreviated) specifying whichfunctions to plot. The default is ‘"both"’ meaning bothsurvival and hazard functions. Specify |
dropsep | If |
lty | vector of line types |
col | vector of colors |
xlim | optional x-axis limits |
ylim | optional y-axis limits |
label.curves | optional list which is passed as the |
Value
spower returns the power estimate (fraction of simulatedchi-squares greater than the alpha-critical value). Ifcox=TRUE,spower returns an object of class“spower” containing the power and various other quantities.
Quantile2 returns anR function of class “Quantile2”with attributes that drive theplot method. The majorattribute is a list containing several lists. Each of these sub-listscontains aTime vector along with one of the following:survival probabilities for either treatment group and with or withoutcontamination caused by non-compliance, hazard rates in a similar way,intervention:control hazard ratio function with and withoutcontamination, and dropin and dropout functions.
logrank returns a single chi-square statistic and an attributehr which is the Pike hazard ratio estimate.
Weibull2,Lognorm2 andGompertz2 return anRfunction with three arguments, only the first of which (the vector oftimes) is intended to be specified by the user.
Side Effects
spower prints the interation number every 10 iterations ifpr=TRUE.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Lakatos E (1988): Sample sizes based on the log-rank statistic in complexclinical trials. Biometrics 44:229–241 (Correction 44:923).
Cuzick J, Edwards R, Segnan N (1997): Adjusting for non-compliance and contamination in randomized clinical trials. Stat in Med 16:1017–1029.
Cook, T (2003): Methods for mid-course corrections in clinical trialswith survival outcomes. Stat in Med 22:3431–3447.
Barthel FMS, Babiker A et al (2006): Evaluation of sample size and powerfor multi-arm survival trials allowing for non-uniform accrual,non-proportional hazards, loss to follow-up and cross-over. Stat in Med25:2521–2542.
See Also
cpower,ciapower,bpower,cph,coxph,labcurve
Examples
# Simulate a simple 2-arm clinical trial with exponential survival so# we can compare power simulations of logrank-Cox test with cpower()# Hazard ratio is constant and patients enter the study uniformly# with follow-up ranging from 1 to 3 years# Drop-in probability is constant at .1 and drop-out probability is# constant at .175. Two-year survival of control patients in absence# of drop-in is .8 (mortality=.2). Note that hazard rate is -log(.8)/2# Total sample size (both groups combined) is 1000# % mortality reduction by intervention (if no dropin or dropout) is 25# This corresponds to a hazard ratio of 0.7283 (computed by cpower)cpower(2, 1000, .2, 25, accrual=2, tmin=1, noncomp.c=10, noncomp.i=17.5)ranfun <- Quantile2(function(x)exp(log(.8)/2*x), hratio=function(x)0.7283156, dropin=function(x).1, dropout=function(x).175)rcontrol <- function(n) ranfun(n, what='control')rinterv <- function(n) ranfun(n, what='int')rcens <- function(n) runif(n, 1, 3)set.seed(11) # So can reproduce resultsspower(rcontrol, rinterv, rcens, nc=500, ni=500, test=logrank, nsim=50) # normally use nsim=500 or 1000## Not run: # Run the same simulation but fit the Cox model for each one to# get log hazard ratios for the purpose of assessing the tightness# confidence intervals that are likely to resultset.seed(11)u <- spower(rcontrol, rinterv, rcens, nc=500, ni=500, test=logrank, nsim=50, cox=TRUE)uv <- print(u)v[c('MOElower','MOEupper','SE')]## End(Not run)# Simulate a 2-arm 5-year follow-up study for which the control group's# survival distribution is Weibull with 1-year survival of .95 and# 3-year survival of .7. All subjects are followed at least one year,# and patients enter the study with linearly increasing probability after that# Assume there is no chance of dropin for the first 6 months, then the# probability increases linearly up to .15 at 5 years# Assume there is a linearly increasing chance of dropout up to .3 at 5 years# Assume that the treatment has no effect for the first 9 months, then# it has a constant effect (hazard ratio of .75)# First find the right Weibull distribution for compliant control patientssc <- Weibull2(c(1,3), c(.95,.7))sc# Inverse cumulative distribution for case where all subjects are followed# at least a years and then between a and b years the density rises# as (time - a) ^ d is a + (b-a) * u ^ (1/(d+1))rcens <- function(n) 1 + (5-1) * (runif(n) ^ .5)# To check this, type hist(rcens(10000), nclass=50)# Put it all togetherf <- Quantile2(sc, hratio=function(x)ifelse(x<=.75, 1, .75), dropin=function(x)ifelse(x<=.5, 0, .15*(x-.5)/(5-.5)), dropout=function(x).3*x/5)par(mfrow=c(2,2))# par(mfrow=c(1,1)) to make legends fitplot(f, 'all', label.curves=list(keys='lines'))rcontrol <- function(n) f(n, 'control')rinterv <- function(n) f(n, 'intervention')set.seed(211)spower(rcontrol, rinterv, rcens, nc=350, ni=350, test=logrank, nsim=50) # normally nsim=500 or morepar(mfrow=c(1,1))# Compose a censoring time generator function such that at 1 year# 5% of subjects are accrued, at 3 years 70% are accured, and at 10# years 100% are accrued. The trial proceeds two years past the last# accrual for a total of 12 years of follow-up for the first subject.# Use linear interporation between these 3 pointsrcens <- function(n){ times <- c(0,1,3,10) accrued <- c(0,.05,.7,1) # Compute inverse of accrued function at U(0,1) random variables accrual.times <- approx(accrued, times, xout=runif(n))$y censor.times <- 12 - accrual.times censor.times}censor.times <- rcens(500)# hist(censor.times, nclass=20)accrual.times <- 12 - censor.times# Ecdf(accrual.times)# lines(c(0,1,3,10), c(0,.05,.7,1), col='red')# spower(..., rcens=rcens, ...)## Not run: # To define a control survival curve from a fitted survival curve# with coordinates (tt, surv) with tt[1]=0, surv[1]=1:Scontrol <- function(times, tt, surv) approx(tt, surv, xout=times)$ytt <- 0:6surv <- c(1, .9, .8, .75, .7, .65, .64)formals(Scontrol) <- list(times=NULL, tt=tt, surv=surv)# To use a mixture of two survival curves, with e.g. mixing proportions# of .2 and .8, use the following as a guide:## Scontrol <- function(times, t1, s1, t2, s2)# .2*approx(t1, s1, xout=times)$y + .8*approx(t2, s2, xout=times)$y# t1 <- ...; s1 <- ...; t2 <- ...; s2 <- ...;# formals(Scontrol) <- list(times=NULL, t1=t1, s1=s1, t2=t2, s2=s2)# Check that spower can detect a situation where generated censoring times# are later than all failure timesrcens <- function(n) runif(n, 0, 7)f <- Quantile2(scontrol=Scontrol, hratio=function(x).8, tmax=6)cont <- function(n) f(n, what='control')int <- function(n) f(n, what='intervention')spower(rcontrol=cont, rinterv=int, rcens=rcens, nc=300, ni=300, nsim=20)# Do an unstratified logrank testlibrary(survival)# From SAS/STAT PROC LIFETEST manual, p. 1801days <- c(179,256,262,256,255,224,225,287,319,264,237,156,270,257,242, 157,249,180,226,268,378,355,319,256,171,325,325,217,255,256, 291,323,253,206,206,237,211,229,234,209)status <- c(1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,0, 0,rep(1,19))treatment <- c(rep(1,10), rep(2,10), rep(1,10), rep(2,10))sex <- Cs(F,F,M,F,M,F,F,M,M,M,F,F,M,M,M,F,M,F,F,M, M,M,M,M,F,M,M,F,F,F,M,M,M,F,F,M,F,F,F,F)data.frame(days, status, treatment, sex)table(treatment, status)logrank(Surv(days, status), treatment) # agrees with p. 1807# For stratified tests the picture is puzzling.# survdiff(Surv(days,status) ~ treatment + strata(sex))$chisq# is 7.246562, which does not agree with SAS (7.1609)# But summary(coxph(Surv(days,status) ~ treatment + strata(sex)))# yields 7.16 whereas summary(coxph(Surv(days,status) ~ treatment))# yields 5.21 as the score test, not agreeing with SAS or logrank() (5.6485)## End(Not run)Enhanced Importing of SPSS Files
Description
spss.get invokes theread.spss function in theforeign package to read an SPSS file, with a default outputformat of"data.frame". Thelabel function is used toattach labels to individual variables instead of to the data frame asdone byread.spss. By default, integer-valued variables areconverted to a storage mode of integer unlessforce.single=FALSE. Date variables are converted toRDatevariables. By default, underscores in names are converted to periods.
Usage
spss.get(file, lowernames=FALSE, datevars = NULL, use.value.labels = TRUE, to.data.frame = TRUE, max.value.labels = Inf, force.single=TRUE, allow=NULL, charfactor=FALSE, reencode = NA)Arguments
file | input SPSS save file. May be a file on the WWW, indicatedby |
lowernames | set to |
datevars | vector of variable names containing dates to beconverted toR internal format |
use.value.labels | see |
to.data.frame | see |
max.value.labels | see |
force.single | set to |
allow | a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9. |
charfactor | set to |
reencode | see |
Value
a data frame or list
Author(s)
Frank Harrell
See Also
read.spss,cleanup.import,sas.get
Examples
## Not run: w <- spss.get('/tmp/my.sav', datevars=c('birthdate','deathdate')) ## End(Not run)Source a File from the Current Working Directory
Description
src concatenates".s" to its argument, quotes the result,andsources in the file. It setsoptions(last.source) tothis file name so thatsrc() can be issued to re-sourcethe file when it is edited.
Usage
src(x)Arguments
x | an unquoted file name aside from |
Side Effects
Sets system optionlast.source
Author(s)
Frank Harrell
See Also
Examples
## Not run: src(myfile) # source("myfile.s")src() # re-source myfile.s## End(Not run)Add a lowess smoother without counfidence bands.
Description
Automatically selectsiter=0 forlowess ify is binary, otherwise usesiter=3.
Usage
stat_plsmo( mapping = NULL, data = NULL, geom = "smooth", position = "identity", n = 80, fullrange = FALSE, span = 2/3, fun = function(x) x, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...)Arguments
mapping,data,geom,position,show.legend,inherit.aes | see ggplot2 documentation |
n | number of points to evaluate smoother at |
fullrange | should the fit span the full range of the plot, or justthe data |
span | see |
fun | a function to transform smoothed |
na.rm | If |
... | other arguments are passed to smoothing function |
Value
a data.frame with additional columns
y | predicted value |
See Also
lowess forloess smoother.
Examples
require(ggplot2)c <- ggplot(mtcars, aes(qsec, wt))c + stat_plsmo()c + stat_plsmo() + geom_point()c + stat_plsmo(span = 0.1) + geom_point()# Smoothers for subsetsc <- ggplot(mtcars, aes(y=wt, x=mpg)) + facet_grid(. ~ cyl)c + stat_plsmo() + geom_point()c + stat_plsmo(fullrange = TRUE) + geom_point()# Geoms and stats are automatically split by aesthetics that are factorsc <- ggplot(mtcars, aes(y=wt, x=mpg, colour=factor(cyl)))c + stat_plsmo() + geom_point()c + stat_plsmo(aes(fill = factor(cyl))) + geom_point()c + stat_plsmo(fullrange=TRUE) + geom_point()# Example with logistic regressiondata("kyphosis", package="rpart")qplot(Age, as.numeric(Kyphosis) - 1, data = kyphosis) + stat_plsmo()Enhanced Importing of STATA Files
Description
Reads a file in Stata version 5-11 binary format format into adata frame.
Usage
stata.get(file, lowernames = FALSE, convert.dates = TRUE, convert.factors = TRUE, missing.type = FALSE, convert.underscore = TRUE, warn.missing.labels = TRUE, force.single = TRUE, allow=NULL, charfactor=FALSE, ...)Arguments
file | inputSPSS save file. May be a file on theWWW, indicatedby |
lowernames | set to |
convert.dates | see |
convert.factors | see |
missing.type | see |
convert.underscore | see |
warn.missing.labels | see |
force.single | set to |
allow | a vector of characters allowed byR that should not beconverted to periods in variable names. By default, underscores invariable names are converted to periods as withR before version 1.9. |
charfactor | set to |
... | arguments passed to |
Details
stata.get invokes theread.dta function in theforeign package to read an STATA file, with a default outputformat ofdata.frame. Thelabel function is used toattach labels to individual variables instead of to the data frame asdone byread.dta. By default, integer-valued variables areconverted to a storage mode of integer unlessforce.single=FALSE. Date variables are converted toRDate variables. By default, underscores in names are converted to periods.
Value
A data frame
Author(s)
Charles Dupont
See Also
read.dta,cleanup.import,label,data.frame,Date
Examples
## Not run: w <- stata.get('/tmp/my.dta')## End(Not run)Determine Dimensions of Strings
Description
This determines the number of rows and maximum number of columns ofeach string in a vector.
Usage
string.bounding.box(string, type = c("chars", "width"))Arguments
string | vector of strings |
type | character: whether to count characters or screen columns |
Value
rows | vector containing the number of character rows in each string |
columns | vector containing the maximum number of charactercolumns in each string |
Author(s)
Charles Dupont
See Also
Examples
a <- c("this is a single line string", "This is a\nmulti-line string")stringDims(a)Break a String into Many Lines at Newlines
Description
Takes a string and breaks it into seperate substrings where there arenewline characters.
Usage
string.break.line(string)Arguments
string | character vector to be separated into many lines. |
Value
Returns a list that is the same length of as thestringargument.
Each list element is a character vector.
Each character vectors elements are thesplit lines of the corresponding element in thestring argument vector.
Author(s)
Charles Dupont
See Also
Examples
a <- c('', 'this is a single line string', 'This is a\nmulti-line string.')b <- string.break.line(a)String Dimentions
Description
Finds the height and width of all the string in a character vector.
Usage
stringDims(string)Arguments
string | vector of strings |
Details
stringDims finds the number of characters in width and number oflines in height for each string in thestring argument.
Value
height | a vector of the number of lines in each string. |
width | a vector with the number of character columns in thelongest line. |
Author(s)
Charles Dupont
See Also
Examples
a <- c("this is a single line string", "This is a\nmulty line string")stringDims(a)Embed a new plot within an existing plot
Description
Subplot will embed a new plot within an existing plot at thecoordinates specified (in user units of the existing plot).
Usage
subplot(fun, x, y, size=c(1,1), vadj=0.5, hadj=0.5, pars=NULL)Arguments
fun | an expression or function defining the new plot to be embedded. |
x |
|
y |
|
size | The size of the embedded plot in inches if |
vadj | vertical adjustment of the plot when |
hadj | horizontal adjustment of the plot when |
pars | a list of parameters to be passed to |
Details
The coordinatesx andy can be scalars or vectors oflength 2. If vectors of length 2 then they determine the oppositecorners of the rectangle for the embedded plot (and the parameterssize,vadj, andhadj are all ignored.
Ifx andy are given as scalars then the plot positionrelative to the point and the size of the plot will be determined bythe argumentssize,vadj, andhadj. The defaultis to center a 1 inch by 1 inch plot atx,y. Settingvadj andhadj to(0,0) will position the lowerleft corner of the plot at(x,y).
The rectangle defined byx,y,size,vadj,andhadj will be used as the plotting area of the new plot.Any tick marks, axis labels, main and sub titles will be outside ofthis rectangle.
Any graphical parameter settings that you would like to be in placebeforefun is evaluated can be specified in theparsargument (warning: specifying layout parameters here (plt,mfrow, etc.) may cause unexpected results).
After the function completes the graphical parameters will have beenreset to what they were before calling the function (so you cancontinue to augment the original plot).
Value
An invisible list with the graphical parameters that were in effectwhen the subplot was created. Passing this list topar willenable you to augment the embedded plot.
Author(s)
Greg Snowgreg.snow@imail.org
See Also
Examples
# make an original plotplot( 11:20, sample(51:60) )# add some histogramssubplot( hist(rnorm(100)), 15, 55)subplot( hist(runif(100),main='',xlab='',ylab=''), 11, 51, hadj=0, vadj=0)subplot( hist(rexp(100, 1/3)), 20, 60, hadj=1, vadj=1, size=c(0.5,2) )subplot( hist(rt(100,3)), c(12,16), c(57,59), pars=list(lwd=3,ask=FALSE) )tmp <- rnorm(25)qqnorm(tmp)qqline(tmp)tmp2 <- subplot( hist(tmp,xlab='',ylab='',main=''), cnvrt.coords(0.1,0.9,'plt')$usr, vadj=1, hadj=0 )abline(v=0, col='red') # wrong way to add a reference line to histogram# right way to add a reference line to histogramop <- par(no.readonly=TRUE)par(tmp2)abline(v=0, col='green')par(op)Summarize Scalars or Matrices by Cross-Classification
Description
summarize is a fast version ofsummary.formula(formula,method="cross",overall=FALSE) for producing stratified summary statisticsand storing them in a data frame for plotting (especially with trellisxyplot anddotplot and HmiscxYplot). Unlikeaggregate,summarize accepts a matrix as its firstargument and a multi-valuedFUNargument andsummarize also labels the variables in the new dataframe using their original names. Unlike methods based ontapply,summarize stores the values of the stratificationvariables using their original types, e.g., a numericby variablewill remain a numeric variable in the collapsed data frame.summarize also retains"label" attributes for variables.summarize works especially well with the HmiscxYplotfunction for displaying multiple summaries of a single variable on eachpanel, such as means and upper and lower confidence limits.
asNumericMatrix converts a data frame into a numeric matrix,saving attributes to reverse the process bymatrix2dataframe.It saves attributes that are commonly preserved across rowsubsetting (i.e., it does not savedim,dimnames, ornames attributes).
matrix2dataFrame converts a numeric matrix back into a dataframe if it was created byasNumericMatrix.
Usage
summarize(X, by, FUN, ..., stat.name=deparse(substitute(X)), type=c('variables','matrix'), subset=TRUE, keepcolnames=FALSE)asNumericMatrix(x)matrix2dataFrame(x, at=attr(x, 'origAttributes'), restoreAll=TRUE)Arguments
X | a vector or matrix capable of being operated on by thefunction specified as the |
by | one or more stratification variables. If a singlevariable, |
FUN | a function of a single vector argument, used to create the statisticalsummaries for |
... | extra arguments are passed to |
stat.name | the name to use when creating the main summary variable. By default,the name of the |
type | Specify |
subset | a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame. |
keepcolnames | by default when |
x | a data frame (for |
at | List containing attributes of original data frame that survivesubsetting. Defaults to attribute |
restoreAll | set to |
Value
Forsummarize, a data frame containing theby variables and thestatistical summaries (the first of which is named the same as theXvariable unlessstat.name is given). Iftype="matrix", thesummaries are stored in a single variable in the data frame, and thisvariable is a matrix.
asNumericMatrix returns a numeric matrix and stores an objectorigAttributes as an attribute of the returned object, with originalattributes of component variables, thestorage.mode.
matrix2dataFrame returns a data frame.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
Examples
## Not run: s <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean, stat.name='Proportion')dotplot(Proportion ~ size | bone, data=s7)## End(Not run)set.seed(1)temperature <- rnorm(300, 70, 10)month <- sample(1:12, 300, TRUE)year <- sample(2000:2001, 300, TRUE)g <- function(x)c(Mean=mean(x,na.rm=TRUE),Median=median(x,na.rm=TRUE))summarize(temperature, month, g)mApply(temperature, month, g)mApply(temperature, month, mean, na.rm=TRUE)w <- summarize(temperature, month, mean, na.rm=TRUE)library(lattice)xyplot(temperature ~ month, data=w) # plot mean temperature by monthw <- summarize(temperature, llist(year,month), quantile, probs=c(.5,.25,.75), na.rm=TRUE, type='matrix')xYplot(Cbind(temperature[,1],temperature[,-1]) ~ month | year, data=w)mApply(temperature, llist(year,month), quantile, probs=c(.5,.25,.75), na.rm=TRUE)# Compute the median and outer quartiles. The outer quartiles are# displayed using "error bars"set.seed(111)dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)attach(dfr)y <- abs(month-6.5) + 2*runif(length(month)) + year-1997s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5)smApply(y, llist(month,year), smedian.hilow, conf.int=.5)xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s, keys='lines', method='alt')# Can also do:s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75), stat.name=c('y','Q1','Q3'))xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, keys='lines')# To display means and bootstrapped nonparametric confidence intervals# use for example:s <- summarize(y, llist(month,year), smean.cl.boot)xYplot(Cbind(y, Lower, Upper) ~ month | year, data=s)# For each subject use the trapezoidal rule to compute the area under# the (time,response) curve using the Hmisc trap.rule functionx <- cbind(time=c(1,2,4,7, 1,3,5,10),response=c(1,3,2,4, 1,3,2,4))subject <- c(rep(1,4),rep(2,4))trap.rule(x[1:4,1],x[1:4,2])summarize(x, subject, function(y) trap.rule(y[,1],y[,2]))## Not run: # Another approach would be to properly re-shape the mm array below# This assumes no missing cells. There are many other approaches.# mApply will do this well while allowing for missing cells.m <- tapply(y, list(year,month), quantile, probs=c(.25,.5,.75))mm <- array(unlist(m), dim=c(3,2,12), dimnames=list(c('lower','median','upper'),c('1997','1998'), as.character(1:12)))# aggregate will help but it only allows you to compute one quantile# at a time; see also the Hmisc mApply functiondframe <- aggregate(y, list(Year=year,Month=month), quantile, probs=.5)# Compute expected life length by race assuming an exponential# distribution - can also use summarizeg <- function(y) { # computations for one race group futime <- y[,1]; event <- y[,2] sum(futime)/sum(event) # assume event=1 for death, 0=alive}mApply(cbind(followup.time, death), race, g)# To run mApply on a data frame:xn <- asNumericMatrix(x)m <- mApply(xn, race, h)# Here assume h is a function that returns a matrix similar to xmatrix2dataFrame(m)# Get stratified weighted meansg <- function(y) wtd.mean(y[,1],y[,2])summarize(cbind(y, wts), llist(sex,race), g, stat.name='y')mApply(cbind(y,wts), llist(sex,race), g)# Compare speed of mApply vs. by for computing d <- data.frame(sex=sample(c('female','male'),100000,TRUE), country=sample(letters,100000,TRUE), y1=runif(100000), y2=runif(100000))g <- function(x) { y <- c(median(x[,'y1']-x[,'y2']), med.sum =median(x[,'y1']+x[,'y2'])) names(y) <- c('med.diff','med.sum') y}system.time(by(d, llist(sex=d$sex,country=d$country), g))system.time({ x <- asNumericMatrix(d) a <- subsAttr(d) m <- mApply(x, llist(sex=d$sex,country=d$country), g) })system.time({ x <- asNumericMatrix(d) summarize(x, llist(sex=d$sex, country=d$country), g) })# An example where each subject has one record per diagnosis but sex of# subject is duplicated for all the rows a subject has. Get the cross-# classified frequencies of diagnosis (dx) by sex and plot the results# with a dot plotcount <- rep(1,length(dx))d <- summarize(count, llist(dx,sex), sum)Dotplot(dx ~ count | sex, data=d)## End(Not run)d <- list(x=1:10, a=factor(rep(c('a','b'), 5)), b=structure(letters[1:10], label='label for b'), d=c(rep(TRUE,9), FALSE), f=pi*(1 : 10))x <- asNumericMatrix(d)attr(x, 'origAttributes')matrix2dataFrame(x)detach('dfr')# Run summarize on a matrix to get column meansx <- c(1:19,NA)y <- 101:120z <- cbind(x, y)g <- c(rep(1, 10), rep(2, 10))summarize(z, g, colMeans, na.rm=TRUE, stat.name='x')# Also works on an all numeric data framesummarize(as.data.frame(z), g, colMeans, na.rm=TRUE, stat.name='x')Summarize Data for Making Tables and Plots
Description
summary.formula summarizes the variables listed in an S formula,computing descriptive statistics (including ones in auser-specified function). The summary statistics may be passed toprint methods,plot methods for making annotated dot charts, andlatex methods for typesetting tables using LaTeX.summary.formula has three methods for computing descriptivestatistics on univariate or multivariate responses, subsetted bycategories of other variables. The method of summarization isspecified in the parametermethod (see details below). For theresponse andcross methods, the statistics used tosummarize the data may be specified in a very flexible way (e.g., the geometric mean,33rd percentile, Kaplan-Meier 2-year survival estimate, mixtures ofseveral statistics). The default summary statistic for these methodsis the mean (the proportion of positive responses for a binaryresponse variable). Thecross method is useful for creating dataframes which contain summary statistics that are passed totrellisas raw data (to make multi-panel dot charts, for example). Theprint methods use theprint.char.matrix function to print boxedtables.
The right hand side offormula may containmChoice(“multiple choice”) variables. Whentest=TRUE each choice istested separately as a binary categorical response.
Theplot method formethod="reverse" creates a temporaryfunctionKey in frame 0 as is done by thexYplot andEcdf.formula functions. Afterplot runs, you can typeKey() to put a legend in a default location, ore.g.Key(locator(1)) to draw a legend where you click the leftmouse button. This key is for categorical variables, so to have theopportunity to put the key on the graph you will probably want to usethe commandplot(object, which="categorical"). A second functionKey2 is created if continuous variables are being plotted. It isused the same asKey. If thewhich argument is notspecified toplot, two pages of plots will be produced. If youdon't definepar(mfrow=) yourself,plot.summary.formula.reverse will try to lay out a multi-panelgraph to best fit all the individual dot charts for continuousvariables.
There is a subscripting method for objects created withmethod="response". This can be used to print or plot selected variables or summary statisticswhere there would otherwise be too many on one page.
cumcategory is a utility function useful when summarizing an ordinalresponse variable. It converts such a variable havingk levels to amatrix withk-1 columns, where columni is a vector of zeros andones indicating that the categorical response is in leveli+1 orgreater. When the left hand side offormula iscumcategory(y),the defaultfun will summarize it by computing all of the relevantcumulative proportions.
FunctionsconTestkw,catTestchisq,ordTestpo arethe default statistical test functions forsummary.formula.These defaults are: Wilcoxon-Kruskal-Wallis test for continuousvariables, Pearson chi-square test for categorical variables, and thelikelihood ratio chi-square test from the proportional odds model forordinal variables. These three functions serve also as templates forthe user to create her own testing functions that are self-defining interms of how the results are printed or rendered in LaTeX, or plotted.
Usage
## S3 method for class 'formula'summary(formula, data=NULL, subset=NULL, na.action=NULL, fun = NULL, method = c("response", "reverse", "cross"), overall = method == "response" | method == "cross", continuous = 10, na.rm = TRUE, na.include = method != "reverse", g = 4, quant = c(0.025, 0.05, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 0.95, 0.975), nmin = if (method == "reverse") 100 else 0, test = FALSE, conTest = conTestkw, catTest = catTestchisq, ordTest = ordTestpo, ...)## S3 method for class 'summary.formula.response'x[i, j, drop=FALSE]## S3 method for class 'summary.formula.response'print(x, vnames=c('labels','names'), prUnits=TRUE, abbreviate.dimnames=FALSE, prefix.width, min.colwidth, formatArgs=NULL, markdown=FALSE, ...)## S3 method for class 'summary.formula.response'plot(x, which = 1, vnames = c('labels','names'), xlim, xlab, pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), superposeStrata = TRUE, dotfont = 1, add = FALSE, reset.par = TRUE, main, subtitles = TRUE, ...)## S3 method for class 'summary.formula.response'latex(object, title = first.word(deparse(substitute(object))), caption, trios, vnames = c('labels', 'names'), prn = TRUE, prUnits = TRUE, rowlabel = '', cdec = 2, ncaption = TRUE, ...)## S3 method for class 'summary.formula.reverse'print(x, digits, prn = any(n != N), pctdig = 0, what=c('%', 'proportion'), npct = c('numerator', 'both', 'denominator', 'none'), exclude1 = TRUE, vnames = c('labels', 'names'), prUnits = TRUE, sep = '/', abbreviate.dimnames = FALSE, prefix.width = max(nchar(lab)), min.colwidth, formatArgs=NULL, round=NULL, prtest = c('P','stat','df','name'), prmsd = FALSE, long = FALSE, pdig = 3, eps = 0.001, ...)## S3 method for class 'summary.formula.reverse'plot(x, vnames = c('labels', 'names'), what = c('proportion', '%'), which = c('both', 'categorical', 'continuous'), xlim = if(what == 'proportion') c(0,1) else c(0,100), xlab = if(what=='proportion') 'Proportion' else 'Percentage', pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), exclude1 = TRUE, dotfont = 1, main, prtest = c('P', 'stat', 'df', 'name'), pdig = 3, eps = 0.001, conType = c('dot', 'bp', 'raw'), cex.means = 0.5, ...)## S3 method for class 'summary.formula.reverse'latex(object, title = first.word(deparse(substitute(object))), digits, prn = any(n != N), pctdig = 0, what=c('%', 'proportion'), npct = c("numerator", "both", "denominator", "slash", "none"), npct.size = 'scriptsize', Nsize = "scriptsize", exclude1 = TRUE, vnames=c("labels", "names"), prUnits = TRUE, middle.bold = FALSE, outer.size = "scriptsize", caption, rowlabel = "", insert.bottom = TRUE, dcolumn = FALSE, formatArgs=NULL, round = NULL, prtest = c('P', 'stat', 'df', 'name'), prmsd = FALSE, msdsize = NULL, long = dotchart, pdig = 3, eps = 0.001, auxCol = NULL, dotchart=FALSE, ...)## S3 method for class 'summary.formula.cross'print(x, twoway = nvar == 2, prnmiss = any(stats$Missing > 0), prn = TRUE, abbreviate.dimnames = FALSE, prefix.width = max(nchar(v)), min.colwidth, formatArgs = NULL, ...)## S3 method for class 'summary.formula.cross'latex(object, title = first.word(deparse(substitute(object))), twoway = nvar == 2, prnmiss = TRUE, prn = TRUE, caption=attr(object, "heading"), vnames=c("labels", "names"), rowlabel="", ...)stratify(..., na.group = FALSE, shortlabel = TRUE)## S3 method for class 'summary.formula.cross'formula(x, ...)cumcategory(y)conTestkw(group, x)catTestchisq(tab)ordTestpo(group, x)Arguments
formula | AnR formula with additive effects. For |
x | an object created by |
y | a numeric, character, category, or factor vector for |
drop | logical. If |
data | name or number of a data frame. Default is the current frame. |
subset | a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame. |
na.action | function for handling missing data in the input data. The default isa function defined here called |
fun | function for summarizing data in each cell. Default is to take themean of each column of the possibly multivariate response variable.You can specify |
method | The default is The The |
overall | For |
continuous | specifies the threshold for when a variable is considered to becontinuous (when there are at least |
na.rm |
|
na.include | for |
g | number of quantile groups to use when variables are automaticallycategorized with |
nmin | if fewer than |
test | applies if |
conTest | a function of two arguments (grouping variable and a continuousvariable) that returns a list with components |
catTest | a function of a frequency table (an integer matrix) that returns alist with the same components as created by |
ordTest | a function of a frequency table (an integer matrix) that returns alist with the same components as created by |
... | for |
object | an object created by |
quant | vector of quantiles to use for summarizing data with |
vnames | By default, tables and plots are usually labeled with variable labels(see the |
pch | vector of plotting characters to represent different groups, in orderof group levels. For |
superposeStrata | If |
dotfont | font for plotting points |
reset.par | set to |
abbreviate.dimnames | see |
prefix.width | see |
min.colwidth | minimum column width to use for boxes printed with |
formatArgs | a list containing other arguments to pass to |
markdown | for |
digits | number of significant digits to print. Default is to use the currentvalue of the |
prn | set to |
prnmiss | set to |
what | for |
pctdig | number of digits to the right of the decimal place for printingpercentages. The default is zero, so percents will be rounded to thenearest percent. |
npct | specifies which counts are to be printed to the right of percentages.The default is to print the frequency (numerator of the percent) inparentheses. You can specify |
npct.size | the size for typesetting |
Nsize | When a second row of column headings is added showing sample sizes, |
exclude1 | by default, |
prUnits | set to |
sep | character to use to separate quantiles when printing |
prtest | a vector of test statistic components to print if |
round | for |
prmsd | set to |
msdsize | defaults to |
long | set to |
pdig | number of digits to the right of the decimal place for printingP-values. Default is |
eps | P-values less than |
auxCol | an optional auxiliary column of information, right justified, to addin front of statistics typeset by |
twoway | for |
which | For |
conType | For plotting |
cex.means | character size for means in box-percentile plots; default is .5 |
xlim | vector of length two specifying x-axis limits. For |
xlab | x-axis label |
add | set to |
main | a main title. For |
subtitles | set to |
caption | character string containing LaTeX table captions. |
title | name of resulting LaTeX file omitting the |
trios | If for |
rowlabel | see |
cdec | number of decimal places to the right of the decimal point for |
ncaption | set to |
i | a vector of integers, or character strings containing variable namesto subset on. Note that each row subsetted on in an |
j | a vector of integers representing column numbers |
middle.bold | set to |
outer.size | the font size for outer quantiles for |
insert.bottom | set to |
dcolumn | see |
na.group | set to |
shortlabel | set to |
dotchart | set to |
group | for |
tab | for |
Value
summary.formula returns a data frame or list depending onmethod.plot.summary.formula.reverse returns the numberof pages of plots that were made.
Side Effects
plot.summary.formula.reverse creates a functionKey andKey2 in frame 0 that will draw legends.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Harrell FE (2007): Statistical tables and plots using S and LaTeX.Document available fromhttps://hbiostat.org/R/Hmisc/summary.pdf.
See Also
mChoice,smean.sd,summarize,label,strata,dotchart2,print.char.matrix,update,formula,cut2,llist,format.default,latex,latexTranslatebpplt,summaryM,summary
Examples
options(digits=3)set.seed(173)sex <- factor(sample(c("m","f"), 500, rep=TRUE))age <- rnorm(500, 50, 5)treatment <- factor(sample(c("Drug","Placebo"), 500, rep=TRUE))# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail', 'Muscle Ache','Depressed')symptom1 <- sample(symp, 500,TRUE)symptom2 <- sample(symp, 500,TRUE)symptom3 <- sample(symp, 500,TRUE)Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')table(Symptoms)# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selections#Frequency table sex*treatment, sex*Symptomssummary(sex ~ treatment + Symptoms, fun=table)# could also do summary(sex ~ treatment +# mChoice(symptom1,symptom2,symptom3), fun=table)#Compute mean age, separately by 3 variablessummary(age ~ sex + treatment + Symptoms)f <- summary(treatment ~ age + sex + Symptoms, method="reverse", test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)plot(f)plot(f, conType='bp', prtest='P')bpplt() # annotated example showing layout of bp plot#Compute predicted probability from a logistic regression model#For different stratifications compute receiver operating#characteristic curve areas (C-indexes)predicted <- plogis(.4*(sex=="m")+.15*(age-50))positive.diagnosis <- ifelse(runif(500)<=predicted, 1, 0)roc <- function(z) { x <- z[,1]; y <- z[,2]; n <- length(x); if(n<2)return(c(ROC=NA)); n1 <- sum(y==1); c(ROC= (mean(rank(x)[y==1])-(n1+1)/2)/(n-n1) ); }y <- cbind(predicted, positive.diagnosis)options(digits=2)summary(y ~ age + sex, fun=roc)options(digits=3)summary(y ~ age + sex, fun=roc, method="cross")#Use stratify() to produce a table in which time intervals go down the#page and going across 3 continuous variables are summarized using#quartiles, and are stratified by two treatmentsset.seed(1)d <- expand.grid(visit=1:5, treat=c('A','B'), reps=1:100)d$sysbp <- rnorm(100*5*2, 120, 10)label(d$sysbp) <- 'Systolic BP'd$diasbp <- rnorm(100*5*2, 80, 7)d$diasbp[1] <- NAd$age <- rnorm(100*5*2, 50, 12)g <- function(y) { N <- apply(y, 2, function(w) sum(!is.na(w))) h <- function(x) { qu <- quantile(x, c(.25,.5,.75), na.rm=TRUE) names(qu) <- c('Q1','Q2','Q3') c(N=sum(!is.na(x)), qu)} w <- as.vector(apply(y, 2, h)) names(w) <- as.vector( outer(c('N','Q1','Q2','Q3'), dimnames(y)[[2]], function(x,y) paste(y,x))) w}#Use na.rm=FALSE to count NAs separately by columns <- summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat), na.rm=FALSE, fun=g, data=d)#The result is very wide. Re-do, putting treatment verticallyx <- with(d, factor(paste('Visit', visit, treat)))summary(cbind(age,sysbp,diasbp) ~ x, na.rm=FALSE, fun=g, data=d)#Compose LaTeX code directlyg <- function(y) { h <- function(x) { qu <- format(round(quantile(x, c(.25,.5,.75), na.rm=TRUE),1),nsmall=1) paste('{\\scriptsize(',sum(!is.na(x)), ')} \\hfill{\\scriptsize ', qu[1], '} \\textbf{', qu[2], '} {\\scriptsize ', qu[3],'}', sep='') } apply(y, 2, h)}s <- summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat), na.rm=FALSE, fun=g, data=d)# latex(s, prn=FALSE)## need option in latex to not print n#Put treatment verticallys <- summary(cbind(age,sysbp,diasbp) ~ x, fun=g, data=d, na.rm=FALSE)# latex(s, prn=FALSE)#Plot estimated mean life length (assuming an exponential distribution) #separately by levels of 4 other variables. Repeat the analysis#by levels of a stratification variable, drug. Automatically break#continuous variables into tertiles.#We are using the default, method='response'## Not run: life.expect <- function(y) c(Years=sum(y[,1])/sum(y[,2]))attach(pbc)require(survival)S <- Surv(follow.up.time, death)s2 <- summary(S ~ age + albumin + ascites + edema + stratify(drug), fun=life.expect, g=3)#Note: You can summarize other response variables using the same #independent variables using e.g. update(s2, response~.), or you #can change the list of independent variables using e.g. #update(s2, response ~.- ascites) or update(s2, .~.-ascites)#You can also print, typeset, or plot subsets of s2, e.g.#plot(s2[c('age','albumin'),]) or plot(s2[1:2,])s2 # invokes print.summary.formula.response#Plot results as a separate dot chart for each of the 3 strata levelspar(mfrow=c(2,2))plot(s2, cex.labels=.6, xlim=c(0,40), superposeStrata=FALSE)#Typeset table, creating s2.texw <- latex(s2, cdec=1)#Typeset table but just print LaTeX codelatex(s2, file="") # useful for Sweave#Take control of groups used for age. Compute 3 quartiles for#both cholesterol and bilirubin (excluding observations that are missing#on EITHER ONE)age.groups <- cut2(age, c(45,60))g <- function(y) apply(y, 2, quantile, c(.25,.5,.75))y <- cbind(Chol=chol,Bili=bili)label(y) <- 'Cholesterol and Bilirubin'#You can give new column names that are not legal S names#by enclosing them in quotes, e.g. 'Chol (mg/dl)'=chols <- summary(y ~ age.groups + ascites, fun=g)par(mfrow=c(1,2), oma=c(3,0,3,0)) # allow outer margins for overallfor(ivar in 1:2) { # title isub <- (1:3)+(ivar-1)*3 # *3=number of quantiles/var. plot(s3, which=isub, main='', xlab=c('Cholesterol','Bilirubin')[ivar], pch=c(91,16,93)) # [, closed circle, ] }mtext(paste('Quartiles of', label(y)), adj=.5, outer=TRUE, cex=1.75) #Overall (outer) titleprlatex(latex(s3, trios=TRUE)) # trios -> collapse 3 quartiles#Summarize only bilirubin, but do it with two statistics:#the mean and the median. Make separate tables for the two randomized#groups and make plots for the active arm.g <- function(y) c(Mean=mean(y), Median=median(y))for(sub in c("D-penicillamine", "placebo")) { ss <- summary(bili ~ age.groups + ascites + chol, fun=g, subset=drug==sub) cat('\n',sub,'\n\n') print(ss) if(sub=='D-penicillamine') { par(mfrow=c(1,1)) plot(s4, which=1:2, dotfont=c(1,-1), subtitles=FALSE, main='') #1=mean, 2=median -1 font = open circle title(sub='Closed circle: mean; Open circle: median', adj=0) title(sub=sub, adj=1) } w <- latex(ss, append=TRUE, fi='my.tex', label=if(sub=='placebo') 's4b' else 's4a', caption=paste(label(bili),' {\\em (',sub,')}', sep='')) #Note symbolic labels for tables for two subsets: s4a, s4b prlatex(w)}#Now consider examples in 'reverse' format, where the lone dependent#variable tells the summary function how to stratify all the #'independent' variables. This is typically used to make tables #comparing baseline variables by treatment group, for example.s5 <- summary(drug ~ bili + albumin + stage + protime + sex + age + spiders, method='reverse')#To summarize all variables, use summary(drug ~., data=pbc)#To summarize all variables with no stratification, use#summary(~a+b+c) or summary(~.,data=\dots)options(digits=1)print(s5, npct='both')#npct='both' : print both numerators and denominatorsplot(s5, which='categorical')Key(locator(1)) # draw legend at mouse clickpar(oma=c(3,0,0,0)) # leave outer margin at bottomplot(s5, which='continuous')Key2() # draw legend at lower left corner of plot # oma= above makes this default key fit the page betteroptions(digits=3)w <- latex(s5, npct='both', here=TRUE) # creates s5.tex#Turn to a different dataset and do cross-classifications on possibly #more than one independent variable. The summary function with #method='cross' produces a data frame containing the cross-#classifications. This data frame is suitable for multi-panel #trellis displays, although `summarize' works better for that.attach(prostate)size.quartile <- cut2(sz, g=4)bone <- factor(bm,labels=c("no mets","bone mets"))s7 <- summary(ap>1 ~ size.quartile + bone, method='cross')#In this case, quartiles are the default so could have said sz + boneoptions(digits=3)print(s7, twoway=FALSE)s7 # same as print(s7)w <- latex(s7, here=TRUE) # Make s7.texlibrary(trellis,TRUE)invisible(ps.options(reset=TRUE))trellis.device(postscript, file='demo2.ps')dotplot(S ~ size.quartile|bone, data=s7, #s7 is name of summary stats xlab="Fraction ap>1", ylab="Quartile of Tumor Size")#Can do this more quickly with summarize:# s7 <- summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,# stat.name='Proportion')# dotplot(Proportion ~ size | bone, data=s7)summary(age ~ stage, method='cross')summary(age ~ stage, fun=quantile, method='cross')summary(age ~ stage, fun=smean.sd, method='cross')summary(age ~ stage, fun=smedian.hilow, method='cross')summary(age ~ stage, fun=function(x) c(Mean=mean(x), Median=median(x)), method='cross')#The next statements print real two-way tablessummary(cbind(age,ap) ~ stage + bone, fun=function(y) apply(y, 2, quantile, c(.25,.75)), method='cross')options(digits=2)summary(log(ap) ~ sz + bone, fun=function(y) c(Mean=mean(y), quantile(y)), method='cross')#Summarize an ordered categorical response by all of the needed#cumulative proportionssummary(cumcategory(disease.severity) ~ age + sex)## End(Not run)Summarize Mixed Data Types vs. Groups
Description
summaryM summarizes the variables listed in an S formula,computing descriptive statistics and optionally statistical tests forgroup differences. This function is typically used when there aremultiple left-hand-side variables that are independently against bygroups marked by a single right-hand-side variable. The summarystatistics may be passed toprint methods,plot methodsfor making annotated dot charts and extended box plots, andlatex methods for typesetting tables using LaTeX. Thehtml method useshtmlTable::htmlTable to typeset thetable in html, by passing information to thelatex method withhtml=TRUE. This is for use with Quarto/RMarkdown.Theprint methods use theprint.char.matrix function toprint boxed tables whenoptions(prType=) has not been given orwhenprType='plain'. For plain tables,print calls theinternal functionprintsummaryM. WhenprType='latex'thelatex method is invoked, and whenprType='html' htmlis rendered. In Quarto/RMarkdown, proper rendering will result evenifresults='asis' does not appear in the chunk header. Whenrendering in html at the console due to havingoptions(prType='html')the table will be rendered in a viewer.
Theplot method createsplotly graphics ifoptions(grType='plotly'), otherwise base graphics are used.plotly graphics provide extra information such as whichquantile is being displayed when hovering the mouse. Test statisticsare displayed by hovering over the mean.
Continuous variables are described by three quantiles (quartiles bydefault) when printing, or by the following quantiles when plottingexpended box plots using thebpplt function:0.05, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 0.95. The boxplots are scaled to the 0.025 and 0.975 quantiles of each continuousleft-hand-side variable. Categorical variables are described by counts and percentages.
The left hand side offormula may containmChoice("multiple choice") variables. Whentest=TRUE each choice istested separately as a binary categorical response.
Theplot method formethod="reverse" creates a temporaryfunctionKey as is done by thexYplot andEcdf.formula functions. Afterplotruns, you can typeKey() to put a legend in a default location, ore.g.Key(locator(1)) to draw a legend where you click the leftmouse button. This key is for categorical variables, so to have theopportunity to put the key on the graph you will probably want to usethe commandplot(object, which="categorical"). A second functionKey2 is created if continuous variables are being plotted. It isused the same asKey. If thewhich argument is notspecified toplot, two pages of plots will be produced. If youdon't definepar(mfrow=) yourself,plot.summaryM will try to lay out a multi-panelgraph to best fit all the individual charts for continuousvariables.
Usage
summaryM(formula, groups=NULL, data=NULL, subset, na.action=na.retain, overall=FALSE, continuous=10, na.include=FALSE, quant=c(0.025, 0.05, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 0.95, 0.975), nmin=100, test=FALSE, conTest=conTestkw, catTest=catTestchisq, ordTest=ordTestpo)## S3 method for class 'summaryM'print(...)printsummaryM(x, digits, prn = any(n != N), what=c('proportion', '%'), pctdig = if(what == '%') 0 else 2, npct = c('numerator', 'both', 'denominator', 'none'), exclude1 = TRUE, vnames = c('labels', 'names'), prUnits = TRUE, sep = '/', abbreviate.dimnames = FALSE, prefix.width = max(nchar(lab)), min.colwidth, formatArgs=NULL, round=NULL, prtest = c('P','stat','df','name'), prmsd = FALSE, long = FALSE, pdig = 3, eps = 0.001, prob = c(0.25, 0.5, 0.75), prN = FALSE, ...)## S3 method for class 'summaryM'plot(x, vnames = c('labels', 'names'), which = c('both', 'categorical', 'continuous'), vars=NULL, xlim = c(0,1), xlab = 'Proportion', pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), exclude1 = TRUE, main, ncols=2, prtest = c('P', 'stat', 'df', 'name'), pdig = 3, eps = 0.001, conType = c('bp', 'dot', 'raw'), cex.means = 0.5, cex=par('cex'), height='auto', width=700, ...)## S3 method for class 'summaryM'latex(object, title = first.word(deparse(substitute(object))), file=paste(title, 'tex', sep='.'), append=FALSE, digits, prn = any(n != N), what=c('proportion', '%'), pctdig = if(what == '%') 0 else 2, npct = c('numerator', 'both', 'denominator', 'slash', 'none'), npct.size = if(html) mspecs$html$smaller else 'scriptsize', Nsize = if(html) mspecs$html$smaller else 'scriptsize', exclude1 = TRUE, vnames=c("labels", "names"), prUnits = TRUE, middle.bold = FALSE, outer.size = if(html) mspecs$html$smaller else "scriptsize", caption, rowlabel = "", rowsep=html, insert.bottom = TRUE, dcolumn = FALSE, formatArgs=NULL, round=NULL, prtest = c('P', 'stat', 'df', 'name'), prmsd = FALSE, msdsize = if(html) function(x) x else NULL, brmsd=FALSE, long = FALSE, pdig = 3, eps = 0.001, auxCol = NULL, table.env=TRUE, tabenv1=FALSE, prob=c(0.25, 0.5, 0.75), prN=FALSE, legend.bottom=FALSE, html=FALSE, mspecs=markupSpecs, ...)## S3 method for class 'summaryM'html(object, ...)Arguments
formula | An S formula with additive effects. There may be several variableson the right hand side separated by "+",or the numeral |
groups | if there is more than one right-hand variable, specify |
x | an object created by |
data | name or number of a data frame. Default is the current frame. |
subset | a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame. |
na.action | function for handling missing data in the input data. The default isa function defined here called |
overall | Setting |
continuous | specifies the threshold for when a variable is considered to becontinuous (when there are at least |
na.include | Set |
nmin | For categories of the response variable in which thereare less than or equal to |
test | Set to |
conTest | a function of two arguments (grouping variable and a continuousvariable) that returns a list with components |
catTest | a function of a frequency table (an integer matrix) that returns alist with the same components as created by |
ordTest | a function of a frequency table (an integer matrix) that returns alist with the same components as created by |
... | For |
object | an object created by |
quant | vector of quantiles to use for summarizing continuous variables.These must be numbers between 0 and 1inclusive and must include the numbers 0.5, 0.25, and 0.75 which areused for printing and for plotting quantile intervals. The outer quantiles are used for scaling the x-axesfor such plots. Specify outer quantiles as |
prob | vector of quantiles to use for summarizing continuous variables.These must be numbers between 0 and 1 inclusive and have previously beenincluded in the Warning: specifying 0 and 1 as two of the quantiles will result incomputing the minimum and maximum of the variable. As for many randomvariables the minimum will continue to become smaller as the sample sizegrows, and the maximum will continue to get larger. Thus the min and maxare not recommended as summary statistics. |
vnames | By default, tables and plots are usually labeled with variable labels(see the |
pch | vector of plotting characters to represent different groups, in orderof group levels. |
abbreviate.dimnames | see |
prefix.width | see |
min.colwidth | minimum column width to use for boxes printed with |
formatArgs | a list containing other arguments to pass to |
digits | number of significant digits to print. Default is to use the currentvalue of the |
what | specifies whether proportions or percentages are to beprinted or LaTeX'd |
pctdig | number of digits to the right of the decimal place for printingpercentages or proportions. The default is zero if |
prn | set to |
prN | set to |
npct | specifies which counts are to be printed to the right of percentages.The default is to print the frequency (numerator of the percent) inparentheses. You can specify |
npct.size | the size for typesetting |
Nsize | When a second row of column headings is added showing sample sizes, |
exclude1 | By default, |
prUnits | set to |
sep | character to use to separate quantiles when printing tables |
prtest | a vector of test statistic components to print if |
round | Specify |
prmsd | set to |
msdsize | defaults to |
brmsd | set to |
long | set to |
pdig | number of digits to the right of the decimal place for printingP-values. Default is |
eps | P-values less than |
auxCol | an optional auxiliary column of information, right justified, to addin front of statistics typeset by |
table.env | set to |
tabenv1 | set to |
which | Specifies whether to plot results for categorical variables,continuous variables, or both (the default). |
vars | Subscripts (indexes) of variables to plot for |
conType | For drawing plots for continuous variables,extended box plots (box-percentile-type plots) are drawn by default,using all quantiles in |
cex.means | character size for means in box-percentile plots; default is .5 |
cex | character size for other plotted items |
height,width | dimensions in pixels for the |
xlim | vector of length two specifying x-axis limits. This is only usedfor plotting categorical variables. Limits for continuousvariables are determined by the outer quantiles specified in |
xlab | x-axis label |
main | a main title. This applies only to the plot forcategorical variables. |
ncols | number of columns for |
caption | character string containing LaTeX table captions. |
title | name of resulting LaTeX file omitting the |
file | name of file to write LaTeX code to. Specifying |
append | specify |
rowlabel | see |
rowsep | if |
middle.bold | set to |
outer.size | the font size for outer quantiles |
insert.bottom | set to |
legend.bottom | set to |
html | set to |
mspecs | list defining markup syntax for various languages,defaults to Hmisc |
dcolumn | see |
Value
a list.plot.summaryM returns the numberof pages of plots that were made if using base graphics, orplotly objects created byplotly::subplot otherwise.If both categorical and continuous variables were plotted, thereturned object is a list with two named elementsCategoricalandContinuous each containingplotly objects.Otherwise aplotly object is returned.Thelatex method returns attributeslegend andnstrata.
Side Effects
plot.summaryM creates a functionKey andKey2 in frame 0 that will draw legends, if base graphics arebeing used.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Harrell FE (2004): Statistical tables and plots using S and LaTeX.Document available fromhttps://hbiostat.org/R/Hmisc/summary.pdf.
See Also
mChoice,label,dotchart3,print.char.matrix,update,formula,format.default,latex,latexTranslate,bpplt,tabulr,bpplotM,summaryP
Examples
options(digits=3)set.seed(173)sex <- factor(sample(c("m","f"), 500, rep=TRUE))country <- factor(sample(c('US', 'Canada'), 500, rep=TRUE))age <- rnorm(500, 50, 5)sbp <- rnorm(500, 120, 12)label(sbp) <- 'Systolic BP'units(sbp) <- 'mmHg'treatment <- factor(sample(c("Drug","Placebo"), 500, rep=TRUE))treatment[1]sbp[1] <- NA# Generate a 3-choice variable; each of 3 variables has 5 possible levelssymp <- c('Headache','Stomach Ache','Hangnail', 'Muscle Ache','Depressed')symptom1 <- sample(symp, 500,TRUE)symptom2 <- sample(symp, 500,TRUE)symptom3 <- sample(symp, 500,TRUE)Symptoms <- mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')table(as.character(Symptoms))# Note: In this example, some subjects have the same symptom checked# multiple times; in practice these redundant selections would be NAs# mChoice will ignore these redundant selectionsf <- summaryM(age + sex + sbp + Symptoms ~ treatment, test=TRUE)f# trio of numbers represent 25th, 50th, 75th percentileprint(f, long=TRUE)plot(f) # first specify options(grType='plotly') to use plotlyplot(f, conType='dot', prtest='P')bpplt() # annotated example showing layout of bp plot# Produce separate tables by countryf <- summaryM(age + sex + sbp + Symptoms ~ treatment + country, groups='treatment', test=TRUE)f## Not run: getHdata(pbc)s5 <- summaryM(bili + albumin + stage + protime + sex + age + spiders ~ drug, data=pbc)print(s5, npct='both')# npct='both' : print both numerators and denominatorsplot(s5, which='categorical')Key(locator(1)) # draw legend at mouse clickpar(oma=c(3,0,0,0)) # leave outer margin at bottomplot(s5, which='continuous') # see also bpplotMKey2() # draw legend at lower left corner of plot # oma= above makes this default key fit the page betteroptions(digits=3)w <- latex(s5, npct='both', here=TRUE, file='')options(grType='plotly')pbc <- upData(pbc, moveUnits = TRUE)s <- summaryM(bili + albumin + alk.phos + copper + spiders + sex ~ drug, data=pbc, test=TRUE)# Render htmloptions(prType='html')s # invokes print.summaryMa <- plot(s)a$Categoricala$Continuousplot(s, which='con')## End(Not run)Multi-way Summary of Proportions
Description
summaryP produces a tall and thin data frame containingnumerators (freq) and denominators (denom) afterstratifying the data by a series of variables. A special capabilityto group a series of related yes/no variables is included through theuse of theynbind function, for which the user specials a finalargumentlabel used to label the panel created for that groupof related variables.
Ifoptions(grType='plotly') is not in effect,theplot method forsummaryPdisplays proportions as amulti-panel dot chart using thelattice package'sdotplotfunction with a specialpanel function. Numerators anddenominators of proportions are also included as text, in the samecolors as used by an optionalgroups variable. Theformula argument used in thedotplot call is constructed,but the user can easily reorder the variables by specifyingformula, with elements namedval (category levels),var (classification variable name),freq (calculatedresult) plus the overall cross-classification variables excludinggroups. Ifoptions(grType='plotly') is in effect, theplot method makes an entirely different display usingHmisc::dotchartpl withplotly ifmarginVal isspecified, whereby a stratificationvariable causes more finely stratified estimates to be shown slightlybelow the lines, with smaller and translucent symbols ifdatahas been run throughaddMarginal. The marginal summaries areshown as the main estimates and the user can turn off display of thestratified estimates, or view their details with hover text.
Theggplot method forsummaryP does not draw numeratorsand denominators but the chart is more compact than using theplot method with base graphics becauseggplot2does not repeat category names the same way aslattice does.Variable names that are too long to fit in panel strips are renamed(1), (2), etc. and an attribute"fnvar" is added to the result;this attribute is a character string defining the abbreviations,useful in a figure caption. Theggplot2 object haslabels for points plotted, used byplotly::ggplotly ashover text (see example).
Thelatex method produces one or more LaTeXtabularscontaining a table representation of the result, with optionalside-by-side display ifgroups is specified. Multipletabulars result from the presence of non-group stratificationfactors.
Usage
summaryP(formula, data = NULL, subset = NULL, na.action = na.retain, sort=TRUE, asna = c("unknown", "unspecified"), ...)## S3 method for class 'summaryP'plot(x, formula=NULL, groups=NULL, marginVal=NULL, marginLabel=marginVal, refgroup=NULL, exclude1=TRUE, xlim = c(-.05, 1.05), text.at=NULL, cex.values = 0.5, key = list(columns = length(groupslevels), x = 0.75, y = -0.04, cex = 0.9, col = lattice::trellis.par.get('superpose.symbol')$col, corner=c(0,1)), outerlabels=TRUE, autoarrange=TRUE, col=colorspace::rainbow_hcl, ...)## S3 method for class 'summaryP'ggplot(data, mapping, groups=NULL, exclude1=TRUE, xlim=c(0, 1), col=NULL, shape=NULL, size=function(n) n ^ (1/4), sizerange=NULL, abblen=5, autoarrange=TRUE, addlayer=NULL, ..., environment)## S3 method for class 'summaryP'latex(object, groups=NULL, exclude1=TRUE, file='', round=3, size=NULL, append=TRUE, ...)Arguments
formula | a formula with the variables for whose levelsproportions are computed on the left hand side, and majorclassification variables on the right. The formula need to includeany variable later used as |
data | an optional data frame. For |
subset | an optional subsetting expression or vector |
na.action | function specifying how to handle |
sort | set to |
asna | character vector specifying level names to consider thesame as |
x | an object produced by |
groups | a character string containing the name of asuperpositioning variable for obtaining further stratification within a horizontal line in the dot chart. |
marginVal | if |
marginLabel | specifies a different character string to use thanthe value of |
refgroup | used when doing a |
exclude1 | By default, |
xlim |
|
text.at | specify to leave unused space to the right of eachpanel to prevent numerators and denominators from touching datapoints. |
cex.values | character size to use for plotting numerators anddenominators |
key | a list to pass to the |
outerlabels | by default if there are two conditioning variablesbesides |
autoarrange | If |
col | a vector of colors to use to override defaults in |
shape | a vector of plotting symbols to override |
mapping,environment | not used; needed because of rules for generics |
size | for |
sizerange | a 2-vector specifying the |
abblen | labels of variables having only one level and havingtheir name longer than |
... | used only for |
object | an object produced by |
file | file name, defaults to writing to console |
round | number of digits to the right of the decimal place forproportions |
append | set to |
addlayer | a |
Value
summaryP produces a data frame of class"summaryP". Theplot method produces alatticeobject of class"trellis". Thelatex method produces anobject of class"latex" with an additional attributengrouplevels specifying the number of levels of anygroups variable and an attributenstrata specifying thenumber of strata.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
bpplotM,summaryM,ynbind,pBlock,ggplot,colorFacet
Examples
n <- 100f <- function(na=FALSE) { x <- sample(c('N', 'Y'), n, TRUE) if(na) x[runif(100) < .1] <- NA x}set.seed(1)d <- data.frame(x1=f(), x2=f(), x3=f(), x4=f(), x5=f(), x6=f(), x7=f(TRUE), age=rnorm(n, 50, 10), race=sample(c('Asian', 'Black/AA', 'White'), n, TRUE), sex=sample(c('Female', 'Male'), n, TRUE), treat=sample(c('A', 'B'), n, TRUE), region=sample(c('North America','Europe'), n, TRUE))d <- upData(d, labels=c(x1='MI', x2='Stroke', x3='AKI', x4='Migraines', x5='Pregnant', x6='Other event', x7='MD withdrawal', race='Race', sex='Sex'))dasna <- subset(d, region=='North America')with(dasna, table(race, treat))s <- summaryP(race + sex + ynbind(x1, x2, x3, x4, x5, x6, x7, label='Exclusions') ~ region + treat, data=d)# add exclude1=FALSE below to include female categoryplot(s, groups='treat')require(ggplot2)ggplot(s, groups='treat')plot(s, val ~ freq | region * var, groups='treat', outerlabels=FALSE)# Much better looking if omit outerlabels=FALSE; see output at# https://hbiostat.org/R/Hmisc/summaryFuns.pdf# See more examples under bpplotM## For plotly interactive graphic that does not handle variable size## panels well:## require(plotly)## g <- ggplot(s, groups='treat')## ggplotly(g, tooltip='text')## For nice plotly interactive graphic:## options(grType='plotly')## s <- summaryP(race + sex + ynbind(x1, x2, x3, x4, x5, x6, x7,## label='Exclusions') ~## treat, data=subset(d, region='Europe'))#### plot(s, groups='treat', refgroup='A') # refgroup='A' does B-A differences# Make a chart where there is a block of variables that# are only analyzed for males. Keep redundant sex in block for demo.# Leave extra space for numerators, denominatorssb <- summaryP(race + sex + pBlock(race, sex, label='Race: Males', subset=sex=='Male') ~ region, data=d)plot(sb, text.at=1.3)plot(sb, groups='region', layout=c(1,3), key=list(space='top'), text.at=1.15)ggplot(sb, groups='region')## Not run: plot(s, groups='treat')# plot(s, groups='treat', outerlabels=FALSE) for standard lattice outputplot(s, groups='region', key=list(columns=2, space='bottom'))require(ggplot2)colorFacet(ggplot(s))plot(summaryP(race + sex ~ region, data=d), exclude1=FALSE, col='green')require(lattice)# Make your own plot using data frame created by summaryPuseOuterStrips(dotplot(val ~ freq | region * var, groups=treat, data=s, xlim=c(0,1), scales=list(y='free', rot=0), xlab='Fraction', panel=function(x, y, subscripts, ...) { denom <- s$denom[subscripts] x <- x / denom panel.dotplot(x=x, y=y, subscripts=subscripts, ...) }))# Show marginal summary for all regions combineds <- summaryP(race + sex ~ region, data=addMarginal(d, region))plot(s, groups='region', key=list(space='top'), layout=c(1,2))# Show marginal summaries for both race and sexs <- summaryP(ynbind(x1, x2, x3, x4, label='Exclusions', sort=FALSE) ~ race + sex, data=addMarginal(d, race, sex))plot(s, val ~ freq | sex*race)## End(Not run)Graphical Summarization of Continuous Variables Against a Response
Description
summaryRc is a continuous version ofsummary.formulawithmethod='response'. It uses theplsmofunction to compute the possibly stratifiedlowessnonparametric regression estimates, and plots them along with the datadensity, with selected quantiles of the overall distribution (overstrata) of eachx shown as arrows on top of the graph. All thex variables must be numeric and continuous or nearly continuous.
Usage
summaryRc(formula, data=NULL, subset=NULL, na.action=NULL, fun = function(x) x, na.rm = TRUE, ylab=NULL, ylim=NULL, xlim=NULL, nloc=NULL, datadensity=NULL, quant = c(0.05, 0.1, 0.25, 0.5, 0.75, 0.90, 0.95), quantloc=c('top','bottom'), cex.quant=.6, srt.quant=0, bpplot = c('none', 'top', 'top outside', 'top inside', 'bottom'), height.bpplot=0.08, trim=NULL, test = FALSE, vnames = c('labels', 'names'), ...)Arguments
formula | AnR formula with additive effects. The |
data | name or number of a data frame. Default is the current frame. |
subset | a logical vector or integer vector of subscripts used to specify thesubset of data to use in the analysis. The default is to use allobservations in the data frame. |
na.action | function for handling missing data in the input data. The default isa function defined here called |
fun | function for transforming |
na.rm |
|
ylab |
|
ylim |
|
xlim | a list with elements named as the variable names appearingon the |
nloc | location for sample size. Specify |
datadensity | see |
quant | vector of quantiles to use for summarizing the marginal distributionof each |
quantloc | specify |
cex.quant | character size for writing which quantiles arerepresented. Set to |
srt.quant | angle for text for quantile labels |
bpplot | if not |
height.bpplot | height in inches of the horizontal extended box plot |
trim | The default is to plot from the 10th smallest to the 10thlargest |
test | Set to |
vnames | By default, plots are usually labeled with variable labels(see the |
... | arguments passed to |
Value
no value is returned
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
plsmo,stratify,label,formula,panel.bpplot
Examples
options(digits=3)set.seed(177)sex <- factor(sample(c("m","f"), 500, rep=TRUE))age <- rnorm(500, 50, 5)bp <- rnorm(500, 120, 7)units(age) <- 'Years'; units(bp) <- 'mmHg'label(bp) <- 'Systolic Blood Pressure'L <- .5*(sex == 'm') + 0.1 * (age - 50)y <- rbinom(500, 1, plogis(L))par(mfrow=c(1,2))summaryRc(y ~ age + bp)# For x limits use 1st and 99th percentiles to frame extended box plotssummaryRc(y ~ age + bp, bpplot='top', datadensity=FALSE, trim=.01)summaryRc(y ~ age + bp + stratify(sex), label.curves=list(keys='lines'), nloc=list(x=.1, y=.05))y2 <- rbinom(500, 1, plogis(L + .5))Y <- cbind(y, y2)summaryRc(Y ~ age + bp + stratify(sex), label.curves=list(keys='lines'), nloc=list(x=.1, y=.05))Summarize Multiple Response Variables and Make Multipanel Scatteror Dot Plot
Description
Multiple left-hand formula variables along with right-hand sideconditioning variables are reshaped into a "tall and thin" data frame iffun is not specified. The resulting raw data can be plotted withtheplot method using user-specifiedpanel functions forlattice graphics, typically to make a scatterplot orloesssmooths, or both. TheHmiscpanel.plsmo function is handyin this context. Instead, iffun is specified, this functiontakes individual response variables (which may be matrices, as inSurv objects) and creates one or more summarystatistics that will be computed while the resulting data frame is beingcollapsed to one row per condition. Theplot method in this caseplots a multi-panel dot chart using thelatticedotplot function ifpanel is not specifiedtoplot. There is an option to printselected statistics as text on the panels.summaryS pays specialattention toHmisc variable annotations:label, units.Whenpanel is specified in addition tofun, a specialx-y plot is made that assumes that thex-axis variable(typically time) is discrete. This is used for example to plot multiplequantile intervals as vertical lines next to the main point. A specialpanel functionmvarclPanel is provided for this purpose.
Theplotp method produces correspondingplotly graphics.
Whenfun is given andpanel is omitted, and the result offun is a vector of more than one statistic, the first statistic is taken as the main one. Any columnswith names not intextonly will figure into the calculation ofaxis limits. Those intextonly will be printed right under thedot lines in the dot chart. Statistics with names intextplotwill figure into limits, be plotted, and printed.pch.stats canbe used to specify symbols for statistics after the first column. Whenfun computed three columns that are plotted, columns two andthree are taken as confidence limits for which horizontal "error bars"are drawn. Two levels with different thicknesses are drawn if there arefour plotted summary statistics beyond the first.
mbarclPanel is used to draw multiple vertical lines around themain points, such as a series of quantile intervals stratified byx and paneling variables. IfmbarclPanel finds a columnof an arumentyother that is named"se", and if there areexactly two levels to a superpositioning variable, the half-height ofthe approximate 0.95 confidence interval for the difference between twopoint estimates is shown, positioned at the midpoint of the two pointestimates at anx value. This assume normality of pointestimates, and the standard error of the difference is the square rootof the sum of squares of the two standard errors. By positioning theintervals in this fashion, a failure of the two point estimates to touchthe half-confidence interval is consistent with rejecting the nullhypothesis of no difference at the 0.05 level.
mbarclpl is thesfun function corresponding tombarclPanel forplotp, andmedvpl is thesfun replacement formedvPanel.
medvPanel takes raw data and plots mediany vs.x,along with confidence intervals and half-interval for the difference inmedians as withmbarclPanel. Quantile intervals are optional.Very transparent vertical violin plots are added by default. Unlikepanel.violin, only half of the violin is plotted, and when thereare two superpose groups they are side-by-side in different colors.
Forplotp, the function corresponding tomedvPanel ismedvpl, which draws back-to-back spike histograms, optional Ginimean difference, optional SD, quantiles (thin line version of boxplot with 0.05 0.25 0.5 0.75 0.95 quantiles), and half-width confidenceinterval for differences in medians. For quantiles, the Harrell-Davisestimator is used.
Usage
summaryS(formula, fun = NULL, data = NULL, subset = NULL, na.action = na.retain, continuous=10, ...)## S3 method for class 'summaryS'plot(x, formula=NULL, groups=NULL, panel=NULL, paneldoesgroups=FALSE, datadensity=NULL, ylab='', funlabel=NULL, textonly='n', textplot=NULL, digits=3, custom=NULL, xlim=NULL, ylim=NULL, cex.strip=1, cex.values=0.5, pch.stats=NULL, key=list(columns=length(groupslevels), x=.75, y=-.04, cex=.9, col=lattice::trellis.par.get('superpose.symbol')$col, corner=c(0,1)), outerlabels=TRUE, autoarrange=TRUE, scat1d.opts=NULL, ...)## S3 method for class 'summaryS'plotp(data, formula=NULL, groups=NULL, sfun=NULL, fitter=NULL, showpts=! length(fitter), funlabel=NULL, digits=5, xlim=NULL, ylim=NULL, shareX=TRUE, shareY=FALSE, autoarrange=TRUE, ...)mbarclPanel(x, y, subscripts, groups=NULL, yother, ...)medvPanel(x, y, subscripts, groups=NULL, violin=TRUE, quantiles=FALSE, ...)mbarclpl(x, y, groups=NULL, yother, yvar=NULL, maintracename='y', xlim=NULL, ylim=NULL, xname='x', alphaSegments=0.45, ...)medvpl(x, y, groups=NULL, yvar=NULL, maintracename='y', xlim=NULL, ylim=NULL, xlab=xname, ylab=NULL, xname='x', zeroline=FALSE, yother=NULL, alphaSegments=0.45, dhistboxp.opts=NULL, ...)Arguments
formula | a formula with possibly multiple left and right-sidevariables separated by |
fun | an optional summarization function, e.g., |
data | optional input data frame. For |
subset | optional subsetting criteria |
na.action | function for dealing with |
continuous | minimum number of unique values for a numericvariable to have to be considered continuous |
... | ignored for |
x | an object created by |
groups | a character string or factor specifying that one of theconditioning variables is used for superpositioning and notpaneling |
panel | optional |
paneldoesgroups | set to |
datadensity | set to |
ylab | optional |
funlabel | optional axis label for when |
textonly | names of statistics to print and not plot. Bydefault, any statistic named |
textplot | names of statistics to print and plot |
digits | used if any statistics are printed as text (including |
custom | a function that customizes formatting of statistics thatare printed as text. This is useful for generating plotmathnotation. See the example in the tests directory. |
xlim | optional |
ylim | optional |
cex.strip | size of strip labels |
cex.values | size of statistics printed as text |
pch.stats | symbols to use for statistics (not included the oneone in columne one) that are plotted. This is a namedvectors, with names exactly matching those created by |
key |
|
outerlabels | set to |
autoarrange | set to |
scat1d.opts | a list of options to specify to |
y,subscripts | provided by |
yother | passed to the panel function from the |
violin | controls whether violin plots are included |
quantiles | controls whether quantile intervals are included |
sfun | a function called by |
fitter | a fitting function such as |
showpts | set to |
shareX |
|
shareY |
|
yvar | a character or factor variable used to stratify theanalysis into multiple y-variables |
maintracename | a default trace name when it can't be inferred |
xname | x-axis variable name for hover text when it can't beinferred |
xlab | x-axis label when it can't be inferred |
alphaSegments | alpha saturation to draw line segments for |
dhistboxp.opts |
|
zeroline | set to |
Value
a data frame with added attributes forsummaryS or alattice object ready to render forplot
Author(s)
Frank Harrell
See Also
Examples
# See tests directory file summaryS.r for more examples, and summarySp.r# for plotp examplesrequire(survival)n <- 100set.seed(1)d <- data.frame(sbp=rnorm(n, 120, 10), dbp=rnorm(n, 80, 10), age=rnorm(n, 50, 10), days=sample(1:n, n, TRUE), S1=Surv(2*runif(n)), S2=Surv(runif(n)), race=sample(c('Asian', 'Black/AA', 'White'), n, TRUE), sex=sample(c('Female', 'Male'), n, TRUE), treat=sample(c('A', 'B'), n, TRUE), region=sample(c('North America','Europe'), n, TRUE), meda=sample(0:1, n, TRUE), medb=sample(0:1, n, TRUE))d <- upData(d, labels=c(sbp='Systolic BP', dbp='Diastolic BP', race='Race', sex='Sex', treat='Treatment', days='Time Since Randomization', S1='Hospitalization', S2='Re-Operation', meda='Medication A', medb='Medication B'), units=c(sbp='mmHg', dbp='mmHg', age='Year', days='Days'))s <- summaryS(age + sbp + dbp ~ days + region + treat, data=d)# plot(s) # 3 pagesplot(s, groups='treat', datadensity=TRUE, scat1d.opts=list(lwd=.5, nhistSpike=0))plot(s, groups='treat', panel=lattice::panel.loess, key=list(space='bottom', columns=2), datadensity=TRUE, scat1d.opts=list(lwd=.5))# To make a plotly graph when the stratification variable region is not# present, run the following (showpts adds raw data points):# plotp(s, groups='treat', fitter=loess, showpts=TRUE)# Make your own plot using data frame created by summaryP# xyplot(y ~ days | yvar * region, groups=treat, data=s,# scales=list(y='free', rot=0))# Use loess to estimate the probability of two different types of events as# a function of times <- summaryS(meda + medb ~ days + treat + region, data=d)pan <- function(...) panel.plsmo(..., type='l', label.curves=max(which.packet()) == 1, datadensity=TRUE)plot(s, groups='treat', panel=pan, paneldoesgroups=TRUE, scat1d.opts=list(lwd=.7), cex.strip=.8)# Repeat using intervals instead of nonparametric smootherpan <- function(...) # really need mobs > 96 to est. proportion panel.plsmo(..., type='l', label.curves=max(which.packet()) == 1, method='intervals', mobs=5)plot(s, groups='treat', panel=pan, paneldoesgroups=TRUE, xlim=c(0, 150))# Demonstrate dot charts of summary statisticss <- summaryS(age + sbp + dbp ~ region + treat, data=d, fun=mean)plot(s)plot(s, groups='treat', funlabel=expression(bar(X)))# Compute parametric confidence limits for mean, and include sample# sizes by naming a column "n"f <- function(x) { x <- x[! is.na(x)] c(smean.cl.normal(x, na.rm=FALSE), n=length(x))}s <- summaryS(age + sbp + dbp ~ region + treat, data=d, fun=f)plot(s, funlabel=expression(bar(X) %+-% t[0.975] %*% s))plot(s, groups='treat', cex.values=.65, key=list(space='bottom', columns=2, text=c('Treatment A:','Treatment B:')))# For discrete time, plot Harrell-Davis quantiles of y variables across# time using different line characteristics to distinguish quantilesd <- upData(d, days=round(days / 30) * 30)g <- function(y) { probs <- c(0.05, 0.125, 0.25, 0.375) probs <- sort(c(probs, 1 - probs)) y <- y[! is.na(y)] w <- hdquantile(y, probs) m <- hdquantile(y, 0.5, se=TRUE) se <- as.numeric(attr(m, 'se')) c(Median=as.numeric(m), w, se=se, n=length(y))}s <- summaryS(sbp + dbp ~ days + region, fun=g, data=d)plot(s, panel=mbarclPanel)plot(s, groups='region', panel=mbarclPanel, paneldoesgroups=TRUE)# For discrete time, plot median y vs x along with CL for difference,# using Harrell-Davis median estimator and its s.e., and use violin# plotss <- summaryS(sbp + dbp ~ days + region, data=d)plot(s, groups='region', panel=medvPanel, paneldoesgroups=TRUE)# Proportions and Wilson confidence limits, plus approx. Gaussian# based half/width confidence limits for difference in probabilitiesg <- function(y) { y <- y[!is.na(y)] n <- length(y) p <- mean(y) se <- sqrt(p * (1. - p) / n) structure(c(binconf(sum(y), n), se=se, n=n), names=c('Proportion', 'Lower', 'Upper', 'se', 'n'))}s <- summaryS(meda + medb ~ days + region, fun=g, data=d)plot(s, groups='region', panel=mbarclPanel, paneldoesgroups=TRUE)Graphic Representation of a Frequency Table
Description
This function can be used to representcontingency tables graphically. Frequency counts are represented asthe heights of "thermometers" by default; you can also specifysymbol='circle' to the function. There is an option to includemarginal frequencies, which are plotted on a halved scale so as to notoverwhelm the plot. If you do not ask for marginal frequencies to beplotted usingmarginals=T,symbol.freq will ask you to clickthe mouse where a reference symbol is to be drawn to assist in readingthe scale of the frequencies.
label attributes, if present, are used for x- and y-axis labels.Otherwise, names of calling arguments are used.
Usage
symbol.freq(x, y, symbol = c("thermometer", "circle"), marginals = FALSE, orig.scale = FALSE, inches = 0.25, width = 0.15, subset, srtx = 0, ...)Arguments
x | first variable to cross-classify |
y | second variable |
symbol | specify |
marginals | set to |
orig.scale | set to |
inches | see |
width | see |
subset | the usual subsetting vector |
srtx | rotation angle for x-axis labels |
... | other arguments to pass to |
Author(s)
Frank Harrell
See Also
Examples
## Not run: getHdata(titanic)attach(titanic)age.tertile <- cut2(titanic$age, g=3)symbol.freq(age.tertile, pclass, marginals=T, srtx=45)detach(2)## End(Not run)Run Unix or Dos Depending on System
Description
Runsunix ordos depending on the current operating system. ForR, just runssystem with optional concatenation of first twoarguments which are assumed namedcommand andtext.
Usage
sys(command, text=NULL, output=TRUE)# S-Plus: sys(\dots, minimized=FALSE)Arguments
command | system command to execute |
text | text to concatenate to system command, if any (typically options or filenames or both) |
output | set to |
Value
seeunix ordos
Side Effects
executes system commands
See Also
t-test for Clustered Data
Description
Does a 2-sample t-test for clustered data.
Usage
t.test.cluster(y, cluster, group, conf.int = 0.95)## S3 method for class 't.test.cluster'print(x, digits, ...)Arguments
y | normally distributed response variable to test |
cluster | cluster identifiers, e.g. subject ID |
group | grouping variable with two values |
conf.int | confidence coefficient to use for confidence limits |
x | an object created by |
digits | number of significant digits to print |
... | unused |
Value
a matrix of statistics of classt.test.cluster
Author(s)
Frank Harrell
References
Donner A, Birkett N, Buck C, Am J Epi 114:906-914, 1981.
Donner A, Klar N, J Clin Epi 49:435-439, 1996.
Hsieh FY, Stat in Med 8:1195-1201, 1988.
See Also
Examples
set.seed(1)y <- rnorm(800)group <- sample(1:2, 800, TRUE)cluster <- sample(1:40, 800, TRUE)table(cluster,group)t.test(y ~ group) # R onlyt.test.cluster(y, cluster, group)# Note: negate estimates of differences from t.test to# compare with t.test.clusterInterface to Tabular Function
Description
tabulr is a front-end to thetables package'stabular function so that the user can takeadvantage of variable annotations used by theHmisc package,particular those created by thelabel,units, andupData functions. When a variable appears in atabular function, the variablex is found in thedata argument or in the parentenvironment, and thelabelLatex function is used to createa LaTeX label. By default any units of measurement are right justifiedin the current LaTeX tabular field usinghfill; usenofillto list variables for whichunits are not right-justified withhfill. Once the label is constructed, the variable name ispreceeded byHeading("LaTeX label")*x in the formula before it ispassed totabular.nolabel can be used tospecify variables for which labels are ignored.
tabulr also replacestrio withtable_trio,Nwithtable_N, andfreq withtable_freq in theformula.
table_trio is a function that takes a numeric vector and computesthe three quartiles and optionally the mean and standard deviation, andoutputs a LaTeX-formatted character string representing the results. Bydefault, calculated statistics are formatted with 3 digits to the leftand 1 digit to the right of the decimal point. Runningtable_options(left=l, right=r) will uselandr digits instead. Other options that can be given totable_options areprmsd=TRUE to add mean +/- standarddeviation to the result,pn=TRUE to add the sample size,bold=TRUE to set the median in bold face,showfreq='all','low', 'high' used by thetable_freq function,pctdec,specifying the number of places to the right of the decimal point forpercentages (default is zero), andnpct='both','numerator','denominator','none' used bytable_formatpct to control what appears after the percent.Optionpnformat may be specified to control the formatting forpn. The default is"(n=..)". Specifypnformat="non" to suppress"n=".pnwhen specifieswhen to print the number of observations. The default is"always". Specifypnwhen="ifna" to includen onlyif there are missing values in the vector being processed.
tabulr substitutestable_N forN in the formula.This is used to create column headings for the number of observations,without a row label.
table_freq analyzes a character variable to compute, for a singleoutput cell, the percents, numerator, and denominator for each category,or optimally just the maximum or minimum, as specified bytable_options(showfreq).
table_formatpct is a function that formats percents depending onsettings of options intable_options.
nFm is a function that callssprintf to formatnumeric values to have a specific number of digits to theleftand to theright of the point.
table_latexdefs writes (by default) to the console a set of LaTeXdefinitions that can be invoked at any point thereafter in aknitr orsweave document by naming the macro, preceeded by a singleslash. Theblfootnote macro is called with a single LaTeXargument which will appear as a footnote without a number.keytrio invokesblfootnote to define the output oftable_trio if mean and SD are not included. If mean and SD areincluded, usekeytriomsd.
Usage
tabulr(formula, data = NULL, nolabel=NULL, nofill=NULL, ...)table_trio(x)table_freq(x)table_formatpct(num, den)nFm(x, left, right, neg=FALSE, pad=FALSE, html=FALSE)table_latexdefs(file='')Arguments
formula | a formula suitable for |
data | a data frame or list. If omitted, the parent environmentis assumed to contain the variables. |
nolabel | a formula such as |
nofill | a formula such as |
... | other arguments to |
x | a numeric vector |
num | a single numerator or vector of numerators |
den | a single denominator |
left,right | number of places to the left and right of thedecimal point, respectively |
neg | set to |
pad | set to |
html | set to |
file | location of output of |
Value
tabulr returns an object of class"tabular"
Author(s)
Frank Harrell
See Also
Examples
## Not run: n <- 400set.seed(1)d <- data.frame(country=factor(sample(c('US','Canada','Mexico'), n, TRUE)), sex=factor(sample(c('Female','Male'), n, TRUE)), age=rnorm(n, 50, 10), sbp=rnorm(n, 120, 8))d <- upData(d, preghx=ifelse(sex=='Female', sample(c('No','Yes'), n, TRUE), NA), labels=c(sbp='Systolic BP', age='Age', preghx='Pregnancy History'), units=c(sbp='mmHg', age='years'))contents(d)require(tables)invisible(booktabs()) # use booktabs LaTeX style for tabularg <- function(x) { x <- x[!is.na(x)] if(length(x) == 0) return('') paste(latexNumeric(nFm(mean(x), 3, 1)), ' \hfill{\smaller[2](', length(x), ')}', sep='')}tab <- tabulr((age + Heading('Females')*(sex == 'Female')*sbp)* Heading()*g + (age + sbp)*Heading()*trio ~ Heading()*country*Heading()*sex, data=d)# Formula after interpretation by tabulr:# (Heading('Age\hfill {\smaller[2] years}') * age + Heading("Females")# * (sex == "Female") * Heading('Systolic BP {\smaller[2] mmHg}') * sbp)# * Heading() * g + (age + sbp) * Heading() * table_trio ~ Heading()# * country * Heading() * sexcat('\begin{landscape}\n')cat('\begin{minipage}{\textwidth}\n')cat('\keytrio\n')latex(tab)cat('\end{minipage}\end{landscape}\n')getHdata(pbc)pbc <- upData(pbc, moveUnits=TRUE)# Convert to character to prevent tabular from stratifyingfor(x in c('sex', 'stage', 'spiders')) { pbc[[x]] <- as.character(pbc[[x]]) label(pbc[[x]]) <- paste(toupper(substring(x, 1, 1)), substring(x, 2), sep='')}table_options(pn=TRUE, showfreq='all')tab <- tabulr((bili + albumin + protime + age) * Heading()*trio + (sex + stage + spiders)*Heading()*freq ~ drug, data=pbc)latex(tab)## End(Not run)testCharDateTime
Description
Test Character Variables for Dates and Times
Usage
testCharDateTime(x, p = 0.5, m = 0, convert = FALSE, existing = FALSE)Arguments
x | input vector of any type, but interesting cases are for character |
p | minimum proportion of non-missing non-blank values of |
m | if greater than 0, a test is applied: the number of distinct illegal values of |
convert | set to |
existing | set to |
Details
For a vectorx, if it is already a date-time, date, or time variable, the type is returned ifconvert=FALSE, or a list with that type, the original vector, andnumna=0 is returned. Otherwise ifx is not a character vector, a type ofnotcharacter is returned, or a list that includes the originalx andtype='notcharacter'. Whenx is character, the main logic is applied. The default logic (whenm=0) is to considerx a date-time variable when its format is YYYY-MM-DD HH:MM:SS (:SS is optional) in more than 1/2 of the non-missing observations. It is considered to be a date if its format is YYYY-MM-DD or MM/DD/YYYY or DD-MMM-YYYY in more than 1/2 of the non-missing observations (MMM=3-letter month). A time variable has the format HH:MM:SS or HH:MM. Blank values ofx (after trimming) are set toNA before proceeding.
Value
ifconvert=FALSE, a single character string with the type ofx:"character", "datetime", "date", "time". Ifconvert=TRUE, a list with components namedtype,x (converted toPOSIXct,Date, orchron times format), andnumna, the number of originally non-NA values ofx that could not be converted to the predominant format. If there were any non-covertible dates/times,the returned vector is given an additional classspecial.miss and anattributespecial.miss which is a list with original character values(codes) and observation numbers (obs). These are summarized bydescribe().
Author(s)
Frank Harrell
Examples
for(conv in c(FALSE, TRUE)) { print(testCharDateTime(c('2023-03-11', '2023-04-11', 'a', 'b', 'c'), convert=conv)) print(testCharDateTime(c('2023-03-11', '2023-04-11', 'a', 'b'), convert=conv)) print(testCharDateTime(c('2023-03-11 11:12:13', '2023-04-11 11:13:14', 'a', 'b'), convert=conv)) print(testCharDateTime(c('2023-03-11 11:12', '2023-04-11 11:13', 'a', 'b'), convert=conv)) print(testCharDateTime(c('3/11/2023', '4/11/2023', 'a', 'b'), convert=conv))}x <- c(paste0('2023-03-0', 1:9), 'a', 'a', 'a', 'b')y <- testCharDateTime(x, convert=TRUE)$xdescribe(y) # note counts of special missing values a, bfunction for use in graphs that are used with the psfrag package in LaTeX
Description
tex is a little function to save typing when including TeXcommands in graphs that are used with the psfrag package in LaTeX totypeset any LaTeX text inside a postscript graphic.texsurrounds the input character string with ‘\tex[options]{}’.This is especially useful for getting Greek letters and math symbolsin postscript graphs. By defaulttex returns a string withpsfrag commands specifying that the string be centered, notrotated, and not specially enlarged or shrunk.
Usage
tex(string, lref='c', psref='c', scale=1, srt=0)Arguments
string | a character string to be processed by |
lref | LaTeX reference point for |
psref | PostScript reference point. |
scale | scall factor, default is 1 |
srt | rotation for |
Value
tex returns a modified character string.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Grant MC, Carlisle (1998): The PSfrag System, Version 3. Fulldocumentation is obtained by searching www.ctan.org for ‘pfgguide.ps’.
See Also
postscript,par,ps.options,mgp.axis.labels,pdf,trellis.device,setTrellis
Examples
## Not run: pdf('test.pdf')x <- seq(0,15,length=100)plot(x, dchisq(x, 5), xlab=tex('$x$'), ylab=tex('$f(x)$'), type='l')title(tex('Density Function of the $\chi_{5}^{2}$ Distribution'))dev.off()# To process this file in LaTeX do something like#\documentclass{article}#\usepackage[scanall]{psfrag}#\begin{document}#\begin{figure}#\includegraphics{test.ps}#\caption{This is an example}#\end{figure}#\end{document}## End(Not run)Additive Regression and Transformations using ace or avas
Description
transace isace packaged for easily automaticallytransforming all variables in a formula without a left-hand side.transace is a fast one-iteration version oftranscan without imputation ofNAs. Theggplot method makes nice transformation plotsusingggplot2. Binary variables are automatically kept linear,and character or factor variables are automatically treated as categorical.
areg.boot usesareg oravas to fit additive regression models allowingall variables in the model (including the left-hand-side) to betransformed, with transformations chosen so as to optimize certaincriteria. The default method usesareg whose goal it isto maximizeR^2.method="avas" explicity tries totransform the response variable so as to stabilize the variance of theresiduals. All-variables-transformed models tend to inflateR^2and it can be difficult to get confidence limits for eachtransformation.areg.boot solves both of these problems usingthe bootstrap. As with thevalidate function in therms library, the Efron bootstrap is used to estimate theoptimism in the apparentR^2, and this optimism is subtractedfrom the apparentR^2 to optain a bias-correctedR^2.This is done however on the transformed response variable scale.
Tests with 3 predictors show that theavas andace estimates are unstable unless the sample sizeexceeds 350. ApparentR^2 with low sample sizes can be veryinflated, and bootstrap estimates ofR^2 can be even moreunstable in such cases, resulting in optimism-correctedR^2 thatare much lower even than the actualR^2. The situation can beimproved a little by restricting predictor transformations to bemonotonic. On the other hand, theareg approach allows one tocontrol overfitting by specifying the number of knots to use for eachcontinuous variable in a restricted cubic spline function.
Formethod="avas" the response transformation is restricted tobe monotonic. You can specify restrictions for transformations ofpredictors (and linearity for the response). When the first argumentis a formula, the function automatically determines which variablesare categorical (i.e.,factor,category, or charactervectors). Specify linear transformations by enclosing variables bythe identify function (I()), and specify monotonicity by usingmonotone(variable). Monotonicity restrictions are notallowed withmethod="areg".
Thesummary method forareg.boot computesbootstrap estimates of standard errors of differences in predictedresponses (usually on the original scale) for selected levels of eachpredictor against the lowest level of the predictor. The smearingestimator (see below) can be used here to estimate differences inpredicted means, medians, or many other statistics. By default,quartiles are used for continuous predictors and all levels are usedfor categorical ones. SeeDetails below. There is also aplot method for plotting transformation estimates,transformations for individual bootstrap re-samples, and pointwiseconfidence limits for transformations. Unless you already have apar(mfrow=) in effect with more than one row or column,plot will try to fit the plots on one page. Apredict method computes predicted values on the originalor transformed response scale, or a matrix of transformedpredictors. There is aFunction method for producing alist ofR functions that perform the final fitted transformations.There is also aprint method forareg.bootobjects.
When estimated means (or medians or other statistical parameters) arerequested for models fitted withareg.boot (bysummary.areg.boot orpredict.areg.boot), the“smearing” estimator ofDuan (1983) is used. Here weestimate the mean of the untransformed response by computing thearithmetic mean ofginverse(lp + residuals),where ginverse is the inverse of the nonparametrictransformation of the response (obtained by reverse linearinterpolation), lp is the linear predictor for an individualobservation on the transformed scale, and residuals is theentire vector of residuals estimated from the fitted model, on thetransformed scales (n residuals for n original observations). ThesmearingEst function computes the general smearing estimate.For efficiencysmearingEst recognizes that quantiles aretransformation-preserving, i.e., when one wishes to estimate aquantile of the untransformed distribution one just needs to computethe inverse transformation of the transformed estimate after thechosen quantile of the vector of residuals is added to it. When themedian is desired, the estimate isginverse(lp + \mbox{median}(residuals)).See the last example for howsmearingEst can be used outside ofareg.boot.
Mean is a generic function that returns anR function tocompute the estimate of the mean of a variable. Its input istypically some kind of model fit object. Likewise,Quantile isa generic quantile function-producing function.Mean.areg.bootandQuantile.areg.boot create functions of a vector of linearpredictors that transform them into the smearing estimates of the meanor quantile of the response variable,respectively.Quantile.areg.boot produces exactly the samevalue aspredict.areg.boot orsmearingEst.Meanapproximates the mapping of linear predictors to means over an evenlyspaced grid of by default 200 points. Linear interpolation is usedbetween these points. This approximate method is much faster than thefull smearing estimator onceMean creates the function. Thesefunctions are especially useful innomogram (see theexample on hypothetical data).
Usage
transace(formula, trim=0.01, data=environment(formula))## S3 method for class 'transace'print(x, ...)## S3 method for class 'transace'ggplot(data, mapping, ..., environment, nrow=NULL)areg.boot(x, data, weights, subset, na.action=na.delete, B=100, method=c("areg","avas"), nk=4, evaluation=100, valrsq=TRUE, probs=c(.25,.5,.75), tolerance=NULL)## S3 method for class 'areg.boot'print(x, ...)## S3 method for class 'areg.boot'plot(x, ylim, boot=TRUE, col.boot=2, lwd.boot=.15, conf.int=.95, ...)smearingEst(transEst, inverseTrans, res, statistic=c('median','quantile','mean','fitted','lp'), q)## S3 method for class 'areg.boot'summary(object, conf.int=.95, values, adj.to, statistic='median', q, ...)## S3 method for class 'summary.areg.boot'print(x, ...)## S3 method for class 'areg.boot'predict(object, newdata, statistic=c("lp", "median", "quantile", "mean", "fitted", "terms"), q=NULL, ...) ## S3 method for class 'areg.boot'Function(object, type=c('list','individual'), ytype=c('transformed','inverse'), prefix='.', suffix='', pos=-1, ...)Mean(object, ...)Quantile(object, ...)## S3 method for class 'areg.boot'Mean(object, evaluation=200, ...)## S3 method for class 'areg.boot'Quantile(object, q=.5, ...)Arguments
formula | a formula without a left-hand-side variable. Variablesmay be enclosed in |
x | for |
object | an object created by |
transEst | a vector of transformed values. In log-normal regression thesecould be predicted log(Y) for example. |
inverseTrans | a function specifying the inverse transformation needed to change |
trim | quantile to which to trim original and transformed valuesfor continuous variables for purposes of plotting thetransformations with |
nrow | the number of rows to graph for |
data | data frame to use if |
environment,mapping | ignored |
weights | a numeric vector of observation weights. By default, allobservations are weighted equally. |
subset | an expression to subset data if |
na.action | a function specifying how to handle |
B | number of bootstrap samples (default=100) |
method |
|
nk | number of knots for continuous variables not restricted to belinear. Default is 4. One or two is not allowed. |
evaluation | number of equally-spaced points at which to evaluate (and save) thenonparametric transformations derived by |
valrsq | set to |
probs | vector probabilities denoting the quantiles of continuous predictorsto use in estimating effects of those predictors |
tolerance | singularity criterion; list source code for the |
res | a vector of residuals from the transformed model. Not required when |
statistic | statistic to estimate with the smearing estimator. For |
q | a single quantile of the original response scale to estimate, when |
ylim | 2-vector of y-axis limits |
boot | set to |
col.boot | color for bootstrapped transformations |
lwd.boot | line width for bootstrapped transformations |
conf.int | confidence level (0-1) for pointwise bootstrap confidence limits andfor estimated effects of predictors in |
values | a list of vectors of settings of the predictors, for predictors forwhich you want to overide settings determined from |
adj.to | a named vector of adjustment constants, for setting all otherpredictors when examining the effect of a single predictor in |
newdata | a data frame or list containing the same number of values of all ofthe predictors used in the fit. For |
type | specifies how |
ytype | By default the first function created by |
prefix | character string defining the prefix for function names created when |
suffix | character string defining the suffix for the function names |
pos | See |
... | arguments passed to other functions. Ignored for |
Details
Astransace only does one iteration over the predictors, it maynot find optimal transformations and it will be dependent on the orderof the predictors inx.
ace andavas standardize transformed variables to havemean zero and variance one for each bootstrap sample, so if apredictor is not important it will still consistently have a positiveregression coefficient. Therefore using the bootstrap to estimatestandard errors of the additive least squares regression coefficientswould not help in drawing inferences about the importance of thepredictors. To do this,summary.areg.boot computes estimatesof, e.g., the inter-quartile range effects of predictors in predictingthe response variable (after untransforming it). As an example, ateach bootstrap repetition the estimated transformed value of one ofthe predictors is computed at the lower quartile, median, and upperquartile of the raw value of the predictor. These transformed xvalues are then multipled by the least squares estimate of the partialregression coefficient for that transformed predictor in predictingtransformed y. Then these weighted transformed x values have theweighted transformed x value corresponding to the lower quartilesubtracted from them, to estimate an x effect accounting fornonlinearity. The last difference computed is then the standardizedeffect of raising x from its lowest to its highest quartile. Beforecomputing differences, predicted values are back-transformed to be onthe original y scale in a way depending onstatistic andq. The sample standard deviation of these effects (differences)is taken over the bootstrap samples, and this is used to computeapproximate confidence intervals for effects andapproximate P-values,both assuming normality.
predict does not re-insertNAs corresponding toobservations that were dropped before the fit, whennewdata isomitted.
statistic="fitted" estimates the same quantity asstatistic="median" if the residuals on the transformed responsehave a symmetric distribution. The two provide identical estimateswhen the sample median of the residuals is exactly zero. The samplemean of the residuals is constrained to be exactly zero although thisdoes not simplify anything.
Value
transace returns a list of classtransace containingthese elements:n (number of non-missing observations used),transformed (a matrix containing transformed values),rsq (vector ofR^2 with which eachvariable can be predicted from the others),omitted (rownumbers of data that were deleted due toNAs),trantab (compact transformation lookups),levels(original levels of character and factorvaribles if the input was adata frame),trim (value oftrim passed totransace),limits (the limits for plotting raw andtransformed variables, computed fromtrim), andtype (avector of transformation types used for the variables).
areg.boot returns a list of class ‘areg.boot’ containingmany elements, including (ifvalrsq isTRUE)rsquare.app andrsquare.val.summary.areg.bootreturns a list of class ‘summary.areg.boot’ containing a matrixof results for each predictor and a vector of adjust-to settings. Italso contains the call and a ‘label’ for the statistic that wascomputed. Aprint method for these objects handles theprinting.predict.areg.boot returns a vector unlessstatistic="terms", in which case it returns amatrix.Function.areg.boot returns by default a list offunctions whose argument is one of the variables (on the originalscale) and whose returned values are the corresponding transformedvalues. The names of the list of functions correspond to the names ofthe original variables. Whentype="individual",Function.areg.boot invisibly returns the vector of names of thecreated function objects.Mean.areg.boot andQuantile.areg.boot also return functions.
smearingEst returns a vector of estimates of distributionparameters of class ‘labelled’ so thatprint.labelled wilprint a label documenting the estimate that was used (seelabel). This label can be retrieved for other purposesby using e.g.label(obj), where obj was the vectorreturned bysmearingEst.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
References
Harrell FE, Lee KL, Mark DB (1996): Stat in Med 15:361–387.
Duan N (1983): Smearing estimate: A nonparametric retransformationmethod. JASA 78:605–610.
Wang N, Ruppert D (1995): Nonparametric estimation of thetransformation in the transform-both-sides regression model. JASA90:522–534.
Seeavas,ace for primary references.
See Also
avas,ace,ols,validate,predab.resample,label,nomogram
Examples
# xtrans <- transace(~ monotone(age) + sex + blood.pressure + categorical(race.code))# print(xtrans) # show R^2s and a few other things# ggplot(xtrans) # show transformations# Generate random data from the model y = exp(x1 + epsilon/3) where# x1 and epsilon are Gaussian(0,1)set.seed(171) # to be able to reproduce examplex1 <- rnorm(200)x2 <- runif(200) # a variable that is really unrelated to y]x3 <- factor(sample(c('cat','dog','cow'), 200,TRUE)) # also unrelated to yy <- exp(x1 + rnorm(200)/3)f <- areg.boot(y ~ x1 + x2 + x3, B=40)fplot(f)# Note that the fitted transformation of y is very nearly log(y)# (the appropriate one), the transformation of x1 is nearly linear,# and the transformations of x2 and x3 are essentially flat # (specifying monotone(x2) if method='avas' would have resulted# in a smaller confidence band for x2)summary(f)# use summary(f, values=list(x2=c(.2,.5,.8))) for example if you# want to use nice round values for judging effects# Plot Y hat vs. Y (this doesn't work if there were NAs)plot(fitted(f), y) # or: plot(predict(f,statistic='fitted'), y)# Show fit of model by varying x1 on the x-axis and creating separate# panels for x2 and x3. For x2 using only a few discrete valuesnewdat <- expand.grid(x1=seq(-2,2,length=100),x2=c(.25,.75), x3=c('cat','dog','cow'))yhat <- predict(f, newdat, statistic='fitted') # statistic='mean' to get estimated mean rather than simple inverse trans.xYplot(yhat ~ x1 | x2, groups=x3, type='l', data=newdat)## Not run: # Another example, on hypothetical dataf <- areg.boot(response ~ I(age) + monotone(blood.pressure) + race)# use I(response) to not transform the response variableplot(f, conf.int=.9)# Check distribution of residualsplot(fitted(f), resid(f))qqnorm(resid(f))# Refit this model using ols so that we can draw a nomogram of it.# The nomogram will show the linear predictor, median, mean.# The last two are smearing estimators.Function(f, type='individual') # create transformation functionsf.ols <- ols(.response(response) ~ age + .blood.pressure(blood.pressure) + .race(race))# Note: This model is almost exactly the same as f but there# will be very small differences due to interpolation of# transformationsmeanr <- Mean(f) # create function of lp computing mean responsemedr <- Quantile(f) # default quantile is .5nomogram(f.ols, fun=list(Mean=meanr,Median=medr))# Create S functions that will do the transformations# This is a table look-up with linear interpolationg <- Function(f)plot(blood.pressure, g$blood.pressure(blood.pressure))# produces the central curve in the last plot done by plot(f)## End(Not run)# Another simulated example, where y has a log-normal distribution# with mean x and variance 1. Untransformed y thus has median# exp(x) and mean exp(x + .5sigma^2) = exp(x + .5)# First generate data from the model y = exp(x + epsilon),# epsilon ~ Gaussian(0, 1)set.seed(139)n <- 1000x <- rnorm(n)y <- exp(x + rnorm(n))f <- areg.boot(y ~ x, B=20)plot(f) # note log shape for y, linear for x. Good!xs <- c(-2, 0, 2)d <- data.frame(x=xs)predict(f, d, 'fitted')predict(f, d, 'median') # almost same; median residual=-.001exp(xs) # population medianspredict(f, d, 'mean')exp(xs + .5) # population means# Show how smearingEst worksres <- c(-1,0,1) # define residualsy <- 1:5ytrans <- log(y)ys <- seq(.1,15,length=50)trans.approx <- list(x=log(ys), y=ys)options(digits=4)smearingEst(ytrans, exp, res, 'fitted') # ignores ressmearingEst(ytrans, trans.approx, res, 'fitted') # ignores res smearingEst(ytrans, exp, res, 'median') # median res=0smearingEst(ytrans, exp, res+.1, 'median') # median res=.1smearingEst(ytrans, trans.approx, res, 'median')smearingEst(ytrans, exp, res, 'mean')mean(exp(ytrans[2] + res)) # should equal 2nd # abovesmearingEst(ytrans, trans.approx, res, 'mean')smearingEst(ytrans, trans.approx, res, mean)# Last argument can be any statistical function operating# on a vector that returns a single valueTransformations/Imputations using Canonical Variates
Description
transcan is a nonlinear additive transformation and imputationfunction, and there are several functions for using and operating onits results.transcan automatically transforms continuous andcategorical variables to have maximum correlation with the best linearcombination of the other variables. There is also an option to use asubstitute criterion - maximum correlation with the first principalcomponent of the other variables. Continuous variables are expandedas restricted cubic splines and categorical variables are expanded ascontrasts (e.g., dummy variables). By default, the first canonicalvariate is used to find optimum linear combinations of componentcolumns. This function is similar toace except thattransformations for continuous variables are fitted using restrictedcubic splines, monotonicity restrictions are not allowed, andNAs are allowed. When a variable has anyNAs,transformed scores for that variable are imputed using least squaresmultiple regression incorporating optimum transformations, orNAs are optionally set to constants. Shrinkage can be used tosafeguard against overfitting when imputing. Optionally, imputedvalues on the original scale are also computed and returned. For thispurpose, recursive partitioning or multinomial logistic models canoptionally be used to impute categorical variables, using what ispredicted to be the most probable category.
By default,transcan imputesNAs with “bestguess” expected values of transformed variables, back transformed tothe original scale. Values thus imputed are most like conditionalmedians assuming the transformations make variables' distributionssymmetric (imputed values are similar to conditionl modes forcategorical variables). By instead specifyingn.impute,transcan does approximate multiple imputation from thedistribution of each variable conditional on all other variables.This is done by samplingn.impute residuals from thetransformed variable, with replacement (a la bootstrapping), or bydefault, using Rubin's approximate Bayesian bootstrap, where a sampleof size n with replacement is selected from the residuals onn non-missing values of the target variable, and then a sampleof size m with replacement is chosen from this sample, wherem is the number of missing values needing imputation for thecurrent multiple imputation repetition. Neither of these bootstrapprocedures assume normality or even symmetry of residuals. Forsometimes-missing categorical variables, optimal scores are computedby adding the “best guess” predicted mean score to randomresiduals off this score. Then categories having scores closest tothese predicted scores are taken as the random multiple imputations(impcat = "rpart" is not currently allowedwithn.impute). The literature recommends usingn.impute = 5 or greater.transcan provides only an approximation tomultiple imputation, especially since it “freezes” theimputation model before drawing the multiple imputations rather thanusing different estimates of regression coefficients for eachimputation. For multiple imputation, thearegImpute functionprovides a much better approximation to the full Bayesian approachwhile still not requiring linearity assumptions.
When you specifyn.impute totranscan you can usefit.mult.impute to re-fit any modeln.impute times basedonn.impute completed datasets (if there are any sometimesmissing variables not specified totranscan, some observationswill still be dropped from these fits). After fittingn.imputemodels,fit.mult.impute will return the fit object from thelast imputation, withcoefficients replaced by the average ofthen.impute coefficient vectors and with a componentvar equal to the imputation-corrected variance-covariancematrix using Rubin's rule.fit.mult.impute can also use the object created by themice function in themice library to draw themultiple imputations, as well as objects created byaregImpute. The following components of fit objects arealso replaced with averages over then.impute model fits:linear.predictors,fitted.values,stats,means,icoef,scale,center,y.imputed.
By specifyingfun tofit.mult.impute you can run anyfunction on the fit objects from completed datasets, with the resultssaved in an element namedfunresults. This facilitatesrunning bootstrap or cross-validation separately on each completeddataset and storing all these results in a list for later processing,e.g., with therms packageprocessMI function. Note that forrms-type validation you will need to specifyfitargs=list(x=TRUE,y=TRUE) tofit.mult.impute and touse special names forfun result components, such asvalidate andcalibrate so that the result can beprocessed withprocessMI. When simultaneously running multipleimputation and resampling model validation you may not need values forn.impute orB (number of bootstraps) as high as usual,as the total number of repetitions will ben.impute * B.
fit.mult.impute can incorporate robust sandwich variance estimates intoRubin's rule ifrobust=TRUE.
Forols models fitted byfit.mult.impute with stacking,theR^2 measure in the stacked model fit is OK, andprint.ols computes adjustedR^2 using the real samplesize so it is also OK becausefit.mult.compute corrects thestacked error degrees of freedom in the stacked fit object to reflectthe real sample size.
Thesummary method fortranscan prints the functioncall,R^2 achieved in transforming each variable, and for eachvariable the coefficients of all other transformed variables that areused to estimate the transformation of the initial variable. Ifimputed=TRUE was used in the call to transcan, also uses thedescribe function to print a summary of imputed values. Iflong = TRUE, also prints all imputed values with observationidentifiers. There is also a simple functionprint.transcanwhich merely prints the transformation matrix and the function call.It has an optional argumentlong, which if set toTRUEcauses detailed parameters to be printed. Instead of plotting whiletranscan is running, you can plot the final transformationsafter the fact usingplot.transcan orggplot.transcan,if the optiontrantab = TRUE was specified totranscan.If in addition the optionimputed = TRUE was specified totranscan,plot andggplot will show the location of imputed values(including multiples) along the axes. Forggplot, imputedvalues are shown as red plus signs.
impute method fortranscan does imputations for aselected original data variable, on the original scale (ifimputed=TRUE was given totranscan). If you do notspecify a variable toimpute, it will do imputations for allvariables given totranscan which had at least one missingvalue. This assumes that the original variables are accessible (i.e.,they have been attached) and that you want the imputed variables tohave the same names are the original variables. Ifn.impute wasspecified totranscan you must tellimpute whichimputation to use. Results are stored in.GlobalEnvwhenlist.out is not specified (it is recommended to uselist.out=TRUE).
Thepredict method fortranscan computespredicted variables and imputed values from a matrix of new data.This matrix should have the same column variables as the originalmatrix used withtranscan, and in the same order (unless aformula was used withtranscan).
TheFunction function is a generic functiongenerator.Function.transcan createsR functions to transformvariables using transformations created bytranscan. Thesefunctions are useful for getting predicted values with predictors setto values on the original scale.
Thevcov methods are defined here so thatimputation-corrected variance-covariance matrices are readilyextracted fromfit.mult.impute objects, and so thatfit.mult.impute can easily compute traditional covariancematrices for individual completed datasets.
The subscript method fortranscan preserves attributes.
TheinvertTabulated function does either inverse linearinterpolation or uses sampling to sample qualifying x-values havingy-values near the desired values. The latter is used to get inversevalues having a reasonable distribution (e.g., no floor or ceilingeffects) when the transformation has a flat or nearly flat segment,resulting in a many-to-one transformation in that region. Samplingweights are a combination of the frequency of occurrence of x-valuesthat are withintolInverse times the range ofy and thesquared distance between the associated y-values and the targety-value (aty).
Usage
transcan(x, method=c("canonical","pc"), categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute, boot.method=c('approximate bayesian', 'simple'), trantab=FALSE, transformed=FALSE, impcat=c("score", "multinom", "rpart"), mincut=40, inverse=c('linearInterp','sample'), tolInverse=.05, pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE, imputed.actual=c('none','datadensity','hist','qq','ecdf'), iter.max=50, eps=.1, curtail=TRUE, imp.con=FALSE, shrink=FALSE, init.cat="mode", nres=if(boot.method=='simple')200 else 400, data, subset, na.action, treeinfo=FALSE, rhsImp=c('mean','random'), details.impcat='', ...)## S3 method for class 'transcan'summary(object, long=FALSE, digits=6, ...)## S3 method for class 'transcan'print(x, long=FALSE, ...)## S3 method for class 'transcan'plot(x, ...)## S3 method for class 'transcan'ggplot(data, mapping, scale=FALSE, ..., environment)## S3 method for class 'transcan'impute(x, var, imputation, name, pos.in, data, list.out=FALSE, pr=TRUE, check=TRUE, ...)fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE, dtrans, derived, fun, vcovOpts=NULL, robust=FALSE, cluster, robmethod=c('huber', 'efron'), method=c('ordinary', 'stack', 'only stack'), funstack=TRUE, lrt=FALSE, pr=TRUE, subset, fitargs)## S3 method for class 'transcan'predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE, type=c("transformed","original"), inverse, tolInverse, check=FALSE, ...)Function(object, ...)## S3 method for class 'transcan'Function(object, prefix=".", suffix="", pos=-1, ...)invertTabulated(x, y, freq=rep(1,length(x)), aty, name='value', inverse=c('linearInterp','sample'), tolInverse=0.05, rule=2)## Default S3 method:vcov(object, regcoef.only=FALSE, ...)## S3 method for class 'fit.mult.impute'vcov(object, regcoef.only=TRUE, intercepts='mid', ...)Arguments
x | a matrix containing continuous variable values and codes forcategorical variables. The matrix must have column names( |
formula | anyR model formula |
fitter | anyR, |
xtrans | an object created by |
method | use |
categorical | a character vector of names of variables in |
asis | a character vector of names of variables that are not to betransformed. For these variables, the guts of |
nk | number of knots to use in expanding each continuous variable (notlisted in |
imputed | Set to |
n.impute | number of multiple imputations. If omitted, single predictedexpected value imputation is used. |
boot.method | default is to use the approximate Bayesian bootstrap (sample withreplacement from sample with replacement of the vector of residuals).You can also specify |
trantab | Set to |
transformed | set to |
impcat | This argument tells how to impute categorical variables on theoriginal scale. The default is |
mincut | If |
inverse | By default, imputed values are back-solved on the original scaleusing inverse linear interpolation on the fitted tabulatedtransformed values. This will cause distorted distributions ofimputed values (e.g., floor and ceiling effects) when the estimatedtransformation has a flat or nearly flat section. To instead usethe |
tolInverse | the multiplyer of the range of transformed values, weighted by |
pr | For |
pl | Set to |
allpl | Set to |
show.na | Set to |
imputed.actual | The default is ‘"none"’ to suppress plotting of actualvs. imputed values for all variables having any |
iter.max | maximum number of iterations to perform for |
eps | convergence criterion for |
curtail | for |
imp.con | for |
shrink | default is |
init.cat | method for initializing scorings of categorical variables. Defaultis ‘"mode"’ to use a dummy variable set to 1 if the value isthe most frequent value (this is the default). Use ‘"random"’to use a random 0-1 variable. Set to ‘"asis"’ to use theoriginal integer codes asstarting scores. |
nres | number of residuals to store if |
data | Data frame used to fill the formula. For |
subset | an integer or logical vector specifying the subset of observationsto fit |
na.action | These may be used if |
treeinfo | Set to |
rhsImp | Set to ‘"random"’ to use random draw imputation when asometimes missing variable is moved to be a predictor of othersometimes missing variables. Default is |
details.impcat | set to a character scalar that is the name of a category variable toinclude in the resulting |
... | arguments passed to |
long | for |
digits | number of significant digits for printing values by |
scale | for |
mapping,environment | not used; needed because of rules about generics |
var | For |
imputation | specifies which of the multiple imputations to use for filling in |
name | name of variable to impute, for |
pos.in | location as defined by |
list.out | If |
check | set to |
newdata | a new data matrix for which to compute transformedvariables. Categorical variables must use the same integer codes aswere used in the call to |
fit.reps | set to |
dtrans | provides an approach to creating derived variables from a singlefilled-in dataset. The function specified as |
derived | an expression containingR expressions for computing derivedvariables that are used in the model formula. This is useful whenmultiple imputations are done for component variables but the actualmodel uses combinations of these (e.g., ratios or otherderivations). For a single derived variable you can specify forexample |
fun | a function of a fit made on one of the completed datasets.Typical uses are bootstrap model validations. The result of |
vcovOpts | a list of named additional arguments to pass to the |
robust | set to |
cluster | a vector of cluster IDs that is the same length of the numberof rows in the dataset being analyzed. When specified, |
robmethod | see the |
funstack | set to |
lrt | set to |
fitargs | a list of extra arguments to pass to |
type | By default, the matrix of transformed variables is returned, withimputed values on the transformed scale. If you had specified |
object | an object created by |
prefix,suffix | When creating separateR functions for each variable in |
pos | position as in |
y | a vector corresponding to |
freq | a vector of frequencies corresponding to cross-classified |
aty | vector of transformed values at which inverses are desired |
rule | see |
regcoef.only | set to |
intercepts | this is primarily for |
Details
The starting approximation to the transformation for each variable istaken to be the original coding of the variable. The initialapproximation for each missing value is taken to be the median of thenon-missing values for the variable (for continuous ones) or the mostfrequent category (for categorical ones). Instead, ifimp.conis a vector, its values are used for imputingNA values. Whenusing each variable as a dependent variable,NA values on thatvariable cause all observations to be temporarily deleted. Once a newworking transformation is found for the variable, along with a modelto predict that transformation from all the other variables, thatlatter model is used to imputeNA values in the selecteddependent variable ifimp.con is not specified.
When that variable is used to predict a new dependent variable, thecurrent working imputed values are inserted. Transformations areupdated after each variable becomes a dependent variable, so the orderof variables onx could conceivably make a difference in thefinal estimates. For obtaining out-of-samplepredictions/transformations,predict uses the sameiterative procedure astranscan for imputation, with the samestarting values for fill-ins as were used bytranscan. It also(by default) uses a conservative approach of curtailing transformedvariables to be within the range of the original ones. Even whenmethod = "pc" is specified, canonical variables are used forimputing missing values.
Note that fitted transformations, when evaluated at imputed variablevalues (on the original scale), will not precisely match thetransformed imputed values returned inxt. This is becausetranscan uses an approximate method based on linearinterpolation to back-solve for imputed values on the original scale.
Shrinkage uses the method ofVan Houwelingen and Le Cessie (1990) (similar toCopas, 1983). The shrinkage factor is
\frac{1-\frac{(1-R2)(n-1)}{n-k-1}}{R2}
where R2 is the apparentR^2d for predicting thevariable, n is the number of non-missing values, and k isthe effective number of degrees of freedom (aside from intercepts). Aheuristic estimate is used for k:A - 1 + sum(max(0,Bi - 1))/m + m, whereA is the number of d.f. required to represent the variable beingpredicted, the Bi are the number of columns required torepresent all the other variables, and m is the number of allother variables. Division by m is done because thetransformations for the other variables are fixed at their currenttransformations the last time they were being predicted. The+ m term comes from the number of coefficients estimatedon the right hand side, whether by least squares or canonicalvariates. If a shrinkage factor is negative, it is set to 0. Theshrinkage factor is the ratio of the adjustedR^2d tothe ordinaryR^2d. The adjustedR^2d is
1-\frac{(1-R2)(n-1)}{n-k-1}
which is also set to zero if it is negative. Ifshrink=FALSEand the adjustedR^2s are much smaller than theordinaryR^2s, you may want to runtranscanwithshrink=TRUE.
Canonical variates are scaled to have variance of 1.0, by multiplyingcanonical coefficients fromcancor by\sqrt{n-1}.
When specifying a non-rms library fitting function tofit.mult.impute (e.g.,lm,glm),running the result offit.mult.impute through that fit'ssummary method will not use the imputation-adjustedvariances. You may obtain the new variances usingfit$var orvcov(fit).
When you specify arms function tofit.mult.impute (e.g.lrm,ols,cph,psm,bj,Rq,Gls,Glm), automatically computedtransformation parameters (e.g., knot locations forrcs) that are estimated for the first imputation areused for all other imputations. This ensures that knot locations willnot vary, which would change the meaning of the regressioncoefficients.
Warning: even thoughfit.mult.impute takes imputation intoaccount when estimating variances of regression coefficient, it doesnot take into account the variation that results from estimation ofthe shapes and regression coefficients of the customized imputationequations. Specifyingshrink=TRUE solves a small part of thisproblem. To fully account for all sources of variation you shouldconsider putting thetranscan invocation inside a bootstrap orloop, if execution time allows. Better still, usearegImpute or a package such as asmice that usesreal Bayesian posterior realizations to multiply impute missing valuescorrectly.
It is strongly recommended that you use theHmiscnaclusfunction to determine is there is a good basis for imputation.naclus will tell you, for example, if systolic bloodpressure is missing whenever diastolic blood pressure is missing. Ifthe only variable that is well correlated with diastolic bp issystolic bp, there is no basis for imputing diastolic bp in this case.
At present,predict does not work with multiple imputation.
When callingfit.mult.impute withglm as thefitter argument, if you need to pass afamily argumenttoglm do it by quoting the family, e.g.,family="binomial".
fit.mult.impute will not work with proportional odds modelswhen regression imputation was used (as opposed to predictive meanmatching). That's because regression imputation will create values ofthe response variable that did not exist in the dataset, altering theintercept terms in the model.
You should be able to use a variable in the formula given tofit.mult.impute as a numeric variable in the regression modeleven though it was a factor variable in the invocation oftranscan. Use for examplefit.mult.impute(y ~ codes(x), lrm, trans) (thanks to Trevor Thompsontrevor@hp5.eushc.org).
Here is an outline of the steps necessary to impute baseline variablesusing thedtrans argument, when the analysis to be repeated byfit.mult.impute is a longitudinal analysis (usinge.g.Gls).
Create a one row per subject data frame containing baselinevariables plus follow-up variables that are assigned to windows. Forexample, you may have dozens of repeated measurements over years butyou capture the measurements at the times measured closest to 1, 2,and 3 years after study entry
Make sure the dataset contains the subject ID
This dataset becomes the one passed to
aregImputeasdata=. You will be imputing missing baseline variables fromfollow-up measurements defined at fixed times.Have another dataset with all the non-missing follow-up valueson it, one record per measurement time per subject. This datasetshould not have the baseline variables on it, and the follow-upmeasurements should not be named the same as the baseline variable(s);the subject ID must also appear
Add the dtrans argument to
fit.mult.imputeto define afunction with one argument representing the one record per subjectdataset with missing values filled it from the current imputation.This function merges the above 2 datasets; the returned value of thisfunction is the merged data frame.This merged-on-the-fly dataset is the one handed by
fit.mult.imputeto your fitting function, so variable names in the formula given tofit.mult.imputemust matched the names created by the merge
Value
Fortranscan, a list of class ‘transcan’ with elements
call | (with the function call) |
iter | (number of iterations done) |
rsq,rsq.adj | containing the |
categorical | the values supplied for |
asis | the values supplied for |
coef | the within-variable coefficients used to compute the firstcanonical variate |
xcoef | the (possibly shrunk) across-variables coefficients of the firstcanonical variate that predicts each variable in-turn. |
parms | the parameters of the transformation (knots for splines, contrastmatrix for categorical variables) |
fillin | the initial estimates for missing values ( |
ranges | the matrix of ranges of the transformed variables (min and max infirst and secondrow) |
scale | a vector of scales used to determine convergence for atransformation. |
formula | the formula (if |
, and optionally a vector of shrinkage factors used for predictingeach variable from the others. Forasis variables, the scaleis the average absolute difference about the median. For othervariables it is unity, since canonical variables are standardized.Forxcoef, row i has the coefficients to predicttransformed variable i, with the column for the coefficient ofvariable i set toNA. Ifimputed=TRUE was given,an optional elementimputed also appears. This is a list withthe vector of imputed values (on the original scale) for each variablecontainingNAs. Matrices rather than vectors are returned ifn.impute is given. Iftrantab=TRUE, thetrantabelement also appears, as described above. Ifn.impute > 0,transcan also returns a listresiduals that can be usedfor future multiple imputation.
impute returns a vector (the same length asvar) ofclass ‘impute’ withNA values imputed.
predict returns a matrix with the same number of columns orvariables as were inx.
fit.mult.impute returns a fit object that is a modification ofthe fit object created by fitting the completed dataset for the finalimputation. Thevar matrix in the fit object has theimputation-corrected variance-covariance matrix.coefficientsis the average (over imputations) of the coefficient vectors,variance.inflation.impute is a vector containing the ratios ofthe diagonals of the between-imputation variance matrix to thediagonals of the average apparent (within-imputation) variancematrix.missingInfo isRubin's rate of missing information anddfmi isRubin's degrees of freedom for a t-statisticfor testing a single parameter. The last two objects are vectorscorresponding to the diagonal of the variance matrix. The class"fit.mult.impute" is prepended to the other classes produced bythe fitting function.
Whenmethod is not'ordinary', i.e., stacking is used,fit.mult.impute returns a modified fit object that is computedon all completed datasets combined, with most all statistics that arefunctions of the sample size corrected to the real sample size.Elements in the fit such asresiduals will have length equal tothe real sample size times the number of imputations.
fit.mult.impute storesintercepts attributes in thecoefficient matrix and invar fororm fits.
Side Effects
prints, plots, andimpute.transcan creates new variables.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, FourthEdition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models.Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation.Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. NewYork: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in health-care databases: Anoverview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancementfor dealing with missing data in observational health care outcomeanalyses. J Clin Epidem 55:184–191, 2002.
See Also
aregImpute,impute,naclus,naplot,ace,avas,cancor,prcomp,rcspline.eval,lsfit,approx,datadensity,mice,ggplot,processMI
Examples
## Not run: x <- cbind(age, disease, blood.pressure, pH) #cbind will convert factor object `disease' to integerpar(mfrow=c(2,2))x.trans <- transcan(x, categorical="disease", asis="pH", transformed=TRUE, imputed=TRUE)summary(x.trans) #Summary distribution of imputed values, and R-squaresf <- lm(y ~ x.trans$transformed) #use transformed values in a regression#Now replace NAs in original variables with imputed values, if not#using transformationsage <- impute(x.trans, age)disease <- impute(x.trans, disease)blood.pressure <- impute(x.trans, blood.pressure)pH <- impute(x.trans, pH)#Do impute(x.trans) to impute all variables, storing new variables under#the old namessummary(pH) #uses summary.impute to tell about imputations #and summary.default to tell about pH overall# Get transformed and imputed values on some new data frame xnewnewx.trans <- predict(x.trans, xnew)w <- predict(x.trans, xnew, type="original")age <- w[,"age"] #inserts imputed valuesblood.pressure <- w[,"blood.pressure"]Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()#Repeat first fit using a formulax.trans <- transcan(~ age + disease + blood.pressure + I(pH), imputed=TRUE)age <- impute(x.trans, age)predict(x.trans, expand.grid(age=50, disease="pneumonia", blood.pressure=60:260, pH=7.4))z <- transcan(~ age + factor(disease.code), # disease.code categorical transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)ggplot(z, scale=TRUE)plot(z$transformed)## End(Not run)# Multiple imputation and estimation of variances and covariances of# regression coefficient estimates accounting for imputationset.seed(1)x1 <- factor(sample(c('a','b','c'),100,TRUE))x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)y <- x2 + 1*(x1=='c') + rnorm(100)x1[1:20] <- NAx2[18:23] <- NAd <- data.frame(x1,x2,y)n <- naclus(d)plot(n); naplot(n) # Show patterns of NAsf <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)options(digits=3)summary(f)f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)summary(f)h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)# Add ,fit.reps=TRUE to save all fit objects in h, then do something like:# for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))diag(vcov(h))h.complete <- lm(y ~ x1 + x2, na.action=na.omit)h.completediag(vcov(h.complete))# Note: had the rms ols function been used in place of lm, any# function run on h (anova, summary, etc.) would have automatically# used imputation-corrected variances and covariances# Example demonstrating how using the multinomial logistic model# to impute a categorical variable results in a frequency# distribution of imputed values that matches the distribution# of non-missing values of the categorical variable## Not run: set.seed(11)x1 <- factor(sample(letters[1:4], 1000,TRUE))x1[1:200] <- NAtable(x1)/sum(table(x1))x2 <- runif(1000)z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')table(z$imputed$x1)/sum(table(z$imputed$x1))# Here is how to create a completed datasetd <- data.frame(x1, x2)z <- transcan(~x1 + I(x2), n.impute=5, data=d)imputed <- impute(z, imputation=1, data=d, list.out=TRUE, pr=FALSE, check=FALSE)sapply(imputed, function(x)sum(is.imputed(x)))sapply(imputed, function(x)sum(is.na(x)))## End(Not run)# Do single imputation and create a filled-in data framez <- transcan(~x1 + I(x2), data=d, imputed=TRUE)imputed <- as.data.frame(impute(z, data=d, list.out=TRUE))# Example where multiple imputations are for basic variables and# modeling is done on variables derived from theseset.seed(137)n <- 400x1 <- runif(n)x2 <- runif(n)y <- x1*x2 + x1/(1+x2) + rnorm(n)/3x1[1:5] <- NAd <- data.frame(x1,x2,y)w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)# Add ,show.imputed.actual for graphical diagnostics## Not run: g <- fit.mult.impute(y ~ product + ratio, ols, w, data=data.frame(x1,x2,y), derived=expression({ product <- x1*x2 ratio <- x1/(1+x2) print(cbind(x1,x2,x1*x2,product)[1:6,])}))## End(Not run)# Here's a method for creating a permanent data frame containing# one set of imputed values for each variable specified to transcan# that had at least one NA, and also containing all the variables# in an original data frame. The following is based on the fact# that the default output location for impute.transcan is# given by the global environment## Not run: xt <- transcan(~. , data=mine, imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)attach(mine, use.names=FALSE)impute(xt, imputation=1) # use first imputation# omit imputation= if using single imputationdetach(1, 'mine2')## End(Not run)# Example of using invertTabulated outside transcanx <- c(1,2,3,4,5,6,7,8,9,10)y <- c(1,2,3,4,5,5,5,5,9,10)freq <- c(1,1,1,1,1,2,3,4,1,1)# x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5# Within a tolerance of .05*(10-1) all y's match exactly# so the distance measure does not play a roleset.seed(1) # so can reproducefor(inverse in c('linearInterp','sample')) print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))# Test inverse='sample' when the estimated transformation is# flat on the right. First show default imputationsset.seed(3)x <- rnorm(1000)y <- pmin(x, 0)x[1:500] <- NAfor(inverse in c('linearInterp','sample')) {par(mfrow=c(2,2)) w <- transcan(~ x + y, imputed.actual='hist', inverse=inverse, curtail=FALSE, data=data.frame(x,y)) if(inverse=='sample') next# cat('Click mouse on graph to proceed\n')# locator(1)}## Not run: # While running multiple imputation for a logistic regression model# Run the rms package validate and calibrate functions and save the# results in w$funresultsa <- aregImpute(~ x1 + x2 + y, data=d, n.impute=10)require(rms)g <- function(fit) list(validate=validate(fit, B=50), calibrate=calibrate(fit, B=75))w <- fit.mult.impute(y ~ x1 + x2, lrm, a, data=d, fun=g, fitargs=list(x=TRUE, y=TRUE))# Get all validate results in it's own list of length 10r <- w$funresultsval <- lapply(r, function(x) x$validate)cal <- lapply(r, function(x) x$calibrate)# See rms processMI and https://hbiostat.org/rmsc/validate.html#sec-val-mival## End(Not run)## Not run: # Account for within-subject correlation using the robust cluster sandwich# covariance estimate in conjunction with Rubin's rule for multiple imputation# rms package must be installeda <- aregImpute(..., data=d)f <- fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=30, data=d, cluster=d$id)# Get likelihood ratio chi-square tests accounting for missingnessa <- aregImpute(..., data=d)h <- fit.mult.impute(y ~ x1 + x2, lrm, a, n.impute=40, data=d, lrt=TRUE)processMI(h, which='anova') # processMI is in rms## End(Not run)Translate Vector or Matrix of Text Strings
Description
Uses the UNIX tr command to translate any character inold intext to the corresponding character innew. If multichar=Torold andnew have more than one element, or each have one elementbut they have different numbers of characters,uses the UNIXsed command to translate the series of characters inold to the series innew when these characters occur intext.Ifold ornew contain a backslash, you sometimes have to quadrupleit to make the UNIX command work. If they contain a forward slash,preceed it by two backslashes. Invokes the builtin chartr function ifmultichar=FALSE.
Usage
translate(text, old, new, multichar=FALSE)Arguments
text | scalar, vector, or matrix of character strings to translate. |
old | vector old characters |
new | corresponding vector of new characters |
multichar | See above. |
Value
an object like text but with characters translated
See Also
grep
Examples
translate(c("ABC","DEF"),"ABCDEFG", "abcdefg")translate("23.12","[.]","\\cdot ") # change . to \cdottranslate(c("dog","cat","tiger"),c("dog","cat"),c("DOG","CAT"))# S-Plus gives [1] "DOG" "CAT" "tiger" - check discrepencytranslate(c("dog","cat2","snake"),c("dog","cat"),"animal")# S-Plus gives [1] "animal" "animal2" "snake"Return the floor, ceiling, or rounded value of date or time tospecified unit.
Description
truncPOSIXt returns the date truncated to the specified unit.ceil.POSIXt returns next ceiling of the date at the unit selected inunits.roundPOSIXt returns the date or time value rounded to nearestspecified unit selected indigits.
truncPOSIXt androundPOSIXt have been extended fromthebase package functionstrunc.POSIXt andround.POSIXt which in the future will add the other time unitswe need.
Usage
ceil(x, units,...)## Default S3 method:ceil(x, units, ...)truncPOSIXt(x, units = c("secs", "mins", "hours", "days","months", "years"), ...)## S3 method for class 'POSIXt'ceil(x, units = c("secs", "mins", "hours", "days","months", "years"), ...)roundPOSIXt(x, digits = c("secs", "mins", "hours", "days", "months", "years"))Arguments
x | date to be ceilinged, truncated, or rounded |
units | unit to that is is rounded up or down to. |
digits | same as |
... | further arguments to be passed to or from other methods. |
Value
An object of classPOSIXlt.
Author(s)
Charles Dupont
See Also
DatePOSIXtPOSIXltDateTimeClasses
Examples
date <- ISOdate(1832, 7, 12)ceil(date, units='months') # '1832-8-1'truncPOSIXt(date, units='years') # '1832-1-1'roundPOSIXt(date, digits='months') # '1832-7-1'Units Attribute of a Vector
Description
Sets or retrieves the"units" attribute of an object.Forunits.default replaces the builtinversion, which only works for time series objects. If the variable isalso given alabel, subsetting (using[.labelled) willretain the"units" attribute. For aSurv object,units first looks for an overall"units" attribute, thenit looks forunits for thetime2 variable then fortime1.When setting"units",value is changed to lower case and any "s" atthe end is removed.
Usage
units(x, ...)## Default S3 method:units(x, none='', ...)## S3 method for class 'Surv'units(x, none='', ...)## Default S3 replacement method:units(x) <- valueArguments
x | any object |
... | ignored |
value | the units of the object, or "" |
none | value to which to set result if no appropriate attribute isfound |
Value
the units attribute of x, if any; otherwise, theunits attribute ofthetspar attribute ofx if any; otherwise the valuenone. Handling forSurv objects is different (see above).
See Also
Examples
require(survival)fail.time <- c(10,20)units(fail.time) <- "Day"describe(fail.time)S <- Surv(fail.time)units(S)label(fail.time) <- 'Failure Time'units(fail.time) <- 'Days'fail.timeUpdate a Data Frame or Cleanup a Data Frame after Importing
Description
cleanup.import will correct errors and shrinkthe size of data frames. By default, double precision numericvariables are changed to integer when they contain no fractional components. Infinite values or values greater than 1e20 in absolute value are setto NA. This solves problems of importing Excel spreadsheets thatcontain occasional character values for numeric columns, as Sconverts these toInf without warning. There is also an option toconvert variable names to lower case and to add labels to variables.The latter can be made easier by importing a CNTLOUT dataset createdby SAS PROC FORMAT and using thesasdict option as shown in theexample below.cleanup.import can also transform character orfactor variables to dates.
upData is a function facilitating the updating of a data framewithout attaching it in search position one. New variables can beadded, old variables can be modified, variables can be removed or renamed, and"labels" and"units" attributes can be provided.Observations can be subsetted. Various checksare made for errors and inconsistencies, with warnings issued to helpthe user. Levels of factor variables can be replaced, especiallyusing thelist notation of the standardmerge.levelsfunction. Unlessforce.single is set toFALSE,upData also converts double precision vectors to integer if nofractional values are present in a vector.upData is also used to process R workspace objectscreated by StatTransfer, which puts variable and value labels as attributes onthe data frame rather than on each variable. If such attributes arepresent, they are used to define all the labels and value labels(through conversion to factor variables) before any label changestake place, andforce.single is set to a default ofFALSE, as StatTransfer already does conversion to integer.
Variables having labels but not classed"labelled" (e.g., dataimported using thehaven package) have that class added to thembyupData.
ThedataframeReduce function removes variables from a data framethat are problematic for certain analyses. Variables can be removedbecause the fraction of missing values exceeds a threshold, because theyare character or categorical variables having too many levels, orbecause they are binary and have too small a prevalence in one of thetwo values. Categorical variables can also have their levels combinedwhen a level is of low prevalence. A data frame listing actions takeis return as attribute"info" to the main returned data frame.
Usage
cleanup.import(obj, labels, lowernames=FALSE, force.single=TRUE, force.numeric=TRUE, rmnames=TRUE, big=1e20, sasdict, print, datevars=NULL, datetimevars=NULL, dateformat='%F', fixdates=c('none','year'), autodate=FALSE, autonum=FALSE, fracnn=0.3, considerNA=NULL, charfactor=FALSE)upData(object, ..., subset, rename, drop, keep, labels, units, levels, force.single=TRUE, lowernames=FALSE, caplabels=FALSE, classlab=FALSE, moveUnits=FALSE, charfactor=FALSE, print=TRUE, html=FALSE)dataframeReduce(data, fracmiss=1, maxlevels=NULL, minprev=0, print=TRUE)Arguments
obj | a data frame or list |
object | a data frame or list |
data | a data frame |
force.single | By default, double precision variables are converted to single precision(in S-Plus only) unless |
force.numeric | Sometimes importing will cause a numeric variable to bechanged to a factor vector. By default, |
rmnames | set to ‘F’ to not have ‘cleanup.import’ remove ‘names’ or ‘.Names’attributes from variables |
labels | a character vector the same length as the number of variables in |
lowernames | set this to |
big | a value such that values larger than this in absolute value are set tomissing by |
sasdict | the name of a data frame containing a raw imported SAS PROC CONTENTSCNTLOUT= dataset. This is used to define variable names and to addattributes to the new data frame specifying the original SAS datasetname and label. |
print | set to |
datevars | character vector of names (after |
datetimevars | character vector of names (after |
dateformat | for |
fixdates | for any of the variables listed in |
autodate | set to |
autonum | set to |
fracnn | see |
considerNA | for |
charfactor | set to |
... | for |
subset | an expression that evaluates to a logical vectorspecifying which rows of |
rename | list or named vector specifying old and new names for variables. Variables arerenamed before any other operations are done. For example, to renamevariables |
drop | a vector of variable names to remove from the data frame |
keep | a vector of variable names to keep, with all othervariables dropped |
units | a named vector or list defining |
levels | a named list defining |
caplabels | set to |
classlab | set to |
moveUnits | set to |
html | set to |
fracmiss | the maximum permissable proportion of |
maxlevels | the maximum number of levels of a character orcategorical or factor variable before the variable is dropped |
minprev | the minimum proportion of non-missing observations in acategory for a binary variable to be retained, and the minimumrelative frequency of a category before it will be combined with othersmall categories |
Value
a new data frame
Author(s)
Frank Harrell, Vanderbilt University
See Also
sas.get,data.frame,describe,label,read.csv,strptime,POSIXct,Date
Examples
## Not run: dat <- read.table('myfile.asc')dat <- cleanup.import(dat)## End(Not run)dat <- data.frame(a=1:3, d=c('01/02/2004',' 1/3/04',''))cleanup.import(dat, datevars='d', dateformat='%m/%d/%y', fixdates='year')dat <- data.frame(a=(1:3)/7, y=c('a','b1','b2'), z=1:3)dat2 <- upData(dat, x=x^2, x=x-5, m=x/10, rename=c(a='x'), drop='z', labels=c(x='X', y='test'), levels=list(y=list(a='a',b=c('b1','b2'))))dat2describe(dat2)dat <- dat2 # copy to original name and delete dat2 if OKrm(dat2)dat3 <- upData(dat, X=X^2, subset = x < (3/7)^2 - 5, rename=c(x='X'))# Remove hard to analyze variables from a redundancy analysis of all# variables in the data framed <- dataframeReduce(dat, fracmiss=.1, minprev=.05, maxlevels=5)# Could run redun(~., data=d) at this point or include dataframeReduce# arguments in the call to redun# If you import a SAS dataset created by PROC CONTENTS CNTLOUT=x.datadict,# the LABELs from this dataset can be added to the data. Let's also# convert names to lower case for the main data file## Not run: mydata2 <- cleanup.import(mydata2, lowernames=TRUE, sasdict=datadict)## End(Not run)Change First Letters to Upper Case
Description
Changes the first letter of each word in a string to upper case, keeping selected words in lower case. Words containing at least 2 capital letters are kept as-is.
Usage
upFirst(txt, lower = FALSE, alllower = FALSE)Arguments
txt | a character vector |
lower | set to |
alllower | set to |
References
https://en.wikipedia.org/wiki/Letter_case#Headings_and_publication_titles
Examples
upFirst(c('this and that','that is Beyond question'))Store Descriptive Information About an Object
Description
Functions get or set useful information about the contents of theobject for later use.
Usage
valueTags(x)valueTags(x) <- valuevalueLabel(x)valueLabel(x) <- valuevalueName(x)valueName(x) <- valuevalueUnit(x)valueUnit(x) <- valueArguments
x | an object |
value | for |
Details
These functions store the a short name of for the contents, a longerlabel that is useful for display, and the units of the contents thatis useful for display.
valueTag is an accessor, andvalueTag<- is a replacementfunction for all of the value's information.
valueName is an accessor, andvalueName<- is areplacement function for the value's name. This name is used when aplot or a latex table needs a short name and the variable name is notuseful.
valueLabel is an accessor, andvalueLabel<- is areplacement function for the value's label. The label is used in aplots or latex tables when they need a descriptive name.
valueUnit is an accessor, andvalueUnit<- is areplacement function for the value's unit. The unit is used to addunit information to the R output.
Value
valueTag returnsNULL or a named list with each of thenamed valuesname,label,unit set if they existsin the object.
ForvalueTag<- returnslist
ForvalueName,valueLable, andvalueUnit returnsNULL or character vector of length 1.
ForvalueName<-,valueLabel<-, andvalueUnit returnsvalue
Author(s)
Charles Dupont
See Also
Examples
age <- c(21,65,43)y <- 1:3valueLabel(age) <- "Age in Years"plot(age, y, xlab=valueLabel(age))x1 <- 1:10x2 <- 10:1valueLabel(x2) <- 'Label for x2'valueUnit(x2) <- 'mmHg'x2x2[1:5]dframe <- data.frame(x1, x2)Label(dframe)##In these examples of llist, note that labels are printed after##variable names, because of print.labelleda <- 1:3b <- 4:6valueLabel(b) <- 'B Label'Variable Clustering
Description
Does a hierarchical cluster analysis on variables, using the HoeffdingD statistic, squared Pearson or Spearman correlations, or proportionof observations for which two variables are both positive as similaritymeasures. Variable clustering is used for assessing collinearity,redundancy, and for separating variables into clusters that can bescored as a single variable, thus resulting in data reduction. Forcomputing any of the three similarity measures, pairwise deletion ofNAs is done. The clustering is done byhclust(). A small functionnaclus is also provided which depicts similarities in whichobservations are missing for variables in a data frame. Thesimilarity measure is the fraction ofNAs in common between any twovariables. The diagonals of thissim matrix are the fraction of NAsin each variable by itself.naclus also computesna.per.obs, thenumber of missing variables in each observation, andmean.na, avector whose ith element is the mean number of missing variables otherthan variable i, for observations in which variable i is missing. Thenaplot function makes several plots (see thewhich argument).
So as to not generate too many dummy variables for multi-valuedcharacter or categorical predictors,varclus will automaticallycombine infrequent cells of such variables usingcombine.levels.
plotMultSim plots multiple similarity matrices, with the similaritymeasure being on the x-axis of each subplot.
na.pattern prints a frequency table of all combinations ofmissingness for multiple variables. If there are 3 variables, afrequency table entry labeled110 corresponds to the number ofobservations for which the first and second variables were missing butthe third variable was not missing.
Usage
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"), type=c("data.matrix","similarity.matrix"), method="complete", data=NULL, subset=NULL, na.action=na.retain, trans=c("square", "abs", "none"), ...)## S3 method for class 'varclus'print(x, abbrev=FALSE, ...)## S3 method for class 'varclus'plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)naclus(df, method)naplot(obj, which=c('all','na per var','na per obs','mean na', 'na per var vs mean na'), ...)plotMultSim(s, x=1:dim(s)[3], slim=range(pretty(c(0,max(s,na.rm=TRUE)))), slimds=FALSE, add=FALSE, lty=par('lty'), col=par('col'), lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05, labelx=TRUE, xspace=.35)na.pattern(x)Arguments
x | a formula,a numeric matrix of predictors, or a similarity matrix. If For |
df | a data frame |
s | an array of similarity matrices. The third dimension of this arraycorresponds to different computations of similarities. The first twodimensions come from a single similarity matrix. This is useful fordisplaying similarity matrices computed by |
similarity | the default is to use squared Spearman correlation coefficients, whichwill detect monotonic but nonlinear relationships. You can alsospecify linear correlation or Hoeffding's (1948) D statistic, whichhas the advantage of being sensitive to many typesof dependence, including highly non-monotonic relationships. Forbinary data, or data to be made binary, |
type | if |
method | see |
data | a data frame, data table, or list |
subset | a standard subsetting expression |
na.action | These may be specified if |
trans | By default, when the similarity measure is based onPearson's or Spearman's correlation coefficients, the coefficients aresquared. Specify |
... | for |
ylab | y-axis label. Default is constructed on the basis of |
legend. | set to |
loc | a list with elements |
maxlen | if a legend is plotted describing abbreviations, original labelslonger than |
labels | a vector of character strings containing labels corresponding tocolumns in the similar matrix, if the column names of that matrix arenot to be used |
obj | an object created by |
which | defaults to |
abbrev | set to |
slim | 2-vector specifying the range of similarity values for scaling they-axes. By default this is the observed range over all of |
slimds | set to |
add | set to |
lty,col,lwd | line type, color, or line thickness for |
vname | optional vector of variable names, in order, used in |
h | relative height for subplot |
w | relative width for subplot |
u | relative extra height and width to leave unused inside the subplot.Also used as the space between y-axis tick mark labels and graph border. |
labelx | set to |
xspace | amount of space, on a scale of 1: |
Details
options(contrasts= c("contr.treatment", "contr.poly")) is issued temporarily byvarclus to make sure that ordinary dummy variablesare generated forfactor variables. Pass arguments to thedataframeReduce function to remove problematic variables(especially if analyzing all variables in a data frame).
Value
forvarclus ornaclus, a list of classvarclus with elementscall (containing the calling statement),sim (similarity matrix),n (sample size used ifx was not a correlation matrix already -n is a matrix),hclust, the object created byhclust,similarity, andmethod.naclus also returns thetwo vectors listed under description, andnaplot returns an invisible vector that is thefrequency table of the number of missing variables per observation.plotMultSim invisibly returns the limits of similarities used inconstructing the y-axes of each subplot. Forsimilarity="ccbothpos"thehclust object isNULL.
na.pattern creates an integer vector of frequencies.
Side Effects
plots
Author(s)
Frank Harrell
Department of Biostatistics, Vanderbilt University
fh@fharrell.com
References
Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition,1990. Cary NC: SAS Institute, Inc.
Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat19:546–57.
See Also
hclust,plclust,hoeffd,rcorr,cor,model.matrix,locator,na.pattern,cut2,combine.levels
Examples
set.seed(1)x1 <- rnorm(200)x2 <- rnorm(200)x3 <- x1 + x2 + rnorm(200)x4 <- x2 + rnorm(200)x <- cbind(x1,x2,x3,x4)v <- varclus(x, similarity="spear") # spearman is the default anywayv # invokes print.varclusprint(round(v$sim,2))plot(v)# Convert the dendrogram to be horizontalv <- as.dendrogram(v$hclust)plot(v, horiz=TRUE, axes=FALSE, xlab=expression(paste('Spearman ', rho^2)))rh <- seq(0, 1, by=0.1) # re-label x-axis re:similarity not distanceaxis(1, at=1 - rh, labels=format(rh))# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)# the -1 causes k dummies to be generated for k countries# plot(varclus(~ age + factor(disease.code) - 1))### use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all# "useful" variables - see dataframeReduce for details about argumentsdf <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3), e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))par(mfrow=c(2,2))for(m in c("ward","complete","median")) { plot(naclus(df, method=m)) title(m)}naplot(naclus(df))n <- naclus(df)plot(n); naplot(n)na.pattern(df)# plotMultSim example: Plot proportion of observations# for which two variables are both positive (diagonals# show the proportion of observations for which the# one variable is positive). Chance-correct the# off-diagonals by subtracting the product of the# marginal proportions. On each subplot the x-axis# shows month (0, 4, 8, 12) and there is a separate# curve for females and malesd <- data.frame(sex=sample(c('female','male'),1000,TRUE), month=sample(c(0,4,8,12),1000,TRUE), x1=sample(0:1,1000,TRUE), x2=sample(0:1,1000,TRUE), x3=sample(0:1,1000,TRUE))s <- array(NA, c(3,3,4))opar <- par(mar=c(0,0,4.1,0)) # waste less spacefor(sx in c('female','male')) { for(i in 1:4) { mon <- (i-1)*4 s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d, subset=d$month==mon & d$sex==sx)$sim } plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'), add=sx=='male', slimds=TRUE, lty=1+(sx=='male')) # slimds=TRUE causes separate scaling for diagonals and # off-diagonals}par(opar)vlab
Description
Easily Retrieve Text Form of Labels/Units
Usage
vlab(x, name = NULL)Arguments
x | a single variable name, unquoted |
name | optional character string to use as variable name |
Details
Uses the same search method ashlab returns label and units in a character string with units, if present, in brackets
Value
character string
Author(s)
Frank Harrell
See Also
Weighted Statistical Estimates
Description
These functions compute various weighted versions of standardestimators. In most cases theweights vector is a vector the samelength ofx, containing frequency counts that in effect expandxby these counts.weights can also be sampling weights, in whichsettingnormwt toTRUE will often be appropriate. This results inmakingweights sum to the length of the non-missing elements inx.normwt=TRUE thus reflects the fact that the true sample size isthe length of thex vector and not the sum of the original values ofweights (which would be appropriate hadnormwt=FALSE). Whenweightsis all ones, the estimates are all identical to unweighted estimates(unless one of the non-default quantile estimation options isspecified towtd.quantile). When missing data have already beendeleted for,x,weights, and (in the case ofwtd.loess.noiter)y,specifyingna.rm=FALSE will save computation time. Omitting theweights argument or specifyingNULL or a zero-length vector willresult in the usual unweighted estimates.
wtd.mean,wtd.var, andwtd.quantile computeweighted means, variances, and quantiles, respectively.wtd.Ecdfcomputes a weighted empirical distribution function.wtd.tablecomputes a weighted frequency table (although only one stratificationvariable is supported at present).wtd.rank computes weightedranks, using mid–ranks for ties. This can be used to obtain Wilcoxontests and rank correlation coefficients.wtd.loess.noiter is aweighted version ofloess.smooth when no iterations for outlierrejection are desired. This results in especially good smoothing wheny is binary.wtd.quantile removes any observations withzero weight at the beginning. Previously, these were changing thequantile estimates.
num.denom.setup is a utility function that allows one to deal withobservations containing numbers of events and numbers of trials, byoutputting two observations when the number of events and non-events(trials - events) exceed zero. A vector of subscripts is generatedthat will do the proper duplications of observations, and a new binaryvariabley is created along with usual cell frequencies (weights)for each of they=0,y=1 cells per observation.
Usage
wtd.mean(x, weights=NULL, normwt="ignored", na.rm=TRUE)wtd.var(x, weights=NULL, normwt=FALSE, na.rm=TRUE, method=c('unbiased', 'ML'))wtd.quantile(x, weights=NULL, probs=c(0, .25, .5, .75, 1), type=c('quantile','(i-1)/(n-1)','i/(n+1)','i/n'), normwt=FALSE, na.rm=TRUE)wtd.Ecdf(x, weights=NULL, type=c('i/n','(i-1)/(n-1)','i/(n+1)'), normwt=FALSE, na.rm=TRUE)wtd.table(x, weights=NULL, type=c('list','table'), normwt=FALSE, na.rm=TRUE)wtd.rank(x, weights=NULL, normwt=FALSE, na.rm=TRUE)wtd.loess.noiter(x, y, weights=rep(1,n), span=2/3, degree=1, cell=.13333, type=c('all','ordered all','evaluate'), evaluation=100, na.rm=TRUE)num.denom.setup(num, denom)Arguments
x | a numeric vector (may be a character or |
num | vector of numerator frequencies |
denom | vector of denominators (numbers of trials) |
weights | a numeric vector of weights |
normwt | specify |
na.rm | set to |
method | determines the estimator type; if |
probs | a vector of quantiles to compute. Default is 0 (min), .25, .5, .75, 1(max). |
type | For |
y | a numeric vector the same length as |
span,degree,cell,evaluation | see |
Details
The functions correctly combine weights of observations havingduplicate values ofx before computing estimates.
Whennormwt=FALSE the weighted variance will not equal theunweighted variance even if the weights are identical. That is becauseof the subtraction of 1 from the sum of the weights in the denominatorof the variance formula. If you want the weighted variance to equal theunweighted variance when weights do not vary, usenormwt=TRUE.The articles by Gatz and Smith discuss alternative approaches, to arriveat estimators of the standard error of a weighted mean.
wtd.rank does not handle NAs as elegantly asrank ifweights is specified.
Value
wtd.mean andwtd.var return scalars.wtd.quantile returns avector the same length asprobs.wtd.Ecdf returns a list whoseelementsx andEcdf correspond to unique sorted values ofx.If the first CDF estimate is greater than zero, a point (min(x),0) isplaced at the beginning of the estimates.See above forwtd.table.wtd.rank returns a vector the samelength asx (after removal of NAs, depending onna.rm). See aboveforwtd.loess.noiter.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University School of Medicine
fh@fharrell.com
Benjamin Tyner
btyner@gmail.com
References
Research Triangle Institute (1995): SUDAAN User's Manual, Release6.40, pp. 8-16 to 8-17.
Gatz DF, Smith L (1995): The standard error of a weighted meanconcentration–I. Bootstrapping vs other methods. Atmospheric Env11:1185-1193.
Gatz DF, Smith L (1995): The standard error of a weighted meanconcentration–II. Estimating confidence intervals. Atmospheric Env29:1195-1200.
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean
See Also
mean,var,quantile,table,rank,loess.smooth,lowess,plsmo,Ecdf,somers2,describe
Examples
set.seed(1)x <- runif(500)wts <- sample(1:6, 500, TRUE)std.dev <- sqrt(wtd.var(x, wts))wtd.quantile(x, wts)death <- sample(0:1, 500, TRUE)plot(wtd.loess.noiter(x, death, wts, type='evaluate'))describe(~x, weights=wts)# describe uses wtd.mean, wtd.quantile, wtd.tablexg <- cut2(x,g=4)table(xg)wtd.table(xg, wts, type='table')# Here is a method for getting stratified weighted meansy <- runif(500)g <- function(y) wtd.mean(y[,1],y[,2])summarize(cbind(y, wts), llist(xg), g, stat.name='y')# Empirically determine how methods used by wtd.quantile match with# methods used by quantile, when all weights are unityset.seed(1)u <- eval(formals(wtd.quantile)$type)v <- as.character(1:9)r <- matrix(0, nrow=length(u), ncol=9, dimnames=list(u,v))for(n in c(8, 13, 22, 29)) { x <- rnorm(n) for(i in 1:5) { probs <- sort( runif(9)) for(wtype in u) { wq <- wtd.quantile(x, type=wtype, weights=rep(1,length(x)), probs=probs) for(qtype in 1:9) { rq <- quantile(x, type=qtype, probs=probs) r[wtype, qtype] <- max(r[wtype,qtype], max(abs(wq-rq))) } } } }r# Restructure data to generate a dichotomous response variable# from records containing numbers of events and numbers of trialsnum <- c(10,NA,20,0,15) # data are 10/12 NA/999 20/20 0/25 15/35denom <- c(12,999,20,25,35)w <- num.denom.setup(num, denom)w# attach(my.data.frame[w$subs,])xyplot and dotplot with Matrix Variables to Plot Error Bars and Bands
Description
A utility functionCbind returns the first argument as a vector andcombines all other arguments into a matrix stored as an attribute called"other". The arguments can be named (e.g.,Cbind(pressure=y,ylow,yhigh)) or alabel attribute may be pre-attachedto the first argument. In either case, the name or label of the firstargument is stored as an attribute"label" of the object returned byCbind. Storing other vectors as a matrix attribute facilitates plottingerror bars, etc., astrellis really wants the x- and y-variables to bevectors, not matrices. If a single argument is given toCbind and thatargument is a matrix with column dimnames, the first column is taken as themain vector and remaining columns are taken as"other". A subscriptmethod forCbind objects subscripts theother matrix alongwith the mainy vector.
ThexYplot function is a substitute forxyplot that allows forsimulated multi-columny. It uses by default thepanel.xYplot andprepanel.xYplot functions to do the actual work. Themethod argumentpassed topanel.xYplot fromxYplot allows you to make error bars, theupper-only or lower-only portions of error bars, alternating lower-only andupper-only bars, bands, or filled bands.panel.xYplot decides how toalternate upper and lower bars according to whether the mediany value ofthe current main data line is above the mediany for allgroups oflines or not. If the median is above the overall median, only the upperbar is drawn. Forbands (but not 'filled bands'), any number of othercolumns ofy will be drawn as lines having the same thickness, color, andtype as the main data line. If plotting bars, bands, or filled bands andonly one additional column is specified for the response variable, thatcolumn is taken as the half width of a precision interval fory, and thelower and upper values are computed automatically asy plus or minus thevalue of the additional column variable.
When agroups variable is present,panel.xYplot will create a functionin frame 0 (.GlobalEnv inR) calledKey that wheninvoked will draw a key describing thegroups labels, point symbols, and colors. By default, the key is outsidethe graph. For S-Plus, ifKey(locator(1)) is specified, the key will appear so thatits upper left corner is at the coordinates of the mouse click. ForR/Lattice the first two arguments ofKey (x andy) are fractionsof the page, measured from the lower left corner, and the defaultplacement is atx=0.05, y=0.95. ForR, an optional argumenttosKey,other, may contain a list of arguments to pass todraw.key (seexyplot for a list of possible arguments, underthekey option).
Whenmethod="quantile" is specified,xYplot automatically groups thex variable into intervals containing a target ofnx observations each,and within eachx group computes three quantiles ofy and plots theseas three lines. The meanx within eachx group is taken as thex-coordinate. This will make a useful empirical display for largedatasets in which scatterdiagrams are too busy to see patterns of centraltendency and variability. You can also specify a general function of adata vector that returns a matrix of statistics for themethod argument.Arguments can be passed to that function via a listmethodArgs. Thestatistic in the first column should be the measure of central tendency.Examples of usefulmethod functions are those listed under the help fileforsummary.formula such assmean.cl.normal.
xYplot can also produce bubble plots. This is done whensize is specified toxYplot. Whensize is used, afunctionsKey is generated for drawing a key to the charactersizes. See the bubble plot example.size can also specify avector where the first character of each observation is used as theplotting symbol, ifrangeCex is set to a singlecexvalue. An optional argument tosKey,other, may containa list of arguments to pass todraw.key (seexyplot for a list of possible arguments, underthekey option). See the bubble plot example.
Dotplot is a substitute fordotplot allowing for a matrix x-variable,automatic superpositioning whengroups is present, and creation of aKey function. When the x-variable (created byCbind to simulate amatrix) contains a total of 3 columns, the first column specifies where thedot is positioned, and the last 2 columns specify starting and endingpoints for intervals. The intervals are shown using line type, width, andcolor from the trellisplot.line list. By default, you will usually see adarker line segment for the low and high values, with the dotted referenceline elsewhere. A good choice of thepch argument for such plots is3(plus sign) if you want to emphasize the interval more than the pointestimate. When the x-variable contains a total of 5 columns, the 2nd and5th columns are treated as the 2nd and 3rd are treated above, and the 3rdand 4th columns define an inner line segment that will have twice thethickness of the outer segments. In addition, tick marks separate the outerand inner segments. This type of display (an example of which appeared inThe Elements of Graphing Data by Cleveland) is very suitable fordisplaying two confidence levels (e.g., 0.9 and 0.99) or the 0.05, 0.25,0.75, 0.95 sample quantiles, for example. For this display, the centralpoint displays well with a default circle symbol.
setTrellis sets nice defaults for Trellis graphics, assuming that thegraphics device has already been opened if using postscript, etc. Bydefault, it sets panel strips to blank and reference dot lines to thickness1 instead of the Trellis default of 2.
numericScale is a utility function that facilitates usingxYplot to plot variables that are not considered to be numeric but which can readilybe converted to numeric usingas.numeric().numericScaleby default will keep the name of the input variable as alabelattribute for the new numeric variable.
Usage
Cbind(...)xYplot(formula, data = sys.frame(sys.parent()), groups, subset, xlab=NULL, ylab=NULL, ylim=NULL, panel=panel.xYplot, prepanel=prepanel.xYplot, scales=NULL, minor.ticks=NULL, sub=NULL, ...)panel.xYplot(x, y, subscripts, groups=NULL, type=if(is.function(method) || method=='quantiles') 'b' else 'p', method=c("bars", "bands", "upper bars", "lower bars", "alt bars", "quantiles", "filled bands"), methodArgs=NULL, label.curves=TRUE, abline, probs=c(.5,.25,.75), nx=NULL, cap=0.015, lty.bar=1, lwd=plot.line$lwd, lty=plot.line$lty, pch=plot.symbol$pch, cex=plot.symbol$cex, font=plot.symbol$font, col=NULL, lwd.bands=NULL, lty.bands=NULL, col.bands=NULL, minor.ticks=NULL, col.fill=NULL, size=NULL, rangeCex=c(.5,3), ...)prepanel.xYplot(x, y, ...)Dotplot(formula, data = sys.frame(sys.parent()), groups, subset, xlab = NULL, ylab = NULL, ylim = NULL, panel=panel.Dotplot, prepanel=prepanel.Dotplot, scales=NULL, xscale=NULL, ...)prepanel.Dotplot(x, y, ...)panel.Dotplot(x, y, groups = NULL, pch = dot.symbol$pch, col = dot.symbol$col, cex = dot.symbol$cex, font = dot.symbol$font, abline, ...)setTrellis(strip.blank=TRUE, lty.dot.line=2, lwd.dot.line=1)numericScale(x, label=NULL, ...)Arguments
... | for Also can be other arguments to pass to |
formula | a |
x |
|
y | a vector, or an object created by |
data,subset,ylim,subscripts,groups,type,scales,panel,prepanel,xlab,ylab | see |
xscale | allows one to use the default |
method | defaults to |
methodArgs | a list containing optional arguments to be passed to the function specifiedin |
label.curves | set to |
abline | a list of arguments to pass to |
probs | a vector of three quantiles with the quantile corresponding to the centralline listed first. By default |
nx | number of target observations for each |
cap | the half-width of horizontal end pieces for error bars, as a fraction ofthe length of the |
lty.bar | line type for bars |
lwd,lty,pch,cex,font,col | see |
lty.bands,lwd.bands,col.bands | used to allow |
minor.ticks | a list with elements |
sub | an optional subtitle |
col.fill | used to override default colors used for the bands in method='filledbands'. This is a vector when |
size | a vector the same length as |
rangeCex | a vector of two values specifying the range in character sizes to usefor the |
strip.blank | set to |
lty.dot.line | line type for dot plot reference lines (default = 1 for dotted; use 2 fordotted) |
lwd.dot.line | line thickness for reference lines for dot plots (default = 1) |
label | a scalar character string to be used as a variable label after |
Details
Unlikexyplot,xYplot senses the presence of agroups variable andautomatically invokespanel.superpose instead ofpanel.xyplot. The sameis true forDotplot vs.dotplot.
Value
Cbind returns a matrix with attributes. Other functions return standardtrellis results.
Side Effects
plots, andpanel.xYplot may create temporaryKey andsKey functions in the session frame.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
Madeline Bauer
Department of Infectious Diseases
University of Southern California School of Medicine
mbauer@usc.edu
See Also
xyplot,panel.xyplot,summarize,label,labcurve,errbar,dotplot,reShape,cut2,panel.abline
Examples
# Plot 6 smooth functions. Superpose 3, panel 2.# Label curves with p=1,2,3 where most separated d <- expand.grid(x=seq(0,2*pi,length=150), p=1:3, shift=c(0,pi)) xYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l') # Use a key instead, use 3 line widths instead of 3 colors # Put key in most empty portion of each panelxYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l', keys='lines', lwd=1:3, col=1) # Instead of implicitly using labcurve(), put a # single key outside of panels at lower left cornerxYplot(sin(x+shift)^p ~ x | shift, groups=p, data=d, type='l', label.curves=FALSE, lwd=1:3, col=1, lty=1:3) Key()# Bubble plotsx <- y <- 1:8x[2] <- NAunits(x) <- 'cm^2'z <- 101:108p <- factor(rep(c('a','b'),4))g <- c(rep(1,7),2)data.frame(p, x, y, z, g)xYplot(y ~ x | p, groups=g, size=z) Key(other=list(title='g', cex.title=1.2)) # draw key for colorssKey(.2,.85,other=list(title='Z Values', cex.title=1.2))# draw key for character sizes# Show the median and quartiles of height given age, stratified # by sex and race. Draws 2 sets (male, female) of 3 lines per panel.# xYplot(height ~ age | race, groups=sex, method='quantiles')# Examples of plotting raw datadfr <- expand.grid(month=1:12, continent=c('Europe','USA'), sex=c('female','male'))set.seed(1)dfr <- upData(dfr, y=month/10 + 1*(sex=='female') + 2*(continent=='Europe') + runif(48,-.15,.15), lower=y - runif(48,.05,.15), upper=y + runif(48,.05,.15))xYplot(Cbind(y,lower,upper) ~ month,subset=sex=='male' & continent=='USA', data=dfr)xYplot(Cbind(y,lower,upper) ~ month|continent, subset=sex=='male',data=dfr)xYplot(Cbind(y,lower,upper) ~ month|continent, groups=sex, data=dfr); Key() # add ,label.curves=FALSE to suppress use of labcurve to label curves where# farthest apartxYplot(Cbind(y,lower,upper) ~ month,groups=sex, subset=continent=='Europe', data=dfr) xYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b', subset=continent=='Europe', keys='lines', data=dfr)# keys='lines' causes labcurve to draw a legend where the panel is most emptyxYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b', data=dfr, subset=continent=='Europe',method='bands') xYplot(Cbind(y,lower,upper) ~ month,groups=sex, type='b', data=dfr, subset=continent=='Europe',method='upper')label(dfr$y) <- 'Quality of Life Score' # label is in Hmisc library = attr(y,'label') <- 'Quality\dots'; will be# y-axis label # can also specify Cbind('Quality of Life Score'=y,lower,upper) xYplot(Cbind(y,lower,upper) ~ month, groups=sex, subset=continent=='Europe', method='alt bars', offset=grid::unit(.1,'inches'), type='b', data=dfr) # offset passed to labcurve to label .4 y units away from curve# for R (using grid/lattice), offset is specified using the grid# unit function, e.g., offset=grid::unit(.4,'native') or# offset=grid::unit(.1,'inches') or grid::unit(.05,'npc')# The following example uses the summarize function in Hmisc to # compute the median and outer quartiles. The outer quartiles are # displayed using "error bars"set.seed(111)dfr <- expand.grid(month=1:12, year=c(1997,1998), reps=1:100)month <- dfr$month; year <- dfr$yeary <- abs(month-6.5) + 2*runif(length(month)) + year-1997s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) xYplot(Cbind(y,Lower,Upper) ~ month, groups=year, data=s, keys='lines', method='alt', type='b')# Can also do:s <- summarize(y, llist(month,year), quantile, probs=c(.5,.25,.75), stat.name=c('y','Q1','Q3')) xYplot(Cbind(y, Q1, Q3) ~ month, groups=year, data=s, type='b', keys='lines') # Or:xYplot(y ~ month, groups=year, keys='lines', nx=FALSE, method='quantile', type='b') # nx=FALSE means to treat month as a discrete variable# To display means and bootstrapped nonparametric confidence intervals # use:s <- summarize(y, llist(month,year), smean.cl.boot) sxYplot(Cbind(y, Lower, Upper) ~ month | year, data=s, type='b')# Can also use Y <- cbind(y, Lower, Upper); xYplot(Cbind(Y) ~ ...) # Or:xYplot(y ~ month | year, nx=FALSE, method=smean.cl.boot, type='b')# This example uses the summarize function in Hmisc to # compute the median and outer quartiles. The outer quartiles are # displayed using "filled bands"s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) # filled bands: default fill = pastel colors matching solid colors# in superpose.line (this works differently in R)xYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year, method="filled bands" , data=s, type="l")# note colors based on levels of selected subgroups, not first two colorsxYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year, method="filled bands" , data=s, type="l", subset=(year == 1998 | year == 2000), label.curves=FALSE )# filled bands using black lines with selected solid colors for fillxYplot ( Cbind ( y, Lower, Upper ) ~ month, groups=year, method="filled bands" , data=s, label.curves=FALSE, type="l", col=1, col.fill = 2:3)Key(.5,.8,col = 2:3) #use fill colors in key# A good way to check for stable variance of residuals from ols # xYplot(resid(fit) ~ fitted(fit), method=smean.sdl) # smean.sdl is defined with summary.formula in Hmisc# Plot y vs. a special variable x# xYplot(y ~ numericScale(x, label='Label for X') | country) # For this example could omit label= and specify # y ~ numericScale(x) | country, xlab='Label for X'# Here is an example of using xYplot with several options# to change various Trellis parameters,# xYplot(y ~ x | z, groups=v, pch=c('1','2','3'),# layout=c(3,1), # 3 panels side by side# ylab='Y Label', xlab='X Label',# main=list('Main Title', cex=1.5),# par.strip.text=list(cex=1.2),# strip=function(\dots) strip.default(\dots, style=1),# scales=list(alternating=FALSE))## Dotplot examples#s <- summarize(y, llist(month,year), smedian.hilow, conf.int=.5) setTrellis() # blank conditioning panel backgrounds Dotplot(month ~ Cbind(y, Lower, Upper) | year, data=s) # or Cbind(\dots), groups=year, data=s# Display a 5-number (5-quantile) summary (2 intervals, dot=median) # Note that summarize produces a matrix for y, and Cbind(y) trusts the # first column to be the point estimate (here the median) s <- summarize(y, llist(month,year), quantile, probs=c(.5,.05,.25,.75,.95), type='matrix') Dotplot(month ~ Cbind(y) | year, data=s) # Use factor(year) to make actual years appear in conditioning title strips# Plot proportions and their Wilson confidence limitsset.seed(3)d <- expand.grid(continent=c('USA','Europe'), year=1999:2001, reps=1:100)# Generate binary events from a population probability of 0.2# of the event, same for all years and continentsd$y <- ifelse(runif(6*100) <= .2, 1, 0)s <- with(d, summarize(y, llist(continent,year), function(y) { n <- sum(!is.na(y)) s <- sum(y, na.rm=TRUE) binconf(s, n) }, type='matrix'))Dotplot(year ~ Cbind(y) | continent, data=s, ylab='Year', xlab='Probability')# Dotplot(z ~ x | g1*g2) # 2-way conditioning # Dotplot(z ~ x | g1, groups=g2); Key() # Key defines symbols for g2# If the data are organized so that the mean, lower, and upper # confidence limits are in separate records, the Hmisc reShape # function is useful for assembling these 3 values as 3 variables # a single observation, e.g., assuming type has values such as # c('Mean','Lower','Upper'):# a <- reShape(y, id=month, colvar=type) # This will make a matrix with 3 columns named Mean Lower Upper # and with 1/3 as many rows as the original dataAuxiliary Function Method for Sorting and Ranking
Description
An auxiliary function method that is a workaround for bug in theimplementation of xtfrm handles inheritance.
Usage
## S3 method for class 'labelled'xtfrm(x)Arguments
x | any object of class labelled. |
See Also
Mean x vs. function of y in groups of x
Description
Compute mean x vs. a function of y (e.g. median) by quantilegroups of x or by x grouped to have a given minimum number ofobservations. Deletes NAs in x and y before doing computations.
Usage
xy.group(x, y, m=150, g, fun=mean, result="list")Arguments
x | a vector, may contain NAs |
y | a vector of same length as x, may contain NAs |
m | number of observations per group |
g | number of quantile groups |
fun | function of y such as median or mean (the default) |
result | "list" (the default), or "matrix" |
Value
if result="list", a list with components x and y suitable for plotting.if result="matrix", matrix with rows corresponding to x-groups and columns namedn, x, and y.
See Also
Examples
## Not run: plot(xy.group(x, y, g=10))#Plot mean y by deciles of xxy.group(x, y, m=100, result="matrix")#Print table, 100 obs/group ## End(Not run)Get Number of Days in Year or Month
Description
Returns the number of days in a specific year or month.
Usage
yearDays(time)monthDays(time)Arguments
time | A POSIXt or Date object describing the month or year inquestion. |
Author(s)
Charles Dupont
See Also
Combine Variables in a Matrix
Description
ynbind column binds a series of related yes/no variables,allowing for a final argumentlabel used to label the panelcreated for the group.labels for individual variables arecollected into a vector attribute"labels" for the result;original variable names are used in place of labels for those variableswithout labels. A positive response is taken to bey, yes,present (ignoring case) or alogicalTRUE value. Bydefault, the columns are sorted be ascending order or the overallproportion of positives. A subsetting method is provided for objects ofclass"ynbind".
pBlock creates a matrix similarly labeled, from a general set ofvariables (without special handling of binaries), and sets toNAany observation not insubset so that when that block ofvariables is analyzed it will be only for that subset.
Usage
ynbind(..., label = deparse(substitute(...)), asna = c("unknown", "unspecified"), sort = TRUE)pBlock(..., subset=NULL, label = deparse(substitute(...)))Arguments
... | a series of vectors |
label | a label for the group, to be attached to the resultingmatrix as a |
asna | a vector of character strings specifying levels that areto be treated the same as |
sort | set to |
subset | subset criteria - either a vector of logicals or subscripts |
Value
a matrix of class"ynbind" or"pBlock" with"label" and"labels" attributes.For"pBlock", factor input vectors will have values convertedtocharacter.
Author(s)
Frank Harrell
See Also
Examples
x1 <- c('yEs', 'no', 'UNKNOWN', NA)x2 <- c('y', 'n', 'no', 'present')label(x2) <- 'X2'X <- ynbind(x1, x2, label='x1-2')X[1:3,]pBlock(x1, x2, subset=2:3, label='x1-2')