Movatterモバイル変換

Priority:

recommended

Version:

7.3-65

Date:

2025-02-19

Revision:

$Rev: 3681 $

Depends:

R (≥ 4.4.0), grDevices, graphics, stats, utils

Imports:

methods

Suggests:

lattice, nlme, nnet, survival

Description:

Functions and datasets to support Venables and Ripley, "Modern Applied Statistics with S" (4th edition, 2002).

Title:

Support Functions and Datasets for Venables and Ripley's MASS

LazyData:

yes

ByteCompile:

yes

License:

GPL-2 |GPL-3

URL:

http://www.stats.ox.ac.uk/pub/MASS4/

Contact:

<MASS@stats.ox.ac.uk>

NeedsCompilation:

yes

Packaged:

2025-02-19 08:49:43 UTC; ripley

Author:

Brian Ripley [aut, cre, cph], Bill Venables [aut, cph], Douglas M. Bates [ctb], Kurt Hornik [trl] (partial port ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David Firth [ctb] (support functions for polr)

Maintainer:

Brian Ripley <Brian.Ripley@R-project.org>

Repository:

CRAN

Date/Publication:

2025-02-28 17:44:52 UTC

Australian AIDS Survival Data

Description

Data on patients diagnosed with AIDS in Australia before 1 July 1991.

Usage

Aids2

Format

This data frame contains 2843 rows and the following columns:

state: Grouped state of origin:"NSW "includes ACT and"other" is WA, SA, NT and TAS.
sex: Sex of patient.
diag: (Julian) date of diagnosis.
death: (Julian) date of death or end of observation.
status: "A" (alive) or"D" (dead) at end of observation.
T.categ: Reported transmission category.
age: Age (years) at diagnosis.

Note

This data set has been slightly jittered as acondition of its release, to ensure patient confidentiality.

Source

Dr P. J. Solomon and the Australian National Centre in HIV Epidemiologyand Clinical Research.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Brain and Body Weights for 28 Species

Description

Average brain and body weights for 28 species of land animals.

Usage

Animals

Format

body: body weight in kg.
brain: brain weight in g.

Note

The nameAnimals avoided conflicts with a system datasetanimals in S-PLUS 4.5 and later.

Source

P. J. Rousseeuw and A. M. Leroy (1987)Robust Regression and Outlier Detection.Wiley, p. 57.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Belgium Phone Calls 1950-1973

Description

A list object with the annual numbers of telephone calls, inBelgium. The components are:

year: last two digits of the year.
calls: number of telephone calls made (in millions of calls).

Usage

phones

Source

P. J. Rousseeuw and A. M. Leroy (1987)Robust Regression & Outlier Detection. Wiley.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Housing Values in Suburbs of Boston

Description

TheBoston data frame has 506 rows and 14 columns.

Usage

Boston

Format

This data frame contains the following columns:

crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 whereBk is the proportion of blacksby town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.

Source

Harrison, D. and Rubinfeld, D.L. (1978)Hedonic prices and the demand for clean air.J. Environ. Economics and Management5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980)Regression Diagnostics. Identifying Influential Data and Sourcesof Collinearity.New York: Wiley.

Data from 93 Cars on Sale in the USA in 1993

Description

TheCars93 data frame has 93 rows and 27 columns.

Usage

Cars93

Format

This data frame contains the following columns:

Manufacturer: Manufacturer.
Model: Model.
Type: Type: a factor with levels"Small","Sporty","Compact","Midsize","Large" and"Van".
Min.Price: Minimum Price (in $1,000): price for a basic version.
Price: Midrange Price (in $1,000): average ofMin.Price andMax.Price.
Max.Price: Maximum Price (in $1,000): price for “a premium version”.
MPG.city: City MPG (miles per US gallon by EPA rating).
MPG.highway: Highway MPG.
AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.
DriveTrain: Drive train type: rear wheel, front wheel or 4WD; (factor).
Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).
EngineSize: Engine size (litres).
Horsepower: Horsepower (maximum).
RPM: RPM (revs per minute at maximum horsepower).
Rev.per.mile: Engine revolutions per mile (in highest gear).
Man.trans.avail: Is a manual transmission version available? (yes or no, Factor).
Fuel.tank.capacity: Fuel tank capacity (US gallons).
Passengers: Passenger capacity (persons)
Length: Length (inches).
Wheelbase: Wheelbase (inches).
Width: Width (inches).
Turn.circle: U-turn space (feet).
Rear.seat.room: Rear seat room (inches) (missing for 2-seater vehicles).
Luggage.room: Luggage capacity (cubic feet) (missing for vans).
Weight: Weight (pounds).
Origin: Of non-USA or USA company origins? (factor).
Make: Combination of Manufacturer and Model (character).

Details

Cars were selected at random from among 1993 passenger car models thatwere listed in both theConsumer Reports issue and thePACE Buying Guide. Pickup trucks and Sport/Utility vehicles wereeliminated due to incomplete information in theConsumer Reportssource. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance)were listed at most once.

Further description can be found in Lock (1993).

Source

Lock, R. H. (1993)1993 New Car Data.Journal of Statistics Education1(1).doi:10.1080/10691898.1993.11910459

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Diagnostic Tests on Patients with Cushing's Syndrome

Description

Cushing's syndrome is a hypertensive disorder associated withover-secretion of cortisol by the adrenal gland. The observationsare urinary excretion rates of two steroid metabolites.

Usage

Cushings

Format

TheCushings data frame has 27 rows and 3 columns:

Tetrahydrocortisone: urinary excretion rate (mg/24hr) of Tetrahydrocortisone.
Pregnanetriol: urinary excretion rate (mg/24hr) of Pregnanetriol.
Type: underlying type of syndrome, codeda (adenoma) ,b(bilateral hyperplasia),c (carcinoma) oru for unknown.

Source

J. Aitchison and I. R. Dunsmore (1975)Statistical Prediction Analysis.Cambridge University Press, Tables 11.1–3.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

DDT in Kale

Description

A numeric vector of 15 measurements by different laboratories of thepesticide DDT in kale, in ppm (parts per million) using the multiplepesticide residue measurement.

Usage

DDT

Source

C. E. Finsterwalder (1976)Collaborative study of an extension of the Millset almethod for the determination of pesticide residues in food.J. Off. Anal. Chem.59, 169–171

R. G. Staudte and S. J. Sheather (1990)Robust Estimation and Testing.Wiley

Level of GAG in Urine of Children

Description

Data were collected on the concentration of a chemical GAG in theurine of 314 children aged from zero to seventeen years. The aim ofthe study was to produce a chart to help a paediatrican to assess if achild's GAG concentration is ‘normal’.

Usage

GAGurine

Format

This data frame contains the following columns:

Age: age of child in years.
GAG: concentration of GAG (the units have been lost).

Source

Mrs Susan Prosser, Paediatrics Department, University of Oxford,via Department of Statistics Consulting Service.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Numbers of Car Insurance claims

Description

The data given in data frameInsurance consist of thenumbers of policyholders of an insurance company who wereexposed to risk, and the numbers of car insurance claims made bythose policyholders in the third quarter of 1973.

Usage

Insurance

Format

This data frame contains the following columns:

District: factor: district of residence of policyholder (1 to 4): 4 is major cities.
Group: an ordered factor: group of car with levels <1 litre, 1–1.5 litre,1.5–2 litre, >2 litre.
Age: an ordered factor: the age of the insured in 4 groups labelled<25, 25–29, 30–35, >35.
Holders: numbers of policyholders.
Claims: numbers of claims

Source

L. A. Baxter, S. M. Coutts and G. A. F. Ross (1980) Applications oflinear models in motor insurance.Proceedings of the 21st International Congress of Actuaries, Zurichpp. 11–29.

M. Aitkin, D. Anderson, B. Francis and J. Hinde (1989)Statistical Modelling in GLIM.Oxford University Press.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Examples

## main-effects fit as Poisson GLM with offsetglm(Claims ~ District + Group + Age + offset(log(Holders)),    data = Insurance, family = poisson)# same via loglmloglm(Claims ~ District + Group + Age + offset(log(Holders)),      data = Insurance)

Internal MASS functions

Description

Internal MASS functions.

Usage

enlist(vec)fbeta(x, alpha, beta)frequency.polygon(x, nclass = nclass.freq(x), xlab="", ylab="", ...)nclass.freq(x)neg.bin(theta = stop("'theta' must be given"))negexp.SSival(mCall, data, LHS)

Details

These are not intended to be called by the user.Some are for compatibilitywith earlier versions of MASS (the book).

Survival from Malignant Melanoma

Description

TheMelanoma data frame has data on 205 patients in Denmarkwith malignant melanoma.

Usage

Melanoma

Format

This data frame contains the following columns:

time: survival time in days, possibly censored.
status: 1 died from melanoma,2 alive,3 dead fromother causes.
sex: 1 = male,0 = female.
age: age in years.
year: of operation.
thickness: tumour thickness in mm.
ulcer: 1 = presence,0 = absence.

Source

P. K. Andersen, O. Borgan, R. D. Gill and N. Keiding (1993)Statistical Models based on Counting Processes.Springer.

Null Spaces of Matrices

Description

Given a matrix,M, find a matrixN giving a basis for the(left) null space. That iscrossprod(N, M) = t(N) %*% Mis an all-zero matrix andN has the maximum number of linearlyindependent columns.

Usage

Null(M)

Arguments

M

Input matrix. A vector is coerced to a 1-column matrix.

Details

For a basis for the (right) null space\{x : Mx = 0\},useNull(t(M)).

Value

The matrixN with the basis for the (left) null space, or amatrix with zero columns if the matrixM is square and ofmaximal rank.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

# The function is currently defined asfunction(M){    tmp <- qr(M)    set <- if(tmp$rank == 0L) seq_len(ncol(M)) else  -seq_len(tmp$rank)    qr.Q(tmp, complete = TRUE)[, set, drop = FALSE]}

Tests of Auditory Perception in Children with OME

Description

Experiments were performed on children on their ability todifferentiate a signal in broad-band noise. The noise was played froma pair of speakers and a signal was added to just one channel; thesubject had to turn his/her head to the channel with the added signal.The signal was either coherent (the amplitude of the noise wasincreased for a period) or incoherent (independent noise was added forthe same period to form the same increase in power).

The threshold used in the original analysis was the stimulus loudnessneeds to get 75% correct responses. Some of the children hadsuffered from otitis media with effusion (OME).

Usage

OME

Format

TheOME data frame has 1129 rows and 7 columns:

ID: Subject ID (1 to 99, with some IDs missing). A few subjects weremeasured at different ages.
OME: "low" or"high" or"N/A" (at ages other than30 and 60 months).
Age: Age of the subject (months).
Loud: Loudness of stimulus, in decibels.
Noise: Whether the signal in the stimulus was"coherent" or"incoherent".
Correct: Number of correct responses fromTrials trials.
Trials: Number of trials performed.

Background

The experiment was to study otitis media with effusion (OME), a verycommon childhood condition where the middle ear space, which isnormally air-filled, becomes congested by a fluid. There is aconcomitant fluctuating, conductive hearing loss which can result invarious language, cognitive and social deficits. The term ‘binauralhearing’ is used to describe the listening conditions in which thebrain is processing information from both ears at the same time. Thebrain computes differences in the intensity and/or timing of signalsarriving at each ear which contributes to sound localisation and alsoto our ability to hear in background noise.

Some years ago, it was found that children of 7–8 years with a historyof significant OME had significantly worse binaural hearing thanchildren without such a history, despite having equivalentsensitivity. The question remained as to whether it was the timing,the duration, or the degree of severity of the otitis media episodesduring critical periods, which affected later binaural hearing. In anattempt to begin to answer this question, 95 children were monitored forthe presence of effusion every month since birth. On the basis of OMEexperience in their first two years, the test population was splitinto one group of high OME prevalence and one of low prevalence.

Source

Sarah Hogan, Dept of Physiology, University of Oxford, viaDept of Statistics Consulting Service

Examples

# Fit logistic curve from p = 0.5 to p = 1.0fp1 <- deriv(~ 0.5 + 0.5/(1 + exp(-(x-L75)/scal)),             c("L75", "scal"),             function(x,L75,scal)NULL)nls(Correct/Trials ~ fp1(Loud, L75, scal), data = OME,    start = c(L75=45, scal=3))nls(Correct/Trials ~ fp1(Loud, L75, scal),    data = OME[OME$Noise == "coherent",],    start=c(L75=45, scal=3))nls(Correct/Trials ~ fp1(Loud, L75, scal),    data = OME[OME$Noise == "incoherent",],    start = c(L75=45, scal=3))# individual fits for each experimentaa <- factor(OME$Age)ab <- 10*OME$ID + unclass(aa)ac <- unclass(factor(ab))OME$UID <- as.vector(ac)OME$UIDn <- OME$UID + 0.1*(OME$Noise == "incoherent")rm(aa, ab, ac)OMEi <- OMElibrary(nlme)fp2 <- deriv(~ 0.5 + 0.5/(1 + exp(-(x-L75)/2)),            "L75", function(x,L75) NULL)dec <- getOption("OutDec")options(show.error.messages = FALSE, OutDec=".")OMEi.nls <- nlsList(Correct/Trials ~ fp2(Loud, L75) | UIDn,   data = OMEi, start = list(L75=45), control = list(maxiter=100))options(show.error.messages = TRUE, OutDec=dec)tmp <- sapply(OMEi.nls, function(X)              {if(is.null(X)) NA else as.vector(coef(X))})OMEif <- data.frame(UID = round(as.numeric((names(tmp)))),         Noise = rep(c("coherent", "incoherent"), 110),         L75 = as.vector(tmp), stringsAsFactors = TRUE)OMEif$Age <- OME$Age[match(OMEif$UID, OME$UID)]OMEif$OME <- OME$OME[match(OMEif$UID, OME$UID)]OMEif <- OMEif[OMEif$L75 > 30,]summary(lm(L75 ~ Noise/Age, data = OMEif, na.action = na.omit))summary(lm(L75 ~ Noise/(Age + OME), data = OMEif,           subset = (Age >= 30 & Age <= 60),           na.action = na.omit), correlation = FALSE)# Or fit by weighted least squaresfpl75 <- deriv(~ sqrt(n)*(r/n - 0.5 - 0.5/(1 + exp(-(x-L75)/scal))),               c("L75", "scal"),               function(r,n,x,L75,scal) NULL)nls(0 ~ fpl75(Correct, Trials, Loud, L75, scal),    data = OME[OME$Noise == "coherent",],    start = c(L75=45, scal=3))nls(0 ~ fpl75(Correct, Trials, Loud, L75, scal),    data = OME[OME$Noise == "incoherent",],    start = c(L75=45, scal=3))# Test to see if the curves shift with agefpl75age <- deriv(~sqrt(n)*(r/n -  0.5 - 0.5/(1 +                  exp(-(x-L75-slope*age)/scal))),                  c("L75", "slope", "scal"),                  function(r,n,x,age,L75,slope,scal) NULL)OME.nls1 <-nls(0 ~ fpl75age(Correct, Trials, Loud, Age, L75, slope, scal),    data = OME[OME$Noise == "coherent",],    start = c(L75=45, slope=0, scal=2))sqrt(diag(vcov(OME.nls1)))OME.nls2 <-nls(0 ~ fpl75age(Correct, Trials, Loud, Age, L75, slope, scal),    data = OME[OME$Noise == "incoherent",],    start = c(L75=45, slope=0, scal=2))sqrt(diag(vcov(OME.nls2)))# Now allow random effects by using NLMEOMEf <- OME[rep(1:nrow(OME), OME$Trials),]OMEf$Resp <- with(OME, rep(rep(c(1,0), length(Trials)),                          t(cbind(Correct, Trials-Correct))))OMEf <- OMEf[, -match(c("Correct", "Trials"), names(OMEf))]## Not run: ## these fail in R on most platformsfp2 <- deriv(~ 0.5 + 0.5/(1 + exp(-(x-L75)/exp(lsc))),             c("L75", "lsc"),             function(x, L75, lsc) NULL)try(summary(nlme(Resp ~ fp2(Loud, L75, lsc),     fixed = list(L75 ~ Age, lsc ~ 1),     random = L75 + lsc ~ 1 | UID,     data = OMEf[OMEf$Noise == "coherent",], method = "ML",     start = list(fixed=c(L75=c(48.7, -0.03), lsc=0.24)), verbose = TRUE)))try(summary(nlme(Resp ~ fp2(Loud, L75, lsc),     fixed = list(L75 ~ Age, lsc ~ 1),     random = L75 + lsc ~ 1 | UID,     data = OMEf[OMEf$Noise == "incoherent",], method = "ML",     start = list(fixed=c(L75=c(41.5, -0.1), lsc=0)), verbose = TRUE)))## End(Not run)

Diabetes in Pima Indian Women

Description

A population of women who were at least 21 years old, of Pima Indian heritageand living near Phoenix, Arizona, was tested for diabetesaccording to World Health Organization criteria. The datawere collected by the US National Institute of Diabetes and Digestive andKidney Diseases. We used the 532 complete records after dropping the(mainly missing) data on serum insulin.

Usage

Pima.trPima.tr2Pima.te

Format

These data frames contains the following columns:

npreg: number of pregnancies.
glu: plasma glucose concentration in an oral glucose tolerance test.
bp: diastolic blood pressure (mm Hg).
skin: triceps skin fold thickness (mm).
bmi: body mass index (weight in kg/(height in m)^2).
ped: diabetes pedigree function.
age: age in years.
type: Yes orNo, for diabetic according to WHO criteria.

Details

The training setPima.tr contains a randomly selected set of 200subjects, andPima.te contains the remaining 332 subjects.Pima.tr2 containsPima.tr plus 100 subjects withmissing values in the explanatory variables.

Source

Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C.and Johannes, R. S. (1988)Using the ADAP learning algorithm to forecast the onset ofdiabetes mellitus.InProceedings of the Symposium on Computer Applications inMedical Care (Washington, 1988), ed. R. A. Greenes,pp. 261–265. Los Alamitos, CA: IEEE Computer Society Press.

Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press.

Blood Pressure in Rabbits

Description

Five rabbits were studied on two occasions, after treatment withsaline (control) and after treatment with the5-HT_3 antagonist MDL72222. After each treatment ascending doses of phenylbiguanide wereinjected intravenously at 10 minute intervals and the responses ofmean blood pressure measured. The goal was to test whether thecardiogenic chemoreflex elicited by phenylbiguanide depends on theactivation of5-HT_3 receptors.

Usage

Rabbit

Format

This data frame contains 60 rows and the following variables:

BPchange: change in blood pressure relative to the start of the experiment.
Dose: dose of Phenylbiguanide in micrograms.
Run: label of run ("C1" to"C5", then"M1" to"M5").
Treatment: placebo or the5-HT_3 antagonist MDL 72222.
Animal: label of animal used ("R1" to"R5").

Source

J. Ludbrook (1994)Repeated measurements and multiple comparisons in cardiovascular research.Cardiovascular Research28, 303–311.
[The numerical data are not in the paper but were supplied byProfessor Ludbrook]

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Accelerated Testing of Tyre Rubber

Description

Data frame from accelerated testing of tyre rubber.

Usage

Rubber

Format

loss: the abrasion loss in gm/hr.
hard: the hardness in Shore units.
tens: tensile strength in kg/sq m.

Source

O.L. Davies (1947)Statistical Methods in Research and Production.Oliver and Boyd, Table 6.1 p. 119.

O.L. Davies and P.L. Goldsmith (1972)Statistical Methods in Research and Production.4th edition, Longmans, Table 8.1 p. 239.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Returns of the Standard and Poors 500

Description

Returns of the Standard and Poors 500 Index in the 1990's

Usage

SP500

Format

A vector of returns of the Standard and Poors 500 index for allthe trading days in 1990, 1991, ..., 1999.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Growth Curves for Sitka Spruce Trees in 1988

Description

TheSitka data frame has 395 rows and 4 columns. It gives repeatedmeasurements on the log-size of 79 Sitka spruce trees, 54 of whichwere grown in ozone-enriched chambers and 25 were controls. The sizewas measured five times in 1988, at roughly monthly intervals.

Usage

Sitka

Format

This data frame contains the following columns:

size: measured size (height times diameter squared) oftree, on log scale.
Time: time of measurement in days since 1 January 1988.
tree: number of tree.
treat: either"ozone" for an ozone-enrichedchamber or"control".

Source

P. J. Diggle, K.-Y. Liang and S. L. Zeger (1994)Analysis of Longitudinal Data.Clarendon Press, Oxford

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Growth Curves for Sitka Spruce Trees in 1989

Description

TheSitka89 data frame has 632 rows and 4 columns. It gives repeatedmeasurements on the log-size of 79 Sitka spruce trees, 54 of whichwere grown in ozone-enriched chambers and 25 were controls. The sizewas measured eight times in 1989, at roughly monthly intervals.

Usage

Sitka89

Format

This data frame contains the following columns:

size: measured size (height times diameter squared) oftree, on log scale.
Time: time of measurement in days since 1 January 1988.
tree: number of tree.
treat: either"ozone" for an ozone-enrichedchamber or"control".

Source

P. J. Diggle, K.-Y. Liang and S. L. Zeger (1994)Analysis of Longitudinal Data.Clarendon Press, Oxford

AFM Compositions of Aphyric Skye Lavas

Description

TheSkye data frame has 23 rows and 3 columns.

Usage

Skye

Format

This data frame contains the following columns:

A: Percentage of sodium and potassium oxides.
F: Percentage of iron oxide.
M: Percentage of magnesium oxide.

Source

R. N. Thompson, J. Esson and A. C. Duncan (1972)Major element chemical variation in the Eocene lavas of the Isle ofSkye.J. Petrology,13, 219–253.

References

J. Aitchison (1986)The Statistical Analysis of Compositional Data.Chapman and Hall, p.360.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

# ternary() is from the on-line answers.ternary <- function(X, pch = par("pch"), lcex = 1,                    add = FALSE, ord = 1:3, ...){  X <- as.matrix(X)  if(any(X < 0)) stop("X must be non-negative")  s <- drop(X %*% rep(1, ncol(X)))  if(any(s<=0)) stop("each row of X must have a positive sum")  if(max(abs(s-1)) > 1e-6) {    warning("row(s) of X will be rescaled")    X <- X / s  }  X <- X[, ord]  s3 <- sqrt(1/3)  if(!add)  {    oldpty <- par("pty")    on.exit(par(pty=oldpty))    par(pty="s")    plot(c(-s3, s3), c(0.5-s3, 0.5+s3), type="n", axes=FALSE,         xlab="", ylab="")    polygon(c(0, -s3, s3), c(1, 0, 0), density=0)    lab <- NULL    if(!is.null(dn <- dimnames(X))) lab <- dn[[2]]    if(length(lab) < 3) lab <- as.character(1:3)    eps <- 0.05 * lcex    text(c(0, s3+eps*0.7, -s3-eps*0.7),         c(1+eps, -0.1*eps, -0.1*eps), lab, cex=lcex)  }  points((X[,2] - X[,3])*s3, X[,1], ...)}ternary(Skye/100, ord=c(1,3,2))

Effect of Swedish Speed Limits on Accidents

Description

An experiment was performed in Sweden in 1961–2 to assess theeffect of a speed limit on the motorway accident rate. Theexperiment was conducted on 92 days in each year, matched so thatdayj in 1962 was comparable to dayj in 1961. On some daysthe speed limit was in effect and enforced, while on other daysthere was no speed limit and cars tended to be driven faster.The speed limit days tended to be in contiguous blocks.

Usage

Traffic

Format

This data frame contains the following columns:

year: 1961 or 1962.
day: of year.
limit: was there a speed limit?
y: traffic accident count for that day.

Source

Svensson, A. (1981)On the goodness-of-fit test for the multiplicative Poisson model.Annals of Statistics,9, 697–704.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Nutritional and Marketing Information on US Cereals

Description

TheUScereal data frame has 65 rows and 11 columns.The data come from the 1993 ASA Statistical Graphics Exposition,and are taken from the mandatory F&DA food label. The data have beennormalized here to a portion of one American cup.

Usage

UScereal

Format

This data frame contains the following columns:

mfr: Manufacturer, represented by its first initial: G=General Mills,K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina.
calories: number of calories in one portion.
protein: grams of protein in one portion.
fat: grams of fat in one portion.
sodium: milligrams of sodium in one portion.
fibre: grams of dietary fibre in one portion.
carbo: grams of complex carbohydrates in one portion.
sugars: grams of sugars in one portion.
shelf: display shelf (1, 2, or 3, counting from the floor).
potassium: grams of potassium.
vitamins: vitamins and minerals (none, enriched, or 100%).

Source

The original data are available athttps://lib.stat.cmu.edu/datasets/1993.expo/.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

The Effect of Punishment Regimes on Crime Rates

Description

Criminologists are interested in the effect of punishment regimes oncrime rates. This has been studied using aggregate data on 47 statesof the USA for 1960 given in this data frame. The variables seem tohave been re-scaled to convenient numbers.

Usage

UScrime

Format

This data frame contains the following columns:

M: percentage of males aged 14–24.
So: indicator variable for a Southern state.
Ed: mean years of schooling.
Po1: police expenditure in 1960.
Po2: police expenditure in 1959.
LF: labour force participation rate.
M.F: number of males per 1000 females.
Pop: state population.
NW: number of non-whites per 1000 people.
U1: unemployment rate of urban males 14–24.
U2: unemployment rate of urban males 35–39.
GDP: gross domestic product per head.
Ineq: income inequality.
Prob: probability of imprisonment.
Time: average time served in state prisons.
y: rate of crimes in a particular category per head of population.

Source

Ehrlich, I. (1973) Participation in illegitimate activities: atheoretical and empirical investigation.Journal of Political Economy,81, 521–565.

Vandaele, W. (1978) Participation in illegitimate activities: Ehrlichrevisited. InDeterrence and Incapacitation,eds A. Blumstein, J. Cohen and D. Nagin, pp. 270–335.US National Academy of Sciences.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Veteran's Administration Lung Cancer Trial

Description

Veteran's Administration lung cancer trial from Kalbfleisch & Prentice.

Usage

VA

Format

A data frame with columns:

stime: survival or follow-up time in days.
status: dead or censored.
treat: treatment: standard or test.
age: patient's age in years.
Karn: Karnofsky score of patient's performance on a scale of 0 to 100.
diag.time: times since diagnosis in months at entry to trial.
cell: one of four cell types.
prior: prior therapy?

Source

Kalbfleisch, J.D. and Prentice R.L. (1980)The Statistical Analysis of Failure Time Data.Wiley.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Determinations of Nickel Content

Description

A numeric vector of 31 determinations of nickel content (ppm) ina Canadian syenite rock.

Usage

abbey

Source

S. Abbey (1988)Geostandards Newsletter12, 241.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Accidental Deaths in the US 1973-1978

Description

A regular time series giving the monthly totals of accidentaldeaths in the USA.

Usage

accdeaths

Details

The values for first six months of 1979 (p. 326) were7798 7406 8363 8460 9217 9316.

Source

P. J. Brockwell and R. A. Davis (1991)Time Series: Theory and Methods.Springer, New York.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Try All One-Term Additions to a Model

Description

Try fitting all models that differ from the current model by adding asingle term from those supplied, maintaining marginality.

This function is generic; there exist methods for classeslm andglm and the default method will work for many other classes.

Usage

addterm(object, ...)## Default S3 method:addterm(object, scope, scale = 0, test = c("none", "Chisq"),        k = 2, sorted = FALSE, trace = FALSE, ...)## S3 method for class 'lm'addterm(object, scope, scale = 0, test = c("none", "Chisq", "F"),        k = 2, sorted = FALSE, ...)## S3 method for class 'glm'addterm(object, scope, scale = 0, test = c("none", "Chisq", "F"),        k = 2, sorted = FALSE, trace = FALSE, ...)

Arguments

object

An object fitted by some model-fitting function.

scope

a formula specifying a maximal model which should include the currentone. All additional terms in the maximal model with all marginal termsin the original model are tried.

scale

used in the definition of the AIC statistic for selecting the models,currently only forlm,aov andglm models. Specifyingscaleasserts that the residual standard error or dispersion is known.

test

should the results include a test statistic relative to the originalmodel? The F test is only appropriate forlm andaov models,and perhaps for some over-dispersedglm models. TheChisq test can be an exact test (lm models with known scale) or alikelihood-ratio test depending on the method.

k

the multiple of the number of degrees of freedom used for the penalty.Onlyk=2 gives the genuine AIC:k = log(n) is sometimes referredto as BIC or SBC.

sorted

should the results be sorted on the value of AIC?

trace

ifTRUE additional information may be given on the fits as they are tried.

...

arguments passed to or from other methods.

Details

The definition of AIC is only up to an additive constant: whenappropriate (lm models with specified scale) the constant is takento be that used in Mallows' Cp statistic and the results are labelledaccordingly.

Value

A table of class"anova" containing at least columns for the changein degrees of freedom and AIC (or Cp) for the models. Some methodswill give further information, for example sums of squares, deviances,log-likelihoods and test statistics.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

quine.hi <- aov(log(Days + 2.5) ~ .^4, quine)quine.lo <- aov(log(Days+2.5) ~ 1, quine)addterm(quine.lo, quine.hi, test="F")house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family=poisson,                   data=housing)addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test="Chisq")house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test = "Chisq")

Anorexia Data on Weight Change

Description

Theanorexia data frame has 72 rows and 3 columns.Weight change data for young female anorexia patients.

Usage

anorexia

Format

This data frame contains the following columns:

Treat: Factor of three levels:"Cont" (control),"CBT"(Cognitive Behavioural treatment) and"FT" (familytreatment).
Prewt: Weight of patient before study period, in lbs.
Postwt: Weight of patient after study period, in lbs.

Source

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993)A Handbook of Small Data Sets.Chapman & Hall, Data set 285 (p. 229)

(Note that the original source mistakenly says that weights are in kg.)

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Likelihood Ratio Tests for Negative Binomial GLMs

Description

Method function to perform sequential likelihood ratio tests for NegativeBinomial generalized linear models.

Usage

## S3 method for class 'negbin'anova(object, ..., test = "Chisq")

Arguments

object

Fitted model object of class"negbin", inheriting fromclasses"glm" and"lm", specifying a Negative Binomialfitted GLM. Typically the output ofglm.nb().

...

Zero or more additional fitted model objects of class"negbin". Theyshould form a nested sequence of models, but need not be specified in anyparticular order.

test

Argument to match thetest argument ofanova.glm.Ignored (with a warning if changed) if a sequence of two or moreNegative Binomial fitted model objects is specified, but possiblyused if only one object is specified.

Details

This function is a method for the generic functionanova() for class"negbin".It can be invoked by callinganova(x) for anobjectx of the appropriate class, or directly bycallinganova.negbin(x) regardless of theclass of the object.

Note

If only one fitted model object is specified, a sequential analysis ofdeviance table is given for the fitted model. Thetheta parameter is keptfixed. If more than one fitted model object is specified they must all beof class"negbin" and likelihood ratio tests are done of each model withinthe next. In this casetheta is assumed to have been re-estimated for eachmodel.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

m1 <- glm.nb(Days ~ Eth*Age*Lrn*Sex, quine, link = log)m2 <- update(m1, . ~ . - Eth:Age:Lrn:Sex)anova(m2, m1)anova(m2)

Adaptive Numerical Integration

Description

Integrate a function of one variable over a finite range using arecursive adaptive method. This function is mainly fordemonstration purposes.

Usage

area(f, a, b, ..., fa = f(a, ...), fb = f(b, ...),     limit = 10, eps = 1e-05)

Arguments

f

The integrand as anS function object. The variable of integration must bethe first argument.

a

Lower limit of integration.

b

Upper limit of integration.

...

Additional arguments needed by the integrand.

fa

Function value at the lower limit.

fb

Function value at the upper limit.

limit

Limit on the depth to which recursion is allowed to go.

eps

Error tolerance to control the process.

Details

The method divides the interval in two and compares the values given bySimpson's rule and the trapezium rule. If these are within eps of eachother the Simpson's rule result is given, otherwise the process is appliedseparately to each half of the interval and the results added together.

Value

The integral froma tob off(x).

References

Venables, W. N. and Ripley, B. D. (1994)Modern Applied Statistics with S-Plus. Springer.pp. 105–110.

Examples

area(sin, 0, pi)  # integrate the sin function from 0 to pi.

Presence of Bacteria after Drug Treatments

Description

Tests of the presence of the bacteriaH. influenzaein children with otitis media in the Northern Territory of Australia.

Usage

bacteria

Format

This data frame has 220 rows and the following columns:

y: presence or absence: a factor with levelsn andy.
ap: active/placebo: a factor with levelsa andp.
hilo: hi/low compliance: a factor with levelshi amdlo.
week: numeric: week of test.
ID: subject ID: a factor.
trt: a factor with levelsplacebo,drug anddrug+, a re-coding ofap andhilo.

Details

Dr A. Leach tested the effects of a drug on 50 children with a history ofotitis media in the Northern Territory of Australia. The childrenwere randomized to the drug or the a placebo, and also to receiveactive encouragement to comply with taking the drug.

The presence ofH. influenzae was checked at weeks 0, 2, 4, 6and 11: 30 of the checks were missing and are not included in thisdata frame.

Source

Dr Amanda Leachvia Mr James McBroom.

References

Menzies School of Health Research 1999–2000 Annual Report. p.20.https://www.menzies.edu.au/icms_docs/172302_2000_Annual_report.pdf.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

contrasts(bacteria$trt) <- structure(contr.sdif(3),     dimnames = list(NULL, c("drug", "encourage")))## fixed effects analysessummary(glm(y ~ trt * week, binomial, data = bacteria))summary(glm(y ~ trt + week, binomial, data = bacteria))summary(glm(y ~ trt + I(week > 2), binomial, data = bacteria))# conditional random-effects analysislibrary(survival)bacteria$Time <- rep(1, nrow(bacteria))coxph(Surv(Time, unclass(y)) ~ week + strata(ID),      data = bacteria, method = "exact")coxph(Surv(Time, unclass(y)) ~ factor(week) + strata(ID),      data = bacteria, method = "exact")coxph(Surv(Time, unclass(y)) ~ I(week > 2) + strata(ID),      data = bacteria, method = "exact")# PQL glmm analysislibrary(nlme)## IGNORE_RDIFF_BEGINsummary(glmmPQL(y ~ trt + I(week > 2), random = ~ 1 | ID,                family = binomial, data = bacteria))## IGNORE_RDIFF_END

Bandwidth for density() via Normal Reference Distribution

Description

A well-supported rule-of-thumb for choosing the bandwidth of a Gaussiankernel density estimator.

Usage

bandwidth.nrd(x)

Arguments

x

A data vector.

Value

A bandwidth on a scale suitable for thewidth argument ofdensity.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S.Springer, equation (5.5) on page 130.

Examples

# The function is currently defined asfunction(x){    r <- quantile(x, c(0.25, 0.75))    h <- (r[2] - r[1])/1.34    4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)}

Biased Cross-Validation for Bandwidth Selection

Description

Uses biased cross-validation to select the bandwidth of a Gaussiankernel density estimator.

Usage

bcv(x, nb = 1000, lower, upper)

Arguments

x

a numeric vector

nb

number of bins to use.

lower,upper

Range over which to minimize. The default is almost always satisfactory.

Value

a bandwidth

References

Scott, D. W. (1992)Multivariate Density Estimation: Theory, Practice, and Visualization.Wiley.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

bcv(geyser$duration)

Body Temperature Series of Beaver 1

Description

Reynolds (1994) describes a small part of a study of the long-termtemperature dynamics of beaverCastor canadensis innorth-central Wisconsin. Body temperature was measured by telemetryevery 10 minutes for four females, but data from a one period of lessthan a day for each of two animals is used there.

Usage

beav1

Format

Thebeav1 data frame has 114 rows and 4 columns.This data frame contains the following columns:

day: Day of observation (in days since the beginning of 1990),December 12–13.
time: Time of observation, in the form0330 for 3.30am.
temp: Measured body temperature in degrees Celsius.
activ: Indicator of activity outside the retreat.

Note

The observation at 22:20 is missing.

Source

P. S. Reynolds (1994) Time-series analyses of beaver body temperatures.Chapter 11 ofLange, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L.and Greenhouse, J. eds (1994)Case Studies in Biometry. NewYork: John Wiley and Sons.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

beav1 <- within(beav1,               hours <- 24*(day-346) + trunc(time/100) + (time%%100)/60)plot(beav1$hours, beav1$temp, type="l", xlab="time",   ylab="temperature", main="Beaver 1")usr <- par("usr"); usr[3:4] <- c(-0.2, 8); par(usr=usr)lines(beav1$hours, beav1$activ, type="s", lty=2)temp <- ts(c(beav1$temp[1:82], NA, beav1$temp[83:114]),           start = 9.5, frequency = 6)activ <- ts(c(beav1$activ[1:82], NA, beav1$activ[83:114]),            start = 9.5, frequency = 6)acf(temp[1:53])acf(temp[1:53], type = "partial")ar(temp[1:53])act <- c(rep(0, 10), activ)X <- cbind(1, act = act[11:125], act1 = act[10:124],          act2 = act[9:123], act3 = act[8:122])alpha <- 0.80stemp <- as.vector(temp - alpha*lag(temp, -1))sX <- X[-1, ] - alpha * X[-115,]beav1.ls <- lm(stemp ~ -1 + sX, na.action = na.omit)summary(beav1.ls, correlation = FALSE)rm(temp, activ)

Body Temperature Series of Beaver 2

Description

Usage

beav2

Format

Thebeav2 data frame has 100 rows and 4 columns.This data frame contains the following columns:

day: Day of observation (in days since the beginning of 1990),November 3–4.
time: Time of observation, in the form0330 for 3.30am.
temp: Measured body temperature in degrees Celsius.
activ: Indicator of activity outside the retreat.

Source

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

attach(beav2)beav2$hours <- 24*(day-307) + trunc(time/100) + (time%%100)/60plot(beav2$hours, beav2$temp, type = "l", xlab = "time",   ylab = "temperature", main = "Beaver 2")usr <- par("usr"); usr[3:4] <- c(-0.2, 8); par(usr = usr)lines(beav2$hours, beav2$activ, type = "s", lty = 2)temp <- ts(temp, start = 8+2/3, frequency = 6)activ <- ts(activ, start = 8+2/3, frequency = 6)acf(temp[activ == 0]); acf(temp[activ == 1]) # also look at PACFsar(temp[activ == 0]); ar(temp[activ == 1])arima(temp, order = c(1,0,0), xreg = activ)dreg <- cbind(sin = sin(2*pi*beav2$hours/24), cos = cos(2*pi*beav2$hours/24))arima(temp, order = c(1,0,0), xreg = cbind(active=activ, dreg))## IGNORE_RDIFF_BEGINlibrary(nlme) # for gls and corAR1beav2.gls <- gls(temp ~ activ, data = beav2, correlation = corAR1(0.8),                 method = "ML")summary(beav2.gls)summary(update(beav2.gls, subset = 6:100))detach("beav2"); rm(temp, activ)## IGNORE_RDIFF_END

Biopsy Data on Breast Cancer Patients

Description

This breast cancer database was obtained from the University of WisconsinHospitals, Madison from Dr. William H. Wolberg. He assessed biopsiesof breast tumours for 699 patients up to 15 July 1992; each of nineattributes has been scored on a scale of 1 to 10, and the outcome isalso known. There are 699 rows and 11 columns.

Usage

biopsy

Format

This data frame contains the following columns:

ID: sample code number (not unique).
V1: clump thickness.
V2: uniformity of cell size.
V3: uniformity of cell shape.
V4: marginal adhesion.
V5: single epithelial cell size.
V6: bare nuclei (16 values are missing).
V7: bland chromatin.
V8: normal nucleoli.
V9: mitoses.
class: "benign" or"malignant".

Source

P. M. Murphy and D. W. Aha (1992). UCI Repository of machinelearning databases. [Machine-readable data repository]. Irvine, CA:University of California, Department of Information and Computer Science.

O. L. Mangasarian and W. H. Wolberg (1990)Cancer diagnosis via linear programming.SIAM News23, pp 1 & 18.

William H. Wolberg and O.L. Mangasarian (1990)Multisurface method of pattern separation for medical diagnosisapplied to breast cytology.Proceedings of the National Academy of Sciences, U.S.A.87, pp. 9193–9196.

O. L. Mangasarian, R. Setiono and W.H. Wolberg (1990)Pattern recognition via linear programming: Theory and applicationto medical diagnosis. InLarge-scale Numerical Optimizationeds Thomas F. Coleman and Yuying Li, SIAM Publications, Philadelphia,pp 22–30.

K. P. Bennett and O. L. Mangasarian (1992)Robust linear programming discrimination of two linearly inseparable sets.Optimization Methods and Software1, pp. 23–34 (Gordon & Breach Science Publishers).

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Risk Factors Associated with Low Infant Birth Weight

Description

Thebirthwt data frame has 189 rows and 10 columns.The data were collected at Baystate Medical Center, Springfield, Massduring 1986.

Usage

birthwt

Format

This data frame contains the following columns:

low: indicator of birth weight less than 2.5 kg.
age: mother's age in years.
lwt: mother's weight in pounds at last menstrual period.
race: mother's race (1 = white,2 = black,3 = other).
smoke: smoking status during pregnancy.
ptl: number of previous premature labours.
ht: history of hypertension.
ui: presence of uterine irritability.
ftv: number of physician visits during the first trimester.
bwt: birth weight in grams.

Source

Hosmer, D.W. and Lemeshow, S. (1989)Applied Logistic Regression. New York: Wiley

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

bwt <- with(birthwt, {race <- factor(race, labels = c("white", "black", "other"))ptd <- factor(ptl > 0)ftv <- factor(ftv)levels(ftv)[-(1:2)] <- "2+"data.frame(low = factor(low), age, lwt, race, smoke = (smoke > 0),           ptd, ht = (ht > 0), ui = (ui > 0), ftv)})options(contrasts = c("contr.treatment", "contr.poly"))glm(low ~ ., binomial, bwt)

Box-Cox Transformations for Linear Models

Description

Computes and optionally plots profile log-likelihoods for theparameter of the Box-Cox power transformation.

Usage

boxcox(object, ...)## Default S3 method:boxcox(object, lambda = seq(-2, 2, 1/10), plotit = TRUE,       interp, eps = 1/50, xlab = expression(lambda),       ylab = "log-Likelihood", ...)## S3 method for class 'formula'boxcox(object, lambda = seq(-2, 2, 1/10), plotit = TRUE,       interp, eps = 1/50, xlab = expression(lambda),       ylab = "log-Likelihood", ...)## S3 method for class 'lm'boxcox(object, lambda = seq(-2, 2, 1/10), plotit = TRUE,       interp, eps = 1/50, xlab = expression(lambda),       ylab = "log-Likelihood", ...)

Arguments

object

a formula or fitted model object. Currently onlylm andaov objects are handled.

lambda

vector of values oflambda– default(-2, 2) in steps of 0.1.

plotit

logical which controls whether the result should be plotted.

interp

logical which controls whether spline interpolation isused. Default toTRUE if plotting withlambda oflength less than 100.

eps

Tolerance forlambda = 0; defaults to 0.02.

xlab

defaults to"lambda".

ylab

defaults to"log-Likelihood".

...

additional parameters to be used in the model fitting.

Value

A list of thelambda vector and the computed profilelog-likelihood vector, invisibly if the result is plotted.

Side Effects

Ifplotit = TRUE plots log-likelihoodvslambda andindicates a 95% confidence interval about the maximum observed valueoflambda. Ifinterp = TRUE, spline interpolation isused to give a smoother plot.

References

Box, G. E. P. and Cox, D. R. (1964)An analysis of transformations (with discussion).Journal of the Royal Statistical Society B,26, 211–252.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

boxcox(Volume ~ log(Height) + log(Girth), data = trees,       lambda = seq(-0.25, 0.25, length.out = 10))boxcox(Days+1 ~ Eth*Sex*Age*Lrn, data = quine,       lambda = seq(-0.05, 0.45, length.out = 20))

Data from a cabbage field trial

Description

Thecabbages data set has 60 observations and 4 variables

Usage

cabbages

Format

This data frame contains the following columns:

Cult: Factor giving the cultivar of the cabbage, two levels:c39andc52.
Date: Factor specifying one of three planting dates:d16,d20 ord21.
HeadWt: Weight of the cabbage head, presumably in kg.
VitC: Ascorbic acid content, in undefined units.

Source

Rawlings, J. O. (1988)Applied Regression Analysis: A Research Tool.Wadsworth and Brooks/Cole. Example 8.4, page 219.(Rawlings cites the original source as the files of the lateDr Gertrude M Cox.)

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Colours of Eyes and Hair of People in Caithness

Description

Data on the cross-classification of people in Caithness, Scotland, byeye and hair colour. The region of the UK is particularly interestingas there is a mixture of people of Nordic, Celtic and Anglo-Saxon origin.

Usage

caith

Format

A 4 by 5 table with rows the eye colours (blue, light, medium, dark) andcolumns the hair colours (fair, red, medium, dark, black).

Source

Fisher, R.A. (1940) The precision of discriminant functions.Annals of Eugenics (London)10, 422–429.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

## IGNORE_RDIFF_BEGIN## The signs can vary by platformcorresp(caith)## IGNORE_RDIFF_ENDdimnames(caith)[[2]] <- c("F", "R", "M", "D", "B")par(mfcol=c(1,3))plot(corresp(caith, nf=2)); title("symmetric")plot(corresp(caith, nf=2), type="rows"); title("rows")plot(corresp(caith, nf=2), type="col"); title("columns")par(mfrow=c(1,1))

Anatomical Data from Domestic Cats

Description

The heart and body weights of samples of male and female cats used fordigitalis experiments. The cats were all adult, over 2 kg body weight.

Usage

cats

Format

This data frame contains the following columns:

Sex: sex: Factor with levels"F" and"M".
Bwt: body weight in kg.
Hwt: heart weight in g.

Source

R. A. Fisher (1947) The analysis of covariance method for the relationbetween a part and the whole,Biometrics3, 65–68.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Heat Evolved by Setting Cements

Description

Experiment on the heat evolved in the setting of each of 13 cements.

Usage

cement

Format

x1, x2, x3, x4: Proportions (%) of active ingredients.
y: heat evolved in cals/gm.

Details

Thirteen samples of Portland cement were set. For each sample, thepercentages of the four main chemical ingredients was accuratelymeasured. While the cement was setting the amount of heat evolved wasalso measured.

Source

Woods, H., Steinour, H.H. and Starke, H.R. (1932) Effect of composition of Portland cement on heat evolved during hardening.Industrial Engineering and Chemistry,24, 1207–1214.

References

Hald, A. (1957)Statistical Theory with EngineeringApplications. Wiley, New York.

Examples

lm(y ~ x1 + x2 + x3 + x4, cement)

Copper in Wholemeal Flour

Description

A numeric vector of 24 determinations of copper in wholemealflour, in parts per million.

Usage

chem

Source

Analytical Methods Committee (1989) Robust statistics – how not toreject outliers.The Analyst114, 1693–1702.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Convert Lists to Data Frames for use by lattice

Description

Convert lists to data frames for use by lattice.

Usage

con2tr(obj)

Arguments

obj

A list of componentsx,y andz as passed tocontour.

Details

con2tr repeats thex andy components suitably tomatch the vectorz.

Value

A data frame suitable for passing to lattice (formerly trellis) functions.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Confidence Intervals for Model Parameters

Description

Computes confidence intervals for one or more parameters in a fittedmodel. PackageMASS added methods forglm andnlsfits. As fronR 4.4.0 these have been migrated to packagestats.

It also adds a method forpolr fits.

Successive Differences Contrast Coding

Description

A coding for factors based on successive differences.

Usage

contr.sdif(n, contrasts = TRUE, sparse = FALSE)

Arguments

n

The number of levels required.

contrasts

logical: Should there ben - 1 columns orthogonal to the mean(the default) orn columns spanning the space?

sparse

logical. If true and the result would be sparse (onlytrue forcontrasts = FALSE), return a sparse matrix.

Details

The contrast coefficients are chosen so that the coded coefficientsin a one-way layout are the differences between the means of thesecond and first levels, the third and second levels, and so on. Thismakes most sense for ordered factors, but does not assume that thelevels are equally spaced.

Value

Ifcontrasts isTRUE, a matrix withn rows andn - 1 columns, and then byn identity matrix ifcontrasts isFALSE.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S.Fourth Edition, Springer.

Examples

(A <- contr.sdif(6))zapsmall(ginv(A))

Co-operative Trial in Analytical Chemistry

Description

Seven specimens were sent to 6 laboratories in 3 separate batches andeach analysed for Analyte. Each analysis was duplicated.

Usage

coop

Format

This data frame contains the following columns:

Lab: Laboratory,L1,L2, ...,L6.
Spc: Specimen,S1,S2, ...,S7.
Bat: Batch,B1,B2,B3 (nested withinSpc/Lab),
Conc: Concentration of Analyte ing/kg.

Source

Analytical Methods Committee (1987)Recommendations for the conduct andinterpretation of co-operative trials,The Analyst112, 679–686.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Simple Correspondence Analysis

Description

Find the principal canonical correlation and corresponding row- andcolumn-scores from a correspondence analysis of a two-way contingencytable.

Usage

corresp(x, ...)## S3 method for class 'matrix'corresp(x, nf = 1, ...)## S3 method for class 'factor'corresp(x, y, ...)## S3 method for class 'data.frame'corresp(x, ...)## S3 method for class 'xtabs'corresp(x, ...)## S3 method for class 'formula'corresp(formula, data, ...)

Arguments

x,formula

The function is generic, accepting various forms of the principalargument for specifying a two-way frequency table. Currently acceptedforms are matrices, data frames (coerced to frequency tables), objectsof class"xtabs" and formulae of the form~ F1 + F2,whereF1 andF2 are factors.

nf

The number of factors to be computed. Note that although 1 is the mostusual, one school of thought takes the first two singular vectors fora sort of biplot.

y

a second factor for a cross-classification.

data

an optional data frame, list or environment against whichto preferentially resolve variables in the formula.

...

If the principal argument is a formula, a data frame may be specifiedas well from which variables in the formula are preferentiallysatisfied.

Details

See Venables & Ripley (2002). Theplot method produces a graphicalrepresentation of the table ifnf=1, with theareas of circlesrepresenting the numbers of points. Ifnf is two or more thebiplot method is called, which plots the second and third columns ofthe matricesA = Dr^(-1/2) U L andB = Dc^(-1/2) V L where thesingular value decomposition isU L V. Thus the x-axis is thecanonical correlation times the row and column scores. Although thisis called a biplot, it doesnot have any useful inner productrelationship between the row and column scores. Think of this as anequally-scaled plot with two unrelated sets of labels. The origin ismarked on the plot with a cross. (For other versions of this plot seethe book.)

Value

An list object of class"correspondence" for whichprint,plot andbiplot methods are supplied.The main components are the canonical correlation(s) and the rowand column scores.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Gower, J. C. and Hand, D. J. (1996)Biplots. Chapman & Hall.

Examples

## IGNORE_RDIFF_BEGIN## The signs can vary by platform(ct <- corresp(~ Age + Eth, data = quine))plot(ct)corresp(caith)biplot(corresp(caith, nf = 2))## IGNORE_RDIFF_END

Resistant Estimation of Multivariate Location and Scatter

Description

Compute a multivariate location and scale estimate with a highbreakdown point – this can be thought of as estimating the mean andcovariance of thegood part of the data.cov.mve andcov.mcd are compatibility wrappers.

Usage

cov.rob(x, cor = FALSE, quantile.used = floor((n + p + 1)/2),        method = c("mve", "mcd", "classical"),        nsamp = "best", seed)cov.mve(...)cov.mcd(...)

Arguments

x

a matrix or data frame.

cor

should the returned result include a correlation matrix?

quantile.used

the minimum number of the data points regarded asgood points.

method

the method to be used – minimum volume ellipsoid, minimumcovariance determinant or classical product-moment. Usingcov.mve orcov.mcd forcesmve ormcdrespectively.

nsamp

the number of samples or"best" or"exact" or"sample". The limitIf"sample" the number chosen ismin(5*p, 3000), takenfrom Rousseeuw and Hubert (1997). If"best" exhaustiveenumeration is done up to 5000 samples: if"exact"exhaustive enumeration will be attempted.

seed

the seed to be used for random sampling: seeRNGkind. Thecurrent value of.Random.seed will be preserved if it is set.

...

arguments tocov.rob other thanmethod.

Details

For method"mve", an approximate search is made of a subset ofsizequantile.used with an enclosing ellipsoid of smallest volume; inmethod"mcd" it is the volume of the Gaussian confidenceellipsoid, equivalently the determinant of the classical covariancematrix, that is minimized. The mean of the subset provides a firstestimate of the location, and the rescaled covariance matrix a firstestimate of scatter. The Mahalanobis distances of all the points fromthe location estimate for this covariance matrix are calculated, andthose points within the 97.5% point under Gaussian assumptions aredeclared to begood. The final estimates are the mean and rescaledcovariance of thegood points.

The rescaling is by the appropriate percentile under Gaussian data; inaddition the first covariance matrix has anad hoc finite-samplecorrection given by Marazzi.

For method"mve" the search is made over ellipsoids determinedby the covariance matrix ofp of the data points. For method"mcd" an additional improvement step suggested by Rousseeuw andvan Driessen (1999) is used, in which once a subset of sizequantile.used is selected, an ellipsoid based on its covarianceis tested (as this will have no larger a determinant, and may be smaller).

There is a hard limit on the allowed number of samples,2^{31} - 1. However, practical limits are likely to be much lowerand one might check the number of samples used for exhaustiveenumeration,combn(NROW(x), NCOL(x) + 1), before attempting it.

Value

A list with components

center

the final estimate of location.

cov

the final estimate of scatter.

cor

(only iscor = TRUE) the estimate of the correlationmatrix.

sing

message giving number of singular samples out of total

crit

the value of the criterion on log scale. For MCD this isthe determinant, and for MVE it is proportional to the volume.

best

the subset used. For MVE the best sample, for MCD the bestset of sizequantile.used.

n.obs

total number of observations.

References

P. J. Rousseeuw and A. M. Leroy (1987)Robust Regression and Outlier Detection.Wiley.

A. Marazzi (1993)Algorithms, Routines and S Functions for Robust Statistics.Wadsworth and Brooks/Cole.

P. J. Rousseeuw and B. C. van Zomeren (1990) Unmaskingmultivariate outliers and leverage points,Journal of the American Statistical Association,85, 633–639.

P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for theminimum covariance determinant estimator.Technometrics41, 212–223.

P. Rousseeuw and M. Hubert (1997) Recent developments in PROGRESS. InL1-Statistical Procedures and Related Topicsed Y. Dodge, IMS Lecture Notes volume31, pp. 201–214.

Examples

set.seed(123)cov.rob(stackloss)cov.rob(stack.x, method = "mcd", nsamp = "exact")

Covariance Estimation for Multivariate t Distribution

Description

Estimates a covariance or correlation matrix assuming the data camefrom a multivariate t distribution: this provides some degree ofrobustness to outlier without giving a high breakdown point.

Usage

cov.trob(x, wt = rep(1, n), cor = FALSE, center = TRUE, nu = 5,         maxit = 25, tol = 0.01)

Arguments

x

data matrix. Missing values (NAs) are not allowed.

wt

A vector of weights for each case: these are treated as if the caseiactually occurredwt[i] times.

cor

Flag to choose between returning the correlation (cor = TRUE) orcovariance (cor = FALSE) matrix.

center

a logical value or a numeric vector providing the location about whichthe covariance is to be taken. Ifcenter = FALSE, no centeringis done; ifcenter = TRUE the MLE of the location vector is used.

nu

‘degrees of freedom’ for the multivariate t distribution. Must exceed2 (so that the covariance matrix is finite).

maxit

Maximum number of iterations in fitting.

tol

Convergence tolerance for fitting.

Value

A list with the following components

cov

the fitted covariance matrix.

center

the estimated or specified location vector.

wt

the specified weights: only returned if thewt argument was given.

n.obs

the number of cases used in the fitting.

cor

the fitted correlation matrix: only returned ifcor = TRUE.

call

The matched call.

iter

The number of iterations used.

References

J. T. Kent, D. E. Tyler and Y. Vardi (1994)A curious likelihood identity for the multivariate t-distribution.Communications in Statistics—Simulation and Computation23, 441–453.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Examples

cov.trob(stackloss)

Performance of Computer CPUs

Description

A relative performance measure and characteristics of 209 CPUs.

Usage

cpus

Format

The components are:

name: manufacturer and model.
syct: cycle time in nanoseconds.
mmin: minimum main memory in kilobytes.
mmax: maximum main memory in kilobytes.
cach: cache size in kilobytes.
chmin: minimum number of channels.
chmax: maximum number of channels.
perf: published performance on a benchmark mix relative to an IBM 370/158-3.
estperf: estimated performance (by Ein-Dor & Feldmesser).

Source

P. Ein-Dor and J. Feldmesser (1987)Attributes of the performance of central processing units: a relativeperformance prediction model.Comm. ACM.30, 308–317.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Morphological Measurements on Leptograpsus Crabs

Description

Thecrabs data frame has 200 rows and 8 columns, describing 5morphological measurements on 50 crabs each of two colour forms andboth sexes, of the speciesLeptograpsus variegatus collected atFremantle, W. Australia.

Usage

crabs

Format

This data frame contains the following columns:

sp: species -"B" or"O" for blue or orange.
sex: as it says.
index: index1:50 within each of the four groups.
FL: frontal lobe size (mm).
RW: rear width (mm).
CL: carapace length (mm).
CW: carapace width (mm).
BD: body depth (mm).

Source

Campbell, N.A. and Mahon, R.J. (1974) A multivariatestudy of variation in two species of rock crab of genusLeptograpsus.Australian Journal of Zoology22, 417–425.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Monthly Deaths from Lung Diseases in the UK

Description

A time series giving the monthly deaths from bronchitis,emphysema and asthma in the UK, 1974-1979, both sexes (deaths),

Usage

deaths

Source

P. J. Diggle (1990)Time Series: A Biostatistical Introduction.Oxford, table A.3

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Transform an Allowable Formula for 'loglm' into one for 'terms'

Description

loglm allows dimension numbers to be used in place of names inthe formula.denumerate modifies such a formula into one thatterms can process.

Usage

denumerate(x)

Arguments

x

A formula conforming to the conventions ofloglm, that is, itmay allow dimension numbers to stand in for names when specifyinga log-linear model.

Details

The model fitting functionloglm fits log-linear models tofrequency data using iterative proportional scaling. To specifythe model the user must nominate the margins in the data thatremain fixed under the log-linear model. It is convenient toallow the user to use dimension numbers, 1, 2, 3, ... for thefirst, second, third, ..., margins in a similar way to variablenames. As the model formula has to be parsed byterms, whichtreats1 in a special way and requires parseable variable names,these formulae have to be modified by giving genuine names forthese margin, or dimension numbers.denumerate replaces thesenumbers with names of a special form, namelyn is replaced by.vn. This allowsterms to parse the formula in the usual way.

Value

A linear model formula like that presented, except that wheredimension numbers, sayn, have been used to specify fixedmargins these are replaced by names of the form.vn which maybe processed byterms.

Examples

denumerate(~(1+2+3)^3 + a/b)## which gives ~ (.v1 + .v2 + .v3)^3 + a/b

Predict Doses for Binomial Assay model

Description

Calibrate binomial assays, generalizing the calculation of LD50.

Usage

dose.p(obj, cf = 1:2, p = 0.5)

Arguments

obj

A fitted model object of class inheriting from"glm".

cf

The terms in the coefficient vector giving the intercept andcoefficient of (log-)dose

p

Probabilities at which to predict the dose needed.

Value

An object of class"glm.dose" giving the prediction (attribute"p" and standard error (attribute"SE") at each responseprobability.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S.Springer.

Examples

ldose <- rep(0:5, 2)numdead <- c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)sex <- factor(rep(c("M", "F"), c(6, 6)))SF <- cbind(numdead, numalive = 20 - numdead)budworm.lg0 <- glm(SF ~ sex + ldose - 1, family = binomial)dose.p(budworm.lg0, cf = c(1,3), p = 1:3/4)dose.p(update(budworm.lg0, family = binomial(link=probit)),       cf = c(1,3), p = 1:3/4)

Deaths of Car Drivers in Great Britain 1969-84

Description

A regular time series giving the monthly totals of car drivers inGreat Britain killed or seriously injured Jan 1969 to Dec1984. Compulsory wearing of seat belts was introduced on 31 Jan 1983

Usage

drivers

Source

Harvey, A.C. (1989)Forecasting, Structural Time Series Models and the Kalman Filter.Cambridge University Press, pp. 519–523.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Try All One-Term Deletions from a Model

Description

Try fitting all models that differ from the current model by dropping asingle term, maintaining marginality.

This function is generic; there exist methods for classeslm andglm and the default method will work for many other classes.

Usage

dropterm (object, ...)## Default S3 method:dropterm(object, scope, scale = 0, test = c("none", "Chisq"),         k = 2, sorted = FALSE, trace = FALSE, ...)## S3 method for class 'lm'dropterm(object, scope, scale = 0, test = c("none", "Chisq", "F"),         k = 2, sorted = FALSE, ...)## S3 method for class 'glm'dropterm(object, scope, scale = 0, test = c("none", "Chisq", "F"),         k = 2, sorted = FALSE, trace = FALSE, ...)

Arguments

object

A object fitted by some model-fitting function.

scope

a formula giving terms which might be dropped. By default, themodel formula. Only terms that can be dropped and maintain marginalityare actually tried.

scale

used in the definition of the AIC statistic for selecting the models,currently only forlm,aov andglm models. Specifyingscaleasserts that the residual standard error or dispersion is known.

test

k

the multiple of the number of degrees of freedom used for the penalty.Onlyk = 2 gives the genuine AIC:k = log(n) is sometimesreferred to as BIC or SBC.

sorted

should the results be sorted on the value of AIC?

trace

ifTRUE additional information may be given on the fits as they are tried.

...

arguments passed to or from other methods.

Details

Value

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

quine.hi <- aov(log(Days + 2.5) ~ .^4, quine)quine.nxt <- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn)dropterm(quine.nxt, test=  "F")quine.stp <- stepAIC(quine.nxt,    scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1),    trace = FALSE)dropterm(quine.stp, test = "F")quine.3 <- update(quine.stp, . ~ . - Eth:Age:Lrn)dropterm(quine.3, test = "F")quine.4 <- update(quine.3, . ~ . - Eth:Age)dropterm(quine.4, test = "F")quine.5 <- update(quine.4, . ~ . - Age:Lrn)dropterm(quine.5, test = "F")house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family=poisson,                   data = housing)house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))dropterm(house.glm1, test = "Chisq")

Foraging Ecology of Bald Eagles

Description

Knight and Skagen collected during a field study on the foragingbehaviour of wintering Bald Eagles in Washington State, USA dataconcerning 160 attempts by one (pirating) Bald Eagle to steal a chumsalmon from another (feeding) Bald Eagle.

Usage

eagles

Format

Theeagles data frame has 8 rows and 5 columns.

y: Number of successful attempts.
n: Total number of attempts.
P: Size of pirating eagle (L = large,S = small).
A: Age of pirating eagle (I = immature,A = adult).
V: Size of victim eagle (L = large,S = small).

Source

Knight, R. L. and Skagen, S. K. (1988)Agonistic asymmetries and the foraging ecology of Bald Eagles.Ecology69, 1188–1194.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Examples

eagles.glm <- glm(cbind(y, n - y) ~ P*A + V, data = eagles,                  family = binomial)dropterm(eagles.glm)prof <- profile(eagles.glm)plot(prof)pairs(prof)

Seizure Counts for Epileptics

Description

Thall and Vail (1990) give a data set on two-week seizure counts for59 epileptics. The number of seizures was recorded for a baselineperiod of 8 weeks, and then patients were randomly assigned to atreatment group or a control group. Counts were then recorded forfour successive two-week periods. The subject's age is the onlycovariate.

Usage

epil

Format

This data frame has 236 rows and the following 9 columns:

y: the count for the 2-week period.
trt: treatment,"placebo" or"progabide".
base: the counts in the baseline 8-week period.
age: subject's age, in years.
V4: 0/1 indicator variable of period 4.
subject: subject number, 1 to 59.
period: period, 1 to 4.
lbase: log-counts for the baseline period, centred to have zero mean.
lage: log-ages, centred to have zero mean.

Note

The value ofy in row 31 was corrected from21 to23 in version 7.3-65.

Source

Thall, P. F. and Vail, S. C. (1990)Some covariance models for longitudinal count data with over-dispersion.Biometrics46, 657–671.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth Edition. Springer.

Examples

summary(glm(y ~ lbase*trt + lage + V4, family = poisson,            data = epil), correlation = FALSE)epil2 <- epil[epil$period == 1, ]epil2["period"] <- rep(0, 59); epil2["y"] <- epil2["base"]epil["time"] <- 1; epil2["time"] <- 4epil2 <- rbind(epil, epil2)epil2$pred <- unclass(epil2$trt) * (epil2$period > 0)epil2$subject <- factor(epil2$subject)epil3 <- aggregate(epil2, list(epil2$subject, epil2$period > 0),   function(x) if(is.numeric(x)) sum(x) else x[1])epil3$pred <- factor(epil3$pred,   labels = c("base", "placebo", "drug"))contrasts(epil3$pred) <- structure(contr.sdif(3),    dimnames = list(NULL, c("placebo-base", "drug-placebo")))## IGNORE_RDIFF_BEGINsummary(glm(y ~ pred + factor(subject) + offset(log(time)),            family = poisson, data = epil3), correlation = FALSE)## IGNORE_RDIFF_ENDsummary(glmmPQL(y ~ lbase*trt + lage + V4,                random = ~ 1 | subject,                family = poisson, data = epil))summary(glmmPQL(y ~ pred, random = ~1 | subject,                family = poisson, data = epil3))

Plots with Geometrically Equal Scales

Description

Version of a scatterplot with scales chosen to be equal on both axes, thatis 1cm represents the same units on each

Usage

eqscplot(x, y, ratio = 1, tol = 0.04, uin, ...)

Arguments

x

vector of x values, or a 2-column matrix, or a list with componentsx andy

y

vector of y values

ratio

desired ratio of units on the axes. Units on the y axis are drawn atratio times the size of units on the x axis. Ignored ifuin isspecified and of length 2.

tol

proportion of white space at the margins of plot.

uin

desired values for the units-per-inch parameter. If of length 1, thedesired units per inch on the x axis.

...

further arguments forplot and graphical parameters. Note thatpar(xaxs="i", yaxs="i") is enforced, andxlim andylim will be adjusted accordingly.

Details

Limits for the x and y axes are chosen so that they include thedata. One of the sets of limits is then stretched from the midpoint tomake the units in the ratio given byratio. Finally both arestretched by1 + tol to move points away from the axes, and thepoints plotted.

Value

invisibly, the values ofuin used for the plot.

Side Effects

performs the plot.

Note

This was originally written for S:R'splot.window has anargumentasp with a similar effect (including to this function'sratio) and can be passed from the defaultplot function.

Argumentsratio anduin were suggested by Bill Dunlap.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Ecological Factors in Farm Management

Description

Thefarms data frame has 20 rows and 4 columns. The rows are farmson the Dutch island of Terschelling and the columns are factorsdescribing the management of grassland.

Usage

farms

Format

This data frame contains the following columns:

Mois: Five levels of soil moisture – level 3 does not occur at these 20 farms.
Manag: Grassland management type (SF = standard,BF = biological,HF = hobby farming,NM = nature conservation).
Use: Grassland use (U1 = hay production,U2 =intermediate,U3 = grazing).
Manure: Manure usage – classesC0 toC4.

Source

J.C. Gower and D.J. Hand (1996)Biplots. Chapman & Hall, Table 4.6.

Quoted as from:
R.H.G. Jongman, C.J.F. ter Braak and O.F.R. van Tongeren (1987)Data Analysis in Community and Landscape Ecology.PUDOC, Wageningen.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

farms.mca <- mca(farms, abbrev = TRUE)  # Use levels as nameseqscplot(farms.mca$cs, type = "n")text(farms.mca$rs, cex = 0.7)text(farms.mca$cs, labels = dimnames(farms.mca$cs)[[1]], cex = 0.7)

Measurements of Forensic Glass Fragments

Description

Thefgl data frame has 214 rows and 10 columns.It was collected by B. German on fragments of glasscollected in forensic work.

Usage

fgl

Format

This data frame contains the following columns:

RI

refractive index; more precisely the refractive index is 1.518xxxx.

The next 8 measurements are percentages by weight of oxides.

Na

sodium.

Mg

manganese.

Al

aluminium.

Si

silicon.

K

potassium.

Ca

calcium.

Ba

barium.

Fe

iron.

type

The fragments were originally classed into seven types, one of whichwas absent in this dataset. The categories which occur arewindow float glass (WinF: 70),window non-float glass (WinNF: 76),vehicle window glass (Veh: 17),containers (Con: 13),tableware (Tabl: 9) andvehicle headlamps (Head: 29).

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Maximum-likelihood Fitting of Univariate Distributions

Description

Maximum-likelihood fitting of univariate distributions, allowingparameters to be held fixed if desired.

Usage

fitdistr(x, densfun, start, ...)

Arguments

x

A numeric vector of length at least one containing onlyfinite values.

densfun

Either a character string or a function returning a density evaluatedat its first argument.

Distributions"beta","cauchy","chi-squared","exponential","gamma","geometric","log-normal","lognormal","logistic","negative binomial","normal","Poisson","t" and"weibull" are recognised, case being ignored.

start

A named list giving the parameters to be optimized with initialvalues. This can be omitted for some of the named distributions andmust be for others (see Details).

...

Additional parameters, either fordensfun or foroptim.In particular, it can be used to specify bounds vialower orupper or both. If arguments ofdensfun (or the densityfunction corresponding to a character-string specification) are includedthey will be held fixed.

Details

For the Normal, log-Normal, geometric, exponential and Poissondistributions the closed-form MLEs (and exact standard errors) areused, andstart should not be supplied.

For all other distributions, direct optimization of the log-likelihoodis performed usingoptim. The estimated standarderrors are taken from the observed information matrix, calculated by anumerical approximation. For one-dimensional problems the Nelder-Meadmethod is used and for multi-dimensional problems the BFGS method,unless arguments namedlower orupper are supplied (whenL-BFGS-B is used) ormethod is supplied explicitly.

For the"t" named distribution the density is taken to be thelocation-scale family with locationm and scales.

For the following named distributions, reasonable starting values willbe computed ifstart is omitted or only partially specified:"cauchy","gamma","logistic","negative binomial" (parametrized bymu andsize),"t" and"weibull". Note that thesestarting values may not be good enough if the fit is poor: inparticular they are not resistant to outliers unless the fitteddistribution is long-tailed.

There areprint,coef,vcovandlogLik methods for class"fitdistr".

Value

An object of class"fitdistr", a list with four components,

estimate

the parameter estimates,

sd

the estimated standard errors,

vcov

the estimated variance-covariance matrix, and

loglik

the log-likelihood.

Note

Numerical optimization cannot work miracles: please note the commentsinoptim on scaling data. If the fitted parameters arefar away from one, consider re-fitting specifying the controlparameterparscale.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

## avoid spurious accuracyop <- options(digits = 3)set.seed(123)x <- rgamma(100, shape = 5, rate = 0.1)fitdistr(x, "gamma")## now do this directly with more control.fitdistr(x, dgamma, list(shape = 1, rate = 0.1), lower = 0.001)set.seed(123)x2 <- rt(250, df = 9)fitdistr(x2, "t", df = 9)## allow df to vary: not a very good idea!fitdistr(x2, "t")## now do fixed-df fit directly with more control.mydt <- function(x, m, s, df) dt((x-m)/s, df)/sfitdistr(x2, mydt, list(m = 0, s = 1), df = 9, lower = c(-Inf, 0))set.seed(123)x3 <- rweibull(100, shape = 4, scale = 100)fitdistr(x3, "weibull")set.seed(123)x4 <- rnegbin(500, mu = 5, theta = 4)fitdistr(x4, "Negative Binomial")options(op)

Forbes' Data on Boiling Points in the Alps

Description

A data frame with 17 observations on boiling pointof water and barometric pressure in inches of mercury.

Usage

forbes

Format

bp: boiling point (degrees Farenheit).
pres: barometric pressure in inches of mercury.

Source

A. C. Atkinson (1985)Plots, Transformations and Regression. Oxford.

S. Weisberg (1980)Applied Linear Regression. Wiley.

Rational Approximation

Description

Find rational approximations to the components of a real numericobject using a standard continued fraction method.

Usage

fractions(x, cycles = 10, max.denominator = 2000, ...)as.fractions(x)is.fractions(f)

Arguments

x

Any object of mode numeric. Missing values are now allowed.

cycles

The maximum number of steps to be used in the continued fractionapproximation process.

max.denominator

An early termination criterion. If any partial denominatorexceedsmax.denominator the continued fraction stops at that point.

...

arguments passed to or from other methods.

f

anR object.

Details

Each component is first expanded in a continued fraction of theform

x = floor(x) + 1/(p1 + 1/(p2 + ...)))

wherep1,p2, ... are positive integers, terminating eitheratcycles terms or when apj > max.denominator. Thecontinued fraction is then re-arranged to retrieve the numeratorand denominator as integers.

The numerators and denominators are then combined into acharacter vector that becomes the"fracs" attribute and used inprinted representations.

Arithmetic operations on"fractions" objects have full floatingpoint accuracy, but the character representation printed out maynot.

Value

An object of class"fractions". A structure with.Data componentthe same as the input numericx, but with the rationalapproximations held as a character vector attribute,"fracs".Arithmetic operations on"fractions" objects are possible.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth Edition. Springer.

Examples

X <- matrix(runif(25), 5, 5)zapsmall(solve(X, X/5)) # print near-zeroes as zerofractions(solve(X, X/5))fractions(solve(X, X/5)) + 1

Velocities for 82 Galaxies

Description

A numeric vector of velocities in km/sec of 82 galaxies from 6well-separated conic sections of anunfilled survey of the CoronaBorealis region. Multimodality in such surveys is evidence for voidsand superclusters in the far universe.

Usage

galaxies

Note

There is an 83rd measurement of 5607 km/sec in the Postmanet al. paper which is omitted in Roeder (1990) and from thedataset here.

There is also a typo: this dataset has 78th observation 26690 whichshould be 26960.

Source

Roeder, K. (1990) Density estimation with confidence sets exemplifiedby superclusters and voids in galaxies.Journal of the American Statistical Association85, 617–624.

Postman, M., Huchra, J. P. and Geller, M. J. (1986)Probes of large-scale structures in the Corona Borealis region.Astronomical Journal92, 1238–1247.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

gal <- galaxies/1000c(width.SJ(gal, method = "dpi"), width.SJ(gal))plot(x = c(0, 40), y = c(0, 0.3), type = "n", bty = "l",     xlab = "velocity of galaxy (1000km/s)", ylab = "density")rug(gal)lines(density(gal, width = 3.25, n = 200), lty = 1)lines(density(gal, width = 2.56, n = 200), lty = 3)

Calculate the MLE of the Gamma Dispersion Parameter in a GLM Fit

Description

A front end togamma.shape for convenience. Finds thereciprocal of the estimate of the shape parameter only.

Usage

gamma.dispersion(object, ...)

Arguments

object

Fitted model object giving the gamma fit.

...

Additional arguments passed on togamma.shape.

Value

The MLE of the dispersion parameter of the gamma distribution.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Estimate the Shape Parameter of the Gamma Distribution in a GLM Fit

Description

Find the maximum likelihood estimate of the shape parameter ofthe gamma distribution after fitting aGamma generalizedlinear model.

Usage

gamma.shape(object, ...)## S3 method for class 'glm'gamma.shape(object, it.lim = 10,            eps.max = .Machine$double.eps^0.25, verbose = FALSE, ...)

Arguments

object

Fitted model object from aGamma family orquasi family withvariance = "mu^2".

it.lim

Upper limit on the number of iterations.

eps.max

Maximum discrepancy between approximations for the iterationprocess to continue.

verbose

IfTRUE, causes successive iterations to be printed out. Theinitial estimate is taken from the deviance.

...

further arguments passed to or from other methods.

Details

A glm fit for a Gamma family correctly calculates the maximumlikelihood estimate of the mean parameters but provides only acrude estimate of the dispersion parameter. This function takesthe results of the glm fit and solves the maximum likelihoodequation for the reciprocal of the dispersion parameter, which isusually called the shape (or exponent) parameter.

Value

List of two components

alpha

the maximum likelihood estimate

SE

the approximate standard error, the square-root of the reciprocal ofthe observed information.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

clotting <- data.frame(    u = c(5,10,15,20,30,40,60,80,100),    lot1 = c(118,58,42,35,27,25,21,19,18),    lot2 = c(69,35,26,21,18,16,13,12,12))clot1 <- glm(lot1 ~ log(u), data = clotting, family = Gamma)gamma.shape(clot1)gm <- glm(Days + 0.1 ~ Age*Eth*Sex*Lrn,          quasi(link=log, variance="mu^2"), quine,          start = c(3, rep(0,31)))gamma.shape(gm, verbose = TRUE)summary(gm, dispersion = gamma.dispersion(gm))  # better summary

Remission Times of Leukaemia Patients

Description

A data frame from a trial of 42 leukaemia patients. Some weretreated with the drug6-mercaptopurineand the rest are controls. The trial was designed as matched pairs,both withdrawn from the trial when either came out of remission.

Usage

gehan

Format

This data frame contains the following columns:

pair: label for pair.
time: remission time in weeks.
cens: censoring, 0/1.
treat: treatment, control or 6-MP.

Source

Cox, D. R. and Oakes, D. (1984)Analysis of Survival Data.Chapman & Hall, p. 7. Taken from

Gehan, E.A. (1965) A generalized Wilcoxon test for comparingarbitrarily single-censored samples.Biometrika52, 203–233.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

library(survival)gehan.surv <- survfit(Surv(time, cens) ~ treat, data = gehan,     conf.type = "log-log")summary(gehan.surv)survreg(Surv(time, cens) ~ factor(pair) + treat, gehan, dist = "exponential")summary(survreg(Surv(time, cens) ~ treat, gehan, dist = "exponential"))summary(survreg(Surv(time, cens) ~ treat, gehan))gehan.cox <- coxph(Surv(time, cens) ~ treat, gehan)summary(gehan.cox)

Rat Genotype Data

Description

Data from a foster feeding experiment with rat mothers and litters offour different genotypes:A,B,I andJ.Rat litters were separated from their natural mothers at birth andgiven to foster mothers to rear.

Usage

genotype

Format

The data frame has the following components:

Litter: genotype of the litter.
Mother: genotype of the foster mother.
Wt: Litter average weight gain of the litter, in grams at age 28 days.(The source states that the within-litter variability is negligible.)

Source

Scheffe, H. (1959)The Analysis of Variance Wiley p. 140.

Bailey, D. W. (1953)The Inheritance of Maternal Influences on the Growth of the Rat.Unpublished Ph.D. thesis, University of California. Table B of the Appendix.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Old Faithful Geyser Data

Description

A version of the eruptions data from the ‘Old Faithful’ geyserin Yellowstone National Park, Wyoming. This version comes fromAzzalini and Bowman (1990) and is of continuous measurement from August1 to August 15, 1985.

Some nocturnal duration measurements were coded as 2, 3 or 4 minutes,having originally been described as ‘short’, ‘medium’or ‘long’.

Usage

geyser

Format

A data frame with 299 observations on 2 variables.

`duration`	numeric	Eruption time in mins
`waiting`	numeric	Waiting time for this eruption

Note

Thewaiting time was incorrectly described as the time to thenext eruption in the original files, and corrected forMASSversion 7.3-30.

References

Azzalini, A. and Bowman, A. W. (1990) A look at somedata on the Old Faithful geyser.Applied Statistics39, 357–365.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Line Transect of Soil in Gilgai Territory

Description

This dataset was collected on a line transect survey in gilgaiterritory in New South Wales, Australia. Gilgais are natural gentledepressions in otherwise flat land, and sometimes seem to be regularlydistributed. The data collection was stimulated by the question: arethese patterns reflected in soil properties? At each of 365 samplinglocations on a linear grid of 4 meters spacing, samples were taken atdepths 0-10 cm, 30-40 cm and 80-90 cm below the surface. pH, electricalconductivity and chloride content were measured on a 1:5 soil:waterextract from each sample.

Usage

gilgais

Format

This data frame contains the following columns:

pH00: pH at depth 0–10 cm.
pH30: pH at depth 30–40 cm.
pH80: pH at depth 80–90 cm.
e00: electrical conductivity in mS/cm (0–10 cm).
e30: electrical conductivity in mS/cm (30–40 cm).
e80: electrical conductivity in mS/cm (80–90 cm).
c00: chloride content in ppm (0–10 cm).
c30: chloride content in ppm (30–40 cm).
c80: chloride content in ppm (80–90 cm).

Source

Webster, R. (1977) Spectral analysis of gilgai soil.Australian Journal of Soil Research15, 191–204.

Laslett, G. M. (1989)Kriging and splines: An empirical comparison of theirpredictive performance in some applications (with discussion).Journal of the American Statistical Association89, 319–409

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Generalized Inverse of a Matrix

Description

Calculates the Moore-Penrose generalized inverse of a matrixX.

Usage

ginv(X, tol = sqrt(.Machine$double.eps))

Arguments

X

Matrix for which the Moore-Penrose inverse is required.

tol

A relative tolerance to detect zero singular values.

Value

A MP generalized inverse matrix forX.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Change a Negative Binomial fit to a GLM fit

Description

This function modifies an output object fromglm.nb() to onethat looks like the output fromglm() with a negative binomialfamily. This allows it to be updated keeping the theta parameterfixed.

Usage

glm.convert(object)

Arguments

object

An object of class"negbin", typically the output fromglm.nb().

Details

Convenience function needed to effect some low level changes to thestructure of the fitted model object.

Value

An object of class"glm" with negative binomial family. The thetaparameter is then fixed at its present estimate.

Examples

quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)quine.nbA <- glm.convert(quine.nb1)quine.nbB <- update(quine.nb1, . ~ . + Sex:Age:Lrn)anova(quine.nbA, quine.nbB)

Fit a Negative Binomial Generalized Linear Model

Description

A modification of the system functionglm() to includeestimation of the additional parameter,theta, for aNegative Binomial generalized linear model.

Usage

glm.nb(formula, data, weights, subset, na.action,       start = NULL, etastart, mustart,       control = glm.control(...), method = "glm.fit",       model = TRUE, x = FALSE, y = TRUE, contrasts = NULL, ...,       init.theta, link = log)

Arguments

formula,data,weights,subset,na.action,start,etastart,mustart,control,method,model,x,y,contrasts,...

arguments for theglm() function.Note that these excludefamily andoffset(butoffset() can be used).

init.theta

Optional initial value for the theta parameter. If omitted a momentestimator after an initial fit using a Poisson GLM is used.

link

The link function. Currently must be one oflog,sqrtoridentity.

Details

An alternating iteration process is used. For giventheta the GLMis fitted using the same process as used byglm(). For fixed meansthetheta parameter is estimated using score and informationiterations. The two are alternated until convergence of both. (Thenumber of alternations and the number of iterations when estimatingtheta are controlled by themaxit parameter ofglm.control.)

Settingtrace > 0 traces the alternating iterationprocess. Settingtrace > 1 traces theglm fit, andsettingtrace > 2 traces the estimation oftheta.

Value

A fitted model object of classnegbin inheriting fromglmandlm. The object is like the output ofglm but containsthree additional components, namelytheta for the ML estimate oftheta,SE.theta for its approximate standard error (usingobserved rather than expected information), andtwologlik fortwice the log-likelihood function.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

quine.nb1 <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)quine.nb2 <- update(quine.nb1, . ~ . + Sex:Age:Lrn)quine.nb3 <- update(quine.nb2, Days ~ .^4)anova(quine.nb1, quine.nb2, quine.nb3)

Fit Generalized Linear Mixed Models via PQL

Description

Fit a GLMM model with multivariate normal random effects, usingPenalized Quasi-Likelihood.

Usage

glmmPQL(fixed, random, family, data, correlation, weights,        control, niter = 10, verbose = TRUE, ...)

Arguments

fixed

a two-sided linear formula giving fixed-effects part of the model.

random

a formula or list of formulae describing the random effects.

family

a GLM family.

data

an optional data frame, list or environment used as the first place to findvariables in the formulae,weights and if present in...,subset.

correlation

an optional correlation structure.

weights

optional case weights as inglm.

control

an optional argument to be passed tolme.

niter

maximum number of iterations.

verbose

logical: print out record of iterations?

...

Further arguments forlme.

Details

glmmPQL works by repeated calls tolme, sonamespacenlme will be loaded at first use. (Before 2015 itused to attachnlme but nowadays only loads the namespace.)

Unlikelme,offset terms are allowed infixed – this is done by pre- and post-processing the calls tolme.

Note that the returned object inherits from class"lme" andthat most generics will use the method for that class. As fromversion 3.1-158, the fitted values have any offset included, as dothe results of callingpredict.

Value

A object of classc("glmmPQL", "lme"): seelmeObject.

References

Schall, R. (1991) Estimation in generalized linear models withrandom effects.Biometrika78, 719–727.

Breslow, N. E. and Clayton, D. G. (1993) Approximate inference ingeneralized linear mixed models.Journal of the American Statistical Association88, 9–25.

Wolfinger, R. and O'Connell, M. (1993) Generalized linear mixed models: apseudo-likelihood approach.Journal of Statistical Computation and Simulation48, 233–243.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

summary(glmmPQL(y ~ trt + I(week > 2), random = ~ 1 | ID,                family = binomial, data = bacteria))## an example of an offset: the coefficient of 'week' changes by one.summary(glmmPQL(y ~ trt + week, random = ~ 1 | ID,               family = binomial, data = bacteria))summary(glmmPQL(y ~ trt + week + offset(week), random = ~ 1 | ID,                family = binomial, data = bacteria))

Record Times in Scottish Hill Races

Description

The record times in 1984 for 35 Scottish hill races.

Usage

hills

Format

The components are:

dist: distance in miles (on the map).
climb: total height gained during the route, in feet.
time: record time in minutes.

Source

A.C. Atkinson (1986) Comment: Aspects of diagnostic regression analysis.Statistical Science1, 397–402.

[A.C. Atkinson (1988) Transformations unmasked.Technometrics30, 311–318 “corrects” the time for Knock Hill from 78.65to 18.65. It is unclear if this based on the original records.]

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Plot a Histogram with Automatic Bin Width Selection

Description

Plot a histogram with automatic bin width selection, using the Scottor Freedman–Diaconis formulae.

Usage

hist.scott(x, prob = TRUE, xlab = deparse(substitute(x)), ...)hist.FD(x, prob = TRUE, xlab = deparse(substitute(x)), ...)

Arguments

x

A data vector

prob

Should the plot have unit area, so be a density estimate?

xlab,...

Further arguments tohist.

Value

For thenclass.* functions, the suggested number of classes.

Side Effects

Plot a histogram.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S.Springer.

Frequency Table from a Copenhagen Housing Conditions Survey

Description

Thehousing data frame has 72 rows and 5 variables.

Usage

housing

Format

Sat: Satisfaction of householders with their present housingcircumstances, (High, Medium or Low, ordered factor).
Infl: Perceived degree of influence householders have on themanagement of the property (High, Medium, Low).
Type: Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).
Cont: Contact residents are afforded with other residents, (Low, High).
Freq: Frequencies: the numbers of residents in each class.

Source

Madsen, M. (1976)Statistical analysis of multiple contingency tables. Two examples.Scand. J. Statist.3, 97–106.

Cox, D. R. and Snell, E. J. (1984)Applied Statistics, Principles and Examples.Chapman & Hall.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

options(contrasts = c("contr.treatment", "contr.poly"))# Surrogate Poisson modelshouse.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,                  data = housing)## IGNORE_RDIFF_BEGINsummary(house.glm0, correlation = FALSE)## IGNORE_RDIFF_ENDaddterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))summary(house.glm1, correlation = FALSE)1 - pchisq(deviance(house.glm1), house.glm1$df.residual)dropterm(house.glm1, test = "Chisq")addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test  =  "Chisq")hnames <- lapply(housing[, -5], levels) # omit FreqnewData <- expand.grid(hnames)newData$Sat <- ordered(newData$Sat)house.pm <- predict(house.glm1, newData,                    type = "response")  # poisson meanshouse.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,                   dimnames = list(NULL, hnames[[1]]))house.pr <- house.pm/drop(house.pm %*% rep(1, 3))cbind(expand.grid(hnames[-1]), round(house.pr, 2))# Iterative proportional scalingloglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)# multinomial modellibrary(nnet)(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,                       data = housing))house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,                        data = housing)anova(house.mult, house.mult2)house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")cbind(expand.grid(hnames[-1]), round(house.pm, 2))# proportional odds modelhouse.cpr <- apply(house.pr, 1, cumsum)logit <- function(x) log(x/(1-x))house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])(ratio <- sort(drop(house.ld)))mean(ratio)(house.plr <- polr(Sat ~ Infl + Type + Cont,                   data = housing, weights = Freq))house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")cbind(expand.grid(hnames[-1]), round(house.pr1, 2))Fr <- matrix(housing$Freq, ncol  =  3, byrow = TRUE)2*sum(Fr*log(house.pr/house.pr1))house.plr2 <- stepAIC(house.plr, ~.^2)house.plr2$anova

Huber M-estimator of Location with MAD Scale

Description

Finds the Huber M-estimator of location with MAD scale.

Usage

huber(y, k = 1.5, tol = 1e-06)

Arguments

y

vector of data values

k

Winsorizes atk standard deviations

tol

convergence tolerance

Value

list of location and scale parameters

mu

location estimate

s

MAD scale estimate

References

Huber, P. J. (1981)Robust Statistics.Wiley.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

huber(chem)

Huber Proposal 2 Robust Estimator of Location and/or Scale

Description

Finds the Huber M-estimator for location with scale specified, scalewith location specified, or both if neither is specified.

Usage

hubers(y, k = 1.5, mu, s, initmu = median(y), tol = 1e-06)

Arguments

y

vector y of data values

k

Winsorizes atk standard deviations

mu

specified location

s

specified scale

initmu

initial value ofmu

tol

convergence tolerance

Value

list of location and scale estimates

mu

location estimate

s

scale estimate

References

Huber, P. J. (1981)Robust Statistics.Wiley.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

hubers(chem)hubers(chem, mu=3.68)

Yields from a Barley Field Trial

Description

Theimmer data frame has 30 rows and 4 columns. Five varieties ofbarley were grown in six locations in each of 1931 and 1932.

Usage

immer

Format

This data frame contains the following columns:

Loc: The location.
Var: The variety of barley ("manchuria","svansota","velvet","trebi" and"peatland").
Y1: Yield in 1931.
Y2: Yield in 1932.

Source

Immer, F.R., Hayes, H.D. and LeRoy Powers (1934)Statistical determination of barley varietal adaptation.Journal of the American Society for Agronomy26, 403–419.

Fisher, R.A. (1947)The Design of Experiments. 4th edition. Edinburgh: Oliver and Boyd.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Examples

immer.aov <- aov(cbind(Y1,Y2) ~ Loc + Var, data = immer)summary(immer.aov)immer.aov <- aov((Y1+Y2)/2 ~ Var + Loc, data = immer)summary(immer.aov)model.tables(immer.aov, type = "means", se = TRUE, cterms = "Var")

Kruskal's Non-metric Multidimensional Scaling

Description

One form of non-metric multidimensional scaling

Usage

isoMDS(d, y = cmdscale(d, k), k = 2, maxit = 50, trace = TRUE,       tol = 1e-3, p = 2)Shepard(d, x, p = 2)

Arguments

d

distance structure of the form returned bydist, or a full,symmetric matrix. Data are assumed to be dissimilarities or relativedistances, but must be positive except for self-distance. Bothmissing and infinite values are allowed.

y

An initial configuration. If none is supplied,cmdscale is usedto provide the classical solution, unless there are missing orinfinite dissimilarities.

k

The desired dimension for the solution, passed tocmdscale.

maxit

The maximum number of iterations.

trace

Logical for tracing optimization. DefaultTRUE.

tol

convergence tolerance.

p

Power for Minkowski distance in the configuration space.

x

A final configuration.

Details

This chooses a k-dimensional (default k = 2) configuration to minimizethe stress, the square root of the ratio of the sum of squareddifferences between the input distances and those of the configurationto the sum of configuration distances squared. However, the inputdistances are allowed a monotonic transformation.

An iterative algorithm is used, which will usually converge in around10 iterations. As this is necessarily anO(n^2) calculation,it is slow for large datasets. Further, since for the defaultp = 2the configuration is only determined up to rotations and reflections(by convention the centroid is at the origin), the result can varyconsiderably from machine to machine.

Value

Two components:

points

A k-column vector of the fitted configuration.

stress

The final stress achieved (in percent).

Side Effects

Iftrace is true, the initial stress and the current stressare printed out every 5 iterations.

References

T. F. Cox and M. A. A. Cox (1994, 2001)Multidimensional Scaling. Chapman & Hall.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

swiss.x <- as.matrix(swiss[, -1])swiss.dist <- dist(swiss.x)swiss.mds <- isoMDS(swiss.dist)plot(swiss.mds$points, type = "n")text(swiss.mds$points, labels = as.character(1:nrow(swiss.x)))swiss.sh <- Shepard(swiss.dist, swiss.mds$points)plot(swiss.sh, pch = ".")lines(swiss.sh$x, swiss.sh$yf, type = "S")

Two-Dimensional Kernel Density Estimation

Description

Two-dimensional kernel density estimation with an axis-alignedbivariate normal kernel, evaluated on a square grid.

Usage

kde2d(x, y, h, n = 25, lims = c(range(x), range(y)))

Arguments

x

x coordinate of data

y

y coordinate of data

h

vector of bandwidths for x and y directions. Defaults tonormal reference bandwidth (seebandwidth.nrd). A scalarvalue will be taken to apply to both directions.

n

Number of grid points in each direction. Can be scalar or a length-2integer vector.

lims

The limits of the rectangle covered by the grid asc(xl, xu, yl, yu).

Value

A list of three components.

x,y

The x and y coordinates of the grid points, vectors of lengthn.

z

Ann[1] byn[2] matrix of the estimated density: rowscorrespond to the value ofx, columns to the value ofy.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

attach(geyser)plot(duration, waiting, xlim = c(0.5,6), ylim = c(40,100))f1 <- kde2d(duration, waiting, n = 50, lims = c(0.5, 6, 40, 100))image(f1, zlim = c(0, 0.05))f2 <- kde2d(duration, waiting, n = 50, lims = c(0.5, 6, 40, 100),            h = c(width.SJ(duration), width.SJ(waiting)) )image(f2, zlim = c(0, 0.05))persp(f2, phi = 30, theta = 20, d = 5)plot(duration[-272], duration[-1], xlim = c(0.5, 6),     ylim = c(1, 6),xlab = "previous duration", ylab = "duration")f1 <- kde2d(duration[-272], duration[-1],            h = rep(1.5, 2), n = 50, lims = c(0.5, 6, 0.5, 6))contour(f1, xlab = "previous duration",        ylab = "duration", levels  =  c(0.05, 0.1, 0.2, 0.4) )f1 <- kde2d(duration[-272], duration[-1],            h = rep(0.6, 2), n = 50, lims = c(0.5, 6, 0.5, 6))contour(f1, xlab = "previous duration",        ylab = "duration", levels  =  c(0.05, 0.1, 0.2, 0.4) )f1 <- kde2d(duration[-272], duration[-1],            h = rep(0.4, 2), n = 50, lims = c(0.5, 6, 0.5, 6))contour(f1, xlab = "previous duration",        ylab = "duration", levels  =  c(0.05, 0.1, 0.2, 0.4) )detach("geyser")

Linear Discriminant Analysis

Description

Linear discriminant analysis.

Usage

lda(x, ...)## S3 method for class 'formula'lda(formula, data, ..., subset, na.action)## Default S3 method:lda(x, grouping, prior = proportions, tol = 1.0e-4,    method, CV = FALSE, nu, ...)## S3 method for class 'data.frame'lda(x, ...)## S3 method for class 'matrix'lda(x, grouping, ..., subset, na.action)

Arguments

formula

A formula of the formgroups ~ x1 + x2 + ... That is, theresponse is the grouping factor and the right hand side specifiesthe (non-factor) discriminators.

data

An optional data frame, list or environment from which variablesspecified informula are preferentially to be taken.

x

(required if no formula is given as the principal argument.)a matrix or data frame or Matrix containing the explanatory variables.

grouping

(required if no formula principal argument is given.)a factor specifying the class for each observation.

prior

the prior probabilities of class membership. If unspecified, theclass proportions for the training set are used. If present, theprobabilities should be specified in the order of the factorlevels.

tol

A tolerance to decide if a matrix is singular; it will reject variablesand linear combinations of unit-variance variables whose variance isless thantol^2.

subset

An index vector specifying the cases to be used in the trainingsample. (NOTE: If given, this argument must be named.)

na.action

A function to specify the action to be taken ifNAs are found.The default action is for the procedure to fail. An alternative isna.omit, which leads to rejection of cases with missing values onany required variable. (NOTE: If given, this argument must be named.)

method

"moment" for standard estimators of the mean and variance,"mle" for MLEs,"mve" to usecov.mve, or"t" for robust estimates based on at distribution.

CV

If true, returns results (classes and posterior probabilities) forleave-one-out cross-validation. Note that if the prior is estimated,the proportions in the whole dataset are used.

nu

degrees of freedom formethod = "t".

...

arguments passed to or from other methods.

Details

The functiontries hard to detect if the within-class covariance matrix issingular. If any variable has within-group variance less thantol^2 it will stop and report the variable as constant. Thiscould result from poor scaling of the problem, but is morelikely to result from constant variables.

Specifying theprior will affect the classification unlessover-ridden inpredict.lda. Unlike in most statistical packages, itwill also affect the rotation of the linear discriminants within theirspace, as a weighted between-groups covariance matrix is used. Thusthe first few linear discriminants emphasize the differences betweengroups with the weights given by the prior, which may differ fromtheir prevalence in the dataset.

If one or more groups is missing in the supplied data, they are droppedwith a warning, but the classifications produced are with respect to theoriginal set of levels.

Value

IfCV = TRUE the return value is a list with componentsclass, the MAP classification (a factor), andposterior,posterior probabilities for the classes.

Otherwise it is an object of class"lda" containing thefollowing components:

prior

the prior probabilities used.

means

the group means.

scaling

a matrix which transforms observations to discriminant functions,normalized so that within groups covariance matrix is spherical.

svd

the singular values, which give the ratio of the between- andwithin-group standard deviations on the linear discriminantvariables. Their squares are the canonical F-statistics.

N

The number of observations used.

call

The (matched) function call.

Note

This function may be called giving either a formula andoptional data frame, or a matrix and grouping factor as the firsttwo arguments. All other arguments are optional, butsubset= andna.action=, if required, must be fully named.

If a formula is given as the principal argument the object may bemodified usingupdate() in the usual way.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Examples

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),                   Sp = rep(c("s","c","v"), rep(50,3)))train <- sample(1:150, 75)table(Iris$Sp[train])## your answer may differ##  c  s  v## 22 23 30z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)predict(z, Iris[-train, ])$class##  [1] s s s s s s s s s s s s s s s s s s s s s s s s s s s c c c## [31] c c c c c c c v c c c c v c c c c c c c c c c c c v v v v v## [61] v v v v v v v v v v v v v v v(z1 <- update(z, . ~ . - Petal.W.))

Histograms or Density Plots of Multiple Groups

Description

Plot histograms or density plots of data on a single Fisher lineardiscriminant.

Usage

ldahist(data, g, nbins = 25, h, x0 = - h/1000, breaks,        xlim = range(breaks), ymax = 0, width,        type = c("histogram", "density", "both"),        sep = (type != "density"),        col = 5, xlab = deparse(substitute(data)), bty = "n", ...)

Arguments

data

vector of data. Missing values (NAs) are allowed and omitted.

g

factor or vector giving groups, of the same length asdata.

nbins

Suggested number of bins to cover the whole range of the data.

h

The bin width (takes precedence overnbins).

x0

Shift for the bins - the breaks are atx0 + h * (..., -1, 0, 1, ...)

breaks

The set of breakpoints to be used. (Usually omitted, takes precedenceoverh andnbins).

xlim

The limits for the x-axis.

ymax

The upper limit for the y-axis.

width

Bandwidth for density estimates. If missing, the Sheather-Jonesselector is used for each group separately.

type

Type of plot.

sep

Whether there is a separate plot for each group, or one combined plot.

col

The colour number for the bar fill.

xlab

label for the plot x-axis. By default, this will be the name ofdata.

bty

The box type for the plot - defaults to none.

...

additional arguments topolygon.

Side Effects

Histogram and/or density plots are plotted on the current device.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Survival Times and White Blood Counts for Leukaemia Patients

Description

A data frame of data from 33 leukaemia patients.

Usage

leuk

Format

A data frame with columns:

wbc: white blood count.
ag: a test result,"present" or"absent".
time: survival time in weeks.

Details

Survival times are given for 33 patients who died from acutemyelogenous leukaemia. Also measured was the patient's white blood cellcount at the time of diagnosis. The patients were also factored into 2groups according to the presence or absence of a morphologiccharacteristic of white blood cells. Patients termed AG positive wereidentified by the presence of Auer rods and/or significant granulationof the leukaemic cells in the bone marrow at the time of diagnosis.

Source

Cox, D. R. and Oakes, D. (1984)Analysis of Survival Data.Chapman & Hall, p. 9.

Taken from

Feigl, P. & Zelen, M. (1965) Estimation of exponential survivalprobabilities with concomitant information.Biometrics21, 826–838.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

library(survival)plot(survfit(Surv(time) ~ ag, data = leuk), lty = 2:3, col = 2:3)# now Cox modelsleuk.cox <- coxph(Surv(time) ~ ag + log(wbc), leuk)summary(leuk.cox)

Fit Linear Models by Generalized Least Squares

Description

Fit linear models by Generalized Least Squares

Usage

lm.gls(formula, data, W, subset, na.action, inverse = FALSE,       method = "qr", model = FALSE, x = FALSE, y = FALSE,       contrasts = NULL, ...)

Arguments

formula

a formula expression as for regression models, of the formresponse ~ predictors.See the documentation offormula for other details.

data

an optional data frame, list or environment in which to interpret thevariables occurring informula.

W

a weight matrix.

subset

expression saying which subset of the rows of the data should be usedin the fit. All observations are included by default.

na.action

a function to filter missing data.

inverse

logical: if trueW specifies the inverse of the weight matrix: thisis appropriate if a variance matrix is used.

method

method to be used bylm.fit.

model

should the model frame be returned?

x

should the design matrix be returned?

y

should the response be returned?

contrasts

a list of contrasts to be used for some or all of

...

additional arguments tolm.fit.

Details

The problem is transformed to uncorrelated form and passed tolm.fit.

Value

An object of class"lm.gls", which is similar to an"lm"object. There is no"weights" component, and only a few"lm"methods will work correctly. As from version 7.1-22 the residuals andfitted values refer to the untransformed problem.

Ridge Regression

Description

Fit a linear model by ridge regression.

Usage

lm.ridge(formula, data, subset, na.action, lambda = 0, model = FALSE,         x = FALSE, y = FALSE, contrasts = NULL, ...)select(obj)

Arguments

formula

a formula expression as for regression models, of the formresponse ~ predictors. See the documentation offormulafor other details.offset terms are allowed.

data

an optional data frame, list or environment in which to interpret thevariables occurring informula.

subset

expression saying which subset of the rows of the data should be usedin the fit. All observations are included by default.

na.action

a function to filter missing data.

lambda

A scalar or vector of ridge constants.

model

should the model frame be returned? Not implemented.

x

should the design matrix be returned? Not implemented.

y

should the response be returned? Not implemented.

contrasts

a list of contrasts to be used for some or all of factor terms in theformula. See thecontrasts.arg ofmodel.matrix.default.

...

additional arguments tolm.fit.

obj

anR object, such as an"lm.ridge" fit.

Details

If an intercept is present in the model, its coefficient is notpenalized. (If you want to penalize an intercept, put in your ownconstant term and remove the intercept.)

Value

A list with components

coef

matrix of coefficients, one row for each value oflambda.Note that these are not on the original scale and are for use by thecoef method.

scales

scalings used on the X matrix.

Inter

was intercept included?

lambda

vector of lambda values

ym

mean ofy

xm

column means ofx matrix

GCV

vector of GCV values

kHKB

HKB estimate of the ridge constant.

kLW

L-W estimate of the ridge constant.

References

Brown, P. J. (1994)Measurement, Regression and CalibrationOxford.

Examples

longley # not the same as the S-PLUS datasetnames(longley)[1] <- "y"lm.ridge(y ~ ., longley)plot(lm.ridge(y ~ ., longley,              lambda = seq(0,0.1,0.001)))select(lm.ridge(y ~ ., longley,               lambda = seq(0,0.1,0.0001)))

Fit Log-Linear Models by Iterative Proportional Scaling

Description

This function provides a front-end to the standard function,loglin, to allow log-linear models to be specified and fittedin a manner similar to that of other fitting functions, such asglm.

Usage

loglm(formula, data, subset, na.action, ...)

Arguments

formula

A linear model formula specifying the log-linear model.

If the left-hand side is empty, thedata argument is requiredand must be a (complete) array of frequencies. In this case thevariables on the right-hand side may be the names of thedimnames attribute of the frequency array, or may be thepositive integers: 1, 2, 3, ... used as alternative names for the1st, 2nd, 3rd, ... dimension (classifying factor).If the left-hand side is not empty it specifies a vector offrequencies. In this case the data argument, if present, must bea data frame from which the left-hand side vector and theclassifying factors on the right-hand side are (preferentially)obtained. The usual abbreviation of a. to stand for ‘allother variables in the data frame’ is allowed. Any non-factorson the right-hand side of the formula are coerced to factor.

data

Numeric array or data frame (or list or environment).In the first case it specifies thearray of frequencies; in the second it provides the data framefrom which the variables occurring in the formula arepreferentially obtained in the usual way.

This argument may be the result of a call toxtabs.

subset

Specifies a subset of the rows in the data frame to be used. Thedefault is to take all rows.

na.action

Specifies a method for handling missing observations. Thedefault is to fail if missing values are present.

...

May supply other arguments to the functionloglm1.

Details

If the left-hand side of the formula is empty thedata argumentsupplies the frequency array and the right-hand side of theformula is used to construct the list of fixed faces as requiredbyloglin. Structural zeros may be specified by giving astart argument with those entries set to zero, as described inthe help information forloglin.

If the left-hand side is not empty, all variables on theright-hand side are regarded as classifying factors and an arrayof frequencies is constructed. If some cells in the completearray are not specified they are treated as structural zeros.The right-hand side of the formula is again used to construct thelist of faces on which the observed and fitted totals must agree,as required byloglin. Hence terms such asa:b,a*b anda/b are all equivalent.

Value

An object of class"loglm" conveying the results of the fittedlog-linear model. Methods exist for the generic functionsprint,summary,deviance,fitted,coef,resid,anova andupdate, which perform the expectedtasks. Only log-likelihood ratio tests are allowed usinganova.

The deviance is simply an alternative name for the log-likelihoodratio statistic for testing the current model within a saturatedmodel, in accordance with standard usage in generalized linearmodels.

Warning

If structural zeros are present, the calculation of degrees offreedom may not be correct.loglin itself takes no action toallow for structural zeros.loglm deducts one degree offreedom for each structural zero, but cannot make allowance forgains in error degrees of freedom due to loss of dimension in themodel space. (This would require checking the rank of themodel matrix, but since iterative proportional scaling methodsare developed largely to avoid constructing the model matrixexplicitly, the computation is at least difficult.)

When structural zeros (or zero fitted values) are present theestimated coefficients will not be available due to infiniteestimates. The deviances will normally continue to be correct, though.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

# The data frames  Cars93, minn38 and quine are available# in the MASS package.# Case 1: frequencies specified as an array.sapply(minn38, function(x) length(levels(x)))## hs phs fol sex f##  3   4   7   2 0##minn38a <- array(0, c(3,4,7,2), lapply(minn38[, -5], levels))##minn38a[data.matrix(minn38[,-5])] <- minn38$f## or more simplyminn38a <- xtabs(f ~ ., minn38)fm <- loglm(~ 1 + 2 + 3 + 4, minn38a)  # numerals as names.deviance(fm)## [1] 3711.9fm1 <- update(fm, .~.^2)fm2 <- update(fm, .~.^3, print = TRUE)## 5 iterations: deviation 0.075anova(fm, fm1, fm2)# Case 1. An array generated with xtabs.loglm(~ Type + Origin, xtabs(~ Type + Origin, Cars93))# Case 2.  Frequencies given as a vector in a data framenames(quine)## [1] "Eth"  "Sex"  "Age"  "Lrn"  "Days"fm <- loglm(Days ~ .^2, quine)gm <- glm(Days ~ .^2, poisson, quine)  # check glm.c(deviance(fm), deviance(gm))          # deviances agree## [1] 1368.7 1368.7c(fm$df, gm$df)                        # resid df do not!c(fm$df, gm$df.residual)               # resid df do not!## [1] 127 128# The loglm residual degrees of freedom is wrong because of# a non-detectable redundancy in the model matrix.

Fit Log-Linear Models by Iterative Proportional Scaling – Internal function

Description

loglm1 is an internal function used byloglm.It is a generic function dispatching on thedata argument.

Usage

loglm1(formula, data, ...)## S3 method for class 'xtabs'loglm1(formula, data, ...)## S3 method for class 'data.frame'loglm1(formula, data, ...)## Default S3 method:loglm1(formula, data, start = rep(1, length(data)), fitted = FALSE,       keep.frequencies = fitted, param = TRUE, eps = 1/10,       iter = 40, print = FALSE, ...)

Arguments

formula

A linear model formula specifying the log-linear model.Seeloglm for its interpretation.

data

Numeric array or data frame (or list or environment).In the first case it specifies thearray of frequencies; in then second it provides the data framefrom which the variables occurring in the formula arepreferentially obtained in the usual way.

This argument may also be the result of a call toxtabs.

start,param,eps,iter,print

Arguments passed tologlin.

fitted

logical: should the fitted values be returned?

keep.frequencies

IfTRUE specifies that the (possibly constructed) array offrequencies is to be retained as part of the fitted model object. Thedefault action is to use the same value as that used forfitted.

...

arguments passed to the default method.

Value

An object of class"loglm".

Estimate log Transformation Parameter

Description

Find and optionally plot the marginal (profile) likelihood for alphafor a transformation model of the formlog(y + alpha) ~ x1 + x2 + ....

Usage

logtrans(object, ...)## Default S3 method:logtrans(object, ..., alpha = seq(0.5, 6, by = 0.25) - min(y),         plotit = TRUE, interp =, xlab = "alpha",         ylab = "log Likelihood")## S3 method for class 'formula'logtrans(object, data, ...)## S3 method for class 'lm'logtrans(object, ...)

Arguments

object

Fitted linear model object, or formula defining the untransformedmodel that isy ~ x1 + x2 + .... The function is generic.

...

Ifobject is a formula, this argument may specify a data frameas forlm.

alpha

Set of values for the transformation parameter, alpha.

plotit

Should plotting be done?

interp

Should the marginal log-likelihood be interpolated with a splineapproximation? (Default isTRUE if plotting is to be done andthe number of real points is less than 100.)

xlab

as forplot.

ylab

as forplot.

data

optionaldata argument forlm fit.

Value

List with componentsx (for alpha) andy (for the marginallog-likelihood values).

Side Effects

A plot of the marginal log-likelihood is produced, if requested,together with an approximate mle and 95% confidence interval.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

logtrans(Days ~ Age*Sex*Eth*Lrn, data = quine,         alpha = seq(0.75, 6.5, length.out = 20))

Resistant Regression

Description

Fit a regression to thegood points in the dataset, therebyachieving a regression estimator with a high breakdown point.lmsreg andltsreg are compatibility wrappers.

Usage

lqs(x, ...)## S3 method for class 'formula'lqs(formula, data, ...,    method = c("lts", "lqs", "lms", "S", "model.frame"),    subset, na.action, model = TRUE,    x.ret = FALSE, y.ret = FALSE, contrasts = NULL)## Default S3 method:lqs(x, y, intercept = TRUE, method = c("lts", "lqs", "lms", "S"),    quantile, control = lqs.control(...), k0 = 1.548, seed, ...)lmsreg(...)ltsreg(...)

Arguments

formula

a formula of the formy ~ x1 + x2 + ....

data

an optional data frame, list or environemnt from whichvariables specified informula are preferentially to be taken.

subset

an index vector specifying the cases to be used infitting. (NOTE: If given, this argument must be named exactly.)

na.action

function to specify the action to be taken ifNAs are found. The default action is for the procedure tofail. Alternatives includena.omit andna.exclude, which lead to omission ofcases with missing values on any required variable. (NOTE: Ifgiven, this argument must be named exactly.)

model,x.ret,y.ret

logical. IfTRUE the model frame,the model matrix and the response are returned, respectively.

contrasts

an optional list. See thecontrasts.argofmodel.matrix.default.

x

a matrix or data frame containing the explanatory variables.

y

the response: a vector of length the number of rows ofx.

intercept

should the model include an intercept?

method

the method to be used.model.frame returns the model frame: for theothers see theDetails section. Usinglmsreg orltsreg forces"lms" and"lts" respectively.

quantile

the quantile to be used: seeDetails. This is over-ridden ifmethod = "lms".

control

additional control items: seeDetails.

k0

the cutoff / tuning constant used for\chi()and\psi() functions whenmethod = "S", currentlycorresponding to Tukey's ‘biweight’.

seed

the seed to be used for random sampling: see.Random.seed. Thecurrent value of.Random.seed will be preserved if it is set..

...

arguments to be passed tolqs.default orlqs.control, seecontrol above andDetails.

Details

Suppose there aren data points andp regressors,including any intercept.

The first three methods minimize some function of the sorted squaredresiduals. For methods"lqs" and"lms" is thequantile squared residual, and for"lts" it is the sumof thequantile smallest squared residuals."lqs" and"lms" differ in the defaults forquantile, which arefloor((n+p+1)/2) andfloor((n+1)/2) respectively.For"lts" the default isfloor(n/2) + floor((p+1)/2).

The"S" estimation method solves for the scalessuch that the average of a function chi of the residuals dividedbys is equal to a given constant.

Thecontrol argument is a list with components

psamp:: the size of each sample. Defaults top.
nsamp:: the number of samples or"best" (thedefault) or"exact" or"sample".If"sample" the number chosen ismin(5*p, 3000),taken from Rousseeuw and Hubert (1997).If"best" exhaustive enumeration is done up to 5000 samples;if"exact" exhaustive enumeration will be attempted howevermany samples are needed.
adjust:: should the intercept be optimized for eachsample? Defaults toTRUE.

Value

An object of class"lqs". This is a list with components

crit

the value of the criterion for the best solution found, inthe case ofmethod == "S" before IWLS refinement.

sing

character. A message about the number of samples whichresulted in singular fits.

coefficients

of the fitted linear model

bestone

the indices of those points fitted by the best samplefound (prior to adjustment of the intercept, if requested).

fitted.values

the fitted values.

residuals

the residuals.

scale

estimate(s) of the scale of the error. The first is basedon the fit criterion. The second (not present formethod == "S") is based on the variance of those residuals whose absolutevalue is less than 2.5 times the initial estimate.

Note

There seems no reason other than historical to use thelms andlqs options. LMS estimation is of low efficiency (convergingat raten^{-1/3}) whereas LTS has the same asymptotic efficiencyas an M estimator with trimming at the quartiles (Marazzi, 1993, p.201).LQS and LTS have the same maximal breakdown value of(floor((n-p)/2) + 1)/n attained iffloor((n+p)/2) <= quantile <= floor((n+p+1)/2).The only drawback mentioned of LTS is greater computation, as a sortwas thought to be required (Marazzi, 1993, p.201) but this is nottrue as a partial sort can be used (and is used in this implementation).

Adjusting the intercept for each trial fit does need the residuals tobe sorted, and may be significant extra computation ifn is largeandp small.

Opinions differ over the choice ofpsamp. Rousseeuw and Hubert(1997) only consider p; Marazzi (1993) recommends p+1 and suggeststhat more samples are better than adjustment for a given computationallimit.

The computations are exact for a model with just an intercept andadjustment, and for LQS for a model with an intercept plus oneregressor and exhaustive search with adjustment. For all other casesthe minimization is only known to be approximate.

References

P. J. Rousseeuw and A. M. Leroy (1987)Robust Regression and Outlier Detection. Wiley.

A. Marazzi (1993)Algorithms, Routines and S Functions for Robust Statistics.Wadsworth and Brooks/Cole.

P. Rousseeuw and M. Hubert (1997) Recent developments in PROGRESS. InL1-Statistical Procedures and Related Topics,ed Y. Dodge, IMS Lecture Notes volume31, pp. 201–214.

Examples

## IGNORE_RDIFF_BEGINset.seed(123) # make reproduciblelqs(stack.loss ~ ., data = stackloss)lqs(stack.loss ~ ., data = stackloss, method = "S", nsamp = "exact")## IGNORE_RDIFF_END

Brain and Body Weights for 62 Species of Land Mammals

Description

A data frame with average brain and body weights for 62 speciesof land mammals.

Usage

mammals

Format

body: body weight in kg.
brain: brain weight in g.
name: Common name of species.(Rock hyrax-a =Heterohyrax brucci,Rock hyrax-b =Procavia habessinic..)

Source

Weisberg, S. (1985)Applied Linear Regression. 2nd edition. Wiley, pp. 144–5.

Selected from:Allison, T. and Cicchetti, D. V. (1976)Sleep in mammals: ecological and constitutional correlates.Science194, 732–734.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Multiple Correspondence Analysis

Description

Computes a multiple correspondence analysis of a set of factors.

Usage

mca(df, nf = 2, abbrev = FALSE)

Arguments

df

A data frame containing only factors

nf

The number of dimensions for the MCA. Rarely 3 might be useful.

abbrev

Should the vertex names be abbreviated? By default these are of theform ‘factor.level’ but ifabbrev = TRUE they are just‘level’ which will suffice if the factors have distinct levels.

Value

An object of class"mca", with components

rs

The coordinates of the rows, innf dimensions.

cs

The coordinates of the column vertices, one for each level of each factor.

fs

Weights for each row, used to interpolate additional factors inpredict.mca.

p

The number of factors

d

The singular values for thenf dimensions.

call

The matched call.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

farms.mca <- mca(farms, abbrev=TRUE)farms.mcaplot(farms.mca)

Data from a Simulated Motorcycle Accident

Description

A data frame giving a series of measurements of head accelerationin a simulated motorcycle accident, used to test crash helmets.

Usage

mcycle

Format

times: in milliseconds after impact.
accel: in g.

Source

Silverman, B. W. (1985) Some aspects of the spline smoothing approach tonon-parametric curve fitting.Journal of the Royal Statistical Society series B47, 1–52.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Age of Menarche in Warsaw

Description

Proportions of female children at various ages during adolescencewho have reached menarche.

Usage

menarche

Format

This data frame contains the following columns:

Age: Average age of the group. (The groups are reasonably age homogeneous.)
Total: Total number of children in the group.
Menarche: Number who have reached menarche.

Source

Milicer, H. and Szczotka, F. (1966) Age at Menarche in Warsaw girls in1965.Human Biology38, 199–203.

The data are also given in
Aranda-Ordaz, F.J. (1981)On two families of transformations to additivity for binary response data.Biometrika68, 357–363.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

mprob <- glm(cbind(Menarche, Total - Menarche) ~ Age,             binomial(link = probit), data = menarche)

Michelson's Speed of Light Data

Description

Measurements of the speed of light in air, made between 5th Juneand 2nd July, 1879. The data consists of five experiments, eachconsisting of 20 consecutive runs. The response is the speedof light in km/s, less 299000. The currently accepted value, onthis scale of measurement, is 734.5.

Usage

michelson

Format

The data frame contains the following components:

Expt: The experiment number, from 1 to 5.
Run: The run number within each experiment.
Speed: Speed-of-light measurement.

Source

A.J. Weekes (1986)A Genstat Primer. Edward Arnold.

S. M. Stigler (1977) Do robust estimators work with real data?Annals of Statistics5, 1055–1098.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Minnesota High School Graduates of 1938

Description

The Minnesota high school graduates of 1938 were classified according tofour factors, described below. Theminn38 data frame has 168rows and 5 columns.

Usage

minn38

Format

This data frame contains the following columns:

hs: high school rank:"L","M" and"U" for lower,middle and upper third.
phs: post high school status: Enrolled in college, ("C"), enrolled innon-collegiate school, ("N"), employed full-time, ("E")and other, ("O").
fol: father's occupational level, (seven levels,"F1","F2",...,"F7").
sex: sex: factor with levels"F" or"M".
f: frequency.

Source

From R. L. Plackett, (1974)The Analysis of CategoricalData. London: Griffin

who quotes the data from

Hoyt, C. J., Krishnaiah, P. R. and Torrance, E. P. (1959) Analysis ofcomplex contingency tables,J. Exp. Ed.27, 187–194.

Accelerated Life Testing of Motorettes

Description

Themotors data frame has 40 rows and 3 columns. It describes anaccelerated life test at each of four temperatures of 10 motorettes,and has rather discrete times.

Usage

motors

Format

This data frame contains the following columns:

temp: the temperature (degrees C) of the test.
time: the time in hours to failure or censoring at 8064 hours (= 336 days).
cens: an indicator variable for death.

Source

Kalbfleisch, J. D. and Prentice, R. L. (1980)The Statistical Analysis of Failure Time Data.New York: Wiley.

taken from

Nelson, W. D. and Hahn, G. J. (1972)Linear regression of a regression relationship from censored data.Part 1 – simple methods and their application.Technometrics,14, 247–276.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

library(survival)plot(survfit(Surv(time, cens) ~ factor(temp), motors), conf.int = FALSE)# fit Weibull modelmotor.wei <- survreg(Surv(time, cens) ~ temp, motors)## IGNORE_RDIFF_BEGINsummary(motor.wei)## IGNORE_RDIFF_END# and predict at 130Cunlist(predict(motor.wei, data.frame(temp=130), se.fit = TRUE))motor.cox <- coxph(Surv(time, cens) ~ temp, motors)summary(motor.cox)# predict at temperature 200plot(survfit(motor.cox, newdata = data.frame(temp=200),     conf.type = "log-log"))summary( survfit(motor.cox, newdata = data.frame(temp=130)) )

Effect of Calcium Chloride on Muscle Contraction in Rat Hearts

Description

The purpose of this experiment was to assess the influence ofcalcium in solution on the contraction of heart muscle in rats.The left auricle of 21 rat hearts was isolated and on severaloccasions a constant-length strip of tissue was electricallystimulated and dipped into various concentrations of calciumchloride solution, after which the shortening of the strip wasaccurately measured as the response.

Usage

muscle

Format

This data frame contains the following columns:

Strip: which heart muscle strip was used?
Conc: concentration of calcium chloride solution, in multiples of 2.2 mM.
Length: the change in length (shortening) of the strip, (allegedly) in mm.

Source

Linder, A., Chakravarti, I. M. and Vuagnat, P. (1964)Fitting asymptotic regression curves with different asymptotes. InContributions to Statistics. Presented to Professor P. C. Mahalanobison the occasion of his 70th birthday,ed. C. R. Rao, pp. 221–228. Oxford: Pergamon Press.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth Edition. Springer.

Examples

## IGNORE_RDIFF_BEGINA <- model.matrix(~ Strip - 1, data=muscle)rats.nls1 <- nls(log(Length) ~ cbind(A, rho^Conc),   data = muscle, start = c(rho=0.1), algorithm="plinear")(B <- coef(rats.nls1))st <- list(alpha = B[2:22], beta = B[23], rho = B[1])(rats.nls2 <- nls(log(Length) ~ alpha[Strip] + beta*rho^Conc,                  data = muscle, start = st))## IGNORE_RDIFF_ENDMuscle <- with(muscle, {Muscle <- expand.grid(Conc = sort(unique(Conc)), Strip = levels(Strip))Muscle$Yhat <- predict(rats.nls2, Muscle)Muscle <- cbind(Muscle, logLength = rep(as.numeric(NA), 126))ind <- match(paste(Strip, Conc),            paste(Muscle$Strip, Muscle$Conc))Muscle$logLength[ind] <- log(Length)Muscle})lattice::xyplot(Yhat ~ Conc | Strip, Muscle, as.table = TRUE,   ylim = range(c(Muscle$Yhat, Muscle$logLength), na.rm = TRUE),   subscripts = TRUE, xlab = "Calcium Chloride concentration (mM)",   ylab = "log(Length in mm)", panel =   function(x, y, subscripts, ...) {      panel.xyplot(x, Muscle$logLength[subscripts], ...)      llines(spline(x, y))   })

Simulate from a Multivariate Normal Distribution

Description

Produces one or more samples from the specifiedmultivariate normal distribution.

Usage

mvrnorm(n = 1, mu, Sigma, tol = 1e-6, empirical = FALSE, EISPACK = FALSE)

Arguments

n

the number of samples required.

mu

a vector giving the means of the variables.

Sigma

a positive-definite symmetric matrix specifying thecovariance matrix of the variables.

tol

tolerance (relative to largest variance) for numerical lackof positive-definiteness inSigma.

empirical

logical. If true, mu and Sigma specify the empiricalnot population mean and covariance matrix.

EISPACK

logical: values other thanFALSE are an error.

Details

The matrix decomposition is done viaeigen; although a Choleskidecomposition might be faster, the eigendecomposition isstabler.

Value

Ifn = 1 a vector of the same length asmu, otherwise ann bylength(mu) matrix with one sample in each row.

Side Effects

Causes creation of the dataset.Random.seed if it doesnot already exist, otherwise its value is updated.

References

B. D. Ripley (1987)Stochastic Simulation. Wiley. Page 98.

Examples

Sigma <- matrix(c(10,3,3,2),2,2)Sigmavar(mvrnorm(n = 1000, rep(0, 2), Sigma))var(mvrnorm(n = 1000, rep(0, 2), Sigma, empirical = TRUE))

Family function for Negative Binomial GLMs

Description

Specifies the information required to fit a Negative Binomial generalizedlinear model, with knowntheta parameter, usingglm().

Usage

negative.binomial(theta = stop("'theta' must be specified"), link = "log")

Arguments

theta

The known value of the additional parameter,theta.

link

The link function, as a character string, name or one-elementcharacter vector specifying one oflog,sqrtoridentity, or an object of class"link-glm".

Value

An object of class"family", a list of functions andexpressions needed byglm() to fit a Negative Binomialgeneralized linear model.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Examples

# Fitting a Negative Binomial model to the quine data#   with theta = 2 assumed known.#glm(Days ~ .^4, family = negative.binomial(2), data = quine)

Newcomb's Measurements of the Passage Time of Light

Description

A numeric vector giving the ‘Third Series’ of measurements of thepassage time of light recorded by Newcomb in 1882. The given valuesdivided by 1000 plus 24.8 give the time in millionths of a second forlight to traverse a known distance. The ‘true’ value is nowconsidered to be 33.02.

The dataset is given in the order in Staudte and Sheather. Stigler(1977, Table 5) gives the dataset as

    28 26 33 24 34 -44 27 16 40 -2 29 22 24 21 25 30 23 29 31 19    24 20 36 32 36 28 25 21 28 29 37 25 28 26 30 32 36 26 30 22    36 23 27 27 28 27 31 27 26 33 26 32 32 24 39 28 24 25 32 25    29 27 28 29 16 23

However, order is not relevant to its use as an example of robustestimation. (Thanks to Anthony Unwin for bringing this difference to ourattention.)

Usage

newcomb

Source

S. M. Stigler (1973)Simon Newcomb, Percy Daniell, and the history of robust estimation1885–1920.Journal of the American Statistical Association68, 872–879.

S. M. Stigler (1977)Do robust estimators work withreal data?Annals of Statistics,5, 1055–1098.

R. G. Staudte and S. J. Sheather (1990)Robust Estimation and Testing. Wiley.

Eighth-Grade Pupils in the Netherlands

Description

Snijders and Bosker (1999) use as a running example a study of 2287eighth-grade pupils (aged about 11) in 132 classes in 131 schools inthe Netherlands. Only the variables used in our examples are supplied.

Usage

nlschools

Format

This data frame contains 2287 rows and the following columns:

lang: language test score.
IQ: verbal IQ.
class: class ID.
GS: class size: number of eighth-grade pupils recorded in the class (theremay be others: seeCOMB, and some may have been omittedwith missing values).
SES: social-economic status of pupil's family.
COMB: were the pupils taught in a multi-grade class (0/1)? Classes whichcontained pupils from grades 7 and 8 are coded1, but onlyeighth-graders were tested.

Source

Snijders, T. A. B. and Bosker, R. J. (1999)Multilevel Analysis. An Introduction to Basic and AdvancedMultilevel Modelling. London: Sage.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

nl1 <- within(nlschools, {IQave <- tapply(IQ, class, mean)[as.character(class)]IQ <- IQ - IQave})cen <- c("IQ", "IQave", "SES")nl1[cen] <- scale(nl1[cen], center = TRUE, scale = FALSE)nl.lme <- nlme::lme(lang ~ IQ*COMB + IQave + SES,                    random = ~ IQ | class, data = nl1)## IGNORE_RDIFF_BEGINsummary(nl.lme)## IGNORE_RDIFF_END

Classical N, P, K Factorial Experiment

Description

A classical N, P, K (nitrogen, phosphate, potassium) factorialexperiment on the growth of peas conducted on 6 blocks. Each half of afractional factorial design confounding the NPK interaction was usedon 3 of the plots.

Usage

npk

Format

Thenpk data frame has 24 rows and 5 columns:

block: which block (label 1 to 6).
N: indicator (0/1) for the application of nitrogen.
P: indicator (0/1) for the application of phosphate.
K: indicator (0/1) for the application of potassium.
yield: Yield of peas, in pounds/plot (the plots were (1/70) acre).

Note

This dataset is also contained inR 3.0.2 and later.

Source

Imperial College, London, M.Sc. exercise sheet.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

options(contrasts = c("contr.sum", "contr.poly"))npk.aov <- aov(yield ~ block + N*P*K, npk)## IGNORE_RDIFF_BEGINnpk.aovsummary(npk.aov)alias(npk.aov)coef(npk.aov)options(contrasts = c("contr.treatment", "contr.poly"))npk.aov1 <- aov(yield ~ block + N + K, data = npk)summary.lm(npk.aov1)se.contrast(npk.aov1, list(N=="0", N=="1"), data = npk)model.tables(npk.aov1, type = "means", se = TRUE)## IGNORE_RDIFF_END

US Naval Petroleum Reserve No. 1 data

Description

Data on the locations, porosity and permeability (a measure of oil flow)on 104 oil wells in the US Naval Petroleum Reserve No. 1 in California.

Usage

npr1

Format

This data frame contains the following columns:

x: x coordinates, in miles (origin unspecified)..
y: y coordinates, in miles.
perm: permeability in milli-Darcies.
por: porosity (%).

Source

Maher, J.C., Carter, R.D. and Lantz, R.J. (1975)Petroleum geology of Naval Petroleum Reserve No. 1, Elk Hills,Kern County, California.USGS Professional Paper912.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Data from an Oats Field Trial

Description

The yield of oats from a split-plot field trial using three varieties andfour levels of manurial treatment. The experiment was laid out in 6 blocksof 3 main plots, each split into 4 sub-plots. The varieties were appliedto the main plots and the manurial treatments to the sub-plots.

Usage

oats

Format

This data frame contains the following columns:

B: Blocks, levels I, II, III, IV, V and VI.
V: Varieties, 3 levels.
N: Nitrogen (manurial) treatment, levels 0.0cwt, 0.2cwt, 0.4cwt and 0.6cwt,showing the application in cwt/acre.
Y: Yields in 1/4lbs per sub-plot, each of area 1/80 acre.

Source

Yates, F. (1935) Complex experiments,Journal of the Royal Statistical Society Suppl.2, 181–247.

Also given inYates, F. (1970)Experimental design: Selected papers of Frank Yates, C.B.E, F.R.S.London: Griffin.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

oats$Nf <- ordered(oats$N, levels = sort(levels(oats$N)))oats.aov <- aov(Y ~ Nf*V + Error(B/V), data = oats, qr = TRUE)## IGNORE_RDIFF_BEGINsummary(oats.aov)summary(oats.aov, split = list(Nf=list(L=1, Dev=2:3)))## IGNORE_RDIFF_ENDpar(mfrow = c(1,2), pty = "s")plot(fitted(oats.aov[[4]]), studres(oats.aov[[4]]))abline(h = 0, lty = 2)oats.pr <- proj(oats.aov)qqnorm(oats.pr[[4]][,"Residuals"], ylab = "Stratum 4 residuals")qqline(oats.pr[[4]][,"Residuals"])par(mfrow = c(1,1), pty = "m")oats.aov2 <- aov(Y ~ N + V + Error(B/V), data = oats, qr = TRUE)model.tables(oats.aov2, type = "means", se = TRUE)

The Painter's Data of de Piles

Description

The subjective assessment, on a 0 to 20 integer scale, of 54classical painters. The painters were assessed on four characteristics:composition, drawing, colour and expression. The data is due to theEighteenth century art critic, de Piles.

Usage

painters

Format

The row names of the data frame are the painters. The components are:

Composition: Composition score.
Drawing: Drawing score.
Colour: Colour score.
Expression: Expression score.
School: The school to which a painter belongs, as indicated by a factor levelcode as follows:"A": Renaissance;"B": Mannerist;"C": Seicento;"D": Venetian;"E": Lombard;"F": Sixteenth Century;"G": Seventeenth Century;"H": French.

Source

A. J. Weekes (1986)A Genstat Primer. Edward Arnold.

M. Davenport and G. Studdert-Kennedy (1972) The statisticalanalysis of aesthetic judgement: an exploration.Applied Statistics21, 324–333.

I. T. Jolliffe (1986)Principal Component Analysis. Springer.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Produce Pairwise Scatterplots from an 'lda' Fit

Description

Pairwise scatterplot of the data on the linear discriminants.

Usage

## S3 method for class 'lda'pairs(x, labels = colnames(x), panel = panel.lda,     dimen, abbrev = FALSE, ..., cex=0.7, type = c("std", "trellis"))

Arguments

x

Object of class"lda".

labels

vector of character strings for labelling the variables.

panel

panel function to plot the data in each panel.

dimen

The number of linear discriminants to be used for the plot; if thisexceeds the number determined byx the smaller value is used.

abbrev

whether the group labels are abbreviated on the plots. Ifabbrev > 0this givesminlength in the call toabbreviate.

...

additional arguments forpairs.default.

cex

graphics parametercex for labels on plots.

type

type of plot. The default is in the style ofpairs.default; thestyle"trellis" uses the Trellis functionsplom.

Details

This function is a method for the generic functionpairs() for class"lda".It can be invoked by callingpairs(x) for anobjectx of the appropriate class, or directly bycallingpairs.lda(x) regardless of theclass of the object.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Parallel Coordinates Plot

Description

Parallel coordinates plot

Usage

parcoord(x, col = 1, lty = 1, var.label = FALSE, ...)

Arguments

x

a matrix or data frame who columns represent variables. Missing valuesare allowed.

col

A vector of colours, recycled as necessary for each observation.

lty

A vector of line types, recycled as necessary for each observation.

var.label

IfTRUE, each variable's axis is labelled with maximum andminimum values.

...

Further graphics parameters which are passed tomatplot.

Side Effects

a parallel coordinates plots is drawn.

Author(s)

B. D. Ripley. Enhancements based on ideas and code by Fabian Scheipl.

References

Wegman, E. J. (1990) Hyperdimensional data analysis using parallelcoordinates.Journal of the American Statistical Association85, 664–675.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

parcoord(state.x77[, c(7, 4, 6, 2, 5, 3)])ir <- rbind(iris3[,,1], iris3[,,2], iris3[,,3])parcoord(log(ir)[, c(3, 4, 2, 1)], col = 1 + (0:149)%/%50)

N. L. Prater's Petrol Refinery Data

Description

The yield of a petroleum refining process with four covariates.The crude oil appears to come from only 10 distinct samples.

These data were originally used by Prater (1956) tobuild an estimation equation for the yield of the refiningprocess of crude oil to gasoline.

Usage

petrol

Format

The variables are as follows

No: crude oil sample identification label. (Factor.)
SG: specific gravity, degrees API. (Constant within sample.)
VP: vapour pressure in pounds per square inch. (Constant within sample.)
V10: volatility of crude; ASTM 10% point. (Constant within sample.)
EP: desired volatility of gasoline. (The end point. Varies within sample.)
Y: yield as a percentage of crude.

Source

N. H. Prater (1956) Estimate gasoline yields fromcrudes.Petroleum Refiner35, 236–238.

This dataset is also given inD. J. Hand, F. Daly, K. McConway, D. Lunn and E. Ostrowski (eds) (1994)A Handbook of Small Data Sets. Chapman & Hall.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

library(nlme)Petrol <- petrolPetrol[, 2:5] <- scale(as.matrix(Petrol[, 2:5]), scale = FALSE)pet3.lme <- lme(Y ~ SG + VP + V10 + EP,                random = ~ 1 | No, data = Petrol)pet3.lme <- update(pet3.lme, method = "ML")pet4.lme <- update(pet3.lme, fixed. = Y ~ V10 + EP)anova(pet4.lme, pet3.lme)

Plot Method for Class 'lda'

Description

Plots a set of data on one, two or more linear discriminants.

Usage

## S3 method for class 'lda'plot(x, panel = panel.lda, ..., cex = 0.7, dimen,     abbrev = FALSE, xlab = "LD1", ylab = "LD2")

Arguments

x

An object of class"lda".

panel

the panel function used to plot the data.

...

additional arguments topairs,ldahist oreqscplot.

cex

graphics parametercex for labels on plots.

dimen

The number of linear discriminants to be used for the plot; if thisexceeds the number determined byx the smaller value is used.

abbrev

whether the group labels are abbreviated on the plots. Ifabbrev > 0this givesminlength in the call toabbreviate.

xlab

label for the x axis

ylab

label for the y axis

Details

This function is a method for the generic functionplot() for class"lda".It can be invoked by callingplot(x) for anobjectx of the appropriate class, or directly bycallingplot.lda(x) regardless of theclass of the object.

The behaviour is determined by the value ofdimen. Fordimen > 2, apairs plot is used. Fordimen = 2, anequiscaled scatter plot is drawn. Fordimen = 1, a set ofhistograms or density plots are drawn. Use argumenttype tomatch"histogram" or"density" or"both".

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Plot Method for Objects of Class 'mca'

Description

Plot a multiple correspondence analysis.

Usage

## S3 method for class 'mca'plot(x, rows = TRUE, col, cex = par("cex"), ...)

Arguments

x

An object of class"mca".

rows

Should the coordinates for the rows be plotted, or just the verticesfor the levels?

col,cex

The colours andcex to be used for the row points and level verticesrespectively.

...

Additional parameters toplot.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

plot(mca(farms, abbrev = TRUE))

Ordered Logistic or Probit Regression

Description

Fits a logistic or probit regression model to an ordered factorresponse. The default logistic case isproportional oddslogistic regression, after which the function is named.

Usage

polr(formula, data, weights, start, ..., subset, na.action,     contrasts = NULL, Hess = FALSE, model = TRUE,     method = c("logistic", "probit", "loglog", "cloglog", "cauchit"))

Arguments

formula

a formula expression as for regression models, of the formresponse ~ predictors. The response should be a factor(preferably an ordered factor), which will be interpreted as anordinal response, with levels ordered as in the factor. The model must have an intercept: attempts to remove one willlead to a warning and be ignored. An offset may be used. See thedocumentation offormula for other details.

data

an optional data frame, list or environment in which to interpretthe variables occurring informula.

weights

optional case weights in fitting. Default to 1.

start

initial values for the parameters. This is in the formatc(coefficients, zeta): see the Values section.

...

additional arguments to be passed tooptim, most often acontrol argument.

subset

expression saying which subset of the rows of the data should be usedin the fit. All observations are included by default.

na.action

a function to filter missing data.

contrasts

a list of contrasts to be used for some or all ofthe factors appearing as variables in the model formula.

Hess

logical for whether the Hessian (the observed information matrix)should be returned. Use this if you intend to callsummary orvcov on the fit.

model

logical for whether the model matrix should be returned.

method

logistic or probit or (complementary) log-log or cauchit(corresponding to a Cauchy latent variable).

Details

This model is what Agresti (2002) calls acumulative linkmodel. The basic interpretation is as acoarsened version of alatent variableY_i which has a logistic or normal orextreme-value or Cauchy distribution with scale parameter one and alinear model for the mean. The ordered factor which is observed iswhich binY_i falls into with breakpoints

\zeta_0 = -\infty < \zeta_1 < \cdots < \zeta_K = \infty

This leads to the model

\mbox{logit} P(Y \le k | x) = \zeta_k - \eta

withlogit replaced byprobit for a normal latentvariable, and\eta being the linear predictor, a linearfunction of the explanatory variables (with no intercept). Notethat it is quite common for other software to use the opposite signfor\eta (and hence the coefficientsbeta).

In the logistic case, the left-hand side of the last display is thelog odds of categoryk or less, and since these are log oddswhich differ only by a constant for differentk, the odds areproportional. Hence the termproportional odds logisticregression.

The log-log and complementary log-log links are the increasing functionsF^{-1}(p) = -log(-log(p)) andF^{-1}(p) = log(-log(1-p));some call the first the ‘negative log-log’ link. Thesecorrespond to a latent variable with the extreme-value distribution forthe maximum and minimum respectively.

Aproportional hazards model for grouped survival times can beobtained by using the complementary log-log link with grouping orderedby increasing times.

predict,summary,vcov,anova,model.frame and anextractAIC method for use withstepAIC (andstep). There are alsoprofile andconfint methods.

Value

A object of class"polr". This has components

coefficients

the coefficients of the linear predictor, which has nointercept.

zeta

the intercepts for the class boundaries.

deviance

the residual deviance.

fitted.values

a matrix, with a column for each level of the response.

lev

the names of the response levels.

terms

theterms structure describing the model.

df.residual

the number of residual degrees of freedoms,calculated using the weights.

edf

the (effective) number of degrees of freedom used by the model

n,nobs

the (effective) number of observations, calculated using theweights. (nobs is for use bystepAIC.

call

the matched call.

method

the matched method used.

convergence

the convergence code returned byoptim.

niter

the number of function and gradient evaluations used byoptim.

lp

the linear predictor (including any offset).

Hessian

(ifHess is true). Note that this is anumerical approximation derived from the optimization proces.

model

(ifmodel is true).

Note

Thevcov method uses the approximate Hessian: forreliable results the model matrix should be sensibly scaled with allcolumns having range the order of one.

Prior to version 7.3-32,method = "cloglog" confusingly gavethe log-log link, implicitly assuming the first response level was the‘best’.

References

Agresti, A. (2002)Categorical Data. Second edition. Wiley.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

options(contrasts = c("contr.treatment", "contr.poly"))house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)house.plrsummary(house.plr, digits = 3)## slightly worse fit fromsummary(update(house.plr, method = "probit", Hess = TRUE), digits = 3)## although it is not really appropriate, can fitsummary(update(house.plr, method = "loglog", Hess = TRUE), digits = 3)summary(update(house.plr, method = "cloglog", Hess = TRUE), digits = 3)predict(house.plr, housing, type = "p")addterm(house.plr, ~.^2, test = "Chisq")house.plr2 <- stepAIC(house.plr, ~.^2)house.plr2$anovaanova(house.plr, house.plr2)house.plr <- update(house.plr, Hess=TRUE)pr <- profile(house.plr)confint(pr)plot(pr)pairs(pr)

Predict Method for glmmPQL Fits

Description

Obtains predictions from a fitted generalized linear modelwith random effects.

Usage

## S3 method for class 'glmmPQL'predict(object, newdata = NULL, type = c("link", "response"),       level, na.action = na.pass, ...)

Arguments

object

a fitted object of class inheriting from"glmmPQL".

newdata

optionally, a data frame in which to look for variables withwhich to predict.

type

the type of prediction required. The default is on thescale of the linear predictors; the alternative"response"is on the scale of the response variable. Thus for a defaultbinomial model the default predictions are of log-odds (probabilitieson logit scale) andtype = "response" gives the predictedprobabilities.

level

an optional integer vector giving the level(s) of groupingto be used in obtaining the predictions. Level values increase fromoutermost to innermost grouping, with level zero corresponding to thepopulation predictions. Defaults to the highest or innermost level ofgrouping.

na.action

function determining what should be done with missingvalues innewdata. The default is to predictNA.

...

further arguments passed to or from other methods.

Value

Iflevel is a single integer, a vector otherwise a data frame.

Examples

fit <- glmmPQL(y ~ trt + I(week > 2), random = ~1 |  ID,               family = binomial, data = bacteria)predict(fit, bacteria, level = 0, type="response")predict(fit, bacteria, level = 1, type="response")

Classify Multivariate Observations by Linear Discrimination

Description

Classify multivariate observations in conjunction withlda, and alsoproject data onto the linear discriminants.

Usage

## S3 method for class 'lda'predict(object, newdata, prior = object$prior, dimen,        method = c("plug-in", "predictive", "debiased"), ...)

Arguments

object

object of class"lda"

newdata

data frame of cases to be classified or, ifobjecthas a formula, a data frame with columns of the same names as thevariables used. A vector will be interpretedas a row vector. If newdata is missing, an attempt will bemade to retrieve the data used to fit thelda object.

prior

The prior probabilities of the classes, by default the proportions in thetraining set or what was set in the call tolda.

dimen

the dimension of the space to be used. If this is less thanmin(p, ng-1),only the firstdimen discriminant components are used (except formethod="predictive"), and only those dimensions are returned inx.

method

This determines how the parameter estimation is handled. With"plug-in"(the default) the usual unbiased parameter estimates are used andassumed to be correct. With"debiased" an unbiased estimator ofthe log posterior probabilities is used, and with"predictive" theparameter estimates are integrated out using a vague prior.

...

arguments based from or to other methods

Details

This function is a method for the generic functionpredict() forclass"lda". It can be invoked by callingpredict(x) foran objectx of the appropriate class, or directly by callingpredict.lda(x) regardless of the class of the object.

Missing values innewdata are handled by returningNA if thelinear discriminants cannot be evaluated. Ifnewdata is omitted andthena.action of the fit omitted cases, these will be omitted on theprediction.

This version centres the linear discriminants so that theweighted mean (weighted byprior) of the group centroids is atthe origin.

Value

a list with components

class

The MAP classification (a factor)

posterior

posterior probabilities for the classes

x

the scores of test cases on up todimen discriminant variables

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Examples

tr <- sample(1:50, 25)train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))z <- lda(train, cl)predict(z, test)$class

Predict from an lqs Fit

Description

Predict from an resistant regression fitted bylqs.

Usage

## S3 method for class 'lqs'predict(object, newdata, na.action = na.pass, ...)

Arguments

object

object inheriting from class"lqs"

newdata

matrix or data frame of cases to be predicted or, ifobjecthas a formula, a data frame with columns of the same names as thevariables used. A vector will be interpretedas a row vector. Ifnewdata is missing, an attempt will bemade to retrieve the data used to fit thelqs object.

na.action

function determining what should be done with missingvalues innewdata. The default is to predictNA.

...

arguments to be passed from or to other methods.

Details

This function is a method for the generic functionpredict() for classlqs.It can be invoked by callingpredict(x) for anobjectx of the appropriate class, or directly bycallingpredict.lqs(x) regardless of theclass of the object.

Missing values innewdata are handled by returningNA if thelinear fit cannot be evaluated. Ifnewdata is omitted andthena.action of the fit omitted cases, these will be omitted on theprediction.

Value

A vector of predictions.

Author(s)

B.D. Ripley

Examples

set.seed(123)fm <- lqs(stack.loss ~ ., data = stackloss, method = "S", nsamp = "exact")predict(fm, stackloss)

Predict Method for Class 'mca'

Description

Used to compute coordinates for additional rows or additional factorsin a multiple correspondence analysis.

Usage

## S3 method for class 'mca'predict(object, newdata, type = c("row", "factor"), ...)

Arguments

object

An object of class"mca", usually the result of a call tomca.

newdata

A data frame containingeither additional rows of the factors used tofitobjector additional factors for the cases used in theoriginal fit.

type

Are predictions required for further rows or for new factors?

...

Additional arguments frompredict: unused.

Value

Iftype = "row", the coordinates for the additional rows.

Iftype = "factor", the coordinates of the column vertices for thelevels of the new factors.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Classify from Quadratic Discriminant Analysis

Description

Classify multivariate observations in conjunction withqda

Usage

## S3 method for class 'qda'predict(object, newdata, prior = object$prior,        method = c("plug-in", "predictive", "debiased", "looCV"), ...)

Arguments

object

object of class"qda"

newdata

prior

The prior probabilities of the classes, by default the proportions in thetraining set or what was set in the call toqda.

method

...

arguments based from or to other methods

Details

This function is a method for the generic functionpredict() for class"qda".It can be invoked by callingpredict(x) for anobjectx of the appropriate class, or directly bycallingpredict.qda(x) regardless of theclass of the object.

Missing values innewdata are handled by returningNA if thequadratic discriminants cannot be evaluated. Ifnewdata is omitted andthena.action of the fit omitted cases, these will be omitted on theprediction.

Value

a list with components

class

The MAP classification (a factor)

posterior

posterior probabilities for the classes

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Examples

tr <- sample(1:50, 25)train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))zq <- qda(train, cl)predict(zq, test)$class

Method for Profiling glm Objects

Description

Investigates the profile log-likelihood function for a fitted model ofclass"glm".

As fromR 4.4.0 was migrated to packagestats with additionalfunctionality.

Quadratic Discriminant Analysis

Description

Quadratic discriminant analysis.

Usage

qda(x, ...)## S3 method for class 'formula'qda(formula, data, ..., subset, na.action)## Default S3 method:qda(x, grouping, prior = proportions,    method, CV = FALSE, nu, ...)## S3 method for class 'data.frame'qda(x, ...)## S3 method for class 'matrix'qda(x, grouping, ..., subset, na.action)

Arguments

formula

A formula of the formgroups ~ x1 + x2 + ... That is, theresponse is the grouping factor and the right hand side specifiesthe (non-factor) discriminators.

data

An optional data frame, list or environment from which variablesspecified informula are preferentially to be taken.

x

(required if no formula is given as the principal argument.)a matrix or data frame or Matrix containing the explanatory variables.

grouping

(required if no formula principal argument is given.)a factor specifying the class for each observation.

prior

the prior probabilities of class membership. If unspecified, the classproportions for the training set are used. If specified, theprobabilities should be specified in the order of the factor levels.

subset

An index vector specifying the cases to be used in the trainingsample. (NOTE: If given, this argument must be named.)

na.action

A function to specify the action to be taken ifNAs are found.The default action is for the procedure to fail. An alternative isna.omit, which leads to rejection of cases with missing values onany required variable. (NOTE: If given, this argument must be named.)

method

"moment" for standard estimators of the mean and variance,"mle" for MLEs,"mve" to usecov.mve, or"t" for robustestimates based on a t distribution.

CV

If true, returns results (classes and posterior probabilities) forleave-out-out cross-validation. Note that if the prior is estimated,the proportions in the whole dataset are used.

nu

degrees of freedom formethod = "t".

...

arguments passed to or from other methods.

Details

Uses a QR decomposition which will give an error message if thewithin-group variance is singular for any group.

Value

an object of class"qda" containing the following components:

prior

the prior probabilities used.

means

the group means.

scaling

for each groupi,scaling[,,i] is an array which transforms observationsso that within-groups covariance matrix is spherical.

ldet

a vector of half log determinants of the dispersion matrix.

lev

the levels of the grouping factor.

terms

(if formula is a formula)an object of mode expression and class term summarizingthe formula.

call

the (matched) function call.

unlessCV=TRUE, when the return value is a list with components:

class

The MAP classification (a factor)

posterior

posterior probabilities for the classes

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Examples

tr <- sample(1:50, 25)train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3])test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3])cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))z <- qda(train, cl)predict(z,test)$class

Absenteeism from School in Rural New South Wales

Description

Thequine data frame has 146 rows and 5 columns.Children from Walgett, New South Wales, Australia, were classified byCulture, Age, Sex and Learner status and the number of days absent fromschool in a particular school year was recorded.

Usage

quine

Format

This data frame contains the following columns:

Eth: ethnic background: Aboriginal or Not, ("A" or"N").
Sex: sex: factor with levels ("F" or"M").
Age: age group: Primary ("F0"), or forms"F1,""F2" or"F3".
Lrn: learner status: factor with levels Average or Slow learner, ("AL" or"SL").
Days: days absent from school in the year.

Source

S. Quine, quoted in Aitkin, M. (1978) The analysis of unbalanced crossclassifications (with discussion).Journal of the Royal Statistical Society series A141, 195–223.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Rational Approximation

Description

Find rational approximations to the components of a real numericobject using a standard continued fraction method.

Usage

rational(x, cycles = 10, max.denominator = 2000, ...)

Arguments

x

Any object of mode numeric. Missing values are now allowed.

cycles

The maximum number of steps to be used in the continued fractionapproximation process.

max.denominator

An early termination criterion. If any partial denominatorexceedsmax.denominator the continued fraction stops at that point.

...

arguments passed to or from other methods.

Details

Each component is first expanded in a continued fraction of theform

x = floor(x) + 1/(p1 + 1/(p2 + ...)))

Value

A numeric object with the same attributes asx but with entriesrational approximations to the values. This effectively roundsrelative to the size of the object and replaces very smallentries by zero.

Examples

X <- matrix(runif(25), 5, 5)zapsmall(solve(X, X/5)) # print near-zeroes as zerorational(solve(X, X/5))

Convert a Formula Transformed by 'denumerate'

Description

denumerate converts a formula written using the conventions ofloglm into one thatterms is able to process.renumerateconverts it back again to a form like the original.

Usage

renumerate(x)

Arguments

x

A formula, normally as modified bydenumerate.

Details

This is an inverse function todenumerate. It is only neededsinceterms returns an expanded form of the original formulawhere the non-marginal terms are exposed. This expanded form ismapped back into a form corresponding to the one that the useroriginally supplied.

Value

A formula where all variables with names of the form.vn, wheren is an integer, converted to numbers,n, as allowed by theformula conventions ofloglm.

Examples

denumerate(~(1+2+3)^3 + a/b)## ~ (.v1 + .v2 + .v3)^3 + a/brenumerate(.Last.value)## ~ (1 + 2 + 3)^3 + a/b

Robust Fitting of Linear Models

Description

Fit a linear model by robust regression using an M estimator.

Usage

rlm(x, ...)## S3 method for class 'formula'rlm(formula, data, weights, ..., subset, na.action,    method = c("M", "MM", "model.frame"),    wt.method = c("inv.var", "case"),    model = TRUE, x.ret = TRUE, y.ret = FALSE, contrasts = NULL)## Default S3 method:rlm(x, y, weights, ..., w = rep(1, nrow(x)),    init = "ls", psi = psi.huber,    scale.est = c("MAD", "Huber", "proposal 2"), k2 = 1.345,    method = c("M", "MM"), wt.method = c("inv.var", "case"),    maxit = 20, acc = 1e-4, test.vec = "resid", lqs.control = NULL)psi.huber(u, k = 1.345, deriv = 0)psi.hampel(u, a = 2, b = 4, c = 8, deriv = 0)psi.bisquare(u, c = 4.685, deriv = 0)

Arguments

formula

a formula of the formy ~ x1 + x2 + ....

data

an optional data frame, list or environment from which variablesspecified informula are preferentially to be taken.

weights

a vector of prior weights for each case.

subset

An index vector specifying the cases to be used in fitting.

na.action

A function to specify the action to be taken ifNAs are found.The ‘factory-fresh’ default action inR isna.omit, and can be changed byoptions(na.action=).

x

a matrix or data frame containing the explanatory variables.

y

the response: a vector of length the number of rows ofx.

method

currently either M-estimation or MM-estimation or (for theformula method only) find the model frame. MM-estimationis M-estimation with Tukey's biweight initialized by a specificS-estimator. See the ‘Details’ section.

wt.method

are the weights case weights (giving the relative importance of case,so a weight of 2 means there are two of these) or the inverse of thevariances, so a weight of two means this error is half as variable?

model

should the model frame be returned in the object?

x.ret

should the model matrix be returned in the object?

y.ret

should the response be returned in the object?

contrasts

optional contrast specifications: seelm.

w

(optional) initial down-weighting for each case.

init

(optional) initial values for the coefficients OR a method to findinitial values OR the result of a fit with acoef component. Knownmethods are"ls" (the default) for an initial least-squares fitusing weightsw*weights, and"lts" for an unweightedleast-trimmed squares fit with 200 samples.

psi

the psi function is specified by this argument. It must give(possibly by name) a functiong(x, ..., deriv) that forderiv=0 returns psi(x)/x and forderiv=1 returnspsi'(x). Tuning constants will be passed in via....

scale.est

method of scale estimation: re-scaled MAD of the residuals (default)or Huber's proposal 2 (which can be selected by either"Huber"or"proposal 2").

k2

tuning constant used for Huber proposal 2 scale estimation.

maxit

the limit on the number of IWLS iterations.

acc

the accuracy for the stopping criterion.

test.vec

the stopping criterion is based on changes in this vector.

...

additional arguments to be passed torlm.default or to thepsifunction.

lqs.control

An optional list of control values forlqs.

u

numeric vector of evaluation points.

k,a,b,c

tuning constants.

deriv

0 or1: compute values of the psi function or of itsfirst derivative.

Details

Fitting is done by iterated re-weighted least squares (IWLS).

Psi functions are supplied for the Huber, Hampel and Tukey bisquareproposals aspsi.huber,psi.hampel andpsi.bisquare. Huber's corresponds to a convex optimizationproblem and gives a unique solution (up to collinearity). The othertwo will have multiple local minima, and a good starting point isdesirable.

Selectingmethod = "MM" selects a specific set of options whichensures that the estimator has a high breakdown point. The initial setof coefficients and the final scale are selected by an S-estimatorwithk0 = 1.548; this gives (forn \gg p)breakdown point 0.5.The final estimator is an M-estimator with Tukey's biweight and fixedscale that will inherit this breakdown point providedc > k0;this is true for the default value ofc that corresponds to95% relative efficiency at the normal. Case weights are notsupported formethod = "MM".

Value

An object of class"rlm" inheriting from"lm".Note that thedf.residual component is deliberately set toNA to avoid inappropriate estimation of the residual scale fromthe residual mean square by"lm" methods.

The additional components not in anlm object are

s

the robust scale estimate used

w

the weights used in the IWLS process

psi

the psi function with parameters substituted

conv

the convergence criteria at each iteration

converged

did the IWLS converge?

wresid

a working residual, weighted for"inv.var" weights only.

Note

Prior to version7.3-52, offset terms informulawere omitted from fitted and predicted values.

References

P. J. Huber (1981)Robust Statistics.Wiley.

F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel (1986)Robust Statistics: The Approach based on Influence Functions.Wiley.

A. Marazzi (1993)Algorithms, Routines and S Functions for Robust Statistics.Wadsworth & Brooks/Cole.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

summary(rlm(stack.loss ~ ., stackloss))rlm(stack.loss ~ ., stackloss, psi = psi.hampel, init = "lts")rlm(stack.loss ~ ., stackloss, psi = psi.bisquare)

Relative Curvature Measures for Non-Linear Regression

Description

Calculates the root mean square parameter effects and intrinsic relativecurvatures,c^\theta andc^\iota, for afitted nonlinear regression, as defined in Bates & Watts, section 7.3,p. 253ff

Usage

rms.curv(obj)

Arguments

obj

Fitted model object of class"nls". The model must be fitted using thedefault algorithm.

Details

The method of section 7.3.1 of Bates & Watts is implemented. Thefunctionderiv3 should be used generate a model function with firstderivative (gradient) matrix and second derivative (Hessian) arrayattributes. This function should then be used to fit the nonlinearregression model.

A print method,print.rms.curv, prints thepc andic components only, suitably annotated.

If eitherpc oric exceeds some threshold (0.3 has beensuggested) the curvature is unacceptably high for the planar assumption.

Value

A list of classrms.curv with componentspc andicfor parameter effects and intrinsic relative curvatures multiplied bysqrt(F),ct andci forc^\theta andc^\iota (unmultiplied),andC the C-array as used in section 7.3.1 of Bates & Watts.

References

Bates, D. M, and Watts, D. G. (1988)Nonlinear Regression Analysis and its Applications.Wiley, New York.

Examples

# The treated sample from the Puromycin datammcurve <- deriv3(~ Vm * conc/(K + conc), c("Vm", "K"),                  function(Vm, K, conc) NULL)Treated <- Puromycin[Puromycin$state == "treated", ](Purfit1 <- nls(rate ~ mmcurve(Vm, K, conc), data = Treated,                start = list(Vm=200, K=0.1)))rms.curv(Purfit1)##Parameter effects: c^theta x sqrt(F) = 0.2121##        Intrinsic: c^iota  x sqrt(F) = 0.092

Simulate Negative Binomial Variates

Description

Function to generate random outcomes from a Negative Binomial distribution,with meanmu and variancemu + mu^2/theta.

Usage

rnegbin(n, mu = n, theta = stop("'theta' must be specified"))

Arguments

n

If a scalar, the number of sample values required. If a vector,length(n) is the number required andn is used as the mean vector ifmu is not specified.

mu

The vector of means. Short vectors are recycled.

theta

Vector of values of thetheta parameter. Short vectors are recycled.

Details

The function uses the representation of the Negative Binomial distributionas a continuous mixture of Poisson distributions with Gamma distributed means.Unlikernbinom the index can be arbitrary.

Value

Vector of random Negative Binomial variate values.

Side Effects

Changes.Random.seed in the usual way.

Examples

# Negative Binomials with means fitted(fm) and theta = 4.5fm <- glm.nb(Days ~ ., data = quine)dummy <- rnegbin(fitted(fm), theta = 4.5)

Road Accident Deaths in US States

Description

A data frame with the annual deaths in road accidents for halfthe US states.

Usage

road

Format

Columns are:

state: name.
deaths: number of deaths.
drivers: number of drivers (in 10,000s).
popden: population density in people per square mile.
rural: length of rural roads, in 1000s of miles.
temp: average daily maximum temperature in January.
fuel: fuel consumption in 10,000,000 US gallons per year.

Source

Imperial College, London M.Sc. exercise

Numbers of Rotifers by Fluid Density

Description

The data give the numbers of rotifers falling out of suspension fordifferent fluid densities. There are two species,pmPolyartha major andkc,Keratella cochlearis andfor each species the number falling out and the total number aregiven.

Usage

rotifer

Format

density: specific density of fluid.
pm.y: number falling out forP. major.
pm.total: total number ofP. major.
kc.y: number falling out forK. cochlearis.
kc.tot: total number ofK. cochlearis.

Source

D. Collett (1991)Modelling Binary Data. Chapman & Hall. p. 217

Sammon's Non-Linear Mapping

Description

One form of non-metric multidimensional scaling.

Usage

sammon(d, y = cmdscale(d, k), k = 2, niter = 100, trace = TRUE,       magic = 0.2, tol = 1e-4)

Arguments

d

distance structure of the form returned bydist, or a full, symmetricmatrix. Data are assumed to be dissimilarities or relative distances,but must be positive except for self-distance. This can contain missingvalues.

y

An initial configuration. If none is supplied,cmdscaleis used to provide the classical solution. (If there are missingvalues ind, an initial configuration must be provided.) Thismust not have duplicates.

k

The dimension of the configuration.

niter

The maximum number of iterations.

trace

Logical for tracing optimization. DefaultTRUE.

magic

initial value of the step size constant in diagonal Newton method.

tol

Tolerance for stopping, in units of stress.

Details

This chooses a two-dimensional configuration to minimize the stress,the sum of squared differences between the input distances and thoseof the configuration, weighted by the distances, the whole sum beingdivided by the sum of input distances to make the stress scale-free.

An iterative algorithm is used, which will usually converge in around50 iterations. As this is necessarily anO(n^2) calculation, it is slowfor large datasets. Further, since the configuration is only determinedup to rotations and reflections (by convention the centroid is at theorigin), the result can vary considerably from machine to machine.In this release the algorithm has been modified by adding a step-lengthsearch (magic) to ensure that it always goes downhill.

Value

Two components:

points

A two-column vector of the fitted configuration.

stress

The final stress achieved.

Side Effects

If trace is true, the initial stress and the current stress are printedout every 10 iterations.

References

Sammon, J. W. (1969)A non-linear mapping for data structure analysis.IEEE Trans. Comput.,C-18 401–409.

Ripley, B. D. (1996)Pattern Recognition and Neural Networks. Cambridge University Press.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

swiss.x <- as.matrix(swiss[, -1])swiss.sam <- sammon(dist(swiss.x))plot(swiss.sam$points, type = "n")text(swiss.sam$points, labels = as.character(1:nrow(swiss.x)))

Ships Damage Data

Description

Data frame giving the number of damage incidents and aggregatemonths of service by ship type, year of construction, and period of operation.

Usage

ships

Format

type: type:"A" to"E".
year: year of construction: 1960–64, 65–69, 70–74, 75–79(coded as"60","65","70","75").
period: period of operation : 1960–74, 75–79.
service: aggregate months of service.
incidents: number of damage incidents.

Source

P. McCullagh and J. A. Nelder, (1983),Generalized Linear Models. Chapman & Hall, section 6.3.2, page 137

Shoe wear data of Box, Hunter and Hunter

Description

A list of two vectors, giving the wear of shoes of materials A and Bfor one foot each of ten boys.

Usage

shoes

Source

G. E. P. Box, W. G. Hunter and J. S. Hunter (1978)Statistics for Experimenters. Wiley, p. 100

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Percentage of Shrimp in Shrimp Cocktail

Description

A numeric vector with 18 determinations by different laboratoriesof the amount (percentage of the declared total weight) of shrimpin shrimp cocktail.

Usage

shrimp

Source

F. J. King and J. J. Ryan (1976)Collaborative study of the determination of the amount of shrimp inshrimp cocktail.J. Off. Anal. Chem.59, 644–649.

R. G. Staudte and S. J. Sheather (1990)Robust Estimation and Testing. Wiley.

Space Shuttle Autolander Problem

Description

Theshuttle data frame has 256 rows and 7 columns.The first six columns are categorical variables giving exampleconditions; the seventh is the decision. The first 253 rows are thetraining set, the last 3 the test conditions.

Usage

shuttle

Format

This data frame contains the following factor columns:

stability: stable positioning or not (stab /xstab).
error: size of error (MM /SS /LX /XL).
sign: sign of error, positive or negative (pp /nn).
wind: wind sign (head /tail).
magn: wind strength (Light /Medium /Strong /Out of Range).
vis: visibility (yes /no).
use: use the autolander or not. (auto /noauto.)

Source

D. Michie (1989)Problems of computer-aided concept formation. InApplications of Expert Systems 2,ed. J. R. Quinlan, Turing Institute Press / Addison-Wesley, pp. 310–333.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Snail Mortality Data

Description

Groups of 20 snails were held for periods of 1, 2, 3 or 4 weeksin carefully controlled conditions of temperature and relativehumidity. There were two species of snail, A and B, and theexperiment was designed as a 4 by 3 by 4 by 2 completely randomizeddesign. At the end of the exposure time the snails were tested to see ifthey had survived; the process itself is fatal for the animals. Theobject of the exercise was to model the probability of survival in terms ofthe stimulus variables, and in particular to test for differences betweenspecies.

The data are unusual in that in most cases fatalities during the experimentwere fairly small.

Usage

snails

Format

The data frame contains the following components:

Species: snail species A (1) or B (2).
Exposure: exposure in weeks.
Rel.Hum: relative humidity (4 levels).
Temp: temperature, in degrees Celsius (3 levels).
Deaths: number of deaths.
N: number of snails exposed.

Source

Zoology Department, The University of Adelaide.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Extract Standardized Residuals from a Linear Model

Description

The standardized residuals. These are normalized to unitvariance, fitted including the current data point.

Usage

stdres(object)

Arguments

object

any object representing a linear model.

Value

The vector of appropriately transformed residuals.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

The Saturated Steam Pressure Data

Description

Temperature and pressure in a saturated steam driven experimental device.

Usage

steam

Format

The data frame contains the following components:

Temp: temperature, in degrees Celsius.
Press: pressure, in Pascals.

Source

N.R. Draper and H. Smith (1981)Applied Regression Analysis. Wiley, pp. 518–9.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Choose a model by AIC in a Stepwise Algorithm

Description

Performs stepwise model selection by AIC.

Usage

stepAIC(object, scope, scale = 0,        direction = c("both", "backward", "forward"),        trace = 1, keep = NULL, steps = 1000, use.start = FALSE,        k = 2, ...)

Arguments

object

an object representing a model of an appropriate class.This is used as the initial model in the stepwise search.

scope

defines the range of models examined in the stepwise search.This should be either a single formula, or a list containingcomponentsupper andlower, both formulae. See thedetails for how to specify the formulae and how they are used.

scale

used in the definition of the AIC statistic for selecting the models,currently only forlm andaov models(seeextractAIC for details).

direction

the mode of stepwise search, can be one of"both","backward", or"forward", with a default of"both".If thescope argument is missing the default fordirection is"backward".

trace

if positive, information is printed during the running ofstepAIC.Larger values may give more information on the fitting process.

keep

a filter function whose input is a fitted model object and theassociatedAIC statistic, and whose output is arbitrary.Typicallykeep will select a subset of the components ofthe object and return them. The default is not to keep anything.

steps

the maximum number of steps to be considered. The default is 1000(essentially as many as required). It is typically used to stop theprocess early.

use.start

if true the updated fits are done starting at the linear predictor forthe currently selected model. This may speed up the iterativecalculations forglm (and other fits), but it can also slow themdown.Not used inR.

k

the multiple of the number of degrees of freedom used for the penalty.Onlyk = 2 gives the genuine AIC:k = log(n) issometimes referred to as BIC or SBC.

...

any additional arguments toextractAIC. (None are currently used.)

Details

The set of models searched is determined by thescope argument.The right-hand-side of itslower component is always includedin the model, and right-hand-side of the model is included in theupper component. Ifscope is a single formula, itspecifies theupper component, and thelower model isempty. Ifscope is missing, the initial model is used as theupper model.

Models specified byscope can be templates to updateobject as used byupdate.formula.

There is a potential problem in usingglm fits with avariablescale, as in that case the deviance is not simplyrelated to the maximized log-likelihood. Theglm method forextractAIC makes theappropriate adjustment for agaussian family, but may need to beamended for other cases. (Thebinomial andpoissonfamilies have fixedscale by default and do not correspondto a particular maximum-likelihood problem for variablescale.)

Where a conventional deviance exists (e.g. forlm,aovandglm fits) this is quoted in the analysis of variance table:it is theunscaled deviance.

Value

the stepwise-selected model is returned, with up to two additionalcomponents. There is an"anova" component corresponding to thesteps taken in the search, as well as a"keep" component if thekeep= argument was supplied in the call. The"Resid. Dev" column of the analysis of deviance table refersto a constant minus twice the maximized log likelihood: it will be adeviance only in cases where a saturated model is well-defined(thus excludinglm,aov andsurvreg fits,for example).

Note

The model fitting must apply the models to the same dataset. This maybe a problem if there are missing values and anna.action other thanna.fail is used (as is the default inR).We suggest you remove the missing values first.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

quine.hi <- aov(log(Days + 2.5) ~ .^4, quine)quine.nxt <- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn)quine.stp <- stepAIC(quine.nxt,    scope = list(upper = ~Eth*Sex*Age*Lrn, lower = ~1),    trace = FALSE)quine.stp$anovacpus1 <- cpusfor(v in names(cpus)[2:7])  cpus1[[v]] <- cut(cpus[[v]], unique(quantile(cpus[[v]])),                    include.lowest = TRUE)cpus0 <- cpus1[, 2:8]  # excludes names, authors' predictionscpus.samp <- sample(1:209, 100)cpus.lm <- lm(log10(perf) ~ ., data = cpus1[cpus.samp,2:8])cpus.lm2 <- stepAIC(cpus.lm, trace = FALSE)cpus.lm2$anovaexample(birthwt)birthwt.glm <- glm(low ~ ., family = binomial, data = bwt)birthwt.step <- stepAIC(birthwt.glm, trace = FALSE)birthwt.step$anovabirthwt.step2 <- stepAIC(birthwt.glm, ~ .^2 + I(scale(age)^2)    + I(scale(lwt)^2), trace = FALSE)birthwt.step2$anovaquine.nb <- glm.nb(Days ~ .^4, data = quine)quine.nb2 <- stepAIC(quine.nb)quine.nb2$anova

The Stormer Viscometer Data

Description

The stormer viscometer measures the viscosity of a fluid by measuring thetime taken for an inner cylinder in the mechanism to perform a fixed numberof revolutions in response to an actuating weight. The viscometer iscalibrated by measuring the time taken with varying weights while themechanism is suspended in fluids of accurately known viscosity. The datacomes from such a calibration, and theoretical considerations suggest anonlinear relationship between time, weight and viscosity, of the formTime = (B1*Viscosity)/(Weight - B2) + EwhereB1 andB2are unknown parameters to be estimated, andE is error.

Usage

stormer

Format

The data frame contains the following components:

Viscosity: viscosity of fluid.
Wt: actuating weight.
Time: time taken.

Source

E. J. Williams (1959)Regression Analysis. Wiley.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Extract Studentized Residuals from a Linear Model

Description

The Studentized residuals. Like standardized residuals, these arenormalized to unit variance, but the Studentized version is fittedignoring the current data point. (They are sometimes called jackknifedresiduals).

Usage

studres(object)

Arguments

object

any object representing a linear model.

Value

The vector of appropriately transformed residuals.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Summary Method Function for Objects of Class 'loglm'

Description

Returns a summary list for log-linear models fitted byiterative proportional scaling usingloglm.

Usage

## S3 method for class 'loglm'summary(object, fitted = FALSE, ...)

Arguments

object

a fitted loglm model object.

fitted

ifTRUE return observed and expected frequencies in the result.Usingfitted = TRUE may necessitate re-fitting the object.

...

arguments to be passed to or from other methods.

Details

This function is a method for the generic functionsummary() for class"loglm".It can be invoked by callingsummary(x) for anobjectx of the appropriate class, or directly bycallingsummary.loglm(x) regardless of theclass of the object.

Value

a list is returned for use byprint.summary.loglm.This has components

formula

the formula used to produceobject

tests

the table of test statistics (likelihood ratio, Pearson) for the fit.

oe

iffitted = TRUE, an array of the observed and expected frequencies,otherwiseNULL.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Summary Method Function for Objects of Class 'negbin'

Description

Identical tosummary.glm, but with three lines of additional output: theML estimate of theta, its standard error, and twice the log-likelihoodfunction.

Usage

## S3 method for class 'negbin'summary(object, dispersion = 1, correlation = FALSE, ...)

Arguments

object

fitted model object of classnegbin inheriting fromglm andlm.Typically the output ofglm.nb.

dispersion

as forsummary.glm, with a default of 1.

correlation

as forsummary.glm.

...

arguments passed to or from other methods.

Details

summary.glm is used to produce the majority of the output and supply theresult.This function is a method for the generic functionsummary() for class"negbin".It can be invoked by callingsummary(x) for anobjectx of the appropriate class, or directly bycallingsummary.negbin(x) regardless of theclass of the object.

Value

As forsummary.glm; the additional lines of output are not included inthe resultant object.

Side Effects

A summary table is produced as forsummary.glm, with the additionalinformation described above.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

summary(glm.nb(Days ~ Eth*Age*Lrn*Sex, quine, link = log))

Summary Method for Robust Linear Models

Description

summary method for objects of class"rlm"

Usage

## S3 method for class 'rlm'summary(object, method = c("XtX", "XtWX"), correlation = FALSE, ...)

Arguments

object

the fitted model.This is assumed to be the result of some fit that producesan object inheriting from the classrlm, in the sense thatthe components returned by therlm function will be available.

method

Should the weighted (by the IWLS weights) or unweighted cross-productsmatrix be used?

correlation

logical. Should correlations be computed (and printed)?

...

arguments passed to or from other methods.

Details

This function is a method for the generic functionsummary() for class"rlm".It can be invoked by callingsummary(x) for anobjectx of the appropriate class, or directly bycallingsummary.rlm(x) regardless of theclass of the object.

Value

If printing takes place, only a null value is returned.Otherwise, a list is returned with the following components.Printing always takes place if this function is invoked automaticallyas a method for thesummary function.

correlation

The computed correlation coefficient matrix for the coefficients in the model.

cov.unscaled

The unscaled covariance matrix; i.e, a matrix such that multiplying it byan estimate of the error variance produces an estimated covariance matrixfor the coefficients.

sigma

The scale estimate.

stddev

A scale estimate used for the standard errors.

df

The number of degrees of freedom for the model and for residuals.

coefficients

A matrix with three columns, containing the coefficients, their standard errorsand the corresponding t statistic.

terms

The terms object used in fitting this model.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

summary(rlm(calls ~ year, data = phones, maxit = 50))

Student Survey Data

Description

This data frame contains the responses of 237 Statistics I students atthe University of Adelaide to a number of questions.

Usage

survey

Format

The components of the data frame are:

Sex: The sex of the student. (Factor with levels"Male" and"Female".)
Wr.Hnd: span (distance from tip of thumb to tip of little finger of spreadhand) of writing hand, in centimetres.
NW.Hnd: span of non-writing hand.
W.Hnd: writing hand of student. (Factor, with levels"Left" and"Right".)
Fold: “Fold your arms! Which is on top” (Factor, with levels"R on L","L on R","Neither".)
Pulse: pulse rate of student (beats per minute).
Clap: ‘Clap your hands! Which hand is on top?’ (Factor, with levels"Right","Left","Neither".)
Exer: how often the student exercises. (Factor, with levels"Freq"(frequently),"Some","None".)
Smoke: how much the student smokes. (Factor, levels"Heavy","Regul" (regularly),"Occas" (occasionally),"Never".)
Height: height of the student in centimetres.
M.I: whether the student expressed height in imperial(feet/inches) or metric (centimetres/metres) units. (Factor, levels"Metric","Imperial".)
Age: age of the student in years.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S-PLUS. Fourth Edition. Springer.

Synthetic Classification Problem

Description

Thesynth.tr data frame has 250 rows and 3 columns.Thesynth.te data frame has 100 rows and 3 columns.It is intended thatsynth.tr be used from training andsynth.te for testing.

Usage

synth.trsynth.te

Format

These data frames contains the following columns:

xs: x-coordinate
ys: y-coordinate
yc: class, coded as 0 or 1.

Source

Ripley, B.D. (1994)Neural networks and related methods forclassification (with discussion).Journal of the Royal Statistical Society series B56, 409–456.

Ripley, B.D. (1996)Pattern Recognition and Neural Networks.Cambridge: Cambridge University Press.

Estimate theta of the Negative Binomial

Description

Given the estimated mean vector, estimatetheta of theNegative Binomial Distribution.

Usage

theta.md(y, mu, dfr, weights, limit = 20, eps = .Machine$double.eps^0.25)theta.ml(y, mu, n, weights, limit = 10, eps = .Machine$double.eps^0.25,         trace = FALSE)theta.mm(y, mu, dfr, weights, limit = 10, eps = .Machine$double.eps^0.25)

Arguments

y

Vector of observed values from the Negative Binomial.

mu

Estimated mean vector.

n

Number of data points (defaults to the sum ofweights)

dfr

Residual degrees of freedom (assumingtheta known). Fora weighted fit this is the sum of the weights minus the number offitted parameters.

weights

Case weights. If missing, taken as 1.

limit

Limit on the number of iterations.

eps

Tolerance to determine convergence.

trace

logical: should iteration progress be printed?

Details

theta.md estimates by equating the deviance to the residualdegrees of freedom, an analogue of a moment estimator.

theta.ml uses maximum likelihood.

theta.mm calculates the moment estimator oftheta byequating the Pearson chi-square\sum (y-\mu)^2/(\mu+\mu^2/\theta)to the residual degrees of freedom.

Value

The required estimate oftheta, as a scalar.Fortheta.ml, the standard error is given as attribute"SE".

Examples

quine.nb <- glm.nb(Days ~ .^2, data = quine)theta.md(quine$Days, fitted(quine.nb), dfr = df.residual(quine.nb))theta.ml(quine$Days, fitted(quine.nb))theta.mm(quine$Days, fitted(quine.nb), dfr = df.residual(quine.nb))## weighted exampleyeast <- data.frame(cbind(numbers = 0:5, fr = c(213, 128, 37, 18, 3, 1)))fit <- glm.nb(numbers ~ 1, weights = fr, data = yeast)summary(fit)mu <- fitted(fit)theta.md(yeast$numbers, mu, dfr = 399, weights = yeast$fr)theta.ml(yeast$numbers, mu, limit = 15, weights = yeast$fr)theta.mm(yeast$numbers, mu, dfr = 399, weights = yeast$fr)

Spatial Topographic Data

Description

Thetopo data frame has 52 rows and 3 columns, oftopographic heights within a 310 feet square.

Usage

topo

Format

This data frame contains the following columns:

x: x coordinates (units of 50 feet)
y: y coordinates (units of 50 feet)
z: heights (feet)

Source

Davis, J.C. (1973)Statistics and Data Analysis in Geology.Wiley.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Plot a Histogram

Description

Creates a histogram on the current graphics device.

Usage

truehist(data, nbins = "Scott", h, x0 = -h/1000,         breaks, prob = TRUE, xlim = range(breaks),         ymax = max(est), col = "cyan",         xlab = deparse(substitute(data)), bty = "n", ...)

Arguments

data

numeric vector of data for histogram. Missing values (NAs)are allowed and omitted.

nbins

The suggested number of bins. Either a positive integer, or a character stringnaming a rule:"Scott" or"Freedman-Diaconis" or"FD". (Case isignored.)

h

The bin width, a strictly positive number (takes precedence overnbins).

x0

Shift for the bins - the breaks are atx0 + h * (..., -1, 0, 1, ...)

breaks

The set of breakpoints to be used. (Usually omitted, takes precedenceoverh andnbins).

prob

If true (the default) plot a true histogram.The vertical axis has arelative frequency densityscale, so the product of the dimensions of any panel gives therelative frequency. Hence the total area under the histogramis 1 and it is directly comparable with most other estimatesof the probability density function.If false plot the counts in the bins.

xlim

The limits for the x-axis.

ymax

The upper limit for the y-axis.

col

The colour for the bar fill: the default is colour 5 in the defaultR palette.

xlab

label for the plot x-axis. By default, this will be the name ofdata.

bty

The box type for the plot - defaults to none.

...

additional arguments torect orplot.

Details

This plots a true histogram, a density estimate of total area 1. Ifbreaks is specified, those breakpoints are used. Otherwise ifh is specified, a regular grid of bins is used with widthh. If neitherbreaks norh is specified,nbins is used to select a suitableh.

Side Effects

A histogram is plotted on the current device.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Unbiased Cross-Validation for Bandwidth Selection

Description

Uses unbiased cross-validation to select the bandwidth of a Gaussiankernel density estimator.

Usage

ucv(x, nb = 1000, lower, upper)

Arguments

x

a numeric vector

nb

number of bins to use.

lower,upper

Range over which to minimize. The default is almost always satisfactory.

Value

a bandwidth.

References

Scott, D. W. (1992)Multivariate Density Estimation: Theory, Practice, and Visualization.Wiley.

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

ucv(geyser$duration)

Counts of Waders at 15 Sites in South Africa

Description

Thewaders data frame has 15 rows and 19 columns.The entries are counts of waders in summer.

Usage

waders

Format

This data frame contains the following columns (species)

S1: Oystercatcher
S2: White-fronted Plover
S3: Kitt Lutz's Plover
S4: Three-banded Plover
S5: Grey Plover
S6: Ringed Plover
S7: Bar-tailed Godwit
S8: Whimbrel
S9: Marsh Sandpiper
S10: Greenshank
S11: Common Sandpiper
S12: Turnstone
S13: Knot
S14: Sanderling
S15: Little Stint
S16: Curlew Sandpiper
S17: Ruff
S18: Avocet
S19: Black-winged Stilt

The rows are the sites:

A = Namibia North coast
B = Namibia North wetland
C = Namibia South coast
D = Namibia South wetland
E = Cape North coast
F = Cape North wetland
G = Cape West coast
H = Cape West wetland
I = Cape South coast
J= Cape South wetland
K = Cape East coast
L = Cape East wetland
M = Transkei coast
N = Natal coast
O = Natal wetland

Source

J.C. Gower and D.J. Hand (1996)BiplotsChapman & Hall Table 9.1. Quoted as from:

R.W. Summers, L.G. Underhill, D.J. Pearson and D.A. Scott (1987)Wader migration systems in south and eastern Africa and western Asia.Wader Study Group Bulletin49 Supplement, 15–34.

Examples

plot(corresp(waders, nf=2))

House Insulation: Whiteside's Data

Description

Mr Derek Whiteside of the UK Building Research Station recorded theweekly gas consumption and average external temperature at his ownhouse in south-east England for two heating seasons, one of 26 weeksbefore, and one of 30 weeks after cavity-wall insulation wasinstalled. The object of the exercise was to assess the effect of theinsulation on gas consumption.

Usage

whiteside

Format

Thewhiteside data frame has 56 rows and 3 columns.:

Insul: A factor, before or after insulation.
Temp: Purportedly the average outside temperature in degrees Celsius. (Thesevalues is far too low for any 56-week period in the 1960s inSouth-East England. It might be the weekly average of daily minima.)
Gas: The weekly gas consumption in 1000s of cubic feet.

Source

A data set collected in the 1960s by Mr Derek Whiteside of theUK Building Research Station. Reported by

Hand, D. J., Daly, F., McConway, K., Lunn, D. and Ostrowski, E. eds (1993)A Handbook of Small Data Sets.Chapman & Hall, p. 69.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

require(lattice)xyplot(Gas ~ Temp | Insul, whiteside, panel =  function(x, y, ...) {    panel.xyplot(x, y, ...)    panel.lmline(x, y, ...)  }, xlab = "Average external temperature (deg. C)",  ylab = "Gas consumption  (1000 cubic feet)", aspect = "xy",  strip = function(...) strip.default(..., style = 1))gasB <- lm(Gas ~ Temp, whiteside, subset = Insul=="Before")gasA <- update(gasB, subset = Insul=="After")summary(gasB)summary(gasA)gasBA <- lm(Gas ~ Insul/Temp - 1, whiteside)summary(gasBA)gasQ <- lm(Gas ~ Insul/(Temp + I(Temp^2)) - 1, whiteside)coef(summary(gasQ))gasPR <- lm(Gas ~ Insul + Temp, whiteside)anova(gasPR, gasBA)options(contrasts = c("contr.treatment", "contr.poly"))gasBA1 <- lm(Gas ~ Insul*Temp, whiteside)coef(summary(gasBA1))

Bandwidth Selection by Pilot Estimation of Derivatives

Description

Uses the method of Sheather & Jones (1991) to select the bandwidth ofa Gaussian kernel density estimator.

Usage

width.SJ(x, nb = 1000, lower, upper, method = c("ste", "dpi"))

Arguments

x

a numeric vector

nb

number of bins to use.

upper,lower

range over which to search for solution ifmethod = "ste".

method

Either"ste" ("solve-the-equation") or"dpi"("direct plug-in").

Value

a bandwidth.

Note

A faster version for largen (thousands) is available inR\ge 3.4.0 as part ofbw.SJ: quadruple itsvalue for comparability with this version.

References

Sheather, S. J. and Jones, M. C. (1991) A reliable data-based bandwidthselection method for kernel density estimation.Journal of the Royal Statistical Society series B53, 683–690.

Scott, D. W. (1992)Multivariate Density Estimation: Theory, Practice, and Visualization.Wiley.

Wand, M. P. and Jones, M. C. (1995)Kernel Smoothing.Chapman & Hall.

Examples

width.SJ(geyser$duration, method = "dpi")width.SJ(geyser$duration)width.SJ(galaxies, method = "dpi")width.SJ(galaxies)

Write a Matrix or Data Frame

Description

Writes a matrix or data frame to a file or the console, using columnlabels and a layout respecting columns.

Usage

write.matrix(x, file = "", sep = " ", blocksize)

Arguments

x

matrix or data frame.

file

name of output file. The default ("") is the console.

sep

The separator between columns.

blocksize

If supplied and positive, the output is written in blocks ofblocksize rows. Choose as large as possible consistent withthe amount of memory available.

Details

Ifx is a matrix, supplyingblocksize is morememory-efficient and enables larger matrices to be written, but eachblock of rows might be formatted slightly differently.

Ifx is a data frame, the conversion to a matrix may negate thememory saving.

Side Effects

A formatted file is produced, with column headings (ifx has them)and columns of data.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Weight Loss Data from an Obese Patient

Description

The data frame gives the weight, in kilograms, of an obese patient at 52time points over an 8 month period of a weight rehabilitation programme.

Usage

wtloss

Format

This data frame contains the following columns:

Days: time in days since the start of the programme.
Weight: weight in kilograms of the patient.

Source

Dr T. Davies, Adelaide.

References

Venables, W. N. and Ripley, B. D. (2002)Modern Applied Statistics with S. Fourth edition. Springer.

Examples

## IGNORE_RDIFF_BEGINwtloss.fm <- nls(Weight ~ b0 + b1*2^(-Days/th),    data = wtloss, start = list(b0=90, b1=95, th=120))wtloss.fm## IGNORE_RDIFF_ENDplot(wtloss)with(wtloss, lines(Days, fitted(wtloss.fm)))

Movatterモバイル変換

Australian AIDS Survival Data

Description

Usage

Format

Note

Source

References

Brain and Body Weights for 28 Species

Description

Usage

Format

Note

Source

References

Belgium Phone Calls 1950-1973

Description

Usage

Source

References

Housing Values in Suburbs of Boston

Description

Usage

Format

Source

Data from 93 Cars on Sale in the USA in 1993

Description

Usage

Format

Details

Source

References

Diagnostic Tests on Patients with Cushing's Syndrome

Description

Usage

Format

Source

References

DDT in Kale

Description

Usage

Source

Level of GAG in Urine of Children

Description

Usage

Format

Source

References

Numbers of Car Insurance claims

Description

Usage

Format

Source

References

Examples

Internal MASS functions

Description

Usage

Details

Survival from Malignant Melanoma

Description

Usage

Format

Source

Null Spaces of Matrices

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Tests of Auditory Perception in Children with OME

Description

Usage

Format

Background

Source

Examples