Introduction to R and splot

Source:vignettes/splot.Rmd

splot.Rmd

Built with R 4.3.1

R is a programming language and software environmentwith a focus on statistics.
Splot is an R package for visualizing data.
This guide will introduce you to both.

Setting up

Downloading R

Follow the link matching your system to download R:
Windows |Mac |Linux

In Windows, you may see two versions, starting with R i386 and R x64.These correspond to the 32 and 64 bit versions of R. The 64 bit versionshould be fine on most modern systems, but if you run into issues, youmight try the 32 bit version.

Installing packages

Packages offer additional functionality beyond base R, usually tomake certain processes easier.

The initial download of R includes a few base packages, but there aremany packages available through the Comprehensive R Archive Network(CRAN).

Packages can be downloaded and installed from within R using theinstall.packages function. For example, this will installsplot:

install.packages("splot")

The first time you install packages, you’ll need to select a mirror.These are CRAN hosts—they have the same files, but are in differentphysical locations. Choose a mirror that is geographically close to youfor the best download speeds. If a package fails to download, trychanging mirrors.

Loading packages

Each time you start R, packages that aren’t part of base R need to beloaded using thelibrary function. For example:

library("splot")

Understanding R

The underlying system

The interpreter. When you enter commands into theconsole, the interpreter tries to understand it. You might think of thisunderstanding in terms of functions (operators) and data (operands). Forexample, if you enter1 + 1 into the console, R willunderstand that each1 is a number, and the+is a function.

Functions. Almost everything in R is a function.Most functions are called by the( function; the name ofthe function followed by parentheses (e.g.,sum()). Manyfunctions acceptsarguments—data entered inside theparentheses, separated by commas. For example,sum(1, 2) isa call to thesum function, with1 as thefirst argument, and2 as the second argument. The+ function works on its own, but it can also be called bythe( function:1 + 1 is the same as'+'(1, 1).

Most functions will output some form of data (in+’scase, the output is a single numeric value). This means that functionscan be entered as arguments to other function. For example,sum(sum(1, 1), 2) is another call to thesumfunction, with the output ofsum(1, 1) as the firstargument, and2 as the second argument.

Data representations

In what follows, the outlined code boxes contain syntax highlightedcode which you can run in an R console, followed by its expected output(preceded by#>).

Matrices

Matrices store sets of data. For example, take a look at the MotorTrend dataset, which is include in base R:

mtcars#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

In themtcars matrix, each row represents a particularcar, and each column represents a feature of that car (a variable). Youcan use the? function (?mtcars) to accessdocumentation.

Note: In base R, there are purely numerical matrices (as made withthematrix function) and matrices with mixed data types(such as numerical and character or factor columns; as made by thedata.frame function). These are both matrixrepresentations, but they have some different methods (functions thatinteract with them).mtcars is adata.frameobject (which you can see with theclass function;class(mtcars)), but the methods used in these examples(such as the[ function) will also work with standardmatrix objects.

Matrices as arguments

Some functions accept entire matrices as arguments. For example, thecolnames function will output a matrix’s column names:

colnames(mtcars)#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"#> [11] "carb"

Vectors as arguments

Other functions only accept values or vectors (single columns orrows, or created independently with thec function) asarguments. You can use the[ function to select singlecolumns or rows by name or index.[’s first argumentselects rows, and its second argument selects columns. For example, youcan select thempg variable like this (note that variablenames are case sensitive):

mtcars[,"mpg"]#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7#> [31] 15.0 21.4

Fun note: The[ function can also be called by the( function:'['(mtcars,, 'mpg').

Since the[ function outputs a vector, you can enter itas an argument to another function, such as thesumfunction:

sum(mtcars[,"mpg"])#> [1] 642.9

Thesum function will handle multiple vectors or singlevalues entered as individual arguments (sum(c(1, 2, 3)) isthe same assum(1, 2, 3)), but other functions expect avector as the first argument. For instancemean(c(1, 2, 3))gives the average of 1, 2, and 3, whereasmean(1, 2, 3)would be the same asmean(1), giving the average of 1.Check a function’s documentation to see what itexpects—sum’s first argument is... meaning itwill collapse additional arguments (those without names matching otherarguments) into the first argument, whereasmean’s firstargument isx.

Visualizing data with splot

Thesplot function generates all sorts of plots. Itsfirst argument is for variable names, and its second argument is for thedataset containing those variables. See the documentation for moreinformation about thesplot function (enter?splot in an R console, orviewonline).

Distribution of a single variable

For example, we can look at the density distribution (and histogram)of thempg variable inmtcars like this:

splot(mpg,mtcars)

Here, the bars depict the frequency of the value-range they cover (whichis the histogram part), and the line is the estimated theoreticaldistribution of the variable (if more cars were sampled from the samesource, they would theoretically resemble this distribution; e.g., mostcars would go between 15 and 20 miles per gallon).

Relationship between two variables

The first argument in thesplot function can be enteredas a formula, which is a way to specify relationships between variables.The first part of a formula is the tilde (~), whichseparates ay variable (before the tilde; on the verticalaxis of a plot) from anx variable (after the tilde; on thehorizontal axis of a plot).

For example, we can look at the relationship between thempg andwt variables like this:

splot(mpg~wt,mtcars)

Each dot represents a car, and its position is a combination of itsmiles per gallon (MPG) and weight; the higher it is vertically, the moremiles it can go per gallon of gas, and the farther it is horizontally,the more it weighs (in tons).

The line is from a linear regression, which is attempting to predicty given x. For example, from this data, if the regression were to see acar that weighed about 3 tons, it would predict its MPG to be around22.

Something we might do to improve this prediction (model fit) is toconsider other variables. Weight seems to be closely related to MPG(going by the last plot), but maybe MPG depends on something else aswell, such as the car’s style of transmission (automatic versus manual).To look at this, we can add a splitting variable to the formula with anasterisk (*).

Splitting variables break the data up into groups based on theirvalue. For example, this will separate cars that have an automatictransmission (0) from those that have a manual transmission (1), andestimate a line for each group.

splot(mpg~wt*am,mtcars)

From this, it seems (in this sample of cars at least) that the negativerelationship between weight and MPG is stronger among cars with a manualtransmission; transmission appears to moderate the relationship betweenweight and MPG. That is, the line for cars with manual transmissions hasa steeper slope than the line for cars with automatic transmissions.

Another way we can improve model fit is by allowing our predictionlines to bend. A particularly clear case where this seems to help is inmodeling the relationship between weight and displacement:

splot(wt~disp+disp^2+disp^3,mtcars)

The^ function raises the preceding vector by the followingvalue, sodisp ^ 2 is the squareddispvariable, anddisp ^ 3 is the cubeddispvariable. Each of these transformations ofx increases theprediction line’s ability to bend.

Maybe the relationship between displacement and weight is actuallycurvy like this, but we might suspect there are just different types ofcars represented here. For example, it kind of looks like there areclusters in the data, one under 200, and one between 200 and 400. We canvisualize this by splitting displacement by itself at those points:

splot(wt~disp*disp,mtcars, split=c(200,400))

This cleans up the data nicely, but if we wanted to say these clustersactually represent different types of cars, it would be more convincingif we could find another variable that defines groups like these.

Categorical variables

We started by looking at thempg variable by itself, butsince this dataset has named entries (unlike sets with less meaningfulrows like participant IDs), it might be informative to visualize the MPGof each entry:

splot(mpg~rownames(mtcars),mtcars, type="bar", sort=TRUE)

Here, the additional arguments are changing aspects of the display fromthe way they would show up by default: Thetype argumentsets the look of the data (bars rather than lines or points), and thesort argument changes the way thex variableis ordered (byy’s value rather than alphabetically).

Thesplot function has many more arguments which mostlyaffect the way each element of the figure is displayed. For example, inthis figure, you might want to adjust the range of the y axis (with themyl argument), and maybe make the labels more informative(with thelaby andlabx arguments):

splot(mpg~rownames(mtcars),mtcars,  type="bar", sort=TRUE,  myl=c(10,35), laby="Miles Per Gallon", labx="Car")

To explore the data more broadly we might look at a few variables asonce. These can be entered as a matrix in the y position:

splot(mtcars[,c("cyl","carb","gear")]~mpg,mtcars, mv.as.x=TRUE)

Themv.as.x argument is saying the columns ofy should be displayed as levels on the x axis (“mv” standsfor “multiple variables”). Otherwise, they would be displayed as levelsof a by variable, with MPG on the x axis.

This type of plot is more commonly displayed as a bar plot, becauselines are sometimes taken to imply that there’s some movement betweenlevels (as in the same participants experiencing different conditions;within-person experimental designs).

Another way to interpret lines, however, is as regression lines. Thisis particularly clear if we look at the raw data by representing thisline plot as a scatter plot:

splot(mtcars[,c("cyl","carb","gear")]~mpg,mtcars,  mv.as.x=TRUE, type="scatter", xlas=1, lpos="topright")

Thexlas argument sets the orientation of the x axis labels(since they default to vertical for scatter plots), and thelpos argument sets the position of the legend.

This representation isn’t very informative in terms of the data (asthere is a lot of overlap at each level of each variable), but theselines are actually prediction lines from regressions. The line plotdepicts each part of the regression: Where each line crosses an x axislabel is the mean of the data represented by the line within that level;the error bars show the standard errors around those means (whichcorrespond to the p-value of the associated t test; if they cross, thedifference is non-significant); and the slope of the line between levelscorresponds to the associated beta weight. In this sense, a line plotcan be somewhat more informative than a bar plot.

For more applied examples, see theexploreandrefinevignettes.
For more splot specific information, see thestyleguide andfulldocumentation.

Brought to you by theLanguageUse and Social Interaction lab at Texas Tech University

Movatterモバイル変換