It can be a bit fiddly to get a phylogenetic dataset into R,particularly if you are not used to working with files in the Nexusformat.

Load raw data

From an Excel spreadsheet

If your data is in an Excel spreadsheet, one way to load it into R isusing the ‘readxl’ package.First you’ll have to install it:

install.packages("readxl")# You only need to do this once

Then you should prepare your Excel spreadsheet such that each rowcorresponds to a taxon, and each column to a character.

Then you can read the data from the Excel file by telling R whichsheet, rows and columns contain your data:

library("readxl")raw_data<-as.matrix(read_excel(  filename,sheet =1,# Loads sheet number 1 from the excel filerange ="B1:AA21",# Extracts columns B to AA, rows 1 to 21# Note that the first row is interpreted as column (character) namescol_types ="text"# Read all columns as character strings))# Read row (taxon) names from column A# Again, the first cell will be interpreted as a column nametaxon_names<-unlist(read_excel(filename,sheet =1,range ="A1:A21"))rownames(raw_data)<- taxon_names

From a text or CSV (comma separated values) file

Characters can be read from a text file in a similar manner to Excel.You may need to adjust the R commands to match the particular format ofyour input file.

raw_data<-read.table(  filename,# Path to your input filesep =",",# What character separates columns?header =TRUE,# Does the data contain a header row?row.names =1,# Which column contains the row names?na.strings ="",stringsAsFactors =FALSE)

From a Nexus file

TreeTools contains an inbuilt Nexus parser:

raw_data<-ReadCharacters(filename)# Or, to go straight to PhyDat format:as_phydat<-ReadAsPhyDat(filename)

This will extract character names and codings from a dataset. It’sbeen written to work with datasets downloaded fromMorphoBank, but my aim is for thisfunction to handle most valid (and many invalid) NEXUS files. If youfind a file that this function can’t handle, pleaseletme know and I’ll try to fix it.

In the meantime, alternative Nexus parsers are available: try

raw_data<- ape::read.nexus.data(filename)

Non-standard elements of a Nexus file might be beyond thecapabilities of ape’s parser. In particular, you will need to replacespaces in taxon names with an underscore, and to arrange all data into asingle block startingBEGIN DATA. You’ll need to strip outcomments, character definitions and separate taxon blocks.

The functionreadNexus in packagephylobaseuses the NCL library and promises to be more powerful, but I’ve not beenable to get it to work.

From a TNT file

A TNT format dataset downloaded fromMorphoBank can be parsed withReadTntCharacters, which might also handle otherTNT-compatible files. If there’s a file that’s not being read correctly,pleaseletme know and I’ll try to fix it.

raw_data<-ReadTntCharacters(filename)# Or, to go straight to PhyDat format:my_data<-ReadTntAsPhyDat(filename)

Processing raw data

Next, we need the raw data in the R-friendlyphyDatformat. If you’ve used theReadAsPhyDat orReadTntAsPhyDat functions, then you can skip this step –you’re already there.

Otherwise, you can try

my_data<-PhyDat(raw_data)

or if that doesn’t work,

my_data<-MatrixToPhyDat(raw_data)

These functions are pretty robust, but might return an error whenthey encounter an unexpected dataset format – if they don’t work on yourdataset, please
letme know.

Failing that, you can enlist the help of the ‘phangorn’ package:

install.packages("phangorn")library("phangorn")my_data<-phyDat(raw_data,type ="USER",levels =c(0:9,"-"))

type="USER" tells the parser to expect morphologicaldata.

Thelevels parameter simply lists all the states thatany character might take.0:9 includes all the integerdigits from 0 to 9. If you have inapplicable data in your matrix, youshould list- as a separate level as it represents anadditional state (as handled by the Morphy implementation of(Brazeau, Guillerme, & Smith, 2019)). If youhave more complicated ambiguities, you may need to use a contrast matrixto decode your matrix.

A contrast matrix translates the tokens used in your dataset to thecharacter states to which they correspond: for example decoding ‘A’ to{01}. For more details, see the ‘phangorn-specials’ vignette in thephangorn package, accessible by typing ‘?phangorn’ in the R prompt andnavigating to index > package vignettes.

contrast.matrix<-matrix(data =c(# 0 1 -  # Each column corresponds to a character-state1,0,0,# Each row corresponds to a token, here 0, denoting the# character-state set {0}0,1,0,# 1 | {1}0,0,1,# - | {-}1,1,0,# A | {01}1,1,0,# + | {01}1,1,1# ? | {01-}),ncol =3,# ncol corresponds to the number of columns in the matrixbyrow =TRUE)dimnames(contrast.matrix)<-list(c(0,1,"-","A","+","?"),# A list of the tokens corresponding to each row# in the contrast matrixc(0,1,"-")# A list of the character-states corresponding to the columns# in the contrast matrix)contrast.matrix

##   0 1 -## 0 1 0 0## 1 0 1 0## - 0 0 1## A 1 1 0## + 1 1 0## ? 1 1 1

If you need to use a contrast matrix, convert the data using

my.phyDat<-phyDat(my.data,type ="USER",contrast = contrast.matrix)

Movatterモバイル変換

Loading phylogenetic data into R

Martin R. Smithmartin.smith@durham.ac.uk

2025-09-23

Load raw data

From an Excel spreadsheet

From a text or CSV (comma separated values) file

From a Nexus file

From a TNT file

Processing raw data

Store processed data

What next?

References