- Notifications
You must be signed in to change notification settings - Fork2
R interface & extension to an adapted Persistence Landscapes Toolbox
License
corybrunson/plt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Persistence landscapes are a vectorization of persistence data/diagramsthat have useful statistical properties including linearity and an innerproduct.1 This is an R package interface to a C++ library toefficienctly compute and calculate with persistence landscapes.2
Until the package is on CRAN, usepak to install the package fromthe GitHub repository as follows:
install.packages("pak")pak::pkg_install("corybrunson/plt")
Alternatively—and especially if you want to contribute—you can clone ordownload the code repository and, from within the directory, install thepackage from source:
devtools::install()
You should now be able to load the package normally from an R session:
library(plt)
Theplt package supports various operations involving persistencelandscapes:
- Compute persistence landscapes from persistence data
- Plot persistence landscapes
- Perform Hilbert space operations (scaling, addition, inner product)plus some additional queries and transformations (extremal values,absolute value, integration) on persistence landscapes
- Conduct hypothesis tests on samples of persistence landscapes
- Vectorize persistence landscapes for other purposes including machinelearning
Examples and tests inplt rely on other packages to simulate dataand to compute persistence diagrams from data:
- tdaunif provides functions to sample uniformly from variousimmersed manifolds.
- ripserr andTDA provide functions to compute persistence datafrom point clouds and distance matrices.
plt introduces the ‘Rcpp_PersistenceLandscape’ S4 class, which isexposed usingRcpp from the underlying ‘PersistenceLandscape’ C++class. Instances of this class can be created usingnew()
but therecommended way is to usepl_new()
. This function accepts either asingle matrix of persistence data or a specially formatted list with theclass'persistence_diagram"
. The$pairs
entry of the list is itselfa list, of a 2-column matrix of persistence pairs for each homologicaldegree from 0 ($pairs[[1]]
) to the maximum degree calculated. Thegeneric converteras_persistence()
includes methods for outputs fromripserr::vietoris_rips()
and fromTDA::*Diag()
; it operates underthe hood ofpl_new()
, but we invoke it explicitly here forillustration.
To begin an illustration, we noisily sample 60 points from a figureeight and compute the persistence diagram of the point cloud:
set.seed(513611L)pc<-tdaunif::sample_lemniscate_gerono(60,sd=.1)plot(pc,asp=1,pch=16L)
pd<-ripserr::vietoris_rips(pc,dim=1,threshold=2,p=2)print(pd)#> PHom object containing persistence data for 63 features.#>#> Contains:#> * 59 0-dim features#> * 4 1-dim features#>#> Radius/diameter: min = 0; max = 0.63582.
We the convert the persistence data to the preferred persistence diagramformat and inspect some of its features:
pd<- as_persistence(pd)print(pd)#> 'persistence' data computed up to degree 1:#>#> * 0-degree features: 59#> * 1-degree features: 4print(head(pd$pairs[[1]]))#> [,1] [,2]#> [1,] 0 0.01918952#> [2,] 0 0.01947548#> [3,] 0 0.02604350#> [4,] 0 0.04218479#> [5,] 0 0.04542467#> [6,] 0 0.05941691print(head(pd$pairs[[2]]))#> [,1] [,2]#> [1,] 0.4809292 0.6358225#> [2,] 0.3016234 0.6075172#> [3,] 0.2504500 0.2727915#> [4,] 0.2251884 0.2300871
This allows us to compute a persistence landscape—in this case, for the1-dimensional features. Here we compute the landscape exactly, which canbe cost-prohibitive for larger persistence data, and print its summary:
pl1<- pl_new(pd,degree=1,exact=TRUE)print(pl1)#> Persistence landscape (exact format) of 2 levels over (0,0.636)summary(pl1)#> Internal representation: exact#> Number of levels: 2#> Representation limits: ( 0.22519 , 0.6358 )#> Landscape range: ( 0 , 0.15295 )#> Magnitude: 0.0026959#> Integral: 0.029522
Some advanced concepts like the magnitude of a landscape will beexplained below.
The objectpl1
is not an array, but rather an object that encapsulatesboth the data that encode a landscape and several basic operations thatcan be performed on it. This allows us to work with persistencelandscapes without worrying about pre-processing their representations.At any point, the underlying encoding of the landscape can be extractedusing$getInternal()
, which in the case of an exactly calculatedlandscape returns a list of 2-column matrices, each matrix containingcoordinates that define one level of the landscape as a piecewise linearfunction:
print(length(pl1$getInternal()))#> [1] 2print(pl1$getInternal())#> [[1]]#> [,1] [,2]#> [1,] -Inf 0.000000000#> [2,] 0.2251884 0.000000000#> [3,] 0.2276378 0.002449358#> [4,] 0.2300871 0.000000000#> [5,] 0.2504500 0.000000000#> [6,] 0.2616207 0.011170764#> [7,] 0.2727915 0.000000000#> [8,] 0.3016234 0.000000000#> [9,] 0.4545703 0.152946885#> [10,] 0.5442232 0.063293964#> [11,] 0.5583759 0.077446647#> [12,] 0.6358225 0.000000000#> [13,] Inf 0.000000000#>#> [[2]]#> [,1] [,2]#> [1,] -Inf 0.00000000#> [2,] 0.4809292 0.00000000#> [3,] 0.5442232 0.06329396#> [4,] 0.6075172 0.00000000#> [5,] Inf 0.00000000
An alternative, approximate construction computes the value of eachlevel of the landscape at each point on a 1-dimensional grid, rangingfromxmin
toxmax
at increments ofxby
. A landscape constructedusing a discrete approximation is stored as a 3-dimensional array ofdimensions (levels, values, 2), with one level per feature (some ofwhich may be trivial) and one value per grid point, stored as
b_ran<- pl_support(pl1)pl1d<- pl_new(pd,degree=1,xmin=b_ran[[1L]],xmax=b_ran[[2L]],xby=0.05)print(dim(pl1d$getInternal()))#> [1] 4 10 2print(pl1d$getInternal())#> , , 1#>#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]#> [1,] 0.2251884 0.2751884 0.3251884 0.3751884 0.4251884 0.4751884 0.5251884#> [2,] 0.2251884 0.2751884 0.3251884 0.3751884 0.4251884 0.4751884 0.5251884#> [3,] 0.2251884 0.2751884 0.3251884 0.3751884 0.4251884 0.4751884 0.5251884#> [4,] 0.2251884 0.2751884 0.3251884 0.3751884 0.4251884 0.4751884 0.5251884#> [,8] [,9] [,10]#> [1,] 0.5751884 0.6251884 0.6751884#> [2,] 0.5751884 0.6251884 0.6751884#> [3,] 0.5751884 0.6251884 0.6751884#> [4,] 0.5751884 0.6251884 0.6751884#>#> , , 2#>#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]#> [1,] 0 0 0.02356502 0.07356502 0.123565 0.1323288 0.08232875 0.06063412#> [2,] 0 0 0.00000000 0.00000000 0.000000 0.0000000 0.04425917 0.03232875#> [3,] 0 0 0.00000000 0.00000000 0.000000 0.0000000 0.00000000 0.00000000#> [4,] 0 0 0.00000000 0.00000000 0.000000 0.0000000 0.00000000 0.00000000#> [,9] [,10]#> [1,] 0.01063412 0#> [2,] 0.00000000 0#> [3,] 0.00000000 0#> [4,] 0.00000000 0
Exactly computed landscapes can be converted to discrete landscapeobjects, but the other direction is not well-defined. Below, we view aportion of the discretized exact landscape, using the default bounds andresolution given topl1
:
# default conversion to discrete uses `xby = 0.001`pl1_<-pl1$discretize()print(dim(pl1_$getInternal()))#> [1] 2 636 2# print first 12 x-coordinatespl1_$getInternal()[, seq(230L,270L), ,drop=FALSE]#> , , 1#>#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]#> [1,] 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 0.237 0.238 0.239 0.24#> [2,] 0.229 0.23 0.231 0.232 0.233 0.234 0.235 0.236 0.237 0.238 0.239 0.24#> [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]#> [1,] 0.241 0.242 0.243 0.244 0.245 0.246 0.247 0.248 0.249 0.25 0.251 0.252#> [2,] 0.241 0.242 0.243 0.244 0.245 0.246 0.247 0.248 0.249 0.25 0.251 0.252#> [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]#> [1,] 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26 0.261 0.262 0.263 0.264#> [2,] 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26 0.261 0.262 0.263 0.264#> [,37] [,38] [,39] [,40] [,41]#> [1,] 0.265 0.266 0.267 0.268 0.269#> [2,] 0.265 0.266 0.267 0.268 0.269#>#> , , 2#>#> [,1] [,2] [,3] [,4] [,5]#> [1,] 0.0006539916 5.241935e-05 4.985605e-05 4.729276e-05 4.472946e-05#> [2,] 0.0000000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00#> [,6] [,7] [,8] [,9] [,10]#> [1,] 4.216616e-05 3.960287e-05 3.703957e-05 3.447627e-05 3.191297e-05#> [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00#> [,11] [,12] [,13] [,14] [,15]#> [1,] 2.934968e-05 2.678638e-05 2.422308e-05 2.165979e-05 1.909649e-05#> [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00#> [,16] [,17] [,18] [,19] [,20] [,21]#> [1,] 1.653319e-05 1.396989e-05 1.14066e-05 8.8433e-06 6.280002e-06 3.716705e-06#> [2,] 0.000000e+00 0.000000e+00 0.00000e+00 0.0000e+00 0.000000e+00 0.000000e+00#> [,22] [,23] [,24] [,25] [,26] [,27]#> [1,] 1.153408e-06 0.0009623328 0.001923512 0.002884692 0.003845871 0.00480705#> [2,] 0.000000e+00 0.0000000000 0.000000000 0.000000000 0.000000000 0.00000000#> [,28] [,29] [,30] [,31] [,32] [,33]#> [1,] 0.00576823 0.006729409 0.007690589 0.008651768 0.009612947 0.01057413#> [2,] 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000 0.00000000#> [,34] [,35] [,36] [,37] [,38] [,39]#> [1,] 0.009677368 0.00878061 0.007883851 0.006987093 0.006090334 0.005193576#> [2,] 0.000000000 0.00000000 0.000000000 0.000000000 0.000000000 0.000000000#> [,40] [,41]#> [1,] 0.004296817 0.003400059#> [2,] 0.000000000 0.000000000
We can also specify the bounds and the resolution of the discretization:
pl1<- pl_delimit(pl1,xmin=0,xmax=1,xby=0.1)pl1_<- pl_discretize(pl1)pl1_$getInternal()#> , , 1#>#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]#> [1,] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1#> [2,] 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1#>#> , , 2#>#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]#> [1,] 0 0 0 0.09837659 0.1075172 0.03582254 0.000000000 0 0 0#> [2,] 0 0 0 0.00000000 0.0000000 0.04388611 0.003068344 0 0 0#> [,11]#> [1,] 0#> [2,] 0
plt provides aplot()
method for the ‘Rcpp_PersistenceLandscape’class. It usesgrDevices to build color palettes, and as such itsdefault palette is viridis; but the user may supply the name of arecognized palette or a sequence of colors between which to interpolate:
n_levs<- max(pl_num_levels(pl1), pl_num_levels(pl1d))par(mfrow= c(2L,1L),mar= c(2,2,0,2))plot(pl1,palette="terrain",n_levels=n_levs,asp=1)plot(pl1d,palette="terrain",n_levels=n_levs,asp=1)
par(mfrow= c(1L,1L),mar= c(5.1,4.1,4.1,2.1))
To illustrate these features, we first generate a companion data set:
# a new landscape and its discretizationset.seed(772888L)pc2<-tdaunif::sample_circle(60,sd=.1)/2pd2<-ripserr::vietoris_rips(pc2,dim=1,threshold=2,p=2)pl2<- pl_new(pd2,degree=1,exact=TRUE)pl2<- pl_delimit(pl2,xmin=0,xmax=2,xby=0.1)pl2_<- pl_discretize(pl2)
Several infix operators have been taught to work in the natural way withlandscapes, so that users can explore vector space operations and innerproducts more conveniently:
par(mfcol= c(3L,2L),mar= c(2,2,0,2))# vector space operations on exact landscapesplot(pl1*2)plot(pl2)plot(pl1*2+pl2)# vector space operations on discrete landscapesplot(pl1_)plot(-pl2_)plot(2*pl1_-pl2_)
par(mfrow= c(1L,1L),mar= c(5.1,4.1,4.1,2.1))# inner products of exact and discrete landscapespl1%*%pl2#> [1] 0.005336876pl1_%*%pl2_#> [1] 0.004917171pl1%*%pl2_#> [1] 0.004917171
(Note that the landscapes are automatically delimited to a compatibledomain.)
Thesummary()
method above reported the magnitude and the integral ofthe persistence landscapepl1
. The magnitude is the inner product ofpl1
with itself:pl1 %*% pl1 = 0.0026959
. Meanwhile, the integral isthe (signed) area under the curve, itself also a linear operator:
pl_integrate(pl1)#> [1] 0.02952152pl_integrate(pl2)#> [1] 0.07295263# 1-integral obeys linearitypl_integrate(pl1)*2- pl_integrate(pl2)#> [1] -0.01390959pl_integrate(pl1*2-pl2)#> [1] -0.01390959
The distance between two landscapes is defined in terms of the integralof their absolute difference for finite norms and the maximum pointwisedistance for the infinite norm. Note that, becausepl_integrate()
defaults to the 1-norm andpl_distance()
defaults to the 2-norm, wemust be careful when comparing their results:
# using the 1-normpl_integrate(pl_abs(pl1*2-pl2),p=1)#> [1] 0.03560593pl_distance(pl1*2,pl2,p=1)#> [1] 0.03560593# using the 2-normpl_integrate(pl_abs(pl1*2-pl2),p=2)^ (1/2)#> [1] 0.05041889pl_distance(pl1*2,pl2,p=2)#> [1] 0.05041889# using the infinity normpl_vmax(pl_abs(pl1*2-pl2))#> [1] 0.1178478 0.1265879pl_distance(pl1*2,pl2,p=Inf)#> [1] 0.1265879
The norm of a persistence landscape is then defined as its distance fromthe null landscape that is constant at zero. (The following code chunkis not evaluated, pending a debug ofpl_new()
when fed an emptymatrix.)
# null landscapepd0<-data.frame(start= double(0L),end= double(0L))# FIXME: Enable landscape construction from empty persistence data.pl0<- pl_new(pd0,degree=1,exact=TRUE)pl_distance(pl1,pl0)pl_norm(pl1)
Finally,plt implements the two hypothesis tests described in theoriginal paper. To illustrate, we first generate lists of landscapes forsamples from two spaces:
# samples of landscapes from lemniscatespl1s<- replicate(6, {pc<-tdaunif::sample_lemniscate_gerono(60,sd=.1)pd<-ripserr::vietoris_rips(pc,dim=1,threshold=2,p=2) pl_new(pd,degree=1,xby=.01)})# samples of landscapes from circlespl2s<- replicate(8, {pc<-tdaunif::sample_circle(60,sd=.1)/2pd<-ripserr::vietoris_rips(pc,dim=1,threshold=2,p=2) pl_new(pd,degree=1,xby=.01)})
An inspection of the mean landscapes makes clear that they are distinct:
# average landscape from each samplepar(mfcol= c(2L,1L),mar= c(2,2,0,2))plot(pl_mean(pl1s))plot(pl_mean(pl2s))
par(mfrow= c(1L,1L),mar= c(5.1,4.1,4.1,2.1))
However, the two hypothesis tests use different procedures and rely ondifferent test statistics, so one may be more effective than another.For convenience, both methods return objects of class'htest'
forconvenient printing:
# z-test of difference in integrals of first levelpl_z_test(pl1s,pl2s)#>#> z-test#>#> data:#> z = -2.3584, df = 12, p-value = 0.01835#> alternative hypothesis: true difference in means is not equal to 0#> 95 percent confidence interval:#> -0.067781063 -0.006254089#> sample estimates:#> mean integral of x mean integral of y#> 0.02664640 0.06366397# permutation test of pairwise distances between landscapespl_perm_test(pl1s,pl2s)#>#> permutation test#>#> data:#> p-value < 2.2e-16#> alternative hypothesis: true distance between mean landscapes is greater than 0#> sample estimates:#> distance between mean landscapes#> 0.06631177
The C++ library is adapted fromPaweł Dłotko’s Persistence LandscapeToolbox.It was originally adapted and ported to R inJose Bouza’stda-toolspackage.
Development of this package benefitted from the use of equipment and thesupport of colleagues at theUniversity ofFlorida, especiallyPeter Bubenik’s researchgroup and theLaboratory for SystemsMedicine.
Bug reports, unit tests, documentation, use cases, feature suggestions,and other contributions are welcome. See theCONTRIBUTINGfile for guidance, and please respect theCode ofConduct.
If you useplt in published work, please include a citationfollowingcitation("plt")
.
Footnotes
Bubenik P (2015) “Statistical Topological Data Analysis usingPersistence Landscapes”.Journal of Machine Learning Research16(3):77–102.https://jmlr.csail.mit.edu/papers/v16/bubenik15a.html↩
Bubenik P & Dłotko P (2017) “A persistence landscapes toolbox fortopological statistics”.Journal of Symbolic Computation78(1):91–114.https://www.sciencedirect.com/science/article/pii/S0747717116300104↩