An R package for space-time anomaly detection using scanstatistics.
To install the latest (CRAN) release of this package, type thefollowing:
install.packages("scanstatistics")To install the development version of this package, type thisinstead:
devtools::install_github("promerpr/scanstatistics")Scan statistics are used to detect anomalous clusters in spatial orspace-time data. The gist of the methodology, at least in this package,is this:
scan_eb_poisson: computes theexpectation-based Poisson scan statistic (Neill 2005).scan_pb_poisson: computes the(population-based) space-time scan statistic (Kulldorff 2001).scan_eb_negbin: computes theexpectation-based negative binomial scan statistic (Tango etal. 2011).scan_eb_zip: computes theexpectation-based zero-inflated Poisson scan statistic (Allévius &Höhle 2017).scan_permutation: computes thespace-time permutation scan statistic (Kulldorff et al. 2005).scan_bayes_negbin: computes theBayesian Spatial scan statistic (Neill 2006), extended to a space-timesetting.knn_zones: Creates a set of spatialzones (groups of locations) to scan for anomalies. Input is amatrix in which rows are the enumerated locations, and columns the knearest neighbors. To create such a matrix, the following two functionsare useful:coords_to_knn: usestats::dist to get the k nearest neighbors of each locationinto a format usable byknn_zones.dist_to_knn: use an already computeddistance matrix to get the k nearest neighbors of each location into aformat usable byknn_zones.flexible_zones: An alternative toknn_zones that uses the adjacency structure of locations tocreate a richer set of zones. The additional input is an adjacencymatrix, but otherwise works asknn_zones.score_locations: Score each locationby how likely it is to have an ongoing anomaly in it. This score isheuristically motivated.top_clusters: Get the top k space-timeclusters, either overlapping or non-overlapping in the spatialdimension.df_to_matrix: Convert a data framewith data for each location and time point to a matrix with locationsalong the column dimension and time along the row dimension, with theselected data as values.To demonstrate the scan statistics in this package, we will use adataset of the annual number of brain cancer cases in the counties ofNew Mexico, for the years 1973-1991. This data was studied by Kulldorff(1998), who detected a cluster of cancer cases in the counties LosAlamos and Santa Fe during the years 1986-1989, though the excess ofbrain cancer in this cluster was not deemed statistically significant.The data originally comes from the packagersatscan, whichprovides an interface to the programSaTScan, but it has been aggregatedand extended for thescanstatistics package.
To get familiar with the counties of New Mexico, we begin by plottingthem on a map using the data framesNM_map andNM_geo supplied by thescanstatistics package:
library(scanstatistics)library(ggplot2)# Load map datadata(NM_map)data(NM_geo)# Plot map with labels at centroidsggplot()+geom_polygon(data = NM_map,mapping =aes(x = long,y = lat,group = group),color ="grey",fill ="white")+geom_text(data = NM_geo,mapping =aes(x = center_long,y = center_lat,label = county))+ggtitle("Counties of New Mexico")
We can further obtain the yearly number of cases and the populationfor each country for the years 1973-1991 from the data tableNM_popcas provided by the package:
data(NM_popcas)head(NM_popcas)#> year county population count#> 1 1973 bernalillo 353813 16#> 2 1974 bernalillo 357520 16#> 3 1975 bernalillo 368166 16#> 4 1976 bernalillo 378483 16#> 5 1977 bernalillo 388471 15#> 6 1978 bernalillo 398130 18It should be noted that Cibola county was split from Valencia countyin 1981, and cases in Cibola have been counted to Valencia in thedata.
The Poisson distribution is a natural first option when dealing withcount data. Thescanstatistics package provides the twofunctionsscan_eb_poisson andscan_pb_poissonwith this distributional assumption. The first is an expectation-based1 scan statistic for univariatePoisson-distributed data proposed by Neill et al. (2005), and we focuson this one in the example below. The second scan statistic is thepopulation-based scan statistic proposed by Kulldorff (2001).
The first argument to any of the scan statistics in this packageshould be a matrix (or array) of observed counts, whether they beinteger counts or real-valued “counts”. In such a matrix, the columnsshould represent locations and the rows the time intervals, orderedchronologically from the earliest interval in the first row to the mostrecent in the last. In this example we would like to detect a potentialcluster of brain cancer in the counties of New Mexico during the years1986-1989, so we begin by retrieving the count and population data fromthat period and reshaping them to a matrix using the helper functiondf_to_matrix:
library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, unioncounts<- NM_popcas%>%filter(year>=1986& year<1990)%>%df_to_matrix(time_col ="year",location_col ="county",value_col ="count")#> Warning: `spread_()` was deprecated in tidyr 1.2.0.#> Please use `spread()` instead.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.The second argument toscan_eb_poisson should be a listof integer vectors, each such vector being azone, which is thename for the spatial component of a potential outbreak cluster. Such azone consists of one or more locations grouped together according totheir similarity across features, and each location is numbered as thecorresponding column index of thecounts matrix above(indexing starts at 1).
In this example, the locations are the counties of New Mexico and thefeatures are the coordinates of the county seats. These are madeavailable in the data tableNM_geo. Similarity will bemeasured using the geographical distance between the seats of thecounties, taking into account the curvature of the earth. A distancematrix is calculated using thespDists function from thesp package, which is then passed todist_to_knnand on toknn_zones:
library(sp)library(magrittr)# Remove Cibola since cases have been counted towards Valencia. Ideally, this# should be accounted for when creating the zones.zones<- NM_geo%>%filter(county!="cibola")%>%select(seat_long, seat_lat)%>% as.matrix%>%spDists(x = .,y = .,longlat =TRUE)%>%dist_to_knn(k =15)%>% knn_zonesThe advantage of expectation-based scan statistics is that parameterssuch as the expected value can be modelled and estimated from past datae.g. by some form of regression. For the expectation-based Poisson scanstatistic, we can use a (very simple) Poisson GLM to estimate theexpected value of the count in each county and year, accounting for thedifferent populations in each region. Similar to thecountsargument, the expected values should be passed as a matrix to thescan_eb_poisson function:
mod<-glm(count~offset(log(population))+1+I(year-1985),family =poisson(link ="log"),data = NM_popcas%>%filter(year<1986))ebp_baselines<- NM_popcas%>%filter(year>=1986& year<1990)%>%mutate(mu =predict(mod,newdata = .,type ="response"))%>%df_to_matrix(value_col ="mu")Note that the population numbers are (perhaps poorly) interpolatedfrom the censuses conducted in 1973, 1982, and 1991.
We can now calculate the Poisson scan statistic. To give us moreconfidence in our detection results, we will perform 999 Monte Carloreplications, by which data is generated using the parameters from thenull hypothesis and a new scan statistic calculated. This is thensummarized in a P-value, calculated as the proportion of times thereplicated scan statistics exceeded the observed one. The output ofscan_poisson is an object of class “scanstatistic”, whichcomes with the print method seen below.
set.seed(1)poisson_result<-scan_eb_poisson(counts = counts,zones = zones,baselines = ebp_baselines,n_mcsim =999)print(poisson_result)#> Data distribution: Poisson#> Type of scan statistic: expectation-based#> Setting: univariate#> Number of locations considered: 32#> Maximum duration considered: 4#> Number of spatial zones: 415#> Number of Monte Carlo replicates: 999#> Monte Carlo P-value: 0.005#> Gumbel P-value: NULL#> Most likely event duration: 4#> ID of locations in MLC: 15, 26As we can see, the most likely cluster for an anomaly stretches from1986-1989 and involves the locations numbered 15 and 26, whichcorrespond to the counties
counties<-as.character(NM_geo$county)counties[c(15,26)][1]"losalamos""santafe"These are the same counties detected by Kulldorff (1998), thoughtheir analysis was retrospective rather than prospective as ours was.Ours was also data dredging as we used the same study period with hopesof detecting the same cluster.
We can score each county according to how likely it is to be part ofa cluster in a heuristic fashion using the functionscore_locations, and visualize the results on a heatmap asfollows:
# Calculate scores and add column with county namescounty_scores<-score_locations(poisson_result, zones)county_scores%<>%mutate(county =factor(counties[-length(counties)],levels =levels(NM_geo$county)))# Create a table for plottingscore_map_df<-merge(NM_map, county_scores,by ="county",all.x =TRUE)%>%arrange(group, order)# As noted before, Cibola county counts have been attributed to Valencia countyscore_map_df[score_map_df$subregion=="cibola", ]%<>%mutate(relative_score = score_map_df%>%filter(subregion=="valencia")%>%select(relative_score)%>% .[[1]]%>% .[1])ggplot()+geom_polygon(data = score_map_df,mapping =aes(x = long,y = lat,group = group,fill = relative_score),color ="grey")+scale_fill_gradient(low ="#e5f5f9",high ="darkgreen",guide =guide_colorbar(title ="Relative\nScore"))+geom_text(data = NM_geo,mapping =aes(x = center_long,y = center_lat,label = county),alpha =0.5)+ggtitle("County scores")
A warning though: thescore_locations function can bequite slow for large data sets. This might change in future versions ofthe package.
Finally, if we want to know not just the most likely cluster, but saythe five top-scoring space-time clusters, we can use the functiontop_clusters. The clusters returned can either beoverlapping or non-overlapping in the spatial dimension, according toour liking.
top5<-top_clusters(poisson_result, zones,k =5,overlapping =FALSE)# Find the counties corresponding to the spatial zones of the 5 clusters.top5_counties<- top5$zone%>% purrr::map(get_zone,zones = zones)%>% purrr::map(function(x) counties[x])# Add the counties corresponding to the zones as a columntop5%<>%mutate(counties = top5_counties)Thetop_clusters function includes Monte Carlo andGumbel P-values for each cluster. These P-values are conservative, sincesecondary clusters from the original data are compared to the mostlikely clusters from the replicate data sets.
Other univariate scan statistics can be calculated practically in thesame way as above, though the distribution parameters need to be adaptedfor each scan statistic.
If you think this package lacks some functionality, or that somethingneeds better documentation, please open an issue here.
Allévius, B., M. Höhle (2017):An expectation-based space-timescan statistic for ZIP-distributed data, (under review).
Kleinman, K. (2015):Rsatscan: Tools, Classes, and Methods forInterfacing with SaTScan Stand-Alone Software,https://CRAN.R-project.org/package=rsatscan.
Kulldorff, M., Athas, W. F., Feuer, E. J., Miller, B. A., Key, C. R.(1998):Evaluating Cluster Alarms: A Space-Time Scan Statistic andBrain Cancer in Los Alamos, American Journal of Public Health 88(9), 1377–80.
Kulldorff, M. (2001),Prospective time periodic geographicaldisease surveillance using a scan statistic, Journal of the RoyalStatistical Society, Series A (Statistics in Society), 164, 61–72.
Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. M.,Mostashari, F. (2005):A space-time permutation scan statistic fordisease outbreak detection, PLoS Medicine, 2 (3), 0216-0224.
Neill, D. B., Moore, A. W., Sabhnani, M., Daniel, K. (2005):Detection of Emerging Space-Time Clusters, In Proceedings ofthe Eleventh ACM SIGKDD International Conference on Knowledge Discoveryin Data Mining, 218–27. ACM.
Neill, D. B., Moore, A. W., Cooper, G. F. (2006):A BayesianSpatial Scan Statistic, Advances in Neural Information ProcessingSystems 18: Proceedings of the 2005 Conference.
Tango, T., Takahashi, K. Kohriyama, K. (2011),A Space-Time ScanStatistic for Detecting Emerging Outbreaks, Biometrics 67 (1),106–15.
Expectation-based scan statistics usepast non-anomalous data to estimate distribution parameters, and thencompares observed cluster counts from the time period of interest tothese estimates. In contrast,population-based scan statisticscompare counts in a cluster to those outside, only using data from theperiod of interest, and does so conditional on the observed totalcount.↩︎