Movatterモバイル変換

install.packages("scanstatistics")

devtools::install_github("promerpr/scanstatistics")

What are scan statistics?

Scan statistics are used to detect anomalous clusters in spatial orspace-time data. The gist of the methodology, at least in this package,is this:

Main functions

Scan statistics

Zone creation

Miscellaneous

Example: Brain cancer in NewMexico

To demonstrate the scan statistics in this package, we will use adataset of the annual number of brain cancer cases in the counties ofNew Mexico, for the years 1973-1991. This data was studied by Kulldorff(1998), who detected a cluster of cancer cases in the counties LosAlamos and Santa Fe during the years 1986-1989, though the excess ofbrain cancer in this cluster was not deemed statistically significant.The data originally comes from the packagersatscan, whichprovides an interface to the programSaTScan, but it has been aggregatedand extended for thescanstatistics package.

To get familiar with the counties of New Mexico, we begin by plottingthem on a map using the data framesNM_map andNM_geo supplied by thescanstatistics package:

library(scanstatistics)library(ggplot2)# Load map datadata(NM_map)data(NM_geo)# Plot map with labels at centroidsggplot()+geom_polygon(data = NM_map,mapping =aes(x = long,y = lat,group = group),color ="grey",fill ="white")+geom_text(data = NM_geo,mapping =aes(x = center_long,y = center_lat,label = county))+ggtitle("Counties of New Mexico")

We can further obtain the yearly number of cases and the populationfor each country for the years 1973-1991 from the data tableNM_popcas provided by the package:

data(NM_popcas)head(NM_popcas)#>   year     county population count#> 1 1973 bernalillo     353813    16#> 2 1974 bernalillo     357520    16#> 3 1975 bernalillo     368166    16#> 4 1976 bernalillo     378483    16#> 5 1977 bernalillo     388471    15#> 6 1978 bernalillo     398130    18

It should be noted that Cibola county was split from Valencia countyin 1981, and cases in Cibola have been counted to Valencia in thedata.

A scan statistic for Poissondata

The Poisson distribution is a natural first option when dealing withcount data. Thescanstatistics package provides the twofunctionsscan_eb_poisson andscan_pb_poissonwith this distributional assumption. The first is an expectation-based¹ scan statistic for univariatePoisson-distributed data proposed by Neill et al. (2005), and we focuson this one in the example below. The second scan statistic is thepopulation-based scan statistic proposed by Kulldorff (2001).

Using the Poisson scanstatistic

The first argument to any of the scan statistics in this packageshould be a matrix (or array) of observed counts, whether they beinteger counts or real-valued “counts”. In such a matrix, the columnsshould represent locations and the rows the time intervals, orderedchronologically from the earliest interval in the first row to the mostrecent in the last. In this example we would like to detect a potentialcluster of brain cancer in the counties of New Mexico during the years1986-1989, so we begin by retrieving the count and population data fromthat period and reshaping them to a matrix using the helper functiondf_to_matrix:

library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#>     filter, lag#> The following objects are masked from 'package:base':#>#>     intersect, setdiff, setequal, unioncounts<- NM_popcas%>%filter(year>=1986& year<1990)%>%df_to_matrix(time_col ="year",location_col ="county",value_col ="count")#> Warning: `spread_()` was deprecated in tidyr 1.2.0.#> Please use `spread()` instead.#> This warning is displayed once every 8 hours.#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Spatial zones

The second argument toscan_eb_poisson should be a listof integer vectors, each such vector being azone, which is thename for the spatial component of a potential outbreak cluster. Such azone consists of one or more locations grouped together according totheir similarity across features, and each location is numbered as thecorresponding column index of thecounts matrix above(indexing starts at 1).

In this example, the locations are the counties of New Mexico and thefeatures are the coordinates of the county seats. These are madeavailable in the data tableNM_geo. Similarity will bemeasured using the geographical distance between the seats of thecounties, taking into account the curvature of the earth. A distancematrix is calculated using thespDists function from thesp package, which is then passed todist_to_knnand on toknn_zones:

library(sp)library(magrittr)# Remove Cibola since cases have been counted towards Valencia. Ideally, this# should be accounted for when creating the zones.zones<- NM_geo%>%filter(county!="cibola")%>%select(seat_long, seat_lat)%>%  as.matrix%>%spDists(x = .,y = .,longlat =TRUE)%>%dist_to_knn(k =15)%>%  knn_zones

Baselines

The advantage of expectation-based scan statistics is that parameterssuch as the expected value can be modelled and estimated from past datae.g. by some form of regression. For the expectation-based Poisson scanstatistic, we can use a (very simple) Poisson GLM to estimate theexpected value of the count in each county and year, accounting for thedifferent populations in each region. Similar to thecountsargument, the expected values should be passed as a matrix to thescan_eb_poisson function:

mod<-glm(count~offset(log(population))+1+I(year-1985),family =poisson(link ="log"),data = NM_popcas%>%filter(year<1986))ebp_baselines<- NM_popcas%>%filter(year>=1986& year<1990)%>%mutate(mu =predict(mod,newdata = .,type ="response"))%>%df_to_matrix(value_col ="mu")

Note that the population numbers are (perhaps poorly) interpolatedfrom the censuses conducted in 1973, 1982, and 1991.

Calculation

We can now calculate the Poisson scan statistic. To give us moreconfidence in our detection results, we will perform 999 Monte Carloreplications, by which data is generated using the parameters from thenull hypothesis and a new scan statistic calculated. This is thensummarized in a P-value, calculated as the proportion of times thereplicated scan statistics exceeded the observed one. The output ofscan_poisson is an object of class “scanstatistic”, whichcomes with the print method seen below.

set.seed(1)poisson_result<-scan_eb_poisson(counts = counts,zones = zones,baselines = ebp_baselines,n_mcsim =999)print(poisson_result)#> Data distribution:                Poisson#> Type of scan statistic:           expectation-based#> Setting:                          univariate#> Number of locations considered:   32#> Maximum duration considered:      4#> Number of spatial zones:          415#> Number of Monte Carlo replicates: 999#> Monte Carlo P-value:              0.005#> Gumbel P-value:                   NULL#> Most likely event duration:       4#> ID of locations in MLC:           15, 26

As we can see, the most likely cluster for an anomaly stretches from1986-1989 and involves the locations numbered 15 and 26, whichcorrespond to the counties

counties<-as.character(NM_geo$county)counties[c(15,26)][1]"losalamos""santafe"

These are the same counties detected by Kulldorff (1998), thoughtheir analysis was retrospective rather than prospective as ours was.Ours was also data dredging as we used the same study period with hopesof detecting the same cluster.

A heuristic score forlocations

We can score each county according to how likely it is to be part ofa cluster in a heuristic fashion using the functionscore_locations, and visualize the results on a heatmap asfollows:

# Calculate scores and add column with county namescounty_scores<-score_locations(poisson_result, zones)county_scores%<>%mutate(county =factor(counties[-length(counties)],levels =levels(NM_geo$county)))# Create a table for plottingscore_map_df<-merge(NM_map, county_scores,by ="county",all.x =TRUE)%>%arrange(group, order)# As noted before, Cibola county counts have been attributed to Valencia countyscore_map_df[score_map_df$subregion=="cibola", ]%<>%mutate(relative_score = score_map_df%>%filter(subregion=="valencia")%>%select(relative_score)%>%                          .[[1]]%>% .[1])ggplot()+geom_polygon(data = score_map_df,mapping =aes(x = long,y = lat,group = group,fill = relative_score),color ="grey")+scale_fill_gradient(low ="#e5f5f9",high ="darkgreen",guide =guide_colorbar(title ="Relative\nScore"))+geom_text(data = NM_geo,mapping =aes(x = center_long,y = center_lat,label = county),alpha =0.5)+ggtitle("County scores")

A warning though: thescore_locations function can bequite slow for large data sets. This might change in future versions ofthe package.

Finding the top-scoringclusters

Finally, if we want to know not just the most likely cluster, but saythe five top-scoring space-time clusters, we can use the functiontop_clusters. The clusters returned can either beoverlapping or non-overlapping in the spatial dimension, according toour liking.

top5<-top_clusters(poisson_result, zones,k =5,overlapping =FALSE)# Find the counties corresponding to the spatial zones of the 5 clusters.top5_counties<- top5$zone%>%  purrr::map(get_zone,zones = zones)%>%  purrr::map(function(x) counties[x])# Add the counties corresponding to the zones as a columntop5%<>%mutate(counties = top5_counties)

Thetop_clusters function includes Monte Carlo andGumbel P-values for each cluster. These P-values are conservative, sincesecondary clusters from the original data are compared to the mostlikely clusters from the replicate data sets.

Concluding remarks

Other univariate scan statistics can be calculated practically in thesame way as above, though the distribution parameters need to be adaptedfor each scan statistic.

Feedback

If you think this package lacks some functionality, or that somethingneeds better documentation, please open an issue here.

References

Allévius, B., M. Höhle (2017):An expectation-based space-timescan statistic for ZIP-distributed data, (under review).

Kulldorff, M., Athas, W. F., Feuer, E. J., Miller, B. A., Key, C. R.(1998):Evaluating Cluster Alarms: A Space-Time Scan Statistic andBrain Cancer in Los Alamos, American Journal of Public Health 88(9), 1377–80.

Kulldorff, M. (2001),Prospective time periodic geographicaldisease surveillance using a scan statistic, Journal of the RoyalStatistical Society, Series A (Statistics in Society), 164, 61–72.

Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. M.,Mostashari, F. (2005):A space-time permutation scan statistic fordisease outbreak detection, PLoS Medicine, 2 (3), 0216-0224.

Neill, D. B., Moore, A. W., Sabhnani, M., Daniel, K. (2005):Detection of Emerging Space-Time Clusters, In Proceedings ofthe Eleventh ACM SIGKDD International Conference on Knowledge Discoveryin Data Mining, 218–27. ACM.

Neill, D. B., Moore, A. W., Cooper, G. F. (2006):A BayesianSpatial Scan Statistic, Advances in Neural Information ProcessingSystems 18: Proceedings of the 2005 Conference.

Tango, T., Takahashi, K. Kohriyama, K. (2011),A Space-Time ScanStatistic for Detecting Emerging Outbreaks, Biometrics 67 (1),106–15.