stppSim: An R package for synthesizingspatiotemporal point patterns - A user guide

`Adepeju, M.`
`Big Data Centre, Manchester Metropolitan University, Manchester, M15 6BH, UK`
`Author:`

`2024-07-24`
`Date:`

Abstract

In light of the progressively limited access to comprehensive spatiallyand temporally logged point data, the stppSim package presents analternate data solution that carries substantial promise across aspectrum of research and practical applications. This package equipsusers with the capability to specify the attributes of an assemblage of‘agents’ (symbolic of entities like objects, individuals, etc.), whoseactivities within spatial (landscape) and temporal contexts yield freshinstances of point patterns and interactions within the surroundings.The resultant assemblage of points and patterns can subsequently bequantified, scrutinized, and processed to facilitate assessments andevaluations of spatial and/or temporal models.

Introduction

In numerous research scenarios, the availability of detailedspatiotemporal (ST) point data is often greatly limited due to privacyconsiderations. To tackle this issue, theR-stppSim packagehas been created with the purpose of offering a solution. It enablesusers to replicate real-world data situations, thus offering analternative reservoir of spatiotemporal point patterns. The suggestedmethodology employs microsimulation and agent-based methodologies togenerate a collection of ‘walkers’ (which can represent agents, objects,individuals, etc.). These walkers possess defined movementcharacteristics and engage with the surrounding environment.

The package includes two main functions: (i) psim_artif and (ii)psim_real, both of which play a central role in simulating definedspatiotemporal interactions within point data. The functionpsim_artif generates these interactions based onuser-provided parameters, effectively executing the simulation processwithout relying on any existing point data. In contrast, the functionpsim_real generates point interactions using the provided actual sampledataset. This latter function proves particularly valuable in situationswhere genuine point data is scarce or inadequate for practicalapplications.

Elements of data simulation

The following section describes three essential components of thesimulation: the agents, the spatial factors, and the temporalaspects:

The agents (`walkers`)

The following properties defines the agents:

Movement - Agents or walkers possessthe capacity to navigate in diverse directions and are equipped toidentify obstacles or limitations along their trajectories. Thesemovements are primarily governed by an inherent transition matrix (TM),which establishes two primary operational states: the exploratory state(where a walker is engaged in environmental exploration) and theperformative state (where a walker is executing an action). Theprobabilistic characteristics of this TM introduce diversity inbehavioral patterns among the walkers. To instigate a switch from onestate to the other, a categorical distribution is assigned to a latentstate variable$z_{it}$, such thateach step (in time) may result into the next state, independent of theprevious state:\[z_t \simCategorical(\Psi{_{1t}}, \Psi{_{2t}})\] Such that$\Psi{_{i}}$ = Pr$(z_t = i)$, where$\Psi{_{i}}$ is the fixed probability ofbeing in state$i$ at time$t$, and$\sum_{i=1}^{z}\Psi{_{i}}=1$
Spatial perception[s_threshold] - Perception range of a walkerat a specified location is determined by the parameters_threshold. As the walker changes its position, thisparameter undergoes an update. A common technique to set this parameteris by visually representing the data and then selecting an estimate thataligns with prior assumptions about the parameter. For many user cases,this strategy is quite effective. Forpsim_artif, usersneed to specify a value. However, forpsim_real, thebest-suiteds_threshold value can be derived from theavailable sample dataset.
Steps [step_length] - Thefurthest distance a walker travels from one location point to anotherrepresents thestep_length, which essentially characterizesthe walker’s speed across an area. It’s vital to set thestep_length judiciously, especially when the walker’smovements are confined to tight pathways like a route network. Here, tehchose value should be less than the pathway’s breadth.
Proportional ratios[p_ratio] - This refers to the density ofevents produced by the walkers in a given space. Specifically, itrepresents the fraction of total events stemming from a select group ofthe most active starting points. Take, for instance, a20:80 ratio: this suggests that 20% of starting points (orwalkers) are responsible for generating 80% of all point events. Thisimplies that starting points possess varying intensity values, which canbe leveraged to predict the eventual spatial distribution of theseevents, termed as thespatial model.

Spatial factors (landscape)

The followings are the key properties of a landscape:

Spatial bandwidth [s_band]The spatial bandwidth is utilized to identify event re-occurrences thattake place between two specific spatial thresholds. For instance,setting a spatial bandwidth of 200m to 400m means the user aims topinpoint repeated events happening within this distance range. Whenpaired with theTemporal bandwidth (discussedfurther below), this defines a comprehensivespatiotemporal bandwidth. Please note: This applies solelyto point pattern simulations created from scratch using thepsim_artif function. For simulations grounded in actualsample datasets, spatial bandwidths are automaticallyidentified.
Origins [coords] - Walkersoriginate from specific starting points, referred to as origins. Theseorigins can be randomly scattered throughout an area or may followparticular spatial patterns. Each origin is characterized by its xycoordinates. For instance, in the context of criminology, an offendermight be represented as a walker, with their home serving as theorigin.

There are two primary patterns in which origins can be concentrated:nucleated and dispersed, as highlighted by (Hornbyand Jones, 1991). In a nucleated concentration, all origins clusteraround a single central point. On the other hand, a dispersedconcentration features multiple focal points, with origins possiblyspread randomly throughout the area (refer to fig. 1 forillustration).

Figure 1: Type of origin concentration

Boundary [poly] - Alandscape has defined boundaries, either represented by a polygonshapefile (known aspoly) or determined by the spatialrange of the sample point data.
Restrictions[restriction_feat] - Features that act asbarriers consist of two main components:

Regions outside of the defined boundary (poly),which have a maximum restriction value of1. This meansthat walkers are prohibited from moving beyond this boundary.
Features inside the boundary that hinder movement. These can bespecific types of land use or physical landforms, like fenced-off areasor hills.

To produce a restriction map, one typically follows a two-stepprocess. For instance, when using a boundary shapefile of the Camdenarea in London (UK), a restriction map can be constructed in thefollowing manner:

Step 1: Generate boundary restriction

#load shapefile dataload(file = system.file("extdata", "camden.rda", package="stppSim"))#extract boundary shapefileboundary = camden$boundary # get boundary#compute the restriction maprestrct_map <- space_restriction(shp = boundary,res = 20, binary = TRUE)#plot the restriction mapplot(restrct_map)

Step 2: Setting therestrct_map above asthebasemap, and then stack the land use features to definethe restrictions within the area,

# get landuse datalanduse = camden$landuse #compute the restriction mapfull_restrct_map <- space_restriction(shp = landuse,      baseMap = restrct_map, res = 20, field = "restrVal", background = 1)#plot the restriction mapplot(full_restrct_map)

Figure 2: Restriction map

Figure 2 provides a graphical representation of both the boundaryextent and the restrictions posed by thewithin-features.Thesewithin-features are categorized into three separateclasses, each having a unique restriction value as enumerated below:

Leisure:0.5
Sports:0.7
Green:0.9

These values indicate the relative restriction each land use typeimposes on movement.

Within the simulation function, the boundary and the within-featuresare inputted using thepoly andrestriction_feat parameters, respectively. Both areprovided in the.shp (shapefile) format.

Focal points [n_foci] -Locations, or origins, that hold greater significance often present moreopportunities for event occurrences. This is specifically indicated whenutilizingpsim_artif. Users generally determine the numberof focal points they wish to simulate. In terms of urban landscapestructure, a focal point can equate to acity/town centre.

Additionally, if there’s a principal focal point within a city, itcan be denoted using themfocal parameter. By default, thevalue formfocal is set toNULL.

There’s also a foci separation parameter that lets users define howclose or far apart these focal points are from each other. Thisparameter accepts values ranging from 1 to 100. A value of 1 signifiesthe closest proximity, whereas 100 indicates the farthest distancebetween focal points.

Temporal dimension

The following parameters define the temporal dimension:

Temporal bandwidth[t_band] The temporal bandwidth is utilizedto identify event re-occurrences that take place between two specifictemporal thresholds. For instance, setting a spatial bandwidth of 2dayto 4days means the user aims to pinpoint repeated events happeningwithin this time range. When paired with theSpatialbandwidth (discussed above), this defines a comprehensivespatiotemporal bandwidth. Similar tospatial bandwidth', this applies solely to point pattern simulations created from scratch using thepsim_artif`function. For simulations grounded in actual sample datasets, temporalbandwidths are automatically identified.
Long-term trend [trend] -This parameter establishes the overarching trend of the time series thatis to be simulated. The trend can be categorized asstable,rising, orfalling.
Stable: Indicates that the time series remainsrelatively constant over time, with no significant upward or downwardtrend.
Rising: Suggests an upward trend in the time series.When this is selected, the supplementaryslope argument canbe employed to further define the incline of the trend as eithergentle (a moderate increase) orsteep (a rapidincrease).
Falling: Denotes a downward trend in the timeseries. Similar to the rising trend, when this is chosen, theslope argument can be used to distinguish between agentle decline or asteep drop.

This parameter is pertinent only when simulating a time series fromthe scratch, without any pre-existing data.

Seasonal peak [fPeak] - Thisparameter sets the initial temporal peak of a sinusoidal pattern in atime series, thereby dictating the medium-term undulations throughoutthe series’ duration. For instance, a first peak set at90days denotes a seasonal cycle spanning180 days in the timeseries. This approach is primarily employed when the simulation’sobjective isn’t to produce spatiotemporal interactions but to capturemore general cyclic patterns within the data.

Figure 3: Global trends and patterns

Figure 3 depicts anticipated seasonal patterns determined by variousfPeak values. Beginning at 90 days, each subsequent patternsees thefPeak value augmented by one month. As thefPeak date is pushed forward, the number of full seasonalcycles reduces.

The integration of thelong-term trend with theseasonal peak shapes thetemporal model forthe simulation. Before launching the actual simulation, it is advisableto either preview or review this model to ensure accuracy and alignmentwith objectives.

time bin - Time to reset all walkers.Typically 1 day.

Installation of`stppSim`

FromR console, type:

#To install from  `CRAN`install.packages("stppSim")#To install the `developmental version`, type:remotes::install_github("MAnalytics/stppSim")#Note: `remotes` is an extra package that needed to be installed prior to the running of this code.

Now, to load the package,

library(stppSim)

Notice:

`interactive` argument

Bothpsim_artif andpsim_real functionsinclude theinteractive argument, which is set toFALSE as the default setting. When the interactive argumentis toggled toTRUE, the console displays queries during thefunction’s execution, prompting the user to decide if they wish to viewthespatial and temporal models of the simulation.

Thespatial model displays the origins’ locations andtheir strength distribution across the simulated space. This strengthdistribution provides an insight into how the eventual point (event)distribution in the simulation is likely to be distributed.

On the other hand, thetemporal model offers a visualrepresentation of the expected trend and seasonal pattern, presented ina smoothed manner.

Thus, by using theinteractive option, users are giventhe advantage of reviewing both spatial and temporal patterns, ensuringthat they align with their expectations and objectives before movingforward with the complete simulation.

Simulating point patterns fromscratch

Three essential arguments are necessary for the simulation:

n_events - This refers tothe number of points to simulate. Instead of providing justa single value, it’s recommended to input a vector of values. Forinstance,n_events = c(200, 500, 1000, 2000). The output ispresented as a list, with each value corresponding to a separate dataframe. Notably, the length ofn_events has minimal to noimpact on processing duration.
start_date - This designates the commencement dateof the time series.
poly - This representsthe polygon shapefile that demarcates the boundary of the study area.The simulated point patterns are restricted to occur within thisdesignated boundary.

By providing these arguments, users can customize the scope andspecifics of their simulation to meet their research objectives.

Example

To generate a spatiotemporal point pattern (stpp) usinga boundary shapefile for the Camden Borough of London, which is embeddedin the package, you the following code:

#load the dataload(file = system.file("extdata", "camden.rda",                        package="stppSim"))boundary <- camden$boundary # get boundary data#specifying data sizespt_sizes = c(200, 1000, 2000)#simulate dataartif_stpp <- psim_artif(n_events=pt_sizes, start_date = "2021-01-01",  poly=boundary, n_origin=50, restriction_feat = NULL,  field = NA,  n_foci=5, foci_separation = 10, mfocal = NULL,  conc_type = "dispersed",  p_ratio = 20, s_threshold = 50, step_length = 20,  trend = "stable", fpeak=NULL,  slope = NULL,show.plot=FALSE, show.data=FALSE)

The processing time on an Intel Core i7-7500CPU @ 2.70GHz, 16.0GB RAMPC is12.5 minutes. The processing time is increases to45.2 minutes if landscape restriction is added.Specifically, this increase occurs when the argumentrestriction_feat = camden$landuse is used, accompanied byfield = "val".

To retrieve the result of anyn_events, simply type theobject name with the value index. For example to retrieve the resultbased onn_events = 1000, type:

stpp_1000 <- artif_stpp[[2]]

Spatial Patterns

The configuration and clustering of events in the spatial domain canbe fine-tuned by adjusting parameters that determine spatial components(such asrestriction_feat,n_origin,mfocal,foci_separation,n_foci,s_band, and so forth) as well as those that guide walkerbehaviors (for example,step_length,s_threshold, andp_ratio). To introduce afocal point in the simulation (refer to themfocal seepackage manual), employ themake_grids function. Thisfunction produces an interactive map that displays and permits theextraction of the xy coordinates from any location on the map. Enhancedwith an integratedOpenStreetMap, the interactive platformaids users in more conveniently pinpointing specific locations.

Figure 4 showcases the spatial point patterns(spp) forn_events = 1000 under diverseparameter settings.Note: The spatial configuration maydiffer with each code execution due to inherent random aspects withinthe function.

Figure 4a displays the outcome when relying solely ondefault arguments, as demonstrated in the previous code.

Figure 4b presents the pattern resulting from theintegration of additional parameters:restriction_feat = camden$landuse andmfocal = c(530000, 182250). Here, the first parameterrestricts the number of events created within the land use (restriction)features, while the second emphasizes a central spatial concentration oforigins, highlighted by a red dot on the map.

Figure 4c depicts the configuration when the parametersofrestriction_feat andmfocal are retained(as in 4b), but with an addedfoci_separation = 50. Thisensures a moderate spatial distance between individual origins.

Lastly,Figure 4d illustrates the spatial pattern when,besides maintaining themfocal setting (similar to theabove figures), thes_threshold andstep_length are set at250 and50respectively. This configuration aims to promote a broader distributionof points relative to their origins.

Figure 4: Simulated spatial point patterns of Camden

In the above figures, notice that points that fall on exactly thesame unique location are aggregated and symbolize to reflect the totalpoint count.

Temporal Patterns

Given that the parameters influencing the overall temporal trends(trend,fPeak, andslope) remainunchanged across each simulation, it’s logical to anticipate consistentor very similar temporal patterns across them. Accordingly,Figure 5a-d depict the temporal patterns corresponding tothe spatial representations shown inFigure 4a-d.

Figure 5: Simulated global trends and patterns (gtp)

When we modify thefPeak parameter to30days (equivalent to one month following the start date of the series)and run the simulation with default parameters, the resulting globaltemporal pattern can be visualized inFigure 6. Thisadjustment will likely introduce a distinct seasonal cycle in thesimulated temporal pattern, emphasizing the influence of thefPeak parameter on the temporal distribution of events.

Figure 6: Gtp with an earlier first seasonal peak

Simulating spatiotemporal interactions

The simulation of point patterns with distinct spatiotemporalinteractions can be achieved using two parameters: the spatial bandwidth(s_band) and the temporal bandwidth (t_band).When we speak of spatiotemporal interaction, we’re referring to thelikelihood that events within these specified bandwidths occur morefrequently than what would be expected in a completely random scenario.In simulated datasets, it’s feasible to observe interactions acrossseveral spatiotemporal bandwidths. For example,

#load the dataload(file = system.file("extdata", "camden.rda",                        package="stppSim"))boundary <- camden$boundary # get boundary data#specifying data sizespt_sizes = c(1500)#simulate dataartif_stpp <- psim_artif(n_events=pt_sizes, start_date = NULL,  poly=boundary, n_origin=50, restriction_feat = NULL,  field = NA,  n_foci=5, foci_separation = 10, mfocal = NULL,  conc_type = "dispersed",  p_ratio = 20, s_threshold = 50, step_length = 20,  trend = "stable", fpeak=NULL,  shortTerm = "acyclical"  s_band = c(0, 200),  t_band = c(1,2),  slope = NULL,show.plot=FALSE, show.data=FALSE)

In the above code, ….. s_band = c(0, 200), t_band = c(1,2),

Simulating`stpp` from sample realdataset

The pivotal parameters in this context aren_events,which dictates the number of points to simulate, andppt,representing the sample real data. As previously mentioned, utilizing avector of values forn_events is advisable. The sampledataset should distinctly featurex,y, andt fields, with further specifics provided in the package’smanual.

Example

To extract a random sample from thetheft crimes data inCamden and then utilize this sample to synthesize afulldataset, you can follow these general steps:

#load Camden crimesdata(camden_crimes)#extract 'theft' crimetheft <- camden_crimes %>%  filter(type == "Theft")#print the total no. of recordsnrow(theft)

#specify the proportion of total records to extractsample_size <- 0.3 #i.e., 30%set.seed(1000)dat_sample <- theft[sample(1:nrow(theft),  round((sample_size * nrow(theft)), digits=0),  replace=FALSE),1:3]#print the number of records in the sample datanrow(dat_sample)

Certainly, visualizing the spatial distribution of the data canprovide insights that can inform the choice of parameters for subsequentanalyses.

Here’s how you might plot the sample data based on their x and ylocations using R’sggplot2 package:

plot(dat_sample$x, dat_sample$y,    pch = 16,     cex = 1,     main = "Sample data at unique locations",     xlab = "x",     ylab = "y")

Figure 7a displays the point patterns derived from thesample datasets. Often, crime data sets get aggregated to specificproximate reference points, like centroids of grid squares. To provide aclearer view of the spatial distribution and clustering inherent in thecrime data, it’s essential to group the points based on their uniquelocations. Accordingly, the subsequent code consolidates points by theirdistinct locations, producing the point patterns depicted inFigure 7b:

agg_sample <- dat_sample %>%  mutate(y = round(y, digits = 0))%>%  mutate(x = round(x, digits = 0))%>%  group_by(x, y) %>%  summarise(n=n()) %>%   mutate(size = as.numeric(if_else((n >= 1 & n <= 2), paste("1"),                        if_else((n>=3 & n <=5), paste("2"), paste("2.5")))))dev.new()itvl <- c(1, 2, 2.5)plot(agg_sample$x, agg_sample$y,     pch = 16,     cex=findInterval(agg_sample$size, itvl),     main = "Sample data aggregated at unique location",     xlab = "x",     ylab = "y")legend("topright", legend=c("1-2","3-5", ">5"), pt.cex=itvl, pch=16)#hist(agg_sample$size)

Figure 7: Sample real data (a) unaggregated and (b) aggregated bylocations

Figure 7b reveals that the southern region of Camden hasthe densest occurrence of theft crimes. The spatial layout of the sampledata points can provide insights for users when determining the mostfitting spatial parameters. For instance, to attain a more compactdistribution of points, one might opt to assign smaller values ton_origin or to boths_threshold andstep_length.

Generally, when selecting suitable spatial parameters for a new studyarea, it’s crucial to comprehend the relative scale of the new region incomparison to Camden (We’ll delve deeper into this comparison in thesections that follow).

Proceeding to simulate the point data:

#As the actual size of any real (full) dataset#would not be known, therefore we will assume#`n_events` to be `2000`. In practice, a user can #infer `n_events` from several other sources, such #as other available full data sets, or population data, #etc.#Simulatesim_fullData <- psim_real(n_events=2000, ppt=dat_sample,  start_date = NULL, poly = NULL, s_threshold = NULL,  step_length = 20, n_origin=50, restriction_feat=landuse,   field="restrVal", p_ratio=20, crsys = "EPSG:27700")

Summarising the results:

summary(sim_fullData[[1]])

Spatiotemporal interactions

Within the primary simulation function,psim_real, thest_learner function is employed to detect spatial andtemporal bandwidths where the closeness (in space and time) of pointevents exceeds what would typically arise from mere chance, in a sampledataset (i.e., spatiotemporal interaction). If interaction bandwidthsare detected, the main simulation function,psim_real,automatically incorporates them to generate point patterns that mirrorthe characteristics of the actual datasets.

#get the restriction datalanduse <- as_Spatial(landuse)simulated_stpp_ <- psim_real(  n_events=2000,  ppt=dat_sample,  start_date = NULL,  poly = NULL,  netw = NULL,  s_threshold = NULL,  step_length = 20,  n_origin=100,  restriction_feat = landuse,  field="restrVal",  p_ratio=20,  interactive = FALSE,  s_range = 600,  s_interaction = "medium",  crsys = "EPSG:27700")

In the above code snippet, thes_range parameter is usedto set the spatial range. The default temporal bandwidth is 30 days witha daily incremental range. If thes_range parameter isassigned a value of NULL, the function bypasses the detection ofspace-time interactions and concentrates solely on modeling the spatialand temporal patterns. To assess the spatiotemporal interaction withinany dataset, the NearRepeat calculator, which can be foundhere (and adapted asNRepeat function in this package), may be employed.

#extract the output of a simulationstpp <- simulated_stpp_[[1]]stpp <- stpp %>%  dplyr::mutate(date = substr(datetime, 1, 10))%>%  dplyr::mutate(date = as.Date(date))#define spatial and temporal thresholds s_range <- 600s_thres <- seq(0, s_range, len=4)t_thres <- 1:31#detect space-time interactionsmyoutput2 <- NRepeat(x = stpp$x, y = stpp$y, time = stpp$date,                        sds = s_thres,                        tds = t_thres,                        s_include.lowest = FALSE, s_right = FALSE,                         t_include.lowest = FALSE, t_right = FALSE)#extract the knox ratioknox_ratio <- round(myoutput2$knox_ratio, digits = 2)#extract the corresponding significance valuespvalues <- myoutput2$pvalues#append asterisks to significant resultsfor(i in 1:nrow(pvalues)){ #i<-1    id <- which(pvalues[i,] <= 0.05)    knox_ratio[i,id] <- paste0(knox_ratio[i,id], "*")}#output the resultsknox_ratio

Comparing simulated data and (`full`) realdata

Both visual and statistical methodologies offer valuable insightswhen comparing the spatial and temporal patterns of simulated data tothose of the full real data (encompassing 100% of the dataset).

Utilizing the visual approach allows for a direct visual comparisonof patterns, trends, clusters, and anomalies between the datasets. Thisis typically done using maps, graphs, or charts that depict the spatialand temporal distributions.

On the other hand, the statistical approach provides a morequantified measure of the similarity or differences between thedatasets. Various statistical tests, measures, or models can be appliedto assess the degree of similarity, correlation, or divergence betweenthe spatial and temporal patterns of the simulated and real data.

Together, these methods offer a comprehensive assessment, combiningthe intuitive appeal of visual representation with the precision andrigor of statistical analysis.

Visual approach

Figure 8a and 8b visually represent the spatial pointdistributions of the simulated and full real datasets, respectively.From these figures, one can assess the spatial fidelity of the simulateddata by visually comparing its distribution, clusters, and other spatialpatterns against the full real dataset.

Meanwhile,Figure 9a and 9b present the temporalpatterns of the simulated and real datasets over time. These plots canbe used to evaluate how well the simulated data captures temporaltrends, seasonality, peaks, and other time-related patterns whencompared to the full real data.

By examining both sets of figures in tandem, one can get a holisticview of the accuracy and reliability of the simulated data in mimickingboth the spatial and temporal characteristics of the real dataset.

Figure 8: Setting an earlier first seasonal peak

InFigure 8, two key observations stand out: thetotal number of points and theclustering of points.

Firstly, the decision to setn_events = 2000 wasdeliberate. This mirrors real-world scenarios where the exact totalnumber of events or points isn’t always known in advance or might besubject to some variability.

Secondly, a notable difference in point clustering is observedbetween the two figures. In the real data (Figure 8b),there’s a pronounced concentration of points at specific, uniquelocations. This is indicative of common crime recording practices whereincidents are assigned to the nearest predefined reference points, suchas street corners, landmarks, or property centroids. Such practices areaimed at preserving anonymity or simplifying the data representation. Incontrast, our simulated data inFigure 8a doesn’t operateunder this premise. Instead, it allows for a more dispersed distributionwithout forcing the points to aggregate around predefined referencelocations.

Thus, while the simulation strives to capture the broader spatialcharacteristics of crime patterns, it does not replicate the specificrecording practices often seen in real crime data.

Figure 9: Global temporal pattern of (a) simulated and (b) full realdata set

FromFigure 9, it’s evident that the temporal dynamicsof both the simulated and real datasets align closely. Both exhibitcongruent seasonal fluctuations, as highlighted by the red lines, and aconsistent upward trend over time. This resemblance underscores thecapability of the simulation in accurately mirroring the time-basedpatterns observed in the actual data.

Statistical approach

In an area as compact as Camden, we can statistically compare thesimulated and actual data sets in terms of both space and time usingPearson's Coefficient. For spatial analysis, data sets weregrouped into a consistent square grid system. By aligning counts basedon grid IDs, we derived a correlation metric. This evaluation employedthree varying grid sizes (150sq.mts,250sq.mts, and400sq.mts) to observe howcorrelation fluctuates with spatial granularity. Temporally, we examinedthree scales:daily,weekly, andmonthly.Table 1 illustrates the correlationvalues, highlighting the degree of resemblance between the two sets ofdata.

Table 1.`Correlation between simulated and real data sets`
Dimension	Scale_sq.mts	Corr.Coeff
Spatial	150	0.5
	250	0.62
	400	0.78
Temporal	Daily	0.34
	Weekly	0.78
	Monthly	0.93

The simulated and actual data sets show significant parallels in bothspatial and temporal domains. However, an exception arises at thedaily temporal scale, where the similarity diminishes. Suchan outcome is anticipated due to the inherent randomness at thisgranular level. Moreover, the daily timestamp of the real data set wasgenerated at random, as detailed in the package user manual. As dataaggregation intensifies, whether spatially or temporally, the similaritybetween the two sets strengthens. This is evidenced by correlationcoefficients of.78 for the broadest spatial scale and.93 for the most extended temporal scale.

Setting simulation parameters for different studyareas

In this vignette, while most parameters should yield comparableoutcomes for any study location, three specific parameters that governthe spatial distribution of simulated points stand out:n_origin,s_threshold, andstep_length. To ensure a balanced distribution of pointpatterns spatially, users are encouraged to designate fitting values forthese parameters. With a change in the size of the study zone, it’santicipated that these three parameters would “proportionally” scale,increasing with a larger area and decreasing for a smaller one.Note: For optimal spatial control, we suggest users scaleeithern_origin or boths_threshold andstep_length, rather than all three.

To address the intricacies tied to these parameters, we introduce thecompare_areas() function. This aids users in gauging therelative sizes of two distinct areas. Commonly, one of these areas - forinstance,Camden in this context - would havepre-established simulation parameters. By integrating a secondarypolygon shapefile into the function, it produces a factor or value thatdenotes the size difference between the two zones. This factor serves asa multiplier for the parameters mentioned earlier when transitioning toa new area. For instance, ifCamden is3 timessmaller than the new chosen area, users should multiply eithern_origin or boths_threshold andstep_length by3 for accurate simulation.Conversely, if Camden is larger, users should divide the parameters bythe factor. From a computational standpoint, adjusting both {s_thresholdand step_length} is more efficient.

To illustrate the efficacy of thecompare_areas()function, let’s juxtapose theBirmingham region of the UKwith the Camden area as an example:

#load 'area1' object - boundary of Camden, UKload(file = system.file("extdata", "camden.rda",                        package="stppSim"))camden_boundary = camden$boundary#load 'area2' - boundary of Birmingham, UKload(file = system.file("extdata", "birmingham_boundary.rda",                        package="stppSim"))#run the comparisonoutput <- compare_areas(area1 = camden_boundary,              area2 = birmingham_boundary, display_output = FALSE)

To display the comparison and the resultant factor, you can use thefollowing method:

The above code returns the string#-----'area2' is 12.3 times bigger than 'area1'-----#.

For the Birminghma simulation, either multiply then_origin value by12.3 or apply the samemultiplication factor of12.3 to boths_threshold andstep_length. After adjustingthese values, input them into the simulation function and executeit.

Discussion

This guide has showcased the capabilities of the primary simulationfunctions in thestppSim package: (i)psim_artif for creating stpp from the ground up, and (ii)psim_real for producing stpp using a sampled real data set.The document illustrated how to adjust the parameters to shape thespatial and dimensional attributes of the data. Nevertheless, it’sessential to tailor these parameters to fit the specific subject matterbeing explored. The package offers vast potential across variousdomains, including analyzing human crime patterns and behaviors,investigating the foraging habits of wildlife and their achievements,and examining disease vectors and infections. We’re committed torefining the package for even broader uses.

We appreciate the feedback from our user community. Please notify usof any issues or bugs so we can address them promptly. Contributions tothis package are welcomed and will be duly credited.

Movatterモバイル変換

stppSim: An R package for synthesizingspatiotemporal point patterns - A user guide

Adepeju, M.Big Data Centre, Manchester Metropolitan University, Manchester, M15 6BH, UKAuthor:

2024-07-24Date:

Introduction

Elements of data simulation

The agents (walkers)

Spatial factors (landscape)

Temporal dimension

Installation ofstppSim

Notice:

interactive argument

Simulating point patterns fromscratch

Example

Simulatingstpp from sample realdataset

Example

Comparing simulated data and (full) realdata

Setting simulation parameters for different studyareas

Discussion

`Adepeju, M.`
`Big Data Centre, Manchester Metropolitan University, Manchester, M15 6BH, UK`
`Author:`

`2024-07-24`
`Date:`

The agents (`walkers`)

Installation of`stppSim`

`interactive` argument

Simulating`stpp` from sample realdataset

Comparing simulated data and (`full`) realdata