Once a (CO)VLMC has been estimated from a sequence, it can be used toproduce new sequences. Any statistical estimator defined on the originalsequence can be applied to the simulated sequences leading a form ofsemi-parametricbootstrap. This provides an interesting alternative to otherbootstrapping solutions for correlated data, such as the blockbootstrap. This can also be use to consider what-if scenarios whendifferent values of the covariates are used study their effect beyondanalysis coefficients in context specific models.
Bühlmann and Wyner show intheir paper thatthis approach is consistent for statistics that depend smoothly fromarbitrary means of fixed mappings of contexts to numerical values. Insimple terms, one can first choose a mapping of arbitrary tuples ofvalues in the state space to numerical vectors and then build aestimator by averaging those vectors over the observed sequence. Theresulting vector mean can then be transformed smoothly to lead to anestimator. The VLMC bootstrap is then consistent for this estimatorunder relatively mild hypothesis on the underlying VLMC.
A typical example of such estimator is the probability of observing afixed pattern in the sequence. More generally, the class of estimatorsincludes smooth transformations of the empirical distribution ofpatterns up to a fixed length.
In practice, it is recommended to sample a longer sequence than theone needed and to keep only the last\(n\) values, in order to minimise theinfluence of the starting values. The first values are considered as aburn in orwarm up phase.
As explained invignette("likelihood"), a VLMC does notspecify distributions for the initial values of time series as nocontext can be computed from them. Insimulate.vlmc(), weuse the notion ofextended context described in the vignette toextend any VLMC into a full model, that specifies both the conditionaldistributions given its contexts and also the distributions of theinitial values.
Let us consider a simple example with an independent sample:
The optimal VLMC according to the BIC has no memory:
Simulating a sequence using the model is done viasimulate.vlmc() which overrides the standardstats::simulate(). For instance:
Even if this is useless here because of the independence, we drop thefirst 100 samples. In generalsimulate.vlmc() is used asstats::simulate() and supports the standard parameters:
nsim for the number of simulated values, here thelength of the new time seriesseed to specify the random seed used during thesimulation (the initial state of the random number generator is restoredto its previous after simulating the values)As forstats::simulate(), the state of the random numbergenerator is never modified by a call tosimulate.vlmc().
In addition, one can specify the initial values of the sequence viatheinit parameter, for instance:
is guaranteed to start with\(0,0\):
This provides an alternative to dropping the initial values of thesimulated time series by setting those values to the one observed in thedata set. Notice that this practice has no theoreticaljustification.
To implement aburn in period, just use theburnin parameter. For instance:
will simulated 30 values and drop the first 20.
Let us consider the French CAC index provided inEuStockMarkets:
We turn it into a discrete time series with three values:
CAC_rel_evol<-diff(CAC_raw)/ CAC_raw[-length(CAC_raw)]CAC_dts<-factor(ifelse(CAC_rel_evol>=0.005,"Up",ifelse(CAC_rel_evol<=-0.005,"Down","Stay") ),levels =c("Down","Stay","Up"))Then we adjust a VLMC to the time series using the AIC criterion:
We use here the AIC to favour predictive performances and as aconsequence the obtained model is quite complex with 1 contexts.
The original discrete time series does not exhibit long subsequencesof constant values as shown in the following graphicalrepresentation.
CAC_rle<-rle(as.integer(CAC_dts))CAC_rle_df<-data.frame(value = CAC_rle$values,length = CAC_rle$lengths)ggplot(CAC_rle_df,aes(x = length))+geom_bar()+labs(title ="Distribution of the lengths of constant subsequences",x ="Length",y ="Count" )To study this aspect of the time series, a natural statistics is theprobability of observing a constant subsequence of a length between 5and 10 among all subsequences of length 10 that can be generated by themodel. Notice that the patterns used in the statistics must be of fixedlengths to be covered by the consistency theorems mentioned above. Weimplement the statistics as follows:
long_sequence<-function(dts) { dts_int<-as.integer(dts)## RLE cannot be used directly as we need to account for overlapping## patterns dts_freq<-frollapply(dts_int,10, \(x)max(rle(x)$length)>=5)mean(dts_freq,na.rm =TRUE)}We generate 100 bootstrap samples, using aburn in timeproportional to the depth of the model, in order to allow for a propermixing to take place. Notice that we could use alsoburnin="auto" to use the another heuristics which uses thenumber of contexts to compute the length of theburn in period.We use theseed parameter ofstats::simulate()to ensure reproducibility.
bootstrap_samples<-vector(100,mode ="list")burn_in_time<-50*depth(CAC_model)for (binseq_along(bootstrap_samples)) { bootstrap_samples[[b]]<-simulate(CAC_model,length(CAC_dts),burin = burn_in_time,seed = b )}Then we compute the statistics on the bootstrap samples.
The bootstrap distribution of this statistics is illustrated on thefollowing figure in which the red vertical line represents the value ofthe statistics on the original sequence.
ggplot(mapping =aes(x = bootstrap_ls))+geom_density()+geom_rug()+labs(title ="Bootstrap distribution of the probability of long constant subsequences",x ="Probability" )+geom_vline(xintercept =long_sequence(CAC_dts),color ="red")The case of COVLMC is more complex mainly because of the dependencetowards an external time series that is not modelled. In addition, notheoretical result justify a form of COVLMC bootstrap at the time ofwriting this document (2023).
We can nevertheless use sampling both to get an idea of the potentialvariability of the results from a qualitative point of view, and also tostudy some what-if scenarios based on a modification of thecovariates.
Let us study the CAC discrete time series using the three otherindexes fromEuStockMarkets as covariates. As previously,we select a model with the AIC criterion:
CAC_covariates<-as.data.frame(EuStockMarkets)[c("DAX","SMI","FTSE")][-1, ]CAC_covlmc<-tune_covlmc(CAC_dts, CAC_covariates,criterion ="AIC")CAC_comodel<-as_covlmc(CAC_covlmc)The obtained model is relatively complex as shown below:
draw(CAC_comodel,model ="full",with_state =TRUE)#> *#> +-- Stay#> | +-- Stay (0.0065 [ (Down) | (I) DAX_1 SMI_1 FTSE_1 DAX_2 SMI_2 FTSE_2#> | | Stay | 0.871 0.007903 -0.009399 0.01673 -0.007771 0.009261 -0.01684#> | | Up | 2.838 -0.005994 -0.01448 0.0299 0.003666 0.0169 -0.03126 ])#> | '-- Down, Up (0.002452 [ (Down) | (I) DAX_1 SMI_1 FTSE_1#> | Stay | -1.707 -0.001755 0.0004276 0.001412#> | Up | -1.397 -0.001357 0.0006376 0.0008178 ])#> '-- Down, Up (NA [ (Down) | (I)#> Stay | 0.347#> Up | 0.05716 ])The variability of the discrete time series for fixed values of thecovariates can be investigated using simulated sequences. The onlydifference betweensimulate.vlmc() andsimulate.covlmc() is the need for the covariates in thelatter. Aburn in time is also problematic here as we do nothave additional throw away values for the covariates. A possibility usedhere consists in starting the discrete time series as the observed onestarted, using as many values as the size of the largest context. Noticethat in the current implementation, simulating from a COVLMC isrelatively slow compared to simulation from a VLMC. Here we perform asingle simulation to illustrate the feature.
In some situations, one can modify the covariates to test theconsequences of the external influence on the discrete time series understudy. Let us consider for instance a week of electricalconsumption:
pc_week<- powerconsumption[powerconsumption$week==12, ]elec<- pc_week$active_powerggplot(pc_week,aes(x = date_time,y = active_power))+geom_line()+xlab("Date")+ylab("Activer power (kW)")We build a discrete version by considering two states: the backgroundbase active power below 0.4 kW and the active use of electric applianceabove this limit.
elec_dts<-cut(elec,breaks =c(0,1.2,8),labels =c("background","active"))elec_cov<-data.frame(day = (pc_week$hour>=7& pc_week$hour<=17))Then we adjust a COVLMC with a binary covariate day/night, using theAIC.
elec_covlmc_tune<-tune_covlmc(elec_dts, elec_cov,criterion ="AIC")best_elec_covlmc<-as_covlmc(elec_covlmc_tune)draw(best_elec_covlmc,model ="full",time_sep =" | ",p_value =FALSE)#> *#> +-- background#> | +-- background ([ (I) | day_1TRUE#> | | -3.383 | 0.8979 ])#> | '-- active ([ (I)#> | -1.526 ])#> '-- active#> +-- background ([ (I) | day_1TRUE#> | 20.57 | -19.36 ])#> '-- active#> +-- background ([ (I)#> | 1.609 ])#> '-- active ([ (I) | day_1TRUE#> 2.552 | 0.8602 ])Then we can simulate longer or shorter days by manipulating thecovariates. For instance a “day only” full week could be obtained asfollows
day_only<-simulate(best_elec_covlmc,length(elec_dts),seed =0,covariate =data.frame(day =rep(TRUE,length(elec_dts))))while a “night only” full week is given by
night_only<-simulate(best_elec_covlmc,length(elec_dts),seed =0,covariate =data.frame(day =rep(FALSE,length(elec_dts))))A typical use of this approach consists in simulating first acollection of sequences with the original covariates to get a sense ofthe variability of the statistics of interest. In a second phase, acomparable collection of sequences is generated using the manipulatedcovariates.