Movatterモバイル変換

Once a (CO)VLMC has been estimated from a sequence, it can be used toproduce new sequences. Any statistical estimator defined on the originalsequence can be applied to the simulated sequences leading a form ofsemi-parametricbootstrap. This provides an interesting alternative to otherbootstrapping solutions for correlated data, such as the blockbootstrap. This can also be use to consider what-if scenarios whendifferent values of the covariates are used study their effect beyondanalysis coefficients in context specific models.

Sampling from a VLMC

As explained invignette("likelihood"), a VLMC does notspecify distributions for the initial values of time series as nocontext can be computed from them. Insimulate.vlmc(), weuse the notion ofextended context described in the vignette toextend any VLMC into a full model, that specifies both the conditionaldistributions given its contexts and also the distributions of theinitial values.

Introduction

Let us consider a simple example with an independent sample:

dts<-sample(0:1,500,replace =TRUE)

The optimal VLMC according to the BIC has no memory:

best_vlmc<-tune_vlmc(dts)draw(as_vlmc(best_vlmc))#> * (0.464, 0.536)

Simulating a sequence using the model is done viasimulate.vlmc() which overrides the standardstats::simulate(). For instance:

dts_sample<-simulate(as_vlmc(best_vlmc),600)[-(1:100)]

Even if this is useless here because of the independence, we drop thefirst 100 samples. In generalsimulate.vlmc() is used asstats::simulate() and supports the standard parameters:

nsim for the number of simulated values, here thelength of the new time series
seed to specify the random seed used during thesimulation (the initial state of the random number generator is restoredto its previous after simulating the values)

As forstats::simulate(), the state of the random numbergenerator is never modified by a call tosimulate.vlmc().

In addition, one can specify the initial values of the sequence viatheinit parameter, for instance:

dts_sample_2<-simulate(as_vlmc(best_vlmc),10,init =c(0L,0L))

is guaranteed to start with\(0,0\):

dts_sample_2#>  [1] 0 0 1 1 1 0 0 0 1 1#> attr(,"class")#> [1] "dts"     "integer"

This provides an alternative to dropping the initial values of thesimulated time series by setting those values to the one observed in thedata set. Notice that this practice has no theoreticaljustification.

To implement aburn in period, just use theburnin parameter. For instance:

dts_sample_3<-simulate(as_vlmc(best_vlmc),10,burnin =20)

will simulated 30 values and drop the first 20.

CAC time series

Let us consider the French CAC index provided inEuStockMarkets:

CAC_raw<-as.data.frame(EuStockMarkets)$CAC

We turn it into a discrete time series with three values:

Stay if the value of the index on day t+1 is between 99.5% and100.5% of the value on day t
Up if the value increased by at least 0.5%
Down if the value decreased by at least 0.5%

CAC_rel_evol<-diff(CAC_raw)/ CAC_raw[-length(CAC_raw)]CAC_dts<-factor(ifelse(CAC_rel_evol>=0.005,"Up",ifelse(CAC_rel_evol<=-0.005,"Down","Stay")  ),levels =c("Down","Stay","Up"))

Then we adjust a VLMC to the time series using the AIC criterion:

CAC_vlmc<-tune_vlmc(CAC_dts,criterion ="AIC")CAC_model<-as_vlmc(CAC_vlmc)

We use here the AIC to favour predictive performances and as aconsequence the obtained model is quite complex with 1 contexts.

The original discrete time series does not exhibit long subsequencesof constant values as shown in the following graphicalrepresentation.

CAC_rle<-rle(as.integer(CAC_dts))CAC_rle_df<-data.frame(value = CAC_rle$values,length = CAC_rle$lengths)ggplot(CAC_rle_df,aes(x = length))+geom_bar()+labs(title ="Distribution of the lengths of constant subsequences",x ="Length",y ="Count"  )

To study this aspect of the time series, a natural statistics is theprobability of observing a constant subsequence of a length between 5and 10 among all subsequences of length 10 that can be generated by themodel. Notice that the patterns used in the statistics must be of fixedlengths to be covered by the consistency theorems mentioned above. Weimplement the statistics as follows:

long_sequence<-function(dts) {  dts_int<-as.integer(dts)## RLE cannot be used directly as we need to account for overlapping## patterns  dts_freq<-frollapply(dts_int,10, \(x)max(rle(x)$length)>=5)mean(dts_freq,na.rm =TRUE)}

We generate 100 bootstrap samples, using aburn in timeproportional to the depth of the model, in order to allow for a propermixing to take place. Notice that we could use alsoburnin="auto" to use the another heuristics which uses thenumber of contexts to compute the length of theburn in period.We use theseed parameter ofstats::simulate()to ensure reproducibility.

bootstrap_samples<-vector(100,mode ="list")burn_in_time<-50*depth(CAC_model)for (binseq_along(bootstrap_samples)) {  bootstrap_samples[[b]]<-simulate(CAC_model,length(CAC_dts),burin = burn_in_time,seed = b  )}

Then we compute the statistics on the bootstrap samples.

bootstrap_ls<-sapply(bootstrap_samples, long_sequence)

The bootstrap distribution of this statistics is illustrated on thefollowing figure in which the red vertical line represents the value ofthe statistics on the original sequence.

ggplot(mapping =aes(x = bootstrap_ls))+geom_density()+geom_rug()+labs(title ="Bootstrap distribution of the probability of long constant subsequences",x ="Probability"  )+geom_vline(xintercept =long_sequence(CAC_dts),color ="red")

Sampling from a COVLMC

The case of COVLMC is more complex mainly because of the dependencetowards an external time series that is not modelled. In addition, notheoretical result justify a form of COVLMC bootstrap at the time ofwriting this document (2023).

We can nevertheless use sampling both to get an idea of the potentialvariability of the results from a qualitative point of view, and also tostudy some what-if scenarios based on a modification of thecovariates.

CAC example continued

Let us study the CAC discrete time series using the three otherindexes fromEuStockMarkets as covariates. As previously,we select a model with the AIC criterion:

CAC_covariates<-as.data.frame(EuStockMarkets)[c("DAX","SMI","FTSE")][-1, ]CAC_covlmc<-tune_covlmc(CAC_dts, CAC_covariates,criterion ="AIC")CAC_comodel<-as_covlmc(CAC_covlmc)

The obtained model is relatively complex as shown below:

draw(CAC_comodel,model ="full",with_state =TRUE)#> *#> +-- Stay#> |   +-- Stay (0.0065 [ (Down) | (I)   DAX_1     SMI_1     FTSE_1  DAX_2     SMI_2    FTSE_2#> |   |                  Stay   | 0.871 0.007903  -0.009399 0.01673 -0.007771 0.009261 -0.01684#> |   |                  Up     | 2.838 -0.005994 -0.01448  0.0299  0.003666  0.0169   -0.03126 ])#> |   '-- Down, Up (0.002452 [ (Down) | (I)    DAX_1     SMI_1     FTSE_1#> |                            Stay   | -1.707 -0.001755 0.0004276 0.001412#> |                            Up     | -1.397 -0.001357 0.0006376 0.0008178 ])#> '-- Down, Up (NA [ (Down) | (I)#>                    Stay   | 0.347#>                    Up     | 0.05716 ])

The variability of the discrete time series for fixed values of thecovariates can be investigated using simulated sequences. The onlydifference betweensimulate.vlmc() andsimulate.covlmc() is the need for the covariates in thelatter. Aburn in time is also problematic here as we do nothave additional throw away values for the covariates. A possibility usedhere consists in starting the discrete time series as the observed onestarted, using as many values as the size of the largest context. Noticethat in the current implementation, simulating from a COVLMC isrelatively slow compared to simulation from a VLMC. Here we perform asingle simulation to illustrate the feature.

co_sample<-simulate(CAC_comodel,length(CAC_dts),seed =0,init = CAC_dts[1:depth(CAC_comodel)],covariate = CAC_covariates)

What-if scenarios

In some situations, one can modify the covariates to test theconsequences of the external influence on the discrete time series understudy. Let us consider for instance a week of electricalconsumption:

pc_week<- powerconsumption[powerconsumption$week==12, ]elec<- pc_week$active_powerggplot(pc_week,aes(x = date_time,y = active_power))+geom_line()+xlab("Date")+ylab("Activer power (kW)")

We build a discrete version by considering two states: the backgroundbase active power below 0.4 kW and the active use of electric applianceabove this limit.

elec_dts<-cut(elec,breaks =c(0,1.2,8),labels =c("background","active"))elec_cov<-data.frame(day = (pc_week$hour>=7& pc_week$hour<=17))

Then we adjust a COVLMC with a binary covariate day/night, using theAIC.

elec_covlmc_tune<-tune_covlmc(elec_dts, elec_cov,criterion ="AIC")best_elec_covlmc<-as_covlmc(elec_covlmc_tune)draw(best_elec_covlmc,model ="full",time_sep =" | ",p_value =FALSE)#> *#> +-- background#> |   +-- background ([ (I)    | day_1TRUE#> |   |                 -3.383 | 0.8979    ])#> |   '-- active ([ (I)#> |                 -1.526 ])#> '-- active#>     +-- background ([ (I)   | day_1TRUE#>     |                 20.57 | -19.36    ])#>     '-- active#>         +-- background ([ (I)#>         |                 1.609 ])#>         '-- active ([ (I)   | day_1TRUE#>                       2.552 | 0.8602    ])

Then we can simulate longer or shorter days by manipulating thecovariates. For instance a “day only” full week could be obtained asfollows

day_only<-simulate(best_elec_covlmc,length(elec_dts),seed =0,covariate =data.frame(day =rep(TRUE,length(elec_dts))))

while a “night only” full week is given by

night_only<-simulate(best_elec_covlmc,length(elec_dts),seed =0,covariate =data.frame(day =rep(FALSE,length(elec_dts))))

A typical use of this approach consists in simulating first acollection of sequences with the original covariates to get a sense ofthe variability of the statistics of interest. In a second phase, acomparable collection of sequences is generated using the manipulatedcovariates.