Abstract
This introduction to thesimDAGR Package is a(slightly) modified version of a provisionally accepted article in theJournal of Statistical Software. If you use this package orwant to cite information contained in this article, please cite thearXiv version(Denz and Timmesfeld 2025).simDAGR package offers a standardized approach to generate datafrom simple and complex DGP based on the definition of structuralequations in directed acyclic graphs using arbitrary functions orregression models. The package offers a clear syntax with an enhancedformula interface and directly supports generating binary, categorical,count and time-to-event data with arbitrary dependencies, possiblynon-linear relationships and interactions. It additionally includes aframework to conduct discrete-time based simulations which allows thegeneration of longitudinal data on a semi-continuous time-scale. Thisapproach may be used to generate time-to-event data with both recurrentor competing events and possibly multiple time-varying covariates, whichmay themselves have arbitrary data types. In this article we demonstratethe vast amount of features included insimDAG byreplicating the DGP of multiple real Monte-Carlo simulation studies.Applied researchers and statisticians frequently use Monte-Carlosimulation techniques in a variety of ways. They are used to estimaterequired sample sizes(Arnold et al.2011), formally compare different statistical methods(Morris, White, and Crowther 2019; Denz,Klaaßen-Mielke, and Timmesfeld 2023), help design and planclinical trials(Kimko and Duffull 2002; Nance etal. 2024) or for teaching purposes(Sigaland Chalmers 2016; Fox et al. 2022). The main reason for theirbroad usage is that the researcher has full control over the true datageneration process (DGP). In general, the researcher will define a DGPappropriate to the situation and generate multiple datasets from it.Some statistical analysis technique is then applied to each dataset andthe results are analyzed. A crucial step in every kind of Monte-Carlosimulation study is thus the generation of these datasets.
Depending on the DGP that is required by the researcher, this stepmay become very difficult and time consuming. For example, someMonte-Carlo simulations require the generation of complex longitudinaldata with variables of different types that are causally related invarious ways(Asparouhov and Muthén 2020).Some of these possible data types are continuous variables, categoricalvariables, count variables or time-to-event variables. All of theserequire different parametrizations and simulation strategies. Ifinteractions or non-linear relationships between these variables arerequired, simulating data from the DGP becomes even morechallenging.
Generating any artificial data requires (1) a formal description ofthe DGP, (2) the knowledge of an algorithm that may be used to generatethe data from this DGP, and (3) the ability to create a softwareapplication to implement that algorithm. Although many statisticians mayhave no problem with theses steps, this might not be the case for moreapplied researchers. More importantly, the third step in particular mayrequire a high level of expertise in programming, because the resultingprogram has to be validated extensively while it also has to becomputationally efficient enough to allow potentially thousands ofdatasets to be generated in a reasonable amount of time. Additionally,it also has to be flexible enough to allow the user to easily makechanges to the DGP to be useful in most cases(Sofrygin, van der Laan, and Neugebauer 2017). Acomprehensive software application that automates most of the requiredwork would therefore be of great benefit to the scientificcommunity.
In this article we present thesimDAGRpackage, which offers an easy to use and consistent framework togenerate arbitrarily complex crossectional and longitudinal data. Theaim of the package is to make all three steps of the data generationprocess easier by giving users a standardized way to define the desiredDGP, which can then be used directly to generate the data withoutfurther user input. It does so by requiring the user to define adirected acyclic graph (DAG) with additional information about theassociations between the supplied variables(Pearl 2009). The package was created using theR programming language(R Core Team2024) and is available on the ComprehensiveRArchive Network (CRAN) athttps://cran.r-project.org/package=simDAG.
In this package, the user is required to describe the desired DGP asa causal DAG. Formally, a DAG is a mathematical graph consisting of aset of\(V\) nodes (or vertices) and aset of\(E\) edges (or links)connecting pairs of nodes. As its’ name suggests, a DAG consists only ofdirected edges and isacyclic, meaning that there areno cycles when following directed paths on the DAG(Byeon and Lee 2023). A causal DAG is a specialsort of DAG in which the nodes represent random variables and the edgesrepresent directed causal relationships between these variables(Pearl 2009). A very simple example containingonly three nodes and no time-dependencies is given in Figure 1. The DAGin this figure contains a directed arrow from\(A\) to\(C\) and from\(B\) to\(C\). This translates to the assumptionsthat there is a direct causal effect of\(A\) on\(C\) and of\(B\) on\(C\), but no direct causal relationshipbetween\(A\) and\(B\) (due to the absence of an arrow betweenthem).
An example DAG with three nodes.
Such DAGs are the cornerstone of thestructural approach tocausal inference developed byPearl (2009)andSpirtes, Glymour, and Scheines (2000).They are used extensively in social research(Wouk, Bauer, and Gottfredson 2019), economics(Imbens 2020) and epidemiology(Byeon and Lee 2023) to encode causalassumptions about the real underlying DGP of empirical data. Forempirical research such graphs are very useful because they give a clearoverview of the causal assumptions made by the researchers. By usingcausal graphical methods such as thebackdoor criterion(Pearl 2009) or thefrontdoor criterion(Pearl 1995), it is also possible to usesuch graphs to determine which variables need to be adjusted for inorder to get unbiased estimates of certain causal effects. ThedaggittyR package directly implementsmultiple tools for this kind of usage(Textor etal. 2016).
These kind of DAGs can be formally described usingstructuralequations. These equations describe how each node is distributed.For example, a general set of structural equations that may be used todescribe the DAG in Figure 1 are:
\[ \begin{aligned} A \sim & f_A(U_A), \\ B \sim & f_B(U_B), \\ C \sim & f_C(A, B, U_C). \\ \end{aligned}\]
In these equations, the unspecified functions\(f_A\),\(f_B\) and\(f_C\) describe how exactly the nodes aredistributed, possibly conditional on other nodes. The terms\(U_A\),\(U_B\) and\(U_C\) denote random errors or disturbances.If the functions in these structural equations are not specified andsome assumption on the probability distribution of the error terms ismade, this is equivalent to a non-parametric structural equation model(Pearl 2009; Sofrygin, van der Laan, andNeugebauer 2017).
To make the generation of data from a DAG possible, however, it isnot enough to only specify which variables are causally related to oneanother. The structural equations now also have to be fully specified.This means that the distribution functions of each node will have to bedefined in some way by the user. A popular way to do this is to useregression models and parametric distributions(Kline 2023), but in theory any kind of functionmay be used, allowing the definition of arbitrarily complex DGPs.Continuing the example from above, we could define the structuralequations of the DAG as follows:
\[ \begin{aligned} A \sim & N(0, 1), \\ B \sim & N(0, 1), \\ C \sim & -2 + A\cdot0.3 + B\cdot-2 + N(0, 1). \\ \end{aligned}\]
This means that both\(A\) and\(B\) are independent standard normallydistributed variables and that\(C\)follows a simple linear regression model based on\(A\) and\(B\) with an independent normallydistributed error term with mean zero. Once all structural equations anddistribution functions have been defined, data may be generated from theDAG using a fairly simple algorithm. This algorithm essentiallygenerates data for one node at a time, using only the supplieddefinitions and the data generated in previous steps. This step-wisemethod relies on the fact that every DAG can betopologicallysorted, which means that there is always an ordering of the nodessuch that for every link\((u_i, u_j)\)between nodes\(u_i\) and\(u_j\),\(u_i\) comes before\(u_j\)(Kahn1962).
The generation of the data starts by ordering the nodes of the graphin such a topologically sorted way. This means that nodes in the DAGthat have no arrows pointing into them, sometimes calledrootnodes, are processed first. Data for these kinds of nodes can begenerated by sampling from a pre-specified parametric distribution, suchas a Gaussian distribution or a beta distribution, or through any otherprocess, such as re-sampling based strategies(Carsey and Harden 2014). Once data for thesenodes has been generated, the data for the next node in the topologicalorder will be generated, based on the already generated data. Thesenodes are calledchild nodes, because they are dependent onother nodes, which are called theirparent nodes(Byeon and Lee 2023). For the example DAG shownearlier, the two possible topological sortings are:
\[ (A, B, C) \quad \text{and} \quad (B, A, C).\]
Here, both\(A\) and\(B\) are root nodes because they do not haveany parents and\(C\) is a child nodeof both of its’ parents\(A\) and\(B\). To generate data for this exampleusing the algorithm described above, one would first generate\(n\) random draws from a standard normaldistribution for both\(A\) and\(B\). Next, one would calculate the linearcombination of these values as specified by the linear regression modelin the earlier Equation and add\(n\)random draws from another standard normal distribution to it (whichrepresents the error term). InR, this simple example couldbe simulated using the following code:
Although the manual code required for this example is fairly simple,this is no longer the case in DAGs with more nodes and/or a more complexDGP (for example one including different data types). ThesimDAG package offers a standardized way to define anypossible DAG and the required distribution functions to facilitate aclear and reproducible workflow.
The previous explanations and the given example focused on the simplecase of crossectional data without any time-dependencies. It is,however, fairly straightforward to include a time-varying structure intoany DAG as well by simply adding a time-index to the time-varying nodesand repeating the node for each point in time that should be considered(Hernán and Robins 2020). The proposedpackage features computationally efficient functions to automate thisprocess for large amounts of time-points using adiscrete-timesimulation approach(Tang, Leu, and Abbass2020). Although this procedure relies on a discrete time scale,it can be used to generate data on a semi-continuous time-scale by usingvery small steps in time. This is described in more detail in a laterSection.
Note also that while causal DAGs imply a specific causal structure,the algorithms and code described here do not necessitate that thiscausal structure has to be interpreted as such in the generated data.For example, the structural equations shown earlier state that\(A\) and\(B\) are direct causes of\(C\), but the datasets that can be generatedfrom these equations could also be interpreted as\(A\) and\(B\) being only associated with\(C\) for unknown reasons. As long as thedesired DGP can bedescribed as a DAG, which is almost alwaysthe case, this strategy may be used effectively to generate data evenfor Monte-Carlo studiesnot concerned with causalinference.
Although the data-generation algorithm described above is appropriatefor most applications, it may not be the best choice for validatingcausal discovery methods, due to the marginal variance of each variableincreasing along the order of the topological sorting(Reisach, Seiler, and Weichwald 2021). Othermethods, such as the onion method proposed byAndrews and Kummerfeld (2024) may be preferablein this particular case.
There are many different software packages that may be used togenerate data in theR programming language and otherlanguages such asPython. It is infeasible to summarise allof them here, so we will only focus on a few that offer functionalitysimilar to thesimDAG package instead. The following reviewis therefore not intended to be exhaustive. We merely aim to show howthe existing software differs from the proposed package.
MultipleR packages support the generation of syntheticdata from fully specified structural equation models. Thelavaan package(Rosseel2012), thesemTools package(Jorgensen et al. 2022) and thesimsem package(Pornprasertmanit etal. 2021) are a few examples. However, these packages focussolely on structural equation models with linear relationships and assuch do not allow the generation of data with different data types. Forexample, none of these packages allow the generation of time-to-eventdata, which is a type of data that is often needed in simulationstudies. SpecializedR packages such as thesurvsim(Moriña and Navarro2017),simsurv(Brilleman etal. 2021),rsurv(Demarqui2024) andreda(Wang et al.2022) packages may be used to simulate such data instead.Although some of these packages allow generation of recurrent events,competing events and general multi-state time-to-event data, unlike thesimDAG package, none of them support arbitrary mixtures ofthese data types or time-varying covariates.
Other packages, such as thesimPop package(Templ et al. 2017) and thesimFrame package(Alfons, Templ, andFilzmoser 2010) allow generation of more complex synthetic datastructures as well, but are mostly focused on generating data thatmimicks real datasets. Similarly, thesimtrial packageoffers very flexible tools for the generation of randomized controlledtrial data, but it would be difficult to use it to generate other datatypes. Software directly based on causal DAGs as DGPs also exists.Although it is not stated in the package documentation directly, thesimstudy package(Goldfeld andWujciak-Jens 2020) also relies on the DAG based algorithmdescribed earlier. It supports the use of different data types andcustom generation functions, but only has partial support for generationof longitudinal data. Alternatively, thePython libraryDagSim(Hajj, Pensar, and Sandve2023) allows users to generate arbitrary forms of data, whilealso allowing the user to supply custom functions for the datageneration process. The price for this flexibility is, however, that notmany default options are implemented in the library.
Finally, thesimcausalR package(Sofrygin, van der Laan, and Neugebauer 2017) isvery similar to thesimDAG package and was in fact a biginspiration for it. Like thesimDAG package, it is alsobased on the causal DAG framework. The syntax for defining a DAG isnearly the same, with some differences in how formula objects can andshould be specified. Unlike the proposed package, however, thesimcausal package is focused mostly on generating data forsimulation studies dealing with causal inference. As such, it alsodirectly supports the generation of data after performing someinterventions on the DAG(Pearl 2009).Although the proposed package lacks such functionality, it is a lot moreflexible in terms of what data can be generated.simDAGsupports the use of arbitrary data generation functions, the definitionof interactions, non-linear relationships and mixed model syntax in its’formula interface and categorical input data for nodes. None of thesefeatures are present insimcausal(Sofrygin, van der Laan, and Neugebauer2017).
First, we will introduce the most important functionality of theproposed package by describing the core functions and the usual workflowwhen employing the package. This includes a detailed description of howto translate the theoretical description of the desired DGP into aDAG object, which may then be used to generate data.Afterwards we will illustrate the capabilities of the package byreproducing the DGPs of multiple real Monte-Carlo simulation studies.Two DGPs describing the generation of crossectional data andlongitudinal data with only few considered points in time will beconsidered first. Afterwards, an introduction into the generation ofmore complex longitudinal data utilizing the discrete-time simulationapproach is presented. Finally, the package and its potential usefulnessis discussed.
The following functions are used in a typical workflow using thesimDAGR package.
empty_dag() | Initializes an emptyDAG object, which should be laterfilled with information on relevant nodes.DAG objects arethe most important data structure of this package. How to define and usethem is illustrated in more detail below. |
node() andnode_td() | Can be used to define one or multiple nodes each. These functionsare typically used to fill theDAG objects with informationabout how the respective node should be generated, e.g., which othernodes it depends on (if any), whether it is time-dependent or not, whatkind of data type it should be and the exact structural equation that itshould follow.node() can only be used to define nodes atone specific point in time, while thenode_td() functionshould only be used to define time-varying nodes for discrete-timesimulations. |
add_node() orDAG + node | Allows the definition made bynode() ornode_td() to be added to theDAG object. |
plot.DAG() | Directly plots aDAG object using theggplot2 library(Wickham2016). |
summary.DAG() | May be used to print the underlying structural equations of allnodes in aDAG object. |
sim_from_dag() | Is one of the two core simulation functions. Given a fully specifiedDAG object that only includes nodes defined usingnode(), it randomly generates adata.table(Barrett et al. 2024) according to the DGPspecified by theDAG. |
sim_discrete_time() | Is the second core simulation function. Given a fully specifiedDAG object that includes one or multiple nodes added usingthenode_td() function, and possibly one or multiple nodesadded using thenode() function, it randomly generates dataaccording to the specified DGP using a discrete-time simulationapproach. This is described in detail in a later Section. |
sim_n_datasets() | Allows users to directly generate multiple datasets from a singleDAG, possibly using parallel processing. |
sim2data() | May be used to transform the output produced by thesim_discrete_time() function into either thewide,long orstart-stop format to make further usage of thegenerated data easier. |
The package additionally includes multiple functions starting withnode_. These functions are used to generate data ofdifferent types and using different specifications. Usually they do nothave to be called directly by the user. Instead they are specified astypes in thenode() ornode_td()functions and only called internally. Some additional conveniencefunctions are also included, but are not further discussed in thisarticle. Instead we choose to focus on describing the core functionalityin detail and refer the interested user to the official documentationfor more information.
Regardless of which kind of data the user want to generate, it isalways necessary to first define aDAG object whichadequately describes the DGP the user wants to simulate. In most casesthis should be done using the following steps:
DAG objectusing theempty_dag() function.node() ornode_td() functions.DAG usingthe+ syntax or theadd_node() function.Theempty_dag() function is very simple, as it does nothave any arguments. It is merely used to setup the initialDAG object. The actual definition of the nodes should bedone using either thenode() function (node at a singlepoint in time) ornode_td() function (node that varies overtime), which have the following syntax:
The arguments are:
name: A character string of the name of the node thatshould be generated, or a character vector including multiple names. Ifa character vector with more than one name is supplied, multipleindependent nodes with the same definition will be added to theDAG object.type: A character string specifying the type of thenode or any suitable function that can be used to generate data for anode. Further details on supported node types are given in the nextSection.parents: If the node is a child node this argumentshould contain the names of its’ parents, unless aformulais supplied instead.formula: An optional formula object that specifies anadditive combination of coefficients and variables as required bygeneralized linear models for example. This argument may be used formost built-in node types and for user-defined functions. It allowsinclusion of internally generated dummy variables (when usingcategorical variables as parents), interaction terms of arbitrarily highorders, cubic terms and arbitrary mixed model syntax for some nodetypes....: Additional arguments passed to the functiondefined by thetype argument.For example, the simple DAG that we described earlier may be createdusing the following code:
library("simDAG")dag<-empty_dag()+node(c("A","B"),type="rnorm",mean=0,sd=1)+node("C",type="gaussian",formula=~-2+ A*0.3+ B*-2,error=1)First, an emptyDAG object is initialized using theempty_dag() function, to which nodes are added directlyusing the+ syntax. Here, only thenode()function is required, because all nodes only have to be defined for asingle point in time, since this DAG is only supposed to describecrossectional data. Additionally, since both\(A\) and\(B\) have the same structural equation here,only one call tonode() is needed to define both of thesenodes. By settingtype="rnorm" and leaving both theparents and theformula arguments at theirdefault values, these nodes are specified as root nodes for which valueswill be generated using thernorm() function with theadditional arguments passed afterwards. Because\(C\) is supposed to follow a linearregression model,type="gaussian" is used here and thestructural equation is specified using theformulaargument.
The result is aDAG object. To re-create Figure 1, usersmay use the associated S3plot() method, which internallyuses theigraph package(Csárdi etal. 2024) to put the nodes into position and theggplot2 package(Wickham2016) and theggforce(Pedersen 2022) for the actual plotting:
library("igraph")library("ggplot2")library("ggforce")plot(dag,layout="as_tree",node_size=0.15,node_fill="grey",node_text_fontface="italic")The output is not exactly the same as Figure 1 because of spacereasons, but it is very similar. Additionally, the underlying structuralequations may be printed directly using the associated S3summary() method:
summary(dag)#> A DAG object using the following structural equations:#>#> A ~ N(0, 1)#> B ~ N(0, 1)#> C ~ N(-2 + A*0.3 + B*-2, 1)Both of these methods may be useful to check whether the DAG isdefined as intended by the user, or to formally describe the DGP in apublication.
Differenttypes of nodes are supported, depending onwhat kind of node is being specified. If the node is a root node, anyfunction with an argument calledn that specifies how manyobservations of some kind should be generated may be used. For example,the baseRrunif() function may be used as atype to generate a uniformly distributed node. Other popular choices arernorm() for normally distributed nodes orrgamma() for a gamma distributed node. For convenience, theproposed package also includes implementations for fast Bernoulli trials(rbernoulli()), fast sampling from discrete probabilitydistributions (rcategorical()) and a function to set a nodeto a constant value (rconstant()).
If the node has one or more parent nodes, any function that cangenerate data based on these nodes and has the argumentsdata (containing the data generated up to this point) andparents (a vector with the names of the parent nodes) maybe used astype. Multiple popular choices are directlyimplemented in this package. These include nodes based on generalizedlinear models and potentially more complex functions to sample fromconditional distributions. Finally, the package also includes twospecialized functions which may only be used for discrete-timesimulations. These functions are able to generate binary or categoricaltime-varying covariates, as well as multiple forms of time-to-event databy repeatedly performing Bernoulli trials or multinomial trials overtime. More details are given in the Section on discrete-time simulation.A brief overview over all implemented node types is given in thefollowing Table. Note that when using these node types, the user mayeither pass the respective function directly to thetypeargument, or use a string of the name without thenode_prefix.
| Node type | Description |
|---|---|
rbernoulli() | Samples from a Bernoulli distribution |
rcategorical() | Samples from a discrete probability distribution |
rconstant() | Sets the node to a constant value |
node_gaussian() | Generates a node based on a (mixed) linear regression model |
node_binomial() | Generates a node based on a (mixed) logistic regression model |
node_conditional_prob() | Samples from a conditional discrete probability distribution |
node_conditional_distr() | Samples from different distributions conditional on values of othervariables |
node_multinomial() | Generates a node based on a multinomial regression model |
node_poisson() | Generates a node based on a (mixed) Poisson regression model |
node_negative_binomial() | Generates a node based on a negative binomial regression model |
node_zeroinfl() | Generates a node based on a zero-inflated Poisson or negativebinomial regression model |
node_identity() | Generates a node based on anR expression that includespreviously generated nodes |
node_mixture() | Generates a node as a mixture of other node types |
node_cox() | Generates a node based on a Cox proportional hazards regressionmodel, using the method ofBender, Augustin, andBlettner (2005) |
node_aftreg(),node_ahreg(),node_ehreg(),node_poreg(),node_ypreg() | Generates a node based on various parametric survival models asimplemented inrsurv(Demarqui2024) |
node_time_to_event() | A node based on repeated Bernoulli trials over time, intended togenerate time-varying variables and time-to-event nodes |
node_competing_events() | A node based on repeated multinomial trials over time, intended togenerate time-varying variables and time-to-event nodes |
Table 1: A brief overview over all implemented nodetypes. The first section contains functions that may only be used togenerate root nodes, while the second section contains functions thatmay only be used to generate child nodes. Nodes in both of thesesections may be used in bothnode() andnode_td() calls. The nodes mentioned in the third sectionare only meant to be used for discrete-time simulations.
In the following Section we will illustrate how to use thesimDAG package to simulate more complex crossectional data.Instead of using a made up artificial example, we will do this bypartially replicating a real Monte-Carlo simulation study byDenz, Klaaßen-Mielke, and Timmesfeld (2023),published in the prestigious peer-reviewed journalStatistics inMedicine. Because this package strictly focuses on the datageneration step of Monte-Carlo studies and for reasons of brevity, wewill not reproduce the entire simulation study. Instead we will onlyreplicate the DGP used to generate the data for this study.
Denz, Klaaßen-Mielke, and Timmesfeld(2023) recently performed a neutral comparison study of multipledifferent methods to estimate counterfactual survival curves fromcrossectional observational data. In this Monte-Carlo simulation studythey wanted to investigate how different kinds of misspecifications ofnuisance models, which are used in some methods to estimate thecounterfactual survival curves, affect the estimates produced by thedifferent methods. To do this, a DGP was required that includes multipleinterrelated binary and continuous variables, as well as aright-censored time-to-event variable.
In particular, the data sets they generated for each simulation runincluded six covariates, two of which were binary (\(X_1, X_3\)) and four of which werecontinuous (\(X_2, X_4, X_5, X_6\)). Itadditionally included a binary treatment variable (\(Z\)) and a right-censored time-to-eventoutcome (\(T\)). The two binarycovariates\(X_1\) and\(X_3\) followed a simple Bernoullidistribution with a success probability of 0.5.\(X_2\) was generated by a linear regressionmodel, dependent on\(X_3\). The twocontinuous covariates\(X_4\) and\(X_6\) were standard normally distributed,while\(X_5\) was generated accordingto a linear regression model dependent on\(X_6\). The treatment variable\(Z\) followed a logistic regression model,dependent on\(X_2\),\(X_3\),\(X_5\) and\(X_6\), where\(X_2\) was included as a squared term.Finally, the outcome\(T\) wasgenerated according to a Cox model, dependent on\(X_1\),\(X_2\),\(X_4\),\(X_5\) and\(Z\), where\(X_5\) was included as a squared term. Datafor\(T\) was generated using themethod by(Bender, Augustin, and Blettner2005), based on a Weibull distribution (\(\lambda=2, \gamma=2.4\)). The time untilcensoring was generated independently from a second Weibull distribution(\(\lambda=1, \gamma=2\)).
This DGP was used because it includes two confounders (\(X_2\),\(X_5\)) for the causal effect of\(Z\) on\(T\), which are correlated with othernon-confounding variables. The inclusion of non-linear relationshipsallowedDenz, Klaaßen-Mielke, and Timmesfeld(2023) to investigate different kinds of misspecified models.More details on the DGP and the simulation study itself are given in theoriginal manuscript. To replicate the DGP of this study, the followingDAG definition may be used:
dag<-empty_dag()+node(c("X1","X3"),type="rbernoulli",p=0.5,output="numeric")+node(c("X4","X6"),type="rnorm",mean=0,sd=1)+node("X2",type="gaussian",formula=~0.3+ X3*0.1,error=1)+node("X5",type="gaussian",formula=~0.3+ X6*0.1,error=1)+node("Z",type="binomial",formula=~-1.2+I(X2^2)*log(3)+ X3*log(1.5)+ X5*log(1.5)+ X6*log(2),output="numeric")+node("T",type="cox",formula=~X1*log(1.8)+ X2*log(1.8)+ X4*log(1.8)+I(X5^2)*log(2.3)+ Z*-1,surv_dist="weibull",lambda=2,gamma=2.4,cens_dist="rweibull",cens_args=list(shape=1,scale=2))As before, we can define the nodes\(X_1\) and\(X_3\) and the nodes\(X_4\) and\(X_5\) using a single call tonode(), because they have the same definition. Since onlydirectly supported regression models are required for this DGP, we wereable to use theformula argument to directly type out allof the structural equations.
Plotting thisDAG using theplot() methodproduces the following output:
The underlying structural equations are:
summary(dag)#> A DAG object using the following structural equations:#>#> X1 ~ Bernoulli(0.5)#> X3 ~ Bernoulli(0.5)#> X4 ~ N(0, 1)#> X6 ~ N(0, 1)#> X2 ~ N(0.3 + X3*0.1, 1)#> X5 ~ N(0.3 + X6*0.1, 1)#> Z ~ Bernoulli(logit(-1.2 + I(X2^2)*log(3) + X3*log(1.5) +#> X5*log(1.5) + X6*log(2)))#> T[T] ~ (-(log(Unif(0, 1))/(2*exp(X1*log(1.8) + X2*log(1.8) +#> X4*log(1.8) + I(X5^2)*log(2.3) + Z*-1))))^(1/2.4)#> T[C] ~ rweibull(shape=1, scale=2)Finally, we can generate a singledata.table from thisDAG using thesim_from_dag() function. Thefirst few rows of the generated data look like this:
library("data.table")dat<-sim_from_dag(dag,n_sim=500)head(round(dat,3))#> X1 X3 X4 X6 X2 X5 Z T_time T_status#> <num> <num> <num> <num> <num> <num> <num> <num> <num>#> 1: 0 0 -0.646 1.186 -0.628 1.395 1 0.318 0#> 2: 1 0 0.968 -2.368 0.101 -0.040 0 0.135 1#> 3: 0 0 0.125 2.012 0.662 0.694 1 0.758 1#> 4: 0 1 -0.142 -2.610 -0.826 0.588 1 1.100 1#> 5: 0 0 -0.273 0.632 1.462 0.007 1 0.890 1#> 6: 1 1 -0.198 0.379 0.693 2.416 1 0.058 1Although only eight nodes were defined in the correspondingDAG, it includes nine columns. The reason for this is thatnodes generated usingtype="cox" by default generate twocolumns: one for the observed time and one indicating whether theobservation is right-censored or not. This is the standard data formatfor such time-to-event data and one of the reasons supplied nodefunctions are allowed to output multiple columns at the same time. Thisof course also extends to custom user-defined nodetypes.When no censoring distribution is supplied, it would also be possible toreturn only the simulated time-to-event in a single column by settingas_two_cols=FALSE in thenode() call for\(T\).
For many applications the desired DGP may contain one or morevariables that change over time. If that is the case, longitudinal datamust be generated. There are two approaches to do this using thealgorithm implemented in the proposed package: (1) defining one node perpoint in time of interest and (2) constructing one node definition forall points in time that should be considered simultaneously. In thisSection the first approach will be illustrated by replicating the DGPthat was used in the Monte-Carlo simulation study performed byGruber et al. (2015).
In their study,Gruber et al. (2015)compared the efficiency of different methods of estimating inverseprobability weights for marginal structural models. Such models areoften used to estimate marginal causal effects of treatment-regimes fromlongitudinal observational data with time-varying variables(Hernán and Robins 2020). The main aim was toquantify the benefits of using ensemble methods versus using traditionalmethods, such as logistic regression models, when estimating therequired weights. Their DGP therefore required the inclusion oftime-varying variables. Below we focus on the DGP theses authors used intheir first simulation scenario.
Their DGP consisted of four variables: a binary treatment variable\(A\), a binary unmeasured baselinecovariate\(U\), a continuousconfounder\(L\) and a binary outcome\(Y\). Since\(U\) represents a baseline variable, it doesnot vary over time. All other variables, however, were generated at twodistinct time points.\(L\) and\(A\) were generated for time\(0\) and\(1\), while the time-lagged outcome\(Y\) was generated for times\(1\) and\(2\). For simplicity,\(A\) was affected only by present and past\(L\) and not by previous values of\(A\) itself.\(L\) on the other hand was caused byprevious values of itself and by\(A\)and\(U\). Finally,\(Y\) is only caused by\(U\), meaning that neither the treatment northe confounder have any actual effect on the outcome. To generatecontinuous child nodes they relied on linear regression models. Forbinary child nodes, logistic regression models were used. A moredetailed description of the data generation process is given in theoriginal article(Gruber et al. 2015). ADAG object to define this DGP in the proposed package isgiven below.
dag<-empty_dag()+node("U",type="rbernoulli",p=0.5,output="numeric")+node("L0",type="gaussian",formula=~0.1+0.6*U,error=1)+node("A0",type="binomial",formula=~-0.4+0.6*L0,output="numeric")+node("Y1",type="binomial",formula=~-3.5+-0.9*U,output="numeric")+node("L1",type="gaussian",formula=~0.5+0.02*L0+0.5*U+1.5*A0,error=1)+node("A1",type="binomial",formula=~-0.4+0.02*L0+0.58*L1,output="numeric")+node("Y2",type="binomial",formula=~-2.5+0.9*U,output="numeric")Shown graphically, theDAG looks like this:
The structural equations can again be printed using thesummary() function:
summary(dag)#> A DAG object using the following structural equations:#>#> U ~ Bernoulli(0.5)#> L0 ~ N(0.1 + 0.6*U, 1)#> A0 ~ Bernoulli(logit(-0.4 + 0.6*L0))#> Y1 ~ Bernoulli(logit(-3.5 + -0.9*U))#> L1 ~ N(0.5 + 0.02*L0 + 0.5*U + 1.5*A0, 1)#> A1 ~ Bernoulli(logit(-0.4 + 0.02*L0 + 0.58*L1))#> Y2 ~ Bernoulli(logit(-2.5 + 0.9*U))Finally, we can call thesim_from_dag() function on thisDAG to generate some data:
dat<-sim_from_dag(dag,n_sim=1000)head(dat)#> U L0 A0 Y1 L1 A1 Y2#> <num> <num> <num> <num> <num> <num> <num>#> 1: 0 1.3086008 0 0 -0.1642422 1 0#> 2: 1 -0.4978996 1 0 3.2424318 0 0#> 3: 1 0.5125773 1 0 2.4114821 1 0#> 4: 1 1.1811591 1 0 3.2285778 0 1#> 5: 1 1.3826790 1 0 1.7878382 1 0#> 6: 1 -0.3077301 0 0 1.1737307 0 0Because the variables\(A\),\(L\) and\(Y\) should be generated at different pointsin time, we have to include one node definition per variable per pointin time to get an appropriateDAG. Apart from that thesyntax is exactly the same as it was when generating crossectional data.More points in time could be added by simply adding more calls tonode() to theDAG object, using appropriateregression models. The main advantage of this method is that it allowsflexible changes of the DGP over time. However, the obvious shortcomingis that it does not easily extend to scenarios with many points in time.Although the authors only considered three time-varying variables andtwo distinct points in time here, theDAG definition isalready quite cumbersome. For every further point in time, three newcalls tonode() would be necessary. If hundreds of pointsin time should be considered, using this method is essentiallyin-feasible. To circumvent these problems we describe a slightlydifferent way of generating longitudinal data below.
Instead of defining each node for every point in time separately, itis often possible to define the structural equations in a generictime-dependent fashion, so that the same equation can be applied at eachpoint in time, simplifying the workflow considerably. The result is thedescription of a specific stochastic process. For example, consider avery simple DAG with only one time-dependent node\(Y\) at three points in time,\(t \in \{0, 1, 2\}\):
In this DAG,\(Y_t\) is only causedby values of itself at\(t-1\). Supposethat\(Y\) is a binary event indicatorthat is zero for everyone at\(t = 0\).At every point in time\(t\),\(Y\) is set to 1 with a probability of 0.01.Once\(Y\) is set to 1, it neverchanges back to 0. The following structural equation may be used todescribe this DAG:
\[ Y_t \sim Bernoulli(P_Y(t)),\]
where
\[ P_Y(t) = \begin{cases} 0 & \text{if} \quad t = 0 \\ 1 & \text{if} \quad Y_{t-1} = 1 \\ 0.01, & \text{otherwise} \end{cases}.\]
The number of points in time could be increased by an arbitrarynumber and the same structural equation could still be used. Note thatthe time points may stand for any period of time, such as minutes, daysor years. This is of course a trivial example, but this approach mayalso be used to define very complex DGPs, as illustrated below. Forexample, arbitrary dependencies on other variables measured at the sametime or at any earlier time may be used when defining a node. Togenerate data from this type of DAG, the same algorithm as described inthe first Section may be used. Since only discrete points in time areconsidered, this type of simulation has also been calleddiscrete-time simulation(Tang, Leu, andAbbass 2020) ordynamic microsimulation(Spooner et al. 2021) in the literature and isclosely related todiscrete-event simulation(Banks et al. 2014).
To generate data for the simple example considered above in theproposed package, we first have to define an appropriateDAG object as before. This can be done using thenode_td() function withtype="time_to_event"as shown below.
By default, the value of node defined usingtype="time_to_event" is 0 for all individuals at\(t = 0\). Theprob_fun argumentdefines the function that determines the occurrence probability at eachpoint in time. It is set to 0.01 here, indicating that for allindividuals and regardless of the value of\(t\) the probability of experiencing theevent is constant. Usually this argument will be passed an appropriatefunction to generate the occurrence probability for each individual ateach point in time separately, but this is not necessary yet. By settingtheevent_duration argument toInf, we areindicating that the all events have an infinite duration, making thefirst event the final event for each person. In contrast to before, thesim_discrete_time() function now has to be used to generatedata from thisDAG object, because it contains atime-varying node:
Themax_t argument specifies how many points in timeshould be simulated. Instead of just three points in time we consider 80here. Contrary to thesim_from_dag() function, thesim_discrete_time() function does not return a singledata.table. Instead it returns asimDT object,which usually has to be processed further using thesim2data() function to be useful. In this case, however, itis enough to simply extract the last state of the simulation to get allthe information we need. This last simulation state is stored in the$data parameter of thesimDT object:
head(sim$data)#> .id Y_event Y_time#> <int> <lgcl> <int>#> 1: 1 FALSE NA#> 2: 2 TRUE 56#> 3: 3 TRUE 8#> 4: 4 TRUE 26#> 5: 5 FALSE NA#> 6: 6 TRUE 58As specified, the simulation contains only the variable\(Y\), split into two columns. The first iscalledY_event and is a binary indicator of whether theindividual is currently experiencing an event. The second column calledY_time shows the time at which that event happened, or isset toNA if there is no event currently happening. Sinceevery event is final, this is all information that was generated here.Individuals with a value ofNA inY_time canbe considered right-censored at\(t =80\). In this trivial example, it would be a lot easier togenerate equivalent data by sampling from an appropriate parametricdistributions. The following Section will illustrate the benefits of theapproach using a more involved example.
Suppose that we want to generate data for the Covid-19 pandemic,containing individual level information about Covid-19 vaccinationsdenoted by\(A\) and the development ofan acute myocarditis denoted by\(Y\).Different individuals get vaccinated at different times, possiblemultiple times. Additionally, they might experience zero or multiplecases of myocarditis, also at different times. Both variables aretherefore time-dependent binary variables, which are related to oneanother. If the target of interest is to estimate the effect of thevaccination on the time until the occurrence of a myocarditis, thevaccination may be considered a time-dependent variable and themyocarditis as a non-terminal, possibly recurrent, time-to-eventoutcome. This setup was of interest toDenz etal. (2025), who performed a Monte-Carlo simulation study toinvestigate the impact of linkage errors when estimating vaccine-safetyfrom observational data. Below we will illustrate a simplified versionof the DGP used therein.
For simplicity, we will make multiple simplifying assumptions thatwere not made in the original simulation study. First, we assume thatthe probability of being vaccinated and the base probability ofdeveloping a myocarditis are constant over time and equal for allindividuals. The only risk factor for developing a myocarditis in thisexample is the Covid-19 vaccination itself. More precisely, thestructural equation for the myocarditis node at\(t\) is given by:
\[ Y_t \sim Bernoulli(P_{Y}(t)),\]
with:
\[ P_{Y}(t) = \begin{cases} P_{Y0} \cdot RR_{A}, & \text{if } t \in \left[T_{A}(t),T_{A}(t) + d_{risk}\right] \\ P_{Y0}, & \text{otherwise} \end{cases},\]
where\(P_{Y0}\) denotes the baseprobability of developing a myocarditis,\(T_{A}(t)\) denotes the time of the lastperformed vaccination and\(d_{risk}\)defines the duration after the vaccination in which the risk ofdeveloping a myocarditis is elevated by\(RR_{A}\). In this particular case, each\(t\) will represent a single day.Similarly, the vaccination node can be described formally as:
\[ A_t \sim Bernoulli(P_{A}(t)),\]
with:
\[ P_{A}(t) = \begin{cases} 1, & \text{if } t \in \left[T_{A}(t), T_A(t) + 20 \right]\\ 0, & \text{if } t \in \left[T_{A}(t) + 21, T_A(t) +150\right] \\ P_{A0}, & \text{otherwise} \end{cases},\]
where\(P_{A0}\) denotes the baseprobability of getting vaccinated. The Figure below illustrates theresult of applying these structural equations to a fictional person whogets vaccinated at\(t = 100\).
A simple graph showing\(P_A(t)\) and\(P_Y(t)\) for a fictional individualwho got vaccinated once at\(t = 100\),with\(P_{A0} = 0.01\),\(P_{Y0} = 0.005\),\(d_{risk} = 20\) and\(RR_A = 3.24\).
The following code may be used to define this DGP:
prob_myoc<-function(data, P_0, RR_A) {fifelse(data$A_event, P_0*RR_A, P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24)First, we define a function that calculates\(P_Y(t)\) at each simulated day for allindividuals, calledprob_myoc(). This function simplychecks whether the binary event indicator of the vaccination event,"A_event", is currentlyTRUE and multipliesthe baseline probability of developing a myocarditis\(P_{Y0}\) with the relative risk if this isthe case. Otherwise it just returns the baseline probability directly.This function is then passed directly to theprob_funargument in thenode_td() call when defining themyocarditis node. For this to be a sensible strategy, we need to ensurethat"A_event" is onlyTRUE when a person iscurrently at elevated risk for a myocarditis, as defined in the earlierEquation. In other words,"A_event" should not actually bean indicator of whether someone just received a vaccination, but anindicator of whether the person is currently in the risk period of 20days following the vaccination. We can achieve this by setting theevent_duration parameter in the node definition of thevaccination node to 20, meaning that the vaccination node will equal 1for 20 days after a vaccination was performed. This argument is a directfeature of thenode_time_to_event() function, which iscalled internally whenevertype="time_to_event" is used ina time-dependent node.
The base probability for the vaccination and for the myocarditisevents are set to the arbitrary values of 0.01 and 0.005 respectively.\(RR_{A}\) is set to 3.24, which is thevalue used in the actual simulation study byDenzet al. (2025). Theimmunity_duration parameter usedfor the vaccination node additionally specifies that a person will notreceive another vaccination in the first 150 days after a vaccinationwas performed. More specifically, these settings ensure that"A_event" is set toFALSE for 130 days aftertheevent_duration of 20 days is over. This is anotherfeature of"time_to_event" nodes. To run the simulation fortwo simulated years, the following code may be used:
Since both of the included variables change over time and may havemultiple events, it is not appropriate to just look at the last state ofthe simulation in this case. Instead we will have to use thesim2data() function to obtain a useful dataset. Thefollowing code may be used to get a dataset in thestart-stopformat:
dat<-sim2data(sim,to="start_stop",target_event="Y",keep_only_first=TRUE,overlap=TRUE)head(dat)#> .id start stop A Y#> <int> <int> <num> <lgcl> <lgcl>#> 1: 1 1 57 FALSE TRUE#> 2: 2 1 21 FALSE FALSE#> 3: 2 21 38 TRUE TRUE#> 4: 3 1 50 FALSE FALSE#> 5: 3 50 70 TRUE FALSE#> 6: 3 70 272 FALSE FALSEIn this format, there are multiple rows per person, corresponding toperiods of time in which all variables stayed the same. These intervalsare defined by thestart column (time at which the periodstarts) andstop columns (time at which the period ends).Theoverlap argument specifies whether these intervalsshould be overlapping or not. By settingtarget_event="Y",the function treats theY node as the outcome instead of asanother time-dependent covariate. The resulting data is in exactly theformat needed to fit standard time-to-event models, such as a Cox modelwith time-varying covariates(Z. Zhang et al.2018). Using thesurvival package(Therneau 2024), we can do this using thefollowing code:
library("survival")mod<-coxph(Surv(start, stop, Y)~ A,data=dat)summary(mod)#> Call:#> coxph(formula = Surv(start, stop, Y) ~ A, data = dat)#>#> n= 25020, number of events= 9871#>#> coef exp(coef) se(coef) z Pr(>|z|)#> ATRUE 1.15906 3.18693 0.02377 48.76 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#>#> exp(coef) exp(-coef) lower .95 upper .95#> ATRUE 3.187 0.3138 3.042 3.339#>#> Concordance= 0.58 (se = 0.002 )#> Likelihood ratio test= 1928 on 1 df, p=<2e-16#> Wald test = 2378 on 1 df, p=<2e-16#> Score (logrank) test = 2648 on 1 df, p=<2e-16Because the DGP fulfills all assumptions of the Cox model and thesample size is very large, the resulting hazard ratio is close to therelative risk of 3.24 that were specified in theDAGobject. Other types of models, such as discrete-time survival models,may require different data formats(Tutz andSchmid 2016). Because of this, thesim2data()function also supports the transformation intolong- andwide- format data.
Above we assumed that the presence of a myocarditis event has noeffect on whether a person gets vaccinated or not. In reality, it isunlikely that a person who is currently experiencing a myocarditis wouldget a Covid-19 vaccine. We can include this into the DGP by modifyingtheprob_fun argument when defining the vaccination node asshown below:
prob_vacc<-function(data, P_0) {fifelse(data$Y_event,0, P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=prob_vacc,event_duration=20,immunity_duration=150,P_0=0.01)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24,event_duration=80)In this extension, the probability of getting vaccinated is set to 0whenever the respective person is currently experiencing a myocarditisevent. By additionally setting theevent_duration parameterof the myocarditis node to the arbitrary value of 80, we are definingthat in the 80 days after experiencing the event there will be novaccination for that person. Data could be generated and transformedusing the same code as before.
The example shown above illustrates only a small fraction of thepossible features that could be included into the DGP using thediscrete-time simulation algorithm implemented in the proposed package.Below we name more features that could theoretically be added to theDGP. Examples for how to actually implement these extensions are givenin the appendix.
Time-Dependent Probabilities and Effects: In theexample above both the base probabilities and the relative risks werealways set to a constant, regardless of the simulation time. This mayalso be changed by re-defining theprob_fun argument tohave a named argument calledsim_time. Internally, thecurrent time of the simulation will then be passed directly to theprob_fun, allowing users to define anytime-dependencies.
Non-Linear Effects: Above, we only consideredthe simple scenario in which the probability of an event after theoccurrence of another event follows a simple step function, e.g., it isset to\(P_0\) in general and increasedby a fixed factor in a fixed duration after an event. This may also bechanged by using more complex definitions ofprob_fun thatare not based only on the_event column of the exposure,but directly on the time of occurrence.
Adding more Binary Time-Dependent Variables:Only two time-dependent variables were considered in the example. It isof course possible to add as many further variables as desired by theuser by simply adding more appropriatenode_td() calls totheDAG object. Both thesim_discrete_time()function and thesim2data() function will still workexactly as described earlier.
Adding Baseline Covariates: It is of course alsopossible to define aDAG that contains bothtime-independent and time-dependent variables at the same time. This canbe achieved by adding nodes with thenode() andnode_td() function. Data for the time-independent variableswill then be generated first (at\(t =0\)) and the time-dependent simulation will be performed asbefore, albeit possibly dependent on the time-independentvariables.
Adding Categorical Time-Dependent Variables:Sometimes it may not be appropriate to describe a time-varying variableusing only two classes. For theses cases, the proposed package alsoincludes the"competing_events" node type, which is verysimilar to the"time_to_event" node type. The onlydifference is that instead of Bernoulli trials multinomial trials areperformed at each point in time for all individuals, allowing thegeneration of mutually exclusive events.
Adding Arbitrarily Distributed Time-DependentVariables: As in thesim_from_dag() function, anykind of function may be used as atype for time-dependentvariables. By appropriately defining those, it is possible to generatecount and continuous time-varying variables as well.
Adding Ordered Events: Some events should onlybe possible once something else has already happened. For example, aperson usually cannot receive a PhD before graduating high school. Thistype of data may also be generated, again by appropriately specifyingthe required probability functions.
All of the mentioned and shown features may of course be used inarbitrary combinations. Additionally, the list above is not meant to beexhaustive. It is merely supposed to highlight how versatile theimplemented approach is.
Although the discrete-time simulation procedure described above isincredibly flexible, it is also very computationally expensivetheoretically. For example, when using nodes of type"time_to_event". as shown above, the program has tore-calculate the event probabilities at each point in time, for everyindividual and for every node. Increasing either the number of nodes,the number of time points or the number of considered individualstherefore non-linearly increases the number of required computations.This fact, together with the lack of available appropriate softwarepackages, is probably the main reason this type of simulation strategyis rarely used in Monte-Carlo simulation studies. Since Monte-Carlostudies often require thousands of replications with hundreds ofdifferent scenarios, even a runtime of a few seconds to generate adataset may be too long to keep the simulation computationallyfeasible.
The presented implementation was therefore designed explicitly to beas fast as possible. To ensure that it can be run on multiple processingcores in parallel, it is additionally designed to use as little randomaccess memory (RAM) as possible. As all other code in this package, itexclusively uses thedata.table package(Barrett et al. 2024) to perform most requiredcomputations. Thedata.table package is arguably the bestchoice for doing data wrangling in theR ecosystem in termsof computational efficiency and has similar performance as correspondingsoftware libraries inR andJulia(Chiou, Xu, and Huang 2023). The proposedpackage additionally relies on a few tricks to keep the amount of memoryused small. For example, when using nodes of type"time_to_event", by default not every state of thesimulation is saved. Only the times at which an event occurred arerecorded. This information is then efficiently pieced together in thesim2data() function when creating the actual dataset.
Although none of these code optimizations can entirely overcome theinherent computational complexity of the approach, they do ensure thatthe usage of this method is feasible to generate reasonably large datawith many points in time on a regular office computer. For example,generating a dataset with 1000 individuals and 730 distinct points intime from the firstDAG, shown in the Section containingthe Covid example, takes only 1.11 seconds on average on the authorspersonal computer, using a single processing core (Intel(R) Core(TM)i7-9700 CPU @ 3.00GHz, 32GB RAM). It is, however, also important to notethat the runtime highly depends on how efficient the functions suppliedto theprob_fun arguments are written as well.
Below we give some additional limited code to illustrate theincreased runtime of thesim_discrete_time() function withincreasingn_sim and increasingmax_t values.As an example, we use the first DGP shown in the Section on the Covidexample to generate some data for multiple different values of eachargument. We include calls to thesim2data() function inthe runtime calculations, because calling this function is necessary toobtain usable data and as such should be considered part of the DGP. Theruntime is calculated using themicrobenchmark package(Mersmann 2023).
#NOTE: This part of the code is not run here, because it would take too long# on CRAN and would introduce another dependency on the "microbenchmark"# package. Results may also vary depending on the hardware this is run on.library(microbenchmark)set.seed(1234)prob_myoc<-function(data, P_0, RR_A) {fifelse(data$A_event, P_0*RR_A, P_0)}run_example<-function(n, max_t) { dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24) sim<-sim_discrete_time(dag,n_sim=n,max_t=max_t) dat<-sim2data(sim,to="start_stop",target_event="Y",keep_only_first=TRUE,overlap=TRUE)}n<-c(10,100,1000,10000,100000)max_t<-c(10,100,1000,10000)params<-data.frame(max_t=rep(max_t,each=length(n)),n=rep(n),time=NA)for (iinseq_len(nrow(params))) { n_i<- params$n[i] max_t_i<- params$max_t[i] bench<-microbenchmark(run_example(n=n_i,max_t=max_t_i),times=1) params$time[i]<-mean(bench$time/1000000000)}params<-within(params, { max_t<-factor(max_t)})ggplot(params,aes(x=n,y=time,color=max_t))+geom_point()+geom_line()+theme_bw()+labs(x="n_sim",y="Runtime in Seconds",color="max_t")+scale_x_continuous(labels=scales::label_comma(),transform="log10")+scale_y_log10()The runtime increases more sharply with higher values ofmax_t than it does with higher values ofn_sim, because each additional considered point in timeinternally translates to one more iteration in anRfor loop, while each additional considered individualtranslates to one more row in the generated data, which is processedusing optimizeddata.table code directly.
In this article we presented thesimDAGRpackage and illustrated how it can be used to generate various forms ofartificial data that may be required for Monte-Carlo simulation studies.In particular, we showed how the package may be used to generatecrossectional and longitudinal data with multiple variables of differentdata types, non-linear effects and arbitrary causal dependencies. Weshowed how similar the syntax is for very different kinds of DGP and howclosely the required code resembles the actual underlying structuralequations when using built-in node types. Because the package is basedon defining a DAG to describe the DGP, it lends itself particularly wellto simulation studies dealing with causal inference, but it is by nomeans limited to applications in this field, as shown in the second tolast Section of the article.
In addition to the main data generation functions, the package alsoincludes multiple functions to facilitate the accurate description ofthe underlying DGP. Such descriptions are of great importance tofacilitate both understanding and reproducibility of simulation studies,as emphasized in the literature(Morris, White,and Crowther 2019; Cheng et al. 2016). Among these are theplot.DAG() function, that was used throughout the articleto graphically display the definedDAG objects. While thisfunction is useful on its own, some users may prefer to use one of themany other options to display DAGs inR(Pitts and Fowler 2024). To make this easier forusers the package also contains thedag2matrix() function,which returns the underlying adjacency matrix of aDAGobject and theas.igraph.DAG() function for directintegration with theigraphR package(Csárdi et al. 2024). Additionally, thesummary.DAG() method may be used to directly print the usedstructural equations, as shown throughout the article.
The most distinguishing feature of the package is its capability ofcarrying out discrete-time simulations to generate longitudinal datawith hundreds of points in time in a suitable amount of time. Whileother packages, such as thesimcausal package(Sofrygin, van der Laan, and Neugebauer 2017)offer similar features to generate crossectional data, its’implementation for generation of longitudinal data is very differentfrom the proposed package. Insimcausal a new node isdefined for each point in time internally. Although the user has directaccess to each of these nodes (and therefore to each value at any pointin time), the provided formula interface does not naturally support thedefinition of nodes with events that occur at some point in time andlast for a certain amount of time. This can be done with little effortin thesimDAG package using the provided"time_to_event" node type.
This type of node can then be used to specify outcomes or to specifybinary time-varying covariates, as illustrated in the main text where weused two such nodes to describe both the Covid-19 vaccination status andthe associated adverse event. Using just this node type it is thereforepossible to define DGPs including a time-to-event outcome (possibly withrecurrent events) with multiple time-varying covariates. While there aremultiple other methods to generate some forms of time-to-event data withtime-varying covariates(Hendry 2013; Austin2012; Huang et al. 2020; Ngwa et al. 2022), most of them requirestrict parametrizations or do not support the use of multiple,arbitrarily distributed time-varying variables. Additionally, neither ofthese methods allows the inclusion of recurrent events or competingevents. None of these restrictions apply to the discrete-time simulationapproach.
However, the method also has two main disadvantages. First, it is alot more computationally expensive than other methods, and secondly itdoes usually require the user to define appropriate functions togenerate the time- and individual specific probabilities per node.Although the inherent computational complexity cannot be removed, it isalleviated in the implementation of this package through the use ofoptimized code and thedata.table back-end(Barrett et al. 2024). As shown in the article,the presented implementation allows generation of large datasets in areasonable amount of time. A computationally more efficient alternativenot considered in this package would bediscrete-eventsimulation(Banks et al. 2014), inwhich the time until the next event is modeled directly instead ofsimulating the entire process over time. Performing such simulations is,however, usually a lot more demanding both conceptually and in terms ofrequired software development(X. Zhang2018). The burden of specifying appropriate input to theprob_fun argument in our approach is comparatively small,but it might still be a concern for some users. We hope that the manyprovided examples and explanations in both this article and theextensive documentation and multiple vignettes of the package will helpusers overcome this issue.
To keep this article at a reasonable length, it was necessary to omitsome implemented features of thesimDAG package. One ofthese features is its capability to generate data for complexmulti-level DGPs. By internally using code from thesimrpackage(Green and MacLeod 2016), itallows users to add arbitrarily complex random effects and random slopesto nodes of type"gaussian","binomial" and"poisson". This can be done by directly adding the standardmixed modelR syntax to theformula interface.Additionally, while the main focus of the package is on generating datafrom a user-defined DGP, the package offers limited support forgenerating data that mimics already existing data through thedag_from_data() function. This function only requires theuser to specify the causal dependencies assumed to be present in thedata and the type of each node and then directly extracts coefficientsand intercepts from the available data by fitting various models to it.The returned DAG is then fully specified and can be used directly in astandardsim_from_dag() call to obtain data similar to theone supplied. Note that this function does not directly support allavailable node types. If the main goal of the user is to generate suchsynthetic mimicked datasets, using packages such assimPop(Templ et al. 2017) might bepreferable.
Finally, we would like to note that the package is still under activedevelopment. We are currently working on multiple new features to makethe package even more versatile for users. For example, future versionsof the package are planned to support the definition of interventions onthe DAG, much like thesimcausal package(Sofrygin, van der Laan, and Neugebauer 2017),which would make it even easier to generate data for causal inferencebased simulations, without having to re-define the DAG multiple times.We also plan to extend the internal library of available node types, byfor example including node functions to simulate competing events datawithout the use of discrete-time simulation(Moriña and Navarro 2017; Haller and Ulm2014).
The results in this paper were obtained usingR 4.5.0with thedata.table 1.17.0 package, thesurvival 3.8.3 package, theigraph 2.1.4package, theggplot2 3.5.2 package and thesimDAG 0.4.1 package.R itself and allpackages used are available from the ComprehensiveRArchive Network (CRAN) athttps://CRAN.R-project.org/.
We would like to thank Katharina Meiszl for her valuable input andfor contributing some unit tests and input checks to the proposedpackage. We would also like to thank the anonymous reviewers, whosefeedback and suggestions greatly improved the manuscript and the packageitself.
As mentioned in the main text, there are a lot of possible featuresthat could be included in the DGP using the discrete-time simulationapproach, which are not shown in the main text. Below we will give somesimple examples on how to use the discrete-time simulation approachimplemented in thesimDAG package to generate artificialdata with some of these useful features. For most of these examples, thefirst DGP shown in the Covid example Section will be extended.
To make\(P_{Y0}\) time-dependent,thesim_time argument may be used to change the definitionof the function that generates the myocarditis probabilities. Thefollowing code gives an example for this:
prob_myoc<-function(data, P_0, RR_A, sim_time) { P_0<- P_0+0.001*sim_timefifelse(data$A_event, P_0*RR_A, P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24)sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A Y#> <int> <int> <num> <lgcl> <lgcl>#> 1: 1 1 13 FALSE FALSE#> 2: 1 14 17 TRUE FALSE#> 3: 1 18 18 TRUE TRUE#> 4: 1 19 24 TRUE FALSE#> 5: 1 25 25 TRUE TRUE#> 6: 1 26 33 TRUE FALSEIn this example, the base probability of myocarditis grows by 0.001with each passing day, but the relative risk of infection given avaccination stays constant. More complex time-dependencies may of coursebe implemented as well. The same procedure could also be used to makethe vaccination probabilities time-dependent as well.
In some scenarios it may also be required to make the effect itselfvary over time. Here, we could make the size of the relative risk ofdeveloping a myocarditis given a recent Covid-19 vaccination dependenton the calender time, again by using thesim_time argumentin the respectiveprob_fun:
prob_myoc<-function(data, P_0, RR_A, sim_time) { RR_A<- RR_A+0.01*sim_timefifelse(data$A_event, P_0*RR_A, P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24)sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A Y#> <int> <int> <num> <lgcl> <lgcl>#> 1: 1 1 16 FALSE FALSE#> 2: 1 17 36 TRUE FALSE#> 3: 1 37 200 FALSE FALSE#> 4: 2 1 21 FALSE FALSE#> 5: 2 22 22 FALSE TRUE#> 6: 2 23 58 FALSE FALSEHere, the relative risk increases by 0.01 for each passing day,meaning that the risk of a myocarditis given a recent vaccinationincreases over time.
In all previous examples, it was assumed that the effect of thevaccination on the probability of developing a myocarditis follows astep-function. The risk was instantly elevated by a constant factor(\(RR_A\)) after vaccination, whichlasts for a specified amount of time and then instantly drops back tothe baseline risk. Any other kind of relationship may also be simulated,by again changing theprob_myoc() function accordingly. Thefollowing code may be used:
prob_myoc<-function(data, P_0, RR_A) { RR_A<- RR_A-0.1*data$A_time_since_lastfifelse(data$A_event, P_0*RR_A, P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150,time_since_last=TRUE)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event","A_time_since_last"),P_0=0.005,RR_A=3.24)sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A Y#> <int> <int> <num> <lgcl> <lgcl>#> 1: 1 1 6 FALSE FALSE#> 2: 1 7 26 TRUE FALSE#> 3: 1 27 29 FALSE FALSE#> 4: 1 30 30 FALSE TRUE#> 5: 1 31 200 FALSE FALSE#> 6: 2 1 43 FALSE FALSEIn this code,\(RR_A\) decreaseswith each day after a person is vaccinated. In contrast to the previousexample, this happens on a person-specific time-scale and not on a totalcalender time level. To do this properly, we set thetime_since_last argument inside thenode_td()call of the vaccination definition toTRUE. This argumentis supported when using nodes of type"time_to_event". Itadds another column to the dataset which includes the time since thelast vaccination was performed for each individual. This column is thenalso added in theparents vector, so that we can use it inprob_myoc() function. Here we simply substract 0.1 from\(RR_A\) for each day aftervaccination. This essentially means that on the day of vaccinationitself, the relative risk will be 3.24 as specified in theDAG. On the first day after the vaccination, however, therelative risk will only be 3.14 and so forth. In other words, the extentof the adverse effect of the vaccination decreases over time linearly.Again, more complex functions may also be used to model this type ofnon-linear effects.
Another possibility to extent the DGP would be to add more"time_to_event" variables. For example, we may want toadditionally consider the effect of a Covid-19 infection itself, denotedby\(C_t\) here. For simplicity we willassume that Covid-19 has a constant probability of occurrence over timewhich is the same for all individuals.In this example we will assumethat the vaccination reduces the risk of getting a Covid-19 infection to0 for\(d_{immune}\) days after thevaccination was performed. We may use the following structural equationto describe this variable:
\[ C_t \sim Bernoulli(P_{C}(t)),\]
with:
\[ P_{C}(t) = \begin{cases} 0, & \text{if } t \in \left[T_{A}(t), T_{A}(t) +d_{immune}\right] \\ P_{C0}, & \text{otherwise} \end{cases},\]
where\(P_{C0}\) is the baselineprobability of experiencing a Covid-19 infection and\(T_A(t)\) is still defined to be the time ofthe last Covid-19 vaccination as before. In addition to this, we willalso change the definition of the myocarditis node (\(Y_t\)). Instead of being only dependent on\(A_t\), the Covid-19 Infection shouldnow also raise the probability of developing a myocarditis by a constantfactor\(RR_C\) in the\(d_{C.risk}\) days after the Covid-19infection. The structural equation can then be changed to be:
\[ Y_t \sim Bernoulli(P_{Y}(t)),\]
with:
\[ P_{Y}(t) = \begin{cases} P_{Y0} \cdot RR_{A} \cdot RR_{C}, & \text{if } t\in \left[T_{A}(t), T_{A}(t) + d_{A.risk}\right] \text{ and } t\in \left[T_{A}(t), T_{A}(t) + d_{C.risk}\right] \\ P_{Y0} \cdot RR_{C}, & \text{if } t \in \left[T_{A}(t),T_{A}(t) + d_{C.risk}\right] \\ P_{Y0} \cdot RR_{A}, & \text{if } t \in \left[T_{A}(t),T_{A}(t) + d_{A.risk}\right] \\ P_{Y0}, & \text{otherwise} \end{cases},\]
where\(d_{A.risk}\) is the durationafter vaccination in which the risk of developing a myocarditis iselevated by\(RR_A\). The structuralequation for\(A\) are the same asdefined in previous Equations. The following code may be used togenerate data from this DGP:
prob_myoc<-function(data, P_0, RR_A, RR_C) { P_0* RR_A^(data$A_event)* RR_C^(data$C_event)}prob_covid<-function(data, P_0, d_immune) {fifelse(data$A_time_since_last< d_immune,0, P_0,na=P_0)}dag<-empty_dag()+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150,time_since_last=TRUE)+node_td("C",type="time_to_event",prob_fun=prob_covid,parents=c("A_time_since_last"),event_duration=14,P_0=0.01,d_immune=120)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event","A_time_since_last","C_event"),P_0=0.005,RR_A=3.24,RR_C=2.5)sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A C Y#> <int> <int> <num> <lgcl> <lgcl> <lgcl>#> 1: 1 1 46 FALSE FALSE FALSE#> 2: 1 47 47 FALSE FALSE TRUE#> 3: 1 48 50 FALSE FALSE FALSE#> 4: 1 51 51 FALSE FALSE TRUE#> 5: 1 52 58 FALSE FALSE FALSE#> 6: 1 59 72 FALSE TRUE FALSEIn this code we first re-define theprob_myoc() functionto include the effect of the Covid-19 infection. This utilizes a smallcomputational trick, relying on the fact that any number to the power of0 is 1 and any number to the power of 1 is itself. SinceTRUE is treated as a 1 andFALSE isinterpreted as a 0 in R, we can simply take the relative risks to thepower of their current event indicator to only multiply the baselinerisk with the respective relative risks if they are currently in therisk durations. Next, we define a new function calledprob_covid() to include the immunity duration, again usingthetime_since_last functionality. By finally addinganothernode_td() call to theDAG object, wewere able to specify the DGP in the desired way.
Previously, only time-dependent variable were included in the DGP. Byadding calls to the simplenode() function to theDAG object it is, however, also possible to additionallyinclude time-independent variables as well. Suppose that we want thebaseline probability\(P_{A0}\) of thevaccination to vary by biological sex in the first DGP described in theCovid example Section. We could do this using the following code:
prob_myoc<-function(data, P_0, RR_A) {fifelse(data$A_event, P_0*RR_A, P_0)}prob_vacc<-function(data, P_0) {fifelse(data$Sex, P_0*2, P_0)}dag<-empty_dag()+node("Sex",type="rbernoulli",p=0.4)+node_td("A",type="time_to_event",prob_fun=0.01,event_duration=20,immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24)sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A Y Sex#> <int> <int> <num> <lgcl> <lgcl> <lgcl>#> 1: 1 1 48 FALSE FALSE FALSE#> 2: 1 49 49 FALSE TRUE FALSE#> 3: 1 50 58 FALSE FALSE FALSE#> 4: 1 59 59 FALSE TRUE FALSE#> 5: 1 60 200 FALSE FALSE FALSE#> 6: 2 1 4 FALSE FALSE TRUEHere, everything is kept the same as in the original example, withthe small change that we now add a call tonode() beforeadding the time-dependent variables to define theSex nodeusing a simple Bernoulli distribution. Additionally, we now had todefine a function that appropriately generates the probabilities ofvaccination perSex. This was done by simply increasing\(P_0\) by a factor of 2 whenever thevalue ofSex isTRUE (which might stand formales or females). It would also be possible to add childtime-independent child nodes as well, but this is left as an exercise tothe interested reader.
The previous examples all included only binary time-dependentvariables. For some applications it may be necessary to used categoricaltime-dependent variables. For example, instead of generating thevaccination status as binary, we may want to model it as a categoricalvariable with multiple levels, indicating which kind of vaccine theperson received (if any). This may be done using the"competing_events" nodetype. Below is asimple example:
prob_myoc<-function(data, P_0, RR_A) {fifelse(data$A_event>0, P_0*RR_A, P_0)}prob_vacc<-function(data) { n<-nrow(data) p_mat<-matrix(c(rep(0.99, n),rep(0.0005, n),rep(0.0005, n)),byrow=FALSE,ncol=3)return(p_mat)}dag<-empty_dag()+node_td("A",type="competing_events",prob_fun=prob_vacc,event_duration=c(20,20),immunity_duration=150)+node_td("Y",type="time_to_event",prob_fun=prob_myoc,parents=c("A_event"),P_0=0.005,RR_A=3.24)sim<-sim_discrete_time(dag,n_sim=100,max_t=200,save_states="all")data<-sim2data(sim,to="start_stop")head(data)#> .id start stop A Y#> <int> <int> <int> <num> <lgcl>#> 1: 1 1 23 0 FALSE#> 2: 1 24 24 0 TRUE#> 3: 1 25 53 0 FALSE#> 4: 1 54 73 1 FALSE#> 5: 1 74 159 0 FALSE#> 6: 1 160 160 0 TRUEThe"competing_events" nodetype works inmuch the same way as the"time_to_event" node type, withthe main difference being that instead of using Bernoulli trials itrelies on multinomial trials, which are equivalent to drawing a simplerandom sample with unequal probabilities for each element. Instead ofsupplying a single probability of success per person and per point intime, theprob_fun supplied to a"competing_events" node is therefore expected to provide avector of probabilities. In the example above, we simply define theprobabilities to be the same for everyone regardless of the simulatedtime. The first category represents “no vaccination” while the next twocategories specify vaccinations with vaccines of different kinds. Notethat in this code, theprob_myoc() function only checkswhether there was any vaccination, so it does not really make adifference whether one uses a"competing_events" or"time_to_event" node here.
To get more useful person specific multinomial probabilities, thenode_multinomial() function may be useful. Because theunderlying code for multinomial trials is a bit more complex than thecode for simple Bernoulli trials, using this type of node may lead to anincreased runtime. The node type is usually most useful when the goal isto generate artificial time-to-event data with mutually exclusive typesof events, also known as competing events or competing risks data. Alsonote that in the example we had to set thesave_statesargument of thesim_discrete_time() function to"all". This is required when the goal is to transform thedata into the start-stop format later, because the DGP does not consistsolely of time-dependent nodes of type"time_to_event".
It is also possible to use continuous variables as time-dependentvariables in the proposed package. The following code gives a verysimple example:
dag<-empty_dag()+node("calories",type="rnorm",mean=2500,sd=150)+node_td("calories",type="gaussian",formula=~1+ calories*1.1,error=1)sim<-sim_discrete_time(dag,n_sim=100,max_t=200,save_states="all")data<-sim2data(sim,to="long")head(data)#> Key: <.id, .time>#> .id .time calories#> <int> <int> <num>#> 1: 1 1 2837.652#> 2: 1 2 3123.649#> 3: 1 3 3436.496#> 4: 1 4 3782.254#> 5: 1 5 4160.574#> 6: 1 6 4577.265In this example, we first generate a normally distributed root nodecalledcalories using a standardnode() call.This represents the value of the variable at\(t = 0\). If we did not specify thisvariable, the code would return an error message at\(t = 1\), because there would be no value ofcalories to use in the subsequently defined regressionmodel. Next, a call tonode_td() is added to theDAG object to specify how this variable changes with eachstep in time. We use a simple linear regression model, where the onlyindependent variable is the last value of the variable itself. In thesubsequent simulation we again setsave_states="all" tospecify that the data at all points in time should be saved. This isnecessary here both because theDAG consists oftime-dependent nodes that are not of type"time_to_event",and because the variable changes at every point in time for everyindividual. Because of the nature of the data, it would also not makesense to transform the data into the start-stop format. Instead, thelong format is choosen here.
The following code gives an example on how events could be simulatedso that a specific order of events is always respected:
prob_bachelors<-function(data) {fifelse(data$highschool_event,0.01,0)}prob_masters<-function(data) {fifelse(data$bachelors_event,0.01,0)}dag<-empty_dag()+node_td("highschool",type="time_to_event",prob_fun=0.01,event_duration=Inf)+node_td("bachelors",type="time_to_event",prob_fun=prob_bachelors,event_duration=Inf,parents="highschool_event")+node_td("masters",type="time_to_event",prob_fun=prob_masters,event_duration=Inf,parents="bachelors_event")sim<-sim_discrete_time(dag,n_sim=100,max_t=200)data<-sim2data(sim,to="start_stop")head(data)#> .id start stop highschool bachelors masters#> <int> <int> <num> <lgcl> <lgcl> <lgcl>#> 1: 1 1 75 FALSE FALSE FALSE#> 2: 1 76 200 TRUE FALSE FALSE#> 3: 2 1 4 FALSE FALSE FALSE#> 4: 2 5 178 TRUE FALSE FALSE#> 5: 2 179 200 TRUE TRUE FALSE#> 6: 3 1 116 FALSE FALSE FALSEIn this specification, we try to simulate the educational status of aperson over time. We assume that individuals can only obtain a bachelorsdegree once they have finished high school and that they can onlyreceive a masters degree once they finished the bachelors degree. Tokeep this order of events in tact, we can split the variable into itsthree constituent parts. First, the indicator whether someone graduatedhighschool is simulated using simple Bernoulli trials asimplemented in the"time_to_event" node type. By settingevent_duration=Inf, we ensure that no one ever “looses”their degree. Only afterwards do we generate whether the same personalso received abachelors degree. By using a standardfifelse() call, we can easily specify that the probabilityof obtaining abachelors degree is 0 for individualswithout ahighschool degree. Subsequently, we do the samefor themasters node.
Again, this simulation could be made much more realistic. Forexample, in the example used here there is no limitations on how closethe different graduations can be to each other. An individual mightreceive all three degrees in the same time unit. This could be changedby using thesim_time argument in the definition of theprobability functions, as discussed earlier.