. 2019 Aug 2;8(2):cnz024. doi:10.1093/comnet/cnz024

Flexible model selection for mechanistic network models

Sixing Chen¹,Antonietta Mira²,Jukka-Pekka Onnela^1,^✉

Editor:Matjaz Perc

¹Department of Biostatistics, T.H. Chan School of Public Health, Harvard University 655 Huntington Avenue, Building 2, 4th Floor, Boston, MA 02115, USA

²Data Science Lab, Institute of Computational Science, Università della Svizzera italiana Via Buffi 6, 6900 Lugano, Switzerland and Dipartimento di Scienza e Alta Tecnologia, Università degli Studi dell’Insubria Via Valleggio, 11 - 22100 Como, Italy

^✉

Corresponding author. Email:onnela@hsph.harvard.edu

Roles

Matjaz Perc:Editor

Received 2019 Mar 25; Accepted 2019 Jun 24; Issue date 2020 Apr.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

PMC Copyright notice

PMCID: PMC7391990 PMID:32765880

Abstract

Network models are applied across many domains where data can be represented as a network. Two prominent paradigms for modelling networks are statistical models (probabilistic models for the observed network) and mechanistic models (models for network growth and/or evolution). Mechanistic models are better suited for incorporating domain knowledge, to study effects of interventions (such as changes to specific mechanisms) and to forward simulate, but they typically have intractable likelihoods. As such, and in a stark contrast to statistical models, there is a relative dearth of research on model selection for such models despite the otherwise large body of extant work. In this article, we propose a simulator-based procedure for mechanistic network model selection that borrows aspects from Approximate Bayesian Computation along with a means to quantify the uncertainty in the selected model. To select the most suitable network model, we consider and assess the performance of several learning algorithms, most notably the so-called Super Learner, which makes our framework less sensitive to the choice of a particular learning algorithm. Our approach takes advantage of the ease to forward simulate from mechanistic network models to circumvent their intractable likelihoods. The overall process is flexible and widely applicable. Our simulation results demonstrate the approach’s ability to accurately discriminate between competing mechanistic models. Finally, we showcase our approach with a protein–protein interaction network model from the literature for yeast (Saccharomyces cerevisiae).

Keywords: mechanistic network model, model selection, Super Learner, likelihood-free methods

1. Introduction

Many systems of scientific and societal interest can be represented as networks, and network models are used, among many other application areas, to study social networks, communication patterns, scientific citations and protein–protein interactions (PPIS) [1–5]. There are (at least) two prominent paradigms to the modelling of networks, which we refer to as the statistical approach and the mechanistic approach. In the statistical approach, one describes a model that specifies the likelihood of observing a given network, that is, these are probabilistic models of data that take the shape of a network [6–8]. In the mechanistic approach, one specifies a collection of domain-specific microscopic mechanistic rules, informed by scientific understanding of the problem, that are used to grow or evolve the network over time [9–14]. Both modelling approaches provide distinct advantages.

In mechanistic models, a particular generative mechanism may seem like a strong assumption in a context where one does not directly observe the formation of the network, and where it is difficult to study microscopic interactions in isolation. For example, it might be difficult to learn about the mechanistic rules that govern the formation or dissolution of ties in in-person interactions, whereas doing so in an online social network or a phone communication network [15–17], where every interaction can be recorded, is feasible. In some biological networks the pairwise interactions between the elements are well understood both theoretically and experimentally, they can be studied in isolation, and these interactions are reproducible. For example, gene duplication is one of the main drivers of the evolution of genomes, it is well understood, and therefore perhaps not surprisingly, network models based on gene duplication were one of the first large-scale models used in systems biology [5].

In comparison, common statistical models have limitations in the structures they are able to accommodate [18], and fitting and sampling from some of these models can be difficult. For example, the popular class of exponential random graph models (ERGMs) may not always be consistent under sampling [19]. Mechanistic models do not suffer from these limitations as much, since generation of network structures from a handful of mechanisms is usually computationally inexpensive, so it is relatively simple to sample or forward simulate (to simulate observations given parameter values) from a particular model. Another advantage of mechanistic models is the ease with which one can incorporate domain knowledge in the model. Since the modeler is in control of the mechanisms to include, one is able to encode relevant domain knowledge of known or hypothesized interactions between actors in the system as mechanistic rules. The duplication-divergence models for PPI networks are good examples of this [5]. Additionally, it is easy to study the effects of interventions on the behaviours of the actors by directly modifying the model’s mechanistic rules to reflect the intervention [8,20].

While there is an extensive literature on mechanistic models in network science, there is a comparative dearth of work on model selection in mechanistic models [21–24]. The aim of this article is to provide a general framework for model selection for mechanistic network models. For instance, given a full model which has an array of different generative mechanisms, we are interested in selecting between different submodels each possessing only a subset of the mechanisms of the full model. Traditional likelihood-based model selection, frequentist or Bayesian, is not applicable to mechanistic models because in most cases their likelihood functions are intractable. The main reason for this is that in mechanistic models one must consider all the possible paths to generate any one particular network realization, which leads to a combinatorial explosion save for the most trivial settings. As such, one must resort to likelihood-free approaches.

A recent likelihood-free approach to both inference and model selection for problems with intractable likelihoods is Approximate Bayesian Computation (ABC) [25–28]. This Bayesian approach aims at calibrating the model parameters to obtain an approximate posterior distribution for the parameters of interest for a given observed data set. Following Bayes theorem, the posterior is obtained by combining information from the prior distribution on parameters and the observed data set as incorporated by the likelihood. ABC inference starts by generating samples of possible parameter values from the prior. For each sample, one forward simulates a data set from the model, where the nature of the model, statistical or mechanistic, is not relevant. Then, given a distance measure between the observed data (the target of inference and/or model selection) and the model generated data, one accepts only those data sets from the model that are within a certain distance from the observed data. The parameters sampled from the prior corresponding to these accepted data sets form an approximation to the posterior distribution. Unless the data are discrete and of low dimension, it is usually necessary to base the distance measure on summary statistics of the data [29]. Model selection with ABC is similar but includes an additional layer of hierarchy and a prior for the candidate model indices [30]. More generally, there is a scarcity of likelihood-free model selection methods outside the realm of ABC. Toniet al. [31] extend the sequential Monte Carlo ABC of Sissonet al. [32] to ABC model choice, while Leeet al. [33] do so with the approach of Fearnhead and Prangle [34]. Other recent innovations include an adaptive approach [35] and a random forest-based approach [36].

The main difficulty in practice for ABC arises when selecting the summary statistics as well as the threshold on the distance. In general, any quantity that can be calculated from the observed data is called a statistic; if the likelihood depends on the data only through some low-dimensional statistic, so that we need only this statistic to calculate the likelihood for any value of the parameter, the statistic is said to be sufficient. In the context of ABC, if the selected summary statistics are sufficient for the parameters of the model and the distance threshold is zero, that is, only parameter values that have given rise to data sets with values of sufficient summary statistics exactly matching those of the observed data set are retained, then these accepted parameter samples will be from the true posterior [27,37]. However, should one fall short on either of these, then the accepted parameter samples will only be from an approximation to the true posterior. When available, relevant domain knowledge can be applied to guide the selection of ‘essentially sufficient’ summary statistics, which capture features attributable to the inferential objects of interest. However, since likelihood-free approaches like ABC are only needed with analytically or computationally intractable likelihoods, it will typically be difficult to find the sufficient statistics in the absence of domain knowledge, though there is previous work on how to select good summary statistics for an ABC procedure in such settings [38,39]. As for the threshold on distance, the smaller the distance threshold, the lower the acceptance rate, and hence the greater the computational burden to generate a reasonable number of accepted samples. In fact, outside of using discrete summary statistics, it may be impractical to use a distance threshold of zero. As a result, the performance of ABC inference can suffer due to the inaccuracy of the resulting posterior [32].

ABC model choice suffers from these same issues for to the model index, which becomes an additional model parameter on which inference is required. Even if one were to select statistics that are marginally sufficient for the candidate models, they may not be jointly sufficient for the overall model (the overall hierarchical model that is indexed by the candidate model index and the corresponding parameters of each candidate model) and thus may not be able to discriminate among the various models under comparison, save for some special cases [30,40]. Lastly, even if one does manage to select all the sufficient statistics and conduct model selection with ABC, there may be a large discrepancy between the resulting ABC Bayes factor and the true Bayes factor [40].

Instead of dealing with issues stemming from the inaccuracy of the ABC posterior and Bayes factor, we propose a procedure for model selection that borrows from ABC the aspect of data generation from candidate models. Just as in ABC model choice, the data forward simulated from each candidate model will be the basis for model selection as it becomes the training data, but rather than using a full Bayesian approach, we propose to conduct model selection with a flexible learning algorithm. We assess the performance of different learning algorithms, including the Super Learner (SL) [41,42]. Originally proposed for prediction in a regression setting, SL is an ensemble learning algorithm that makes a prediction by combining the predictions from a library of candidate algorithms. Given a particular loss function, SL aims to minimize the expected loss, called the risk. Thediscrete SL simply picks the candidate algorithm that has the lowest cross-validated risk over the training data, whereas thefull SL creates the convex combination of the candidate algorithm-specific estimates that has the lowest cross-validated risk. Given a bounded loss function, both the discrete and the full SL have the so-called oracle property, meaning that asymptotically (in the number of training data sets) they perform at least as well as the optimal candidate algorithm available in the library and as the optimal convex combination of the candidate algorithms, respectively [43,44]. Due to the intractable nature of the likelihood and the ease of forward simulation, model selection with mechanistic network models lends itself well to the flexibility of the SL.

Our proposed approach shares similarities with that of Pudloet al. [36], which is a random forest-based ABC approach for model selection that is fairly robust to the choice of summary statistics. Their approach measures performance with the prior error rate, which is the probability to select the wrong model averaged over the prior. In contrast, in our approach with the SL, the choice of performance measure is flexible and can be encoded directly into the loss function. For example, the prior error rate implicitly weighs misclassification differently for each model due to the sensitivity to the choice of the prior. Should one desire a measure that does not discriminate between misclassification of different models, one can use a measure like the Area Under the receiver operating characteristic Curve (AUC) [45,46]. Furthermore, SL can make use of a host of candidate algorithms, including random forest, to perform the classification. Random forest is a very flexible learning algorithm, but it may not perform well in all settings. One can be robust against this by having more candidate algorithms in the SL library. To handle uncertainty in the selected model, we adapt the regression-based approach used by Pudloet al. [36].

The rest of the article is organized as follows. In Section 2, we provide a brief overview of SL as well as the procedure for model selection in the context of mechanistic network models. We also introduce and motivate a simple mechanistic network model as a proof of concept in our subsequent simulations. In Section 3, we lay out the details of the simulations as well as the results, and evaluate the performance of our approach. In Section 4, we apply our approach to the selection between two candidate models for a PPI data set. Finally, in Section 5, we conclude with discussions and suggestions for future work.

2. Methods and materials

2.1 Overview of SL

The SL is a fairly recent ensemble machine learning method introduced by Vander Laanet al. [42]. Given a particular loss function Inline graphic , the SL aims to minimize the expected loss, known as the risk, with respect to the distribution of the training data with a prediction algorithm composed of a library of candidate algorithms, such as random forest (RF) and support vector machine (SVM). The procedure begins by partitioning the training data, with predictors Inline graphic , and outcome, into validation sets. In our setting, when applied to model selection for mechanistic network models, summary statistics computed on forward simulated network realizations play the role of predictors and network model indices play the role of outcome in SL. The predictors and outcome of the Inline graphic th validation set are referred to as and, respectively, while those of the corresponding training set, that is, the union of the remaining validation sets, are referred to as and. Note the distinction between training data ( and) and training set ( and). For theth validation set, each candidate algorithm Inline graphic of the library is trained on. The resulting trained candidate algorithm is then evaluated at, giving prediction. After training each candidate algorithm on each training set and evaluating it on the corresponding validation set, a new data set is formed with the cross-validated predicted outcomes Inline graphic generated using all candidate algorithms in and the true outcome from all validation sets, where the former serve as the new predictor and the latter as the new outcome. is used to estimate the cross-validated risk. This cross-validation procedure is intended to prevent overfitting, and this new data set will be the basis for the final prediction algorithm.

To arrive at the final prediction via Inline graphic , in the case of the discrete SL, the candidate algorithm with the smallest estimated cross-validated risk is chosen for final prediction. Assuming a regression setting and a squared error loss function, the cross-validated risk for theth candidate algorithm can be estimated as

where the first summation is over the Inline graphic validation sets and the second is over the observations in theth validation set. In the case of the full SL, the estimated risk is minimized over all convex combinations of the candidate algorithms. In the same regression setting with squared loss and a particular convex combination, where Inline graphic and, the cross-validated risk is estimated as

Once the final prediction algorithm is determined, that is, Inline graphic that achieves the smallest risk in the discrete SL or in the full SL, each candidate algorithm is refit on the entire training data in order to predict an outcome for, the observed predictors. Each resulting trained candidate algorithm is then evaluated at, giving prediction. The final prediction is Inline graphic in the discrete SL, or in the full SL. In the classification setting, such as our model selection framework, the are the candidate algorithm-specific scores for each class and the true class, with and defined according to the chosen loss function. The final prediction or are then the scores for each potential class.Figure 1 gives a visual representation of the SL framework.

Fig. 1. — Schematic of the SL framework. Note that and are the model generated training data for network model selection, while are the data from the observed network.

2.2 Model selection framework

Next, we introduce the proposed procedure for mechanistic network model selection within the SL framework. In this setting, the SL will be based on training data forward simulated from each candidate mechanistic network model in order to predict the model index (outcome) from a set of chosen network statistics (predictors). Before going ahead with the SL procedure, one needs to determine the appropriate loss function. Since the prediction is for the model index, this is a classification problem, and therefore a loss function like squared loss is no longer appropriate. Instead, we propose to use Inline graphic as the loss function., also known as ‘rank loss’, is the loss function associated with AUC, the area under the receiver operating characteristic curve. The AUC is an appropriate measure of the quality of the classification since it does not depend on the distribution of the model index in the data for performance evaluation. The corresponding loss function is bounded, so the resulting SL will retain the oracle property [43,44].

Algorithm 1 lays out the procedure we propose for model selection using SL, andFig. 2 is the corresponding schematic. In addition to selecting the candidate algorithms, one needs to select the all-important network summary statistics to train the candidate algorithms on, in steps 4 and 6 in Algorithm 1. As previously stated in Robertet al. [40], even if the sufficient statistics of all candidate models are selected, they may not be jointly sufficient for the overall model. Since it will be difficult or impossible to achieve, sufficiency cannot be a criterion in choosing these statistics, but rather their ability to characterize the differences between data generated by the candidate models and thus their ability to discriminate among the models. Suppose one is trying to select between a full model and one of its submodels that has one of the mechanisms of the full model removed or ‘turned off’. In this case, one needs to consider the characteristics of the network that the missing mechanism affects. The statistics chosen as predictors in steps 4 and 6 should reflect the characteristics affected by the differences in the model, either the different mechanisms themselves or different parameter values of the included mechanisms. For instance, should the submodel be missing the triadic closure mechanism, which refers to a mechanism that ‘closes’ a triad of three nodes and two edges by adding the third edge, one would expect statistics related to clustering to be affected. The ability to characterize these differences will determine the performance of the overall procedure and, thus, should guide the selection of the summary statistics. Though the candidate models we consider here are nested, they do not need to be in general to use this framework. Finally, in step 7 of Algorithm 1, evaluating the trained SL on the predictors for the observed network gives a score to classify the observation. In binary classification, that is, having two candidate models, one model will be nominally ‘negative’ and represented by 0 in the training of the SL, while the other will be nominally ‘positive’ and represented by 1. In this case, the score will be bounded between 0 and 1 with scores above a user-defined cut-off classified as 1, or the ‘positive’ model, and vice versa. For example, a cut-off of 0.5 can be used for this binary situation, as we did in our simulations. Note that due to the insensitivity of the AUC to the model index prior in the training data, no prior should be placed on the model index. In contrast, priors for parameters of candidate models should be considered.

Algorithm 1 Steps for proposed mechanistic network model selection via SL:

Select relevant statistics that highlight differences between models as predictors.
Generate training data from all models of interest.
Split the training data into cross-validation sets.
Train/evaluate each candidate algorithms on each training/validation pair based on selected predictors.
Train SL on the results from each candidate algorithm.
Train each candidate algorithm on the entire training data set.
Classify/select model for observed network based on the results from steps 5 and 6 of this algorithm using a preselected cut-off on the score.

In order to perform uncertainty quantification, that is, to quantify the confidence in the model selected through Algorithm 1, we adapt a procedure for the same purpose from Pudloet al. [36]. In Pudloet al. [36], for each element in the training data, one needs to produce an out-of-bag (OOB) classifier, that is, a random forest classifier based only on trees that do not involve (were not fitted using) the given element. OOB shares similarities with cross-validation, which SL does in steps 1–3 inFig. 1, but step 4 inFig. 1 involves the use of the whole data set. This means that the base SL is not completely cross-validated. To overcome this, we propose to first split the training data into Inline graphic subsets, and the classifier for any element in is defined as the SL trained on, that is, the training data without. For each element of the training data, with denoting the indicator function, we compute, where is the model selected by the corresponding classifier and is the true model index. Now, one can build a binary regression model (since Inline graphic is a binary indicator for correct model classification) for by regressing, that is, computed for all elements of the training data, on the predictors. This regression model can be simply logistic regression or a SL with the correctly specified loss function. Lastly, one can take the fitted value of this regression model at Inline graphic , the predictors for the observed network, as the estimate for, where is the unobserved true model index for the observed network.

3. Simulation study

3.1 Variant of the Erdős–Rènyi model

As a proof of concept for our framework for mechanistic model selection, we introduce a simple mechanistic model. The basis for the model is the classic Erdős–Rènyi (ER) model [47]. In the ER model, the number of nodes Inline graphic is fixed, and there are two variants on how edges are placed in the graph. In one variant, sometimes called the model, each of the, choose 2, possible edges are independent and included in the graph with probability, so the number of edges in the graph has a binomial distribution. In the other variant, sometimes called the Inline graphic model, the number of edges in the graph is also fixed. In this case, the random graph has a uniform distribution over all possible graphs with nodes and edges.

Our model takes elements from both variants of the ER model. The model generates random graphs with a fixed number of nodes and edges just like the second variant of the ER model, but each edge is added one at a time with a certain probability akin to the first variant. At each step of graph generation, we select a pair of unconnected nodes uniformly from all such node pairs, and we connect them with an edge with a given probability. This process is repeated until the required number of edges Inline graphic have been added. If the probability for adding each edge was always fixed, then this model would be the same as the second variant of the ER model. Instead, in our model, there is a base probability for edge placement, but two additional mechanisms are included to allow this probability to vary. The first mechanism is triadic closure: should connecting the two selected nodes with an edge close a triad, then the probability will be increased by Inline graphic over the base probability for adding the edge. We dub the second mechanism ‘triadic closure plus’: should connecting the two selected nodes with an edge close more than one triad, then the probability will be further increased by for each potentially closed triad in excess of one. Should the sum Inline graphic , where is the number of closeable triads, exceed 1, it will be interpreted as 1.Figure 3 illustrates these two additional mechanisms. Though this model is fairly simple, each mechanism can be motivated by consideration of the domain of application. In a friendship network, the first mechanism corresponds to the idea that two people are more likely to become friends if they have a mutual friend, while the second mechanism further increases the likelihood for each additional mutual friend. As such, these mechanisms can also be related to the so-called weak ties hypothesis [48], and it has been shown that higher proportions of shared friends are associated with greater tie strengths in large-scale communication networks [16].

Fig. 3. — The probability to add an edge between two unconnected nodes in our variant of the ER model with two additional mechanisms when there are no closeable triads (left), when there is one closeable triad (middle) and when there is more than one closeable triad (right). Here, is the base probability for edge placement, is the increase in the probability for closing at least one triad and is the increase for each additional closeable triad after the first.

3.2 Model selection

We next assess the performance of our model selection framework through simulation. We use various candidate algorithms, both on their own and as members of the SL candidate algorithm library, to select between the full model with both triadic mechanisms vs. the submodel with only the first triadic closure mechanism. Networks generated from both models have 100 nodes and a base edge probability of Inline graphic . The probability of edge placement increases by for the first closed triad and by for every subsequent triad, and we vary over the values 0.005, 0.01, 0.03 and 0.05. We vary the number of edges for the two models over the values 500, 1000 and 2000. For a given number of edges, the variation of Inline graphic from 0.005 to 0.05 makes the differences caused by the additional mechanism easier to detect. For a given value of, the variation of the number of edges from 500 to 2000 means that there will be more opportunities for the additional mechanism to manifest itself, also making it easier to detect. The simulation studies iterate through each combination of values of Inline graphic and edge count to see the interplay between the two factors.

In the simulations, SL used a library of three candidate algorithms: Inline graphic -nearest neighbours (KNN), SVM and RF. These three candidate algorithms are chosen largely for their ability to handle collinear predictors, which are often present in summary statistics for networks. Given an observed sample for classification, KNN determines which samples in the training data are closest to the observed sample, based on some distance measure in the predictor space, a common choice being the Euclidean distance. The predicted class is the most frequent class among the KNN. Unlike KNN, which essentially formulates a new decision rule for each observation, SVM seeks to formulate a single decision rule by separating the space of the predictors with a set of hyperplanes that segregate the space class-wise. Once the class-wise segregation of the predictor space is complete, a new data point is classified based on the class label of the subset of the predictor space it falls in. Lastly, RF seeks to create a set of decision trees (the ‘forest’) from the training data in order to arrive at the final prediction. To build each tree, a bootstrap sample of the training data is taken to form the root. Then, at each node of each decision tree, a subset of the predictors is selected at random, and a ‘best’ split is determined for these predictors in order to form its daughter nodes. Typically, the quality of the split is measured by the amount of homogeneity in each daughter node. Given an observation, each tree in the forest is traversed and gives a class label for the observation based on the most frequent class among the samples in the leaf of the tree where the observation is assigned. An observation is then classified by the forest as the mode of all the tree-wise decisions.

In addition to the library of candidate algorithms, choosing appropriate predictors is an important task. As discussed in the previous sections, sufficient summary statistics are difficult to obtain in all but the trivial mechanistic network models, and one should aim to use summary statistics that characterize the differences between the candidate models. In the simulations presented here, there are five summary statistics chosen as predictors. The first predictor is the triangle (closed triad) count, which is an obvious choice since the additional mechanism in the full model will favour edges that close multiple triads. The second is the average local clustering coefficient over all nodes. The local clustering coefficient of a node is a measure of how close its neighbours are to forming a complete subgraph by themselves, that is, having every possible edge between any two neighbours. If the addition of an edge between nodes Inline graphic and will close multiple triads, then, without loss of generality, from the point of view of node, the addition of the edge would mean the addition of a single neighbour,, and the addition of multiple edges amongst’s updated set of neighbours between and the shared neighbours of and. In scenarios with lower total edge counts, where the degree of either Inline graphic or is likely to be low, this could lead to a potentially large change in the local clustering coefficient. Lastly, the additional mechanism is a rich-get-richer scheme in terms of the degree of a node, since the more closable triads a pair of nodes has, the higher their degrees, which leads to a higher probability of both getting an increase to their degrees with an additional edge. Thus, this mechanism is likely to affect the degree distribution of the network. As a proxy to the full degree distribution, the three quartiles (25%, 50%, 75%) of the degree distribution are included as predictors.

The mechanistic model was coded in Python using NetworkX. The training of SL, with a 5-fold cross-validation, was done with the R package SuperLearner, which contains wrappers for the chosen candidate algorithms. The parameters for the candidate algorithms themselves are kept at the default values ( Inline graphic for KNN;, min terminal node size = 1 for RF;-classification with, radial kernel for SVM; see SuperLearner package documentation for more details). For a given combination of edge count and value for, we generated 10,000 training samples from the full model (with both triadic closure mechanisms) and the submodel (with only the first triadic closure mechanism) as training data. Rather than using a separate sample to assess performance, the Inline graphic of SL as well as that of each candidate algorithm were estimated via a 10-fold cross-validation. Note that this is a separate stage of cross-validation from that of the training of SL itself. First, the training data is partitioned into 10 validation sets, which are used for performance cross-validation. Then, for each of the ten performance validation sets, a 5-fold cross-validated SL is trained on the union of the remaining nine performance validation sets, and then used to predict the model index for the given performance validation set. The AUC measure is computed for each of the 10 performance validation sets and averaged.

The cross-validated estimate of the AUC for the full SL, the discrete SL and each candidate algorithm for each simulation scenario is summarized inFig. 4 with numerical results inTable 1. In general, performance deteriorates as the value of Inline graphic decreases, which is no surprise as the effect of the additional mechanism diminishes as gets smaller, and thus the two models become harder to distinguish. In contrast, performance improves as the edge count increases, as the mechanism has more opportunities to manifest itself with more edges. In general, the model selection problem is easier for larger networks.

Fig. 4. — Cross-validated AUC for each method: full SL (fSL, red), discrete SL (dSL, blue), SVM (grey), random forest (RF, grey), KNN (grey), in each simulation scenario.

Table 1.

Cross-validated AUC from the simulations



fSL	0.50786	0.51941	0.58896	0.65630
dSL	0.50575	0.51950	0.58911	0.65585
SVM	0.50097	0.49936	0.50504	0.55592
RF	0.50826	0.51950	0.58911	0.65585
KNN	0.50353	0.51158	0.56172	0.63093

fSL	0.57071	0.65993	0.90814	0.97895
dSL	0.57055	0.65738	0.90445	0.97798
SVM	0.50897	0.55834	0.90445	0.97586
RF	0.57055	0.65738	0.90248	0.97798
KNN	0.55596	0.64410	0.89605	0.97275

fSL	0.74179	0.90348	0.99839	0.99947
dSL	0.73703	0.90067	0.99769	0.99917
SVM	0.67271	0.90067	0.99709	0.99879
RF	0.73703	0.89573	0.99769	0.99917
KNN	0.72623	0.89077	0.99540	0.99751

Open in a new tab

Cross-validated AUC from the simulations for the full SL (fSL), discrete SL (dSL), SVM, RF, KNN over different edge counts (EC) and values of Inline graphic , the increase in probability for adding an edge for each potentially closed triad in excess of one. Discussion for the ordering of performance for EC = 500 appears later in this section.

The simulation results seem to support the oracle properties of the discrete and full SL. Indeed, in most scenarios, the discrete SL has the same cross-validated AUC as the best performing candidate algorithm, with the full SL performing a little better, as expected. In these scenarios, the ordering of the candidate algorithms by performance is likely the same across each fold in the cross-validation. Thus, the discrete SL always picks the same candidate algorithm in each fold and has the same performance as the best candidate algorithm averaged across all folds. The full SL, in this case, takes a convex combination of the candidate algorithms in each fold and performs at least as well as, and likely better than, the best performing candidate algorithm in each fold. When averaged across the cross-validation folds, the full SL clearly performs better, as evidenced by the simulations. Since it selects the candidate algorithm with the best cross-validated performance by design, the discrete SL’s oracle property is mediated through the optimal candidate algorithm performing the best asymptotically.

Still, there are a few scenarios where either or both of the discrete and full SL perform slightly worse than the best performing candidate algorithm. This occurs with edge count 500 and Inline graphic . In these scenarios, there are several candidate algorithms that are quite close in performance, and the ordering of their performance is likely not constant across the cross-validation folds. In this case, the discrete SL picks different candidate algorithms across the folds and the AUC averaged across the folds may be worse than that of the best candidate algorithm. The full SL on the other hand weights different candidate algorithms differently across the folds, and the cross-validated AUC can also end up worse than the best candidate algorithm.

In the scenarios where either or both of the discrete and full SL perform worse than the best candidate algorithm, they both still perform very close to the best candidate algorithm, with difference in AUC only of the order of Inline graphic or less. Since the oracle properties for both SLs are asymptotic results, some small deviations from the asymptotic behaviour would be expected for finite samples. In the limit, the ordering of the performance of the candidate algorithms are likely to be constant across all folds, and both SLs are expected to perform no worse than the optimal candidate.

The best performing candidate algorithm varies between SVM and RF across the scenarios, which is not something knowna priori, but both versions of SL are able to closely match or beat the best candidate algorithm. Although the differences between the best and worst performing candidate algorithms in the scenarios visited are not too great (biggest difference is about 0.1), it is still easy to see the manifestation of the oracle property, which would be even more valuable in scenarios with greater differences. Thus, when one particular candidate algorithm outperforms the rest, one can be confident that the discrete SL will select it, and the full version will weight it heavily. In cases when one is unlikely to fully grasp the significance of the various summary statistics as predictors or what learning algorithm would work best due to the complex nature of the data, it would be ideal to consider multiple candidate algorithms paired with different combinations of predictors and find an optimal mix. This makes SL well suited for model selection for mechanistic network models.

One interesting note is that the magnitude of the difference in performance between the full SL and the best performing candidate algorithm (random forest most of the time) is not too great. This reflects results from other papers in the literature that use the SL specifically for classification [49,51]. Despite the relatively modest gain, we believe the flexibility the SL offers is important for this setting. In addition to the choice of predictors and candidate algorithms, the choice of the loss function allows one to define the performance metric, which may not be possible with the candidate algorithms. The oracle property also takes care of the analogue of the multiple comparison problem in the model selection setting by allowing the user to gain information from multiple candidate algorithms yet make only one ‘selection’.

We implemented the proposed procedure for quantifying the uncertainty in the selected model on the simulated data. We assessed two scenarios, an easy one, with edge count 2000 and Inline graphic , and a difficult one, with edge count 500 and. For each scenario, we first defined the as 1, or the ‘positive’ model, if the score given by the cross-validated SL was, and 0 or the ‘negative’ model otherwise. We defined as above and built the regression model for with a new SL with the appropriate loss function in order to maximize the corresponding binomial likelihood

The new SL used the same library of candidate algorithms as before. The average value of Inline graphic for the two scenarios were 0.71 for the easy scenario and 0.53 for the difficult one. The average values for the easy scenario is noticeably higher than 0.5, while that of the difficult scenario is very close to 0.5 and thus hardly better than a random guess. These results are compatible with our AUC simulation results.

4. Protein interaction network

We apply our model selection approach to the yeast (Saccharomyces cerevisiae) PPI network obtained from the database of interacting proteins (DIP) [52]. We use models from two publications [53,54] that fit different duplication divergence models [11,55] to the same data set.

Duplication divergence models are used in systems biology for modelling PPI networks. The model fits in Hormozdiariet al. [53] and Schweigeret al. [54] both contain the same mechanisms, but different seed networks and parameter values. The models generate network realizations by growing a seed network until the requisite number of nodes Inline graphic is reached. At the beginning of each step in network generation, an existing node is chosen uniformly at random and duplicated by adding a new node and connecting the new node with each neighbour of the chosen existing node. Each edge connected to the new node is then removed with probability Inline graphic . Finally, an edge is added between the new node and any existing node at the start of theth time step with probability, where is the number of nodes at the beginning of theth step.

The two model fits use different parameter values and different seed networks. The fit from Hormozdiariet al. [53] uses a seed network of 50 nodes. This seed network is constructed by connecting two cliques (complete graphs) of 7 and 10 nodes with several edges. Then another 33 nodes are connected with randomly chosen nodes from the two cliques. The fit from Schweigeret al. [54] uses a seed network of 40 nodes. The seed network is generated from an inverse geometric model, where coordinates in Inline graphic are generated for each node. Then, each pair of nodes with Euclidean distance between corresponding coordinates greater than threshold are connected. Each dimension of the coordinates is generated independently from the standard normal distribution.

To apply our model selection approach, we generated network realizations from each model to fit the SL. Note that the results from Hormozdiariet al. [53] and Schweigeret al. [54] are based on different older releases of the DIP data, while our analysis only uses the latest data release. Thus, we will only use the seed networks (these are independent of the empirical network) from the respective models fits, but not their corresponding parameter estimates. Furthermore, rather than refitting both models, we take the approach from Pudloet al. [36] and draw parameter values from a prior (uniform over the unit square), then generate a network realization from each model for each parameter value as the training data.

The SL here uses the same candidate algorithms we used in the previous section. The predictors here are the criteria of fit from both papers. Betweenness centrality of a node Inline graphic is the number of shortest paths across all pairs of nodes that pass through, while closeness centrality of is the inverse of the sum of the lengths of all shortest paths from to all other nodes. The-hop reachability of a node is the number of distinct nodes that can be reached in edges or less. Hormozdiariet al. [53] use these measures to compare the overall connectivity of networks generated from their model with that of the empirical PPI network. We use the means of the two centrality measures as well as that of Inline graphic -hop reachability up to as predictors.

Bicliques are subgraphs between two distinct sets of nodes where every edge between the two sets exists. A biclique is maximal if it is not a strict subgraph of another biclique. Buet al. [56] note that large bicliques exist in the yeast PPI network, and they are used to infer protein function as well as to find binding motifs [57]. Schweigeret al. [54] based the fit of different models on maximal bicliques with distinct sets of nodes with at least two nodes. For consistency, maximal bicliques are represented as ordered pairs of the number of nodes in the two distinct sets, with the first element being less than or equal to the second. This can be cast as a bivariate distribution. To use as predictors, we summarize the maximal biclique distribution of a network with the total number of maximal bicliques, a norm of the distribution of maximal bicliques, the mean and variance of each dimension of the bivariate distribution, as well as the Pearson and Spearman correlation between the two dimensions. The norm is computed as Inline graphic , where is the proportion of maximal bicliques with sets of nodes with size and. This choice is motivated by the distance used in Schweigeret al. [54].

For the purposes of model selection and without loss of generality, we assigned the model of Hormozdiariet al. [53] as the 0 or ‘negative model’ and that of Schweigeret al. [54] as the 1 or ‘positive’ model. We trained three SLs: the first is based only on the predictors from Hormozdiariet al. [53], the second is based only on those from Schweigeret al. [54], while the last one is based on those from both papers. Note that we use the definitions of the predictors from the two papers as given and compute their values from the latest release of the data. Predictably, the first two trained SLs evaluated at the corresponding predictors for the PPI network returned opposing results, each favouring the model its corresponding predictors were from. The first SL returned a score of approximately 0.00629 and very strongly favours model 0, while the second returned a score of approximately 0.788 and strongly favours model 1. Both scores are closer to their respective extremes than 0.5, suggesting at least moderate evidence for their respective models. The third SL returned a score of approximately 0.658, which favours model 1, but is closer to 0.5 than either of the scores from the previous two SLs. The results from using both sets of summaries thus favour the model of Schweigeret al. [54] for the most current (S. cerevisiae) PPI dataset.

These results exemplify the difficulty with model selection for mechanistic models where predictors are insufficient summaries of the data. Each of the summaries used as predictors are reasonable criteria for fit as presented in their respective papers, but they can lead to conflicting results depending on the subset of predictors used, as with the first two trained SLs. It can therefore be difficult to compare these results. In general, with classification methods such as random forest or SL based on such learning algorithms, using a larger set of predictors typically does not reduce performance, so it is reasonable to use a larger set of predictors. However, one can always come up with new summaries. Thus, a natural question is when to stop expanding the set of predictors/summaries.

One means of assessing the usefulness of a predictor is through an importance metric. As a reference, we report inTable 2 the random forest feature importance [58] for each predictor in the three selections specified in the previous paragraphs. In the first selection (SL1), based only on predictors from Hormozdiariet al. [53], the two mean centrality measures as well as 6-hop reachability had the three highest feature importances. In the second (SL2), based only on predictors from Schweigeret al. [54], the norm of the maximal biclique distribution, as well as the mean and variance of the second dimension of the distribution, had the highest feature importances. When the two sets of predictors are combined in SL3, the top three predictors from Schweigeret al. [54] retain high importance, while those from Hormozdiariet al. [53] all have lower importance with only betweenness centrality retaining moderate importance. In this example, it seems the predictors from Schweigeret al. [54] are more discerning. The example also shows that the feature importance can be a useful tool in distilling a large set of available predictors to build a classifier for model selection.

Table 2.

Feature importance in the three selections based on different sets of predictors

Predictor [53]	SL1	SL3	Predictor [54]	SL2	SL3
Bet. cent.	0.431	0.113	Total bic.	0.012	0.007
Clo. cent.	0.201	0.045	Norm bic.	0.229	0.191
1-hop	0.059	0.004	Mean 1-d	0.060	0.028
2-hop	0.059	0.005	Mean 2-d	0.264	0.230
3-hop	0.033	0.003	Var 1-d	0.040	0.016
4-hop	0.029	0.002	Var 2-d	0.339	0.311
5-hop	0.054	0.005	Pearson	0.040	0.014
6-hop	0.134	0.021	Spearman	0.017	0.005

Open in a new tab

Feature importance in the three selections that are based on different sets of predictors: selection 1 (SL1) is based only on those from Hormozdiariet al. [53], selection 2 (SL2) is based only on those from Schweigeret al. [54], and selection 3 (SL3) is based on both sets. The three predictors with the highest importance in SL1 and SL2 are bolded.

5. Discussion

Network models are widely used in many domains, and mechanistic models allow one to easily incorporate domain knowledge. However, due to the intractability of the likelihood of the typical mechanistic network models, likelihood-based model selection methods are not feasible. We propose a procedure that combines the SL framework with the data generation of ABC, allowing one to leverage the ease of generating data from mechanistic models via forward simulation, while quantifying uncertainty in the selected model.

While ABC provides one viable means for mechanistic network model selection, an accurate ABC posterior requires the knowledge of sufficient statistics [27,37], which are typically difficult to find in the case of intractable likelihoods. The use of insufficient statistics and non-zero threshold for the distance, as is commonly done in ABC, can make the approximation of the posterior distribution of the model index worse. As an alternative to ABC for likelihood-free model selection, we proposed to use SL for model selection while borrowing the generation of pseudo-data from ABC. While it still suffers from the lack of sufficiency, our approach aims to make the best use of available predictors.

With training data readily generated from each candidate model, SL seeks to build an optimal prediction algorithm from a library of candidate algorithms. In this case, it seeks to build an optimal classifier from candidate algorithms to best discriminate between the available models with the given predictors that are not necessarily sufficient. One is unlikely to know what pairing of classifiers and predictors will perform well, but with SL one does not need to make this choice as SL will try to build the optimal classifier with all that is given. However, this does not mean that the quality of the predictors does not matter. The better the predictors are at characterizing the differences between the candidate models, the better SL performs. Though the ability to characterize the differences likely correlates with sufficiency, we cannot use sufficiency as a criterion for choosing the predictors. For mechanistic models, one can, and should, apply domain knowledge to select predictors.

Our procedure assumes that one has well-defined candidate models. Bayesian models require a corresponding prior distribution, whereas non-Bayesian models require the calibration of model parameters. For such calibration, it is important to consider which characteristics of the networks are unlikely to be affected by the different mechanisms of the candidate models, and then to calibrate the parameters of each candidate model based on these characteristics of the observed network. This will hopefully allow the differences in the candidate models to more clearly manifest themselves and to ensure that the data generated from each candidate model are similar to the observed data. Ideally, this also means that the data generated from the true model will match the observed data closely. However, due to the nature of the likelihood, calibration of model parameters in the latter scenario is not straightforward. This is another situation where likelihood-free methods such as ABC are appropriate.

Finally, it is possible to approach the model selection problem in a very different way: one can turn the model selection problem into a density estimation problem in the summary statistic space. Briefly, one can use kernel density estimation, or similar methods, to estimate the probability density function of the network summary statistics conditional on the model used to generate the networks. By introducing a prior probability on the model index, one can then use Bayes’ theorem to arrive at the model posterior probability conditional on the observed summaries. The choice of summary statistics can then be guided by the idea that informative summaries, associated with different mechanistic models, should result in minimal common support in the estimated model-specific densities in the summary statistic space. This approach would not alleviate the problem of insufficient statistics, but rather than having to specify a threshold for the distance metric, one would now have to specify the precision (width) of the kernels used in density estimation. Density estimation is known to be difficult in high dimensional spaces, but it might be possible to use a heuristic for proposing small sets of summaries such that the dimensionality of the summary statistic space does not become prohibitive. We leave further development of this idea for future work.

Acknowledgements

We would like to thank Dr. Laura Balzer for her generous help on the Super Learner.

Funding

The National Institutes of Health (1DP2MH103909-01 to S.C. and J.P.O., 5U01HG009088-02 to S.C., U54GM088558-09 to S.C., 5R37AI051164-12 to J.P.O., 1R01AI112339-01 to J.P.O., U54GM088558-06 to J.P.O.); and the Swiss National Science Foundation (CR12I1_156229 to S.C. and A.M., 105218_163196 to A.M.).

References

1.Wasserman, S. & Faust, K. (1994) Social Network Analysis: Methods and Applications. Vol. 8Cambridge, UK: Cambridge University Press. [Google Scholar]
2.Pastor-Satorras, R. & Vespignani, A. (2007) Evolution and Structure of the Internet: A Statistical Physics Approach. Cambridge, UK: Cambridge University Press. [Google Scholar]
3.Newman, M. (2010) Networks: An Introduction. Oxford, UK: Oxford University Press. [Google Scholar]
4.Lusher, D., Koskinen, J. & Robins, G. (2012) Exponential Random Graph Models for Social Networks: Theory, Methods, and Applications. Cambridge, UK: Cambridge University Press. [Google Scholar]
5.Raval, A. & Ray, A. (2013) Introduction to Biological Networks. Boca Raton, USA: CRC Press. [Google Scholar]
6.Robins, G., Pattison, P., Kalish, Y. & Lusher, D. (2007) An introduction to exponential random graph () models for social networks. Soc. Netw., 29, 173–191. [Google Scholar]
7.Hoff, P. D., Raftery, A. E. & Handcock, M. S. (2002) Latent space approaches to social network analysis. J. Am. Stat. Assoc., 97, 1090–1098. [Google Scholar]
8.Goyal, R., Blitzstein, J. & De Gruttola, V. (2014) Sampling networks from their posterior predictive distribution. Netw. Sci., 2, 107–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Barabási, A.-L. & Albert, R. (1999) Emergence of scaling in random networks. Science, 286, 509–512. [DOI] [PubMed] [Google Scholar]
10.Watts, D. J. & Strogatz, S. H. (1998) Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442. [DOI] [PubMed] [Google Scholar]
11. Solé, R. V., Pastor-Satorras, R., Smith, E. & Kepler, T. B. (2002) A model of large-scale proteome evolution. Adv. Complex Syst., 5, 43–54. [Google Scholar]
12.Vázquez, A. Flammini, A., Maritan, A. & Vespignani, A. (2003) Modeling of protein interaction networks. Complexus, 1, 38–44. [Google Scholar]
13.Klemm, K. & Eguiluz, V. M. (2002) Highly clustered scale-free networks. Phys. Rev. E, 65, 036123. [DOI] [PubMed] [Google Scholar]
14.Kumpula, J. M., Onnela, J.-P., Saramäki, J., Kaski, K. & Kertész, J. (2007) Emergence of communities in weighted networks. Phys. Rev. Lett., 99, 228701. [DOI] [PubMed] [Google Scholar]
15.Viswanath, B., Mislove, A., Cha, M. & Gummadi, K. P. (2009) On the evolution of user interaction in facebook. Proceedings of the 2nd ACM Workshop on Online Social Networks. ACM, 37–42. [Google Scholar]
16.Onnela, J.-P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., Kertész, J. & Barabási, A.-L. (2007) Structure and tie strengths in mobile communication networks. Proc. Natl. Acad. Sci. USA, 104, 7332–7336. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Saramäki, J. & Moro, E. (2015) From seconds to months: an overview of multi-scale dynamics of mobile telephone calls. Eur. Phys. J. B, 88, 164. [Google Scholar]
18.Goyal, R. & Onnela, J.-P.. Mechanistic and probabilistic network models (in progress). [Google Scholar]
19.Shalizi, C. R. & Rinaldo, A. (2013) Consistency under sampling of exponential random graph models. Ann. Stat., 41, 508. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Wang, R., Goyal, R., Lei, Q., Essex, M. & De Gruttola, V. (2014) Sample size considerations in the design of cluster randomized trials of combination HIV prevention. Clin. Trials, 11, 309–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Middendorf, M., Ziv, E. & Wiggins, C. H. (2005) Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc. Natl. Acad. Sci. USA, 102, 3192–3197. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Ratmann, O., Andrieu, C., Wiuf, C. & Richardson, S. (2009) Model criticism based on likelihood-free inference, with an application to protein network evolution. Proc. Nat. Acad. Sci., USA, 106, 10576–10581. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Thorne, T. & Stumpf, M. P. (2012) Graph spectral analysis of protein interaction network evolution. J. R. Soc. Interface, 9, 2653–2666. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Onnela, J.-P. & Mira, A.. Statistical inference and model selection for mechanistic network models (in progress). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Marin, J.-M., Pudlo, P., Robert, C. P. & Ryder, R. J. (2012) Approximate Bayesian computational methods. Stat. Comput., 22, 1167–1180. [Google Scholar]
26.Sunnåker, M., Busetto, A. G., Numminen, E., Corander, J., Foll, M. & Dessimoz, C. (2013) Approximate Bayesian computation. PLoS Comput. Biol., 9, e1002803. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lintusaari, J., Gutmann, M. U., Dutta, R., Kaski, S. & Corander, J. (2017) Fundamentals and recent developments in approximate Bayesian computation. Syst. Biol., 66, e66–e82. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Sisson, S. A., Fan, Y. & Beaumont, M. (2018) Handbook of Approximate Bayesian Computation. Boca Raton, USA: Chapman and Hall/CRC. [Google Scholar]
29.Beaumont, M. A. (2010) Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst., 41, 379–406. [Google Scholar]
30.Grelaud, A., Robert, C. P., Marin, J.-M., Rodolphe, F. & Taly, J.-F. (2009) ABC likelihood-free methods for model choice in Gibbs random fields. Bayesian Anal., 4, 317–335. [Google Scholar]
31.Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. P. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface, 6, 187–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sisson, S. A., Fan, Y. & Tanaka, M. M. (2007) Sequential monte carlo without likelihoods. Proc. Natl. Acad. Sci. USA, 104, 1760–1765. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Lee, X. J., Drovandi, C. C. & Pettitt, A. N. (2015) Model choice problems using approximate Bayesian computation with applications to pathogen transmission data sets. Biometrics, 71, 198–207. [DOI] [PubMed] [Google Scholar]
34.Fearnhead, P. & Prangle, D. (2012) Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 74, 419–474. [Google Scholar]
35.Stoehr, J., Pudlo, P. & Cucala, L. (2015) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Stat. Comput., 25, 129–141. [Google Scholar]
36.Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M. & Robert, C. P. (2015) Reliable ABC model choice via random forests. Bioinformatics. 32, 859–866. [DOI] [PubMed] [Google Scholar]
37.Barber, S., Voss, J. & Webster, M. (2015) The rate of convergence for approximate Bayesian computation. Electron. J. Stat., 9, 80–105. [Google Scholar]
38.Drovandi, C. C., Pettitt, A. N. & Faddy, M. J. (2011) Approximate Bayesian computation using indirect inference. J. R. Stat. Soc. Ser. C (Appl. Stat.), 60, 317–337. [Google Scholar]
39.Prangle, D., Fearnhead, P., Cox, M. P., Biggs, P. J. & French, N. P. (2014) Semi-automatic selection of summary statistics for ABC model choice. Stat. Appl. Genet. Mol. Biol., 13, 67–82. [DOI] [PubMed] [Google Scholar]
40.Robert, C. P., Cornuet, J.-M., Marin, J.-M. & Pillai, N. S. (2011) Lack of confidence in approximate Bayesian computation model choice. Proc. Natl. Acad. Sci. USA, 108, 15112–15117. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Polley, E. C., Rose, S. & Van der Laan, M. J. (2011) Super Learning. Targeted Learning. New York, NY: Springer, 43–66. [Google Scholar]
42.Van der, Laan, M. J., Polley, E. C. & Hubbard, A. E. (2007) Super Learner. Stat. Appl. Genet. Mol. Biol., 6. [DOI] [PubMed] [Google Scholar]
43.Van Der Laan, M. J. & Dudoit, S. (2003) Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: finite sample oracle inequalities and examples. Technical Report. Division of Biostatistics, University of California, Berkeley. [Google Scholar]
44.Dudoit, S. & van der, Laan, M. J. (2005) Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol., 2, 131–154. [Google Scholar]
45.Bradley, A. P. (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit., 30, 1145–1159. [Google Scholar]
46.Ling, C. X., Huang, J. & Zhang, H. (2003) AUC: a statistically consistent and more discriminating measure than accuracy. IJCAI. 3, 519–524. [Google Scholar]
47.Erdös, P. & Rényi, A. (1959) On random graphs I. Publ. Math. Debrecen, 6, 290–297. [Google Scholar]
48.Granovetter, M. S. (1973) The strength of weak ties. Am. J. Sociol., 78, 1360–1380. [Google Scholar]
49.Pirracchio, R., Petersen, M. L. & van der Laan, M. (2014) Improving propensity score estimators’ robustness to model misspecification using super learner. Am. J. Epidemiol., 181, 108–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Pirracchio, R., Petersen, M. L., Carone, M., Rigon, M. R., Chevret, S. & van der Laan, M. J. (2015) Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. Lancet Respiratory Med., 3, 42–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Petersen, M. L., LeDell, E., Schwab, J., Sarovar, V., Gross, R., Reynolds, N., Haberer, J. E., Goggin, K., Golin, C., Arnsten, J., et al. (2015) Super learner analysis of electronic adherence data improves viral prediction and may provide strategies for selective HIV RNA monitoring. J. Acquir. Immune Defic. Syndr. (1999), 69, 109. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U. & Eisenberg, D. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res., 32, D449–D451. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Hormozdiari, F., Berenbrink, P., Pržulj, N. & Sahinalp, S. C. (2007) Not all scale-free networks are born equal: the role of the seed graph in PPI network evolution. PLoS Comput. Biol., 3, e118. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Schweiger, R., Linial, M. & Linial, N. (2011) Generative probabilistic models for protein—protein interaction networks–the biclique perspective. Bioinformatics, 27, i142–i148. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Pastor-Satorras, R., Smith, E. & Solé, R. V. (2003) Evolving protein interaction networks through gene duplication. J. Theoretical Biol., 222, 199–210. [DOI] [PubMed] [Google Scholar]
56.Bu, D., Zhao, Y., Cai, L., Xue, H., Zhu, X., Lu, H., Zhang, J., Sun, S., Ling, L., Zhang, N.. et al. (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res., 31, 2443–2450. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Li, H., Li, J. & Wong, L. (2006) Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scale. Bioinformatics, 22, 989–996. [DOI] [PubMed] [Google Scholar]
58.Breiman, L. (2001) Random forests. Mach. learn., 45, 5–32. [Google Scholar]

Movatterモバイル変換

PERMALINK

Flexible model selection for mechanistic network models

Sixing Chen

Antonietta Mira

Jukka-Pekka Onnela

Roles

Abstract

1. Introduction

2. Methods and materials

2.1 Overview of SL

Fig. 1.

2.2 Model selection framework

Fig. 2.

3. Simulation study

3.1 Variant of the Erdős–Rènyi model

Fig. 3.

3.2 Model selection

Fig. 4.

Table 1.

4. Protein interaction network

Table 2.

5. Discussion

Acknowledgements

Funding

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

Movatterモバイル変換

PERMALINK

Flexible model selection for mechanistic network models

Sixing Chen

Antonietta Mira

Jukka-Pekka Onnela

Roles

Abstract

1. Introduction

2. Methods and materials

2.1 Overview of SL

Fig. 1.

2.2 Model selection framework

Fig. 2.

3. Simulation study

3.1 Variant of the Erdős–Rènyi model

Fig. 3.

3.2 Model selection

Fig. 4.

Table 1.

4. Protein interaction network

Table 2.

5. Discussion

Acknowledgements

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases