Loading metrics
Open Access
Peer-reviewed
Research Article
Estimation of heterogeneous instantaneous reproduction numbers with application to characterize SARS-CoV-2 transmission in Massachusetts counties
- Zhenwei Zhou,
Roles Data curation, Formal analysis, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing
* E-mail:zwzhou@bu.edu (ZZ);lfwhite@bu.edu (LFW)
Affiliation Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
⨯ - Eric D. Kolaczyk,
Roles Conceptualization, Methodology, Supervision, Writing – original draft, Writing – review & editing
Affiliations Department of Mathematics & Statistics, Boston University, Boston, Massachusetts, United States of America, Department of Mathematics and Statistics, McGill University, Montreal, Canada
⨯ - Robin N. Thompson,
Roles Methodology, Supervision, Writing – review & editing
Affiliation Mathematics Institute and SBIDER, University of Warwick, Coventry, England, United Kingdom
⨯ - Laura F. White
Roles Conceptualization, Data curation, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing
* E-mail:zwzhou@bu.edu (ZZ);lfwhite@bu.edu (LFW)
Affiliation Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, United States of America
⨯
Estimation of heterogeneous instantaneous reproduction numbers with application to characterize SARS-CoV-2 transmission in Massachusetts counties
- Zhenwei Zhou,
- Eric D. Kolaczyk,
- Robin N. Thompson,
- Laura F. White
- Published: September 1, 2022
- https://doi.org/10.1371/journal.pcbi.1010434
Figures
Abstract
The reproductive number is an important metric that has been widely used to quantify the infectiousness of communicable diseases. The time-varying instantaneous reproductive number is useful for monitoring the real-time dynamics of a disease to inform policy making for disease control. Local estimation of this metric, for instance at a county or city level, allows for more targeted interventions to curb transmission. However, simultaneous estimation of local reproductive numbers must account for potential sources of heterogeneity in these time-varying quantities—a key element of which is human mobility. We develop a statistical method that incorporates human mobility between multiple regions for estimating region-specific instantaneous reproductive numbers. The model also can account for exogenous cases imported from outside of the regions of interest. We propose two approaches to estimate the reproductive numbers, with mobility data used to adjust incidence in the first approach and to inform a formal priori distribution in the second (Bayesian) approach. Through a simulation study, we show that region-specific reproductive numbers can be well estimated if human mobility is reasonably well approximated by available data. We use this approach to estimate the instantaneous reproductive numbers of COVID-19 for 14 counties in Massachusetts using CDC case report data and the human mobility data collected by SafeGraph. We found that, accounting for mobility, our method produces estimates of reproductive numbers that are distinct across counties. In contrast, independent estimation of county-level reproductive numbers tends to produce similar values, as trends in county case-counts for the state are fairly concordant. These approaches can also be used to estimate any heterogeneity in transmission, for instance, age-dependent instantaneous reproductive number estimates. As people are more mobile and interact frequently in ways that permit transmission, it is important to account for this in the estimation of the reproductive number.
Author Summary
To control the spreading of an infectious disease, it is very important to understand the real-time infectiousness of the pathogen that causes the disease. An existing metric called instantaneous reproductive number is often used to quantify the average number of secondary cases generated by individuals who are infectious at a certain time point, assuming no changes to current conditions. In practice, we might be interested in using the metric to describe the infectiousness in multiple regions. However, this is challenging when there are visitors traveling between these regions, since this could lead to a misclassification of where an individual is actually infected and create biased estimates for the instantaneous reproductive numbers. We developed a method that takes account of human mobility to estimate the instantaneous reproductive numbers for multiple regions simultaneously, which could reveal the heterogeneity of the metric. This method aims to provide helpful information on region-specific infectiousness for disease control measures that focus on the region with higher pathogen infectiousness. This approach is also applicable for estimating the reproductive number in the presence of other sources of heterogeneity, including by age.
Citation:Zhou Z, Kolaczyk ED, Thompson RN, White LF (2022) Estimation of heterogeneous instantaneous reproduction numbers with application to characterize SARS-CoV-2 transmission in Massachusetts counties. PLoS Comput Biol 18(9): e1010434. https://doi.org/10.1371/journal.pcbi.1010434
Editor:Joseph T. Wu, University of Hong Kong, HONG KONG
Received:November 23, 2021;Accepted:July 25, 2022;Published: September 1, 2022
Copyright: © 2022 Zhou et al. This is an open access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability:The datasets generated for simulation and analyzed for real data analysis are available in the GitHub repository,https://github.com/zwzhou-biostat/hrt/tree/main/data. The R codes reproducing the results of this study are available in the GitHub repository,https://github.com/zwzhou-biostat/hrt/tree/main/code.
Funding:This work was funded by NIH R35 GM141821 to ZZ and LFW. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
This is aPLOS Computational Biology Methods paper.
Introduction
In the aftermath of the pandemic caused by the SARS-CoV-1 virus, the idea of using surveillance data to estimate reproductive numbers was introduced and popularized by the seminal paper by Wallinga and Teunis [1]. Subsequent methods have been developed that are better suited to real-time estimation, particularly the approach to estimate the instantaneous reproductive number introduced by Fraser [2] and implemented in the popularEpiEstim R package [3,4]. These methods have promised to be useful in surveillance and monitoring an epidemic, but as the pandemic caused by SARS-CoV-2 has demonstrated, there are still needed improvements to these approaches. Principal issues include accounting for reporting delays in the data, underreporting of cases, and heterogeneity in transmission by geography and by demographic factors, such as age. For these methods to be truly useful in the ongoing COVID-19 pandemic and for future events, these issues must be addressed. Work is being done on the first two issues. For instance, Li et al. [6], Gunther et al. [7] and Martinez et al. [8] propose solutions to the timeliness problem. Pitzer et al. [9] demonstrate the impact of reporting issues and White et al. [10] have shown how estimates ofR(t) can be corrected with information on the reporting fraction of diseases. In this paper, we propose a framework for addressing heterogeneity in transmission, specifically due to human mobility, though our methods can be more generally applied.
Studies have shown that there is transmission heterogeneity in COVID-19, as well as other infectious diseases, which lead to a disproportional impact of the disease on some groups. Multiple studies have found strong evidence of strong heterogeneity wherein a small number of individuals are responsible for the vast majority of cases [11–13]. Additionally, in this COVID-19 pandemic, Sy et al. [14] have shown how mobility, such as subway usage, in NYC leads to disproportionate case burden among those who are not maintaining a physical distance. This implies it would be more efficient if we could account for the heterogeneity and focus control efforts on the populations with the highest transmission probabilities [15].
Many factors could contribute to the heterogeneity of virus transmission, including important systematic factors such as lower socioeconomic status (SES) that disadvantage certain groups and could lead to a higher probability of disease transmission. These factors often cluster geographically. The impact of these factors on virus transmission could be reflected in the reproductive numbers. Ideally, we will be able to discover the heterogeneity by examining the differences in the reproductive numbers between different regions. However, due to human mobility, the heterogeneity of the reproductive numbers among different regions could be masked. This is because human mobility could distribute the infectees by certain infectors to different regions, leading to misclassification of the incidence of one region to another.
We propose a framework for accounting for heterogeneity in disease transmission when estimating the time-varying instantaneous reproductive number for each region. This could help monitor changes in transmission to guide public health measures, for example, implementing more stringent disease control measures for the region with higher virus transmission. Our framework requires data to inform the source or patterns of heterogeneity. We focus on human mobility data to estimate the reproductive number for multiple regions or population groups. However, we note that the framework is suitable for understanding the effects of other important factors on the heterogeneity of the reproductive number, such as age.
We present an analytical framework with two approaches to estimate the dynamics of transmission heterogeneity. If we believe that the heterogeneity of the reproductive numbers can be recovered by accounting for population mixing due to human mobility, and we have confidence that the human mobility data represent the mixing of incidence, we suggest using an efficient and straightforward approach that adjusts the incidence prior to estimation. If instead we want to adjust for the importation of cases and more accurately quantify the uncertainty associated with the use of human mobility data with standard errors, we propose a more flexible and computationally intensive Bayesian approach that is more appropriate.
Materials and methods
Overview
We propose two approaches to estimate instantaneous reproductive numbers that incorporate human mobility data to account for heterogeneity. Both approaches are based on the framework of a system of renewal equations that bring human mobility into consideration. The difference between these two approaches is how the estimation handles potential uncertainty in the human mobility data. These methods can be applied to other types of heterogeneity, such as differential age-mixing where one might use the information on contact patterns between age groups. Our first approach simplifies the problem by assuming that human mobility data accurately represents the mixing patterns and corresponding incidence misclassification without error. In this setting, we propose an approach that extends the framework developed by Fraser et al. [2] to estimate the heterogeneous instantaneous reproductive numbers by adjusting the observed incidences for multiple regions using the human mobility data. In reality, there is likely some randomness in human mobility and we would typically wish to quantify the uncertainty due to other factors that might drive the heterogeneity of instantaneous reproductive numbers. For this setting, we use a system of renewal equations that incorporates human mobility data and estimate instantaneous reproductive numbers under a hierarchical Bayesian framework. Both approaches are evaluated by simulations, and implemented to estimate instantaneous reproductive numbers for all counties in Massachusetts, USA, during the COVID-19 pandemic together with human mobility data from SafeGraph.
Data
The COVID-19 incidence data is provided by CDC case report [16] and we use incidence from July 2020 to March 2021 because testing and case reporting became more frequent and regular starting in July 2020. We aggregate confirmed cases in Massachusetts by date and county. Human mobility data is obtained from the multiscale dynamic human mobility flow dataset constructed and maintained by Kang et al. [17], who computed, aggregated and inferred the daily and weekly dynamic origin-to-destination (O-D) flow at three geographic scales (census tract, county and state) analyzing anonymous mobile phone users’ visits to various places provided by SafeGraph [18].
Notation
Suppose that we want to estimate an instantaneous reproductive number, denoted asR(t), forJ stratum, where the stratum can be geographical regions, age groups, communities, etc. LetNj(t),t = 1, …,T be the number of new cases reported at timet for regionj, andmj(t) =E[Nj(t)], wheret = 1 is the first observation time andT is the last time with available data. The distribution of serial intervals is denoted asω(τ|θ), whereτ is the interval between the time of disease onset in an infector-infectee pair, andθ is the parameters of the distribution. There are several assumptions for both approaches that we propose:
- Serial interval and reproductive number are statistically independent;
- Reproductive number follows a Poisson distribution;
- All infectors appear before those they infected;
- Individuals mix homogeneously;
- Closed population;
- Complete case reporting;
- Accurate mobility information;
- The serial interval is the same as the reporting interval (i.e. the time between case report dates in an infector-infectee pair).
Instantaneous reproductive number
The instantaneous reproductive number, originally developed by Fraser et al. [2], estimates the average number of secondary cases generated by individuals who are infectious at timet assuming no changes to current conditions. When using the instantaneous reproductive number, the expected incidence at timet, which is denoted asm(t), can be expressed as the following renewal equation:(1)
In practice, the estimator forR(t) can be computed with reported incidenceN(t) as:(2)
Cori et al. [3] used a Bayesian approach to estimate theR(t) with credible intervals and proposed smoothing the estimates by using a longer time window, assuming theR(t) stay the same within that window. Thompson et al. [4], and then Creswell et al. [5], developed extended versions of the method to perform estimation in the presence of imported cases. Based on the renewalEq 1 as well as the estimation method developed by Cori et al. [3] and Thompson et al. [4], we can formulate the process into a system of renewal equations that incorporates the human mobility data, assuming that the mobility data accurately describe the how the infected individuals are travelling between regions.
DenoteP as theJ-by-J human mobility matrix that reclassifies incidences to the presumed location of the transmission event. Letpj′j be the entry ofP matrix in thej′th row andjth column, and represents the proportion of population in regionj′ that travels to regionj. Then to describe incidences in multiple regions, we can extend theEq (1) to a system of equations:(3)where
.
Eq (3) describes the relationship between infectious individuals in regionj′ at timet −τ and cases they infect who are reported in regionjτ time points later. Two processes are at play, first the serial interval,ω(τ), as is commonly used in the EpiEstim estimator, which describes the probability of a secondary case takingτ time points to show up. Second, and unique to our formulation, is thepj′j(t), which describes the probability of an individual infected at regionj′ that travel to and be reported as a case in regionj at timet. This formulation assumes that individuals have consistent mixing patterns between regions, e.g., regular commuting patterns, or that there is so-called slow-mixing, meaning that the individuals are traveling out of regionj′τ time points after they are infected.
We can rewriteEq (3) in a matrix form, we have:(4)wherem(t) = (m1(t),m2(t), …,mJ(t))⊤ is a vector of incidences for theJ regions at timet, Therefore, (m(t − 1), …,m(1)) is aJ-by-(t − 1) matrix for the incidences ofJ regions from timet − 1 to 1.R(t) = (R1(t),R2(t), …,RJ(t))⊤ is a vector of instantaneous reproductive numbers for theJ regions at timet. IJ is aJ-by-J identity matrix.
Based on the above system of renewal equations, we propose two approaches for the estimation of heterogeneousR(t) incorporating mobility data as follows.
Approach I—Incidence adjustment approach
In this approach, we use the matrixP from the human mobility data deterministically. According toEq (4), assume thatP is invertible, we have(5)
Note thatm(t) = (m1(t),m2(t), …,mJ(t))⊤ = (E[N1(t)],E[N2(t)], …,E[NJ(t)])⊤ =E[N(t)]. To estimateR(t) with the reported incidenceN(t), letNlocal(t) =P−⊤(t)N(t), and assume follows a Poisson distribution:
(6)where Λj(t) = ∑τ <tNj(t −τ)ω(τ).
is a random variable for the incidence that actually should be attribute to regionj, and the distribution of
is conditional on the previous reported cases in regionj, the distribution of reporting intervalω(⋅) as well as the instantaneous reproductive numberRj(t) in regionj at timet. Λj(t) is the cumulative reported cases in regionj that contribute to the new reported cases at timet, Therefore,Rj Λj(t) is the expectation of the random variable
that is assumed to follow a Poisson distribution.
Assume thatRj(t) follows a gamma prior distributionGamma(a,b), and within ak-days window (from dayt −k + 1 tot), the incidences all depend on the sameRj(t). we can write the posterior joint distribution ofRj(t) as:(7a)
(7b)
(7c)
(7d)
Therefore, the posterior ofRj(t) also follows a gamma distribution. The estimation can be performed by implementing the existingEpiEstim R package with the incidence adjustment data.
Approach II—Bayesian approach
Based on the renewal equation with instantaneous reproductive number by previous studies [2,19], we formulate the renewal equations forJ regions as:(8)whereμj(t) is the rate of exogenous infections (infections out of any of the regionsj ∈ {1, …,J}) occurs in regionj, andω(τ) is the probability distribution of serial interval. We modelRj′j(t) =Rj′(t)pj′j(t), whereRj′(t) is the region specific reproductive number for regionj′ at timet, andpj′j(t) represents the probability of an individual infected at regionj′ that travel to and be reported as a case in regionj at timet, assuming that {pj′j(t):j′,j ∈ 1, 2, …,J} are known. Then we have:
(9)
pj′j(t) here attempts to capture the mobility information of infected cases between the regions at timet, and we denote a matrixP(t) with entriespj′j(t) as a transition matrix that models the infected subjects flowing across the regions. For example, while estimatingR(t) for multiple regions, we can inform theP(t) matrix with mobility data between the regions and/or geographical distance between the regions. Within a Bayesian hierarchical modeling framework, Dirichlet priors forP(t) can incorporate prior knowledge for the estimation ofR(t).
We modelRj′(t) in(9) as:(10)assumingϵj′ has constant variance over time.
Assume that the distribution of serial intervalω(τ) andpj′j is known;Nj(t)∼Poisson(mj(t)) and {Nj(t)} are independent conditional onmj(t), so we have the factorization:. Then we can sample the posterior distribution of parameters with Bayesian hierarchical modeling:
(11a)
(11b)
(11c)with certain prior specifications for {μj(t)}, {βj(t)}, {σj}.
We also allow a smoothing window for the estimation ofRj(t). If the length of the smoothing window isk, then we modifyEq (11c) to be:(12)
ForEq (12), we assume that the expected incidence atk consecutive time points are the same. Specifically, we assume that the incidence from timet tot +k − 1 follow the same Poisson distribution with meanmj(t).
Assumption violation for mobility information
In practice, mobility information being used might not be accurate. In those scenarios, assuming that we have complete case reporting, we could have negative values in the local incidence according to the formulaNlocal =P−⊤(t)N(t). Since negative values would not make sense for the incidence as count data, it indicates that the mobility information that we are using is not accurate assuming case counts are accurate, which is an assumption violation for our method.
This is a violation of the assumption of accurate mobility information, indicating that it would be better to use higher quality mobility data that better describes how the infected individuals are traveling between regions. If we are not able to identify better mobility data, we propose an approach to adjust the P matrix that yields non-negative values forNlocal.
To adjust theP matrix, we use a shrinkage factorsj(t) For timet and regionj.sj(t)∈[0, 1] describes the extent of shrinkage for the percentage of population flowing in or out of regionj at timet. Denote theP matrix adjusted asPadj(t). To adjustP matrix so that the is non-negative, denote regions other than regionj as {j′}, for the column ofP matrix that correspond to regionj, we letPadj(t)(j′,j) =sj(t)P(t)(j′,j) andPadj(t)(j,j) =P(t)(j,j) + (1 −sj(t))∑j′ ∈ {1, …,J}/jP(t)(j′,j), for the row ofP matrix that correspond to regionj, we letPadj(t)(j,j′) =sj(t)P(t)(j,j′) andPadj(t)(j′,j′) =P(t)(j′,j′) + (1 −sj(t))∑j′ ∈ {1, …,J}/jP(t)(j,j′). For each time pointt and each regionj, we searchsj(t) from 1 to 0 with and interval of 0.01 until
is non-negative. Since the regions with lower incidence are more readily impacted by the inaccurate mobility information, the adjustments are made such that a region with lower incidence will be adjusted first.
We further demonstrate how the proposed method to adjustP matrix could help in simulation scenario 4 where there is an inaccurateP matrix. We also emphasize that this hinges on the assumption that incidence data is accurate and that the approach we describe will not detect inaccuracies in the case where case counts are very large. This is important future work.
Simulation
Simulation settings.
Scenario 1: We consider three regions (j ∈ {a,b,c}), where there are no exogenous infections (except for the initial cases on day 0), so thatμj(t) = 0. Assumingpj′j(t) andω(τ) are known, we specify the 3-by-3 matrixP(t) =P with entriespj′j(t) to be the same across time points, wherej′ is row index andj is column index, andP is specified as the following:and we generate discrete distributionω(τ) forτ from the CDF off(τ) =Gamma(2, 0.5):
For {Rj(t)}, we specify nonlinear functions for each region (also shown inFig 1):
The top panel shows the simulated incidence of 100 replicates. The shaded areas are the 95% quantile bands of the simulated incidences, the solid lines in the shaded area are the mean of the simulated incidences. The bottom panel shows the specifiedR(t) functions, which are plotted as solid lines.
To generate incidence data, we let the initializing cases to 10 in each region and let the maximum time of observation to beT = 214. According toEq (9), with {Rj(t)}, {ω(τ)} and matrixP(t), we can computemj(t) and sample the incidences fort = 1, 2, …,T recursively. Note that we specified the sameP forP(t) at all time points. Based on the incidences at previous time pointst′ <t, andRj(t) at current time pointt, we can computemj(t) for the current time pointt. Then the incidences for the current time pointt are sampled from Poisson distribution with meanmj(t). The same steps are used to sample the incidence at the next time pointt + 1, until we generated the incidences forT time points for each region. 100 Monte Carlo replicates are generated for the simulation study.Fig 1 shows the 100 replicates of simulated data.
We performed the incidence adjustment approach (Approach I) to estimate the instantaneous reproductive numbers on the simulated data described above. For the Bayesian approach (Approach II), we evaluated the performance of the model using different distribution assumption for the incidenceNj(t), and also using different lengths of smoothing window. Then we explored whether using a prior for the transitionP matrix to allow for more flexibility could yield proper estimates forRj(t). The performance of the proposed model is compared with the model without considering the heterogeneous ofRj(t), that is using an identityP matrix.
For all models, we useN(0, 0.5) as the prior distribution forβj(t), andN(0, 1) forσj, whereβj(t) andσj are the hyperparameters forRj(t) inEq (11a). Other model parameter settings are described below:
- Model 1: constantP matrix, smoothing window is 1, assume Poisson distribution forNj(t);
- Model 2: constantP matrix, smoothing window is 9, assume Poisson distribution forNj(t);
- Model 3: constantP matrix, smoothing window is 9, assume Negative Binomial distribution withϕ ∼N(0, 5) forNj(t);
- Model 4: randomP matrix that each column follows a Dirichlet distribution centering at the trueP matrix with large concentration parameter, smoothing window is 9, assume Poisson distribution forNj(t);
- Model 5: constant identityP matrix, smoothing window is 9, assume Poisson distribution forNj(t) (this model is equivalent to Fraser’s method, which do not consider human mobility);
The estimates from Approach I (with and without mobility information) and model 4 and 5 of Approach II are shown in the main result section. Note that model 4 of Approach II is with mobility information, and model 5 of Approach II is without mobility information. Simulation results from Model 1, 2, 3 of Approach II are shown inS1 Appendix andS1 Fig.
Scenario 2: In practice, we might have a low count of cases for some of the regions, so we also evaluated the proposed approaches under the scenario where we have a lower count during a certain period of time. In the low count scenario, we specify theR(t) for the three regions to be three piece-wise functions, and it is shown inFig 1.
We initiate with 50 incidences for the three regions, and simulated incidence from 100 replicates are shown inFig 1. When performing estimation, we focus on the range from day 90 to day 130, because this range covers the period where the incidence decreases to zero or low counts that are close to zero. The result for estimatedR(t)’s is shown inS1 Appendix andS2 Fig.
Scenario 3: We also evaluate the performance of the proposed approaches in a scenario where the population are traveling out of two of the regions with higherR(t) to the third region with lowerR(t). We expect this scenario will show that if we do not consider the human mobility, we will overestimate theR(t) for the region where accepting population travel from other regions with higherR(t). In this scenario, we specify theP matrix to be:
TheP matrix specified intends to create a scenario in which the population in region a and region b travel to region c, while the population in region c stay in place.
We specify theR(t) for the three regions to be (shown inFig 1):
In this scenario, we initiate the 100 incidences for each of the three regions. We use Approach I to perform the estimation, since when the incidence counts are high, both approaches generate a similar result. The result of estimatedR(t)’s is shown inS1 Appendix andS3 Fig. We also used anotherP matrix with higher population mixing between the regions, and the result of estimatedR(t)’s is shown inS1 Appendix andS4 Fig.
Scenario 4: We show how the inaccurateP matrix might affect the estimates ofN(t) andR(t) and evaluate the performance of the proposed method that adjustP matrix in this scenario. In this scenario, we specify theP matrix to be:
We use theP matrix above to generate data, then we apply approach I to the simulated data with an inaccurateP matrix with 10 times higher incidence exporting from region a to region b and c, the inaccurateP matrix is as below:
In this scenario, we initiate the 500 incidences for region a and 0 incidence for region b and c. With simulated incidences and the inaccurateP matrix, yields negative values. We apply approach I with adjustedP matrix to demonstrate the performance of the proposed method to adjust the inaccurateP matrix for better estimates ofE[N(t)]. We show the result of estimates forE[N(t)] andR(t) using trueP matrix, inaccurateP matrix, adjusted inaccurateP matrix and identifyP matrix (equivalent with not using mobility data) inS1 Appendix andS5 Fig.
Real data application
We implement the two approaches described above to the COVID-19 incidence data from the CDC. Since case reporting was more regular starting around July 2020, we focus on the case report data from July 2020 to March 2021. We aim to estimate the heterogeneous instantaneous reproductive numbers for all counties (14 in total) in Massachusetts, USA.
Human mobility patterns across the counties are examined, followed by the estimation of instantaneous reproductive numbers as well as the expected incidence for each county. While performing the estimation, we assume the serial interval follows a gamma distribution Gamma(3.45, 0.66), which corresponds to a mean of 5.2 days and an SD of 2.8 days [20].
Results
Our approaches are based on the renewal equation framework proposed by Fraser et al. [2] for estimating the instantaneous reproductive number. We extend the framework to incorporate human mobility data in a system of renewal equations to estimate the instantaneous reproductive numbers for multiple regions. We propose two approaches to carry out the estimation. For Approach I, we adjust the incidence in multiple regions according to the human mobility data and then estimate the instantaneous reproductive number separately in each region using the EpiEstim method. We call this the incidence adjustment approach. For Approach II, we perform estimation using a system of renewal equations in a hierarchical Bayesian framework. We call this the Bayesian approach.
In this section, we show results for both the simple incidence adjustment approach and the more complex system of renewal equations using simulation and data from Massachusetts during the COVID-19 pandemic.
Simulation results
Our simulation study considers three regions with substantially different transmission profiles over time, but reasonably similar patterns in incidence. The incidence data are simulated with pre-specified reproductive numbers over time as well as a transition matrix, which can be informed by mobility data in practice, that describes how the population in each region distribute to other regions. The details of the simulations are described in the Simulation Settings Section. Approach I is straightforward as we use the human mobility data deterministically to adjust the incidence. For Approach II, we evaluated the model using different assumptions on the distribution of incidence and randomness for the mobility data. In this section, we show the results from the two proposed approaches along with the naive approach that does not use mobility data inFig 2. The naive approach estimates the reproductive numbers separately for each region, which is equivalent to Approach I and Approach II without using the mobility data. Other simulation results are inS1 Appendix.
Solid lines are posterior means, along with the 95% credible bands (shaded). As noted at the sidebars on the left, the figures in the upper panel are the estimated Incidence andR(t) by region while using mobility data, and the lower panel shows the results from models without using mobility data. Both the results from Approach I and Approach II are provided.
Fig 2 shows the main result of the simulation.Table 1 summarizes the mean squared error (MSE), sensitivity and specificity for the estimates from Approach I and II with or without using mobility information. For Approach I, the incidence adjustment approach, a fixed transition matrixP (for mobility between regions) is used for the estimation. For Approach II, the Bayesian approach, Dirichlet priors with concentration parameter 104 are placed on the row vectors in the transition matrixP. FromFig 2, we observe that both of Approach I and approach II provide estimated incidence for the 3 regions that are close to the incidence mean for 100 Monte Carlo replicates. The estimated reproductive numbers are also close to the true reproductive numbers. The credible bands of the estimated reproductive numbers are quite narrow for the incidence adjustment approach, while it is wider in the Bayesian approach. FromTable 1, we see that the estimate of expectation of incidence andR(t) from Approach II is more accurate that Approach I in terms of MSE.
Mean squared error (MSE) is computed as the squared error that averaged across days. Sensitivity and specificity are computed for the eventR(t)>1. Note that Approach I without using mobility data is equivalent to the original Fraser’s method implemented in EpiEstim.
When we do not use mobility data (i.e.P is an identity matrix), the incidence estimates obtained by Approach I deviate from the true incidence mean, especially earlier in the outbreak, compared to that obtained by Approach II. Although the estimated incidences obtained by Approach II are close to the mean of simulated data, theR(t) estimates obtained by both approaches when not accounting for mobility, are very similar for each of the three regions and quite different from the trueR(t) curves. The results show that the estimates forR(t) are close to the trueR(t) if we use the mobility information in the model. But if we just stratify the data by region and estimateR(t) ignoring mobility patterns between regions, we are not able to capture the transmission differences. FromTable 1, for estimated expectation of incidence from Approach I, the MSE is 27.129 when using mobility information and it is 137.49 when not using mobility information. For estimatedR(t) from Approach I, the MSE is 0.011 when using mobility information, and it is 0.146 when not using mobility information. Both sensitivity and specificity that test for the eventR(t)>1 are higher when using mobility information compare to not using mobility information. The pattern for the results from Approach II is similar to that from Approach I.
COVID-19 results
Overview of incidence in Massachusetts.
Fig 3 is an overview of the incidence of COVID-19 from July 1st, 2020 to February 28th, 2021 for the 14 counties in the State of Massachusetts, USA. Essex, Middlesex and Suffolk county, the most populous counties, have relatively high incidence. The overall pattern of incidence across counties is similar, exhibiting an obvious increase after November 2020 during the second wave.
Incidence of COVID-19 from July 1st, 2020 to February 28th, 2021 for the 14 counties in the State of Massachusetts, USA.
Population flow across counties in Massachusetts based on human mobility data.
Human mobility data is obtained from the multiscale dynamic human mobility flow dataset constructed and maintained by Kang et al. [17]. They computed, aggregated and inferred the daily and weekly dynamic origin-to-destination (O-D) flow at three geographic scales (census tract, county and state) analyzing anonymous mobile phone users’ visits to various places provided by SafeGraph [18]. We use county-level data in Massachusetts for the modeling in this real data analysis. The human mobility data consists of the estimated number of visitors traveling from one county to another each day. For each county, we use the directional mobility data to compute the proportions of the population that travel to other counties as well as the proportion of the population that stays in the county. Therefore, we obtain the mobility matricesP(t) for each day. There are two assumptions we made in the analysis. First, we assume that the population is mixing slowly, or the population has a regular travel pattern, as described in Methods and Materials Section. Second, we assume that the mobility of the whole population is representative of the mobility of infected individuals. Both assumptions are important for the model as well as the analysis. If there is a large change in the travel pattern in the population, the first assumption will be violated, and the model will suffer due to the difficulty of tracing the location of the infected individuals over time. The second assumption makes it reasonable to infer the mobility of the infected individuals from the mobility of the population. The infected individuals, especially in the early stage of infection, could be more mobile than the general population, thus we consider our assumption to be conservative for the mobility of the infected individuals.
Before we analyze the real data with the proposed model, we first visualize the mobility data to examine the overall travel pattern of the population across counties in Massachusetts. We useLij(t) to denote the number of visitors from countyj to countyi in dayt. To visualize how the counties are clustered according to visitors traveling between them, we compute the average daily population flow for each (i,j) pair from July 1st, 2020 to February 28th, and stratify the flow by weekdays and weekends, assuming there will be different patterns for working days and non-working days.
There are notable differences between weekday and weekend patterns of mobility that can be seen in the heatmaps and dendrograms (Fig 4) generated with complete-linkage hierarchical clustering. During weekdays, most travel is between regions that are geographically proximate, for example, Barnstable, Bristol and Plymouth. On weekends, counties further apart are in the same cluster on the heat map, for example, Norfolk is in the cluster with Barnstable, Bristol and Plymouth. We also show the population flow on the geographical map inFig 5. The clustering is more clear for regions that are geographically proximate for the daily average that is not stratified by weekdays and weekends. From the figure showing the difference between the weekdays’ daily average and weekends’ daily average, we observe that there is more of the population traveling between Essex, Worcester, Norfolk, Suffolk and Middlesex on weekdays compared to weekends, and less of the population traveling between Middlesex, Barnstable and Plymouth as well as between Norfolk, Barnstable and Plymouth. These patterns support the clustering in the heat maps.
Darker colors indicate regions with more flow between them.
The figure on the left shows the average daily population flow, and the figure on the right shows the difference of average population flow between weekdays and weekends, a positive value means the population flow is larger for weekdays than weekends, and a negative value means the population flow is smaller for weekdays than weekends. The maps are created in R and withurbnmapr andggplot2 packages using data from the US Census Bureau, available athttps://github.com/UrbanInstitute/urbnmapr.
Besides quantifying and visualizing the population flow between counties, quantifying the daily change of population for a county will help us understand the potential output of cases from that county and the infection pressure from other counties. We first compute population flow out and flow in before we compute population change. Assume that there areJ counties considered, and the set contains the indices for theJ counties. For county
, to compute the population flow out in dayt, we can aggregate the number of visitors from countyj that travel to other counties to be the size of population flow out for countyj in dayt, and denote it as
. Also, we can aggregate the number of visitors from other counties that travel to countyj to be the size of population flow in countyj in dayt, and denote this as
. And we useL(t)(j) to denote the population size of countyj on dayt. Note that the data is for human mobility in that day, instead of a permanent move.
Population change, which is denoted as, can inform how the population in regionj is mixing with other regions in dayt. Population change refers to the change in population size in dayt compared to the population size in dayt − 1 as a ratio, that is
, where
denotes the size of population flow in and
the size of population flow out in dayt for countyj. We examine the percentage of population change for all counties shown as the top panel inFig 6. The population change plot shows that there is a relatively high percentage of population change for Barnstable, Dukes and Nantucket before October, 2020, due to the population inflow.
Solid lines are the posterior means for incidence andR(t), along with the 95% credible band. The bar plots for the observed incidence are also shown. Results from Approach I, the incidence adjustment approach, are in red and those from Approach II, the Bayesian approach, are in green, and those from the original Fraser’s method (obtained by Approach II without incorporating mobility data) are in blue.
Estimated expected incidence and heterogeneous instantaneous reproductive numbers.
We applied both Approach I and II to the COVID-19 incidence data of 14 counties in the State of Massachusetts, USA, while taken the mobility data into account. The middle and bottom panel inFig 6 shows the estimated expected incidence andR(t) for all counties with both of our proposed methods. Results from Approach I, the incidence adjustment approach, are shown in red and those from Approach II, the Bayesian approach, are shown in green, and the results from original Fraser’s approach (obtained by Approach II without incorporating mobility data) are shown in blue.
When using mobility information, the estimatedR(t)’s are relatively lower for Barnstable and Norfolk compare to that when not using mobility information. Barnstable has a high percentage of population flow in from July to October, and during that time period,R(t) estimates are clearly lower to that when not using mobility information. For Norfolk, the inflow of population happens after August, and we observed lower estimates ofR(t) when using mobility information compare to that when not using mobility information. From the result, it is possible that the increase of incidence in these counties is due to inflow of incidences from other counties with higherR(t). From the simulation with scenario 3, we demonstrated that if there is a region with lowerR(t) and accepting incidences from other regions with higherR(t), we will overestimate theR(t) if we do not consider the mobility data.
The results from Approach II are similar to those from Approach I when incidences are high, while in counties with lower reproductive numbers the estimatedR(t)’s from Approach II are smaller than that from Approach I. The difference of estimates between these two approaches could be due to the low count of incidence. In simulation scenario 2, we observe that Approach I tends to overestimate R(t) compared to Approach II in presence of a low incidence count.
Discussion
It is well-established for many infectious diseases that there is substantial heterogeneity in transmission patterns. One might reasonably expect that some of this variability occurs geographically due to a potentially complex combination of social factors and some amount of stochastic effects. Estimating spatially granular reproductive numbers allows for greater targeting of interventions and the potential to uncover the factors that drive heightened transmission. We have described two approaches for estimatingR(t) that incorporate mixing patterns between distinct groups, which in our setting is informed by mobility data between geographic regions.
We demonstrate how these two approaches perform on simulated data. Simulations shows that both of the approaches are able to estimate the heterogeneous instantaneous reproductive numbers for multiple regions well when the mobility data is well-specified. We observed that the second approach has larger variability. This is expected since in Approach II we incorporate some of our uncertainty around the accuracy of the mobility data, allowing some flexibility in the case where the mobility data might not exactly represent how incident cases are flowing between the regions. This means that the first approach is likely more sensitive to inaccuracies in the mobility data, while the second approach samples over for the mobility prior together with the other parameters, allowing for some misspecification. Therefore, if we have high-quality mobility data that is representative of the population and incidence flow, and are only interested in obtaining reproductive numbers for multiple regions, we can use the more efficient Approach I. If we want to incorporate uncertainty in the mobility data and/or investigate factors that are associated withR(t), we can use Approach II.
In our simulation, we show that using mobility information allows us to obtain estimates forR(t) that are close to the trueR(t) and that this is not feasible when mobility data is not used (see scenario 1). In other words, simply stratifying data by region and estimatingR(t) ignoring mobility patterns between regions does not appropriately capture transmission differences. This is especially important when there are regions with a lowerR(t) accepting population flow from regions with a higherR(t). For example, people might live in counties with lowerR(t), but work in counties with higherR(t). If mobility information is not taken into account, we could overestimateR(t) for the counties in which these people are living, and underestimate theR(t) in the counties where they work. This is shown in our simulation results (see scenario 3).
A potential additional benefit of the more computationally intensive second approach is that local factors, such as age, socioeconomic status and disease containment policies can be incorporated into the estimation framework. This can potentially allow one to not only estimate more accurately the differences between regions, but also potentially start to more carefully understand some of the underlying factors influencing the transmission differences.
When we consider the dynamics of COVID-19 in Massachusetts, the county-level results show that the two approaches yield similar estimates, but that these are distinct from the naive approach that ignores mobility. Generally, the estimated incidence data is similar, but there are some differences in the estimatedR(t)’s with mobility incorporated.R(t) estimates from Approach I have a larger credible band for the counties with lower incidence, such as Nantucket, Franklin and Dukes. The second approach produces smoother estimatedR(t)’s when incidences are low. From simulation scenario 2, we have shown that Approach I tends to overestimateR(t) compared to Approach II when there are low counts for incidence. This could be the reason for the largerR(t) estimates from Approach I for Nantucket, Franklin and Dukes during the time with low incidence count. For Barnstable and Norfolk, we observe a positive population change that correspond to lower estimatedR(t) from Approach I and II when compare with not using mobility information, this might due to the incidence input from other counties with higherR(t).
For both of the methods, an important assumption is that the mobility data describes the flow of infectious individuals, even though it is not explicitly measuring this. This might not hold if individuals dramatically change their behavior when they are infectious. A potential approach to cope with this problem might be adding parameters informed by behavioral data among infectious individuals as weights to the mobility data to account for changed mobility due to the disease.
In a summary, the instantaneousR(t) is an important metric for infectious disease surveillance, since it provides a real-time description of the transmission dynamics among the population. While estimatingR(t) for multiple regions, we expect to see heterogeneity. However, estimating the heterogeneity can be challenging when there is extensive population flow between regions leading to a mixing of the population that can mask or misrepresent the true transmission dynamics. We have presented two methods that incorporate mobility data for the estimation of spatially heterogeneousR(t). The ultimate goal of this approach is to identify the regions with the higher transmission in which to focus interventions as well as study potential mechanisms of transmission. These methods have broad applicability to estimatingR(t) in the presence of any potential heterogeneities, such as age-mixing which can use mixing behavior described by contact surveys such as those performed by Mossong et al. [21].
Supporting information
S1 Appendix.Other simulation results.
Simulation result for Model 1, 2 and 3 in scenario 1, and results for scenario 2, 3 and 4.
https://doi.org/10.1371/journal.pcbi.1010434.s001
(PDF)
S1 Fig.EstimatedE[N(t)] andR(t) by Model 1, 2 and 3.
Solid lines are posterior means, along with the 95% credible bands (shaded). The results are summarized from Approach II with different parameter settings described in the Simulation Settings Section.
https://doi.org/10.1371/journal.pcbi.1010434.s002
(TIF)
S2 Fig.EstimatedR(t) for low incidence scenario.
Solid lines are posterior means, along with the 95% credible bands (shaded). The results are summarized from Approach I and Approach II.
https://doi.org/10.1371/journal.pcbi.1010434.s003
(TIF)
S3 Fig.EstimatedR(t) for population input from other regions.
Solid lines are posterior means, along with the 95% credible bands (shaded). The results are summarized from Approach I with and without incorporating mobility information.
https://doi.org/10.1371/journal.pcbi.1010434.s004
(TIF)
S4 Fig.EstimatedR(t) for population input from other regions with higher population mixing among the regions.
Solid lines are posterior means, along with the 95% credible bands (shaded). The results are summarized from Approach I with and without incorporating mobility information.
https://doi.org/10.1371/journal.pcbi.1010434.s005
(TIF)
S5 Fig.EstimatedE[N(t)] andR(t) in scenario with inaccurateP matrix.
Solid lines are posterior means, along with the 95% credible bands (shaded), the color of solid lines represents the mobility information used in the model. Red dashed lines are means ofN(t) andR(t) in the simulated data.
https://doi.org/10.1371/journal.pcbi.1010434.s006
(JPG)
Acknowledgments
We are grateful for the helpful discussions and suggestions from Peitao Wu and Benjamin Rader.
References
- 1.Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of epidemiology. 2004;160(6):509–516. pmid:15353409
- 2.Fraser C. Estimating individual and household reproduction numbers in an emerging epidemic. PloS one. 2007;2(8):e758. pmid:17712406
- 3.Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. American journal of epidemiology. 2013;178(9):1505–1512. pmid:24043437
- 4.Thompson R, Stockwin J, van Gaalen RD, Polonsky J, Kamvar Z, Demarsh P, et al. Improved inference of time-varying reproduction numbers during infectious disease outbreaks. Epidemics. 2019;29:100356. pmid:31624039
- 5.Creswell R, Augustin D, Bouros I, Farm HJ, Miao S, Ahern A, Robinson M, Lemeneul-Diot A, Gavaghan DJ, Lambert BC, Thompson RN. Heterogeneity in the onwards transmission risk between local and imported cases affects practical estimates of the time-dependent reproduction number (2022) Phil. Trans. Roy. Soc. A, 380:20210308.
- 6.Li T, White LF. Bayesian back-calculation and nowcasting for line list data during the COVID-19 pandemic. PLoS computational biology. 2021;17(7):e1009210. pmid:34252078
- 7.Günther F, Bender A, Katz K, Küchenhoff H, Höhle M. Nowcasting the COVID-19 pandemic in Bavaria. Biometrical Journal. 2021;63(3):490–502. pmid:33258177
- 8.Martinez de Salazar P, Lu F, Hay J, Gomez-Barroso D, Fernandez-Navarro P, Martinez E, et al. Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data. 2021;.
- 9.Pitzer VE, Chitwood M, Havumaki J, Menzies NA, Perniciaro S, Warren JL, et al. The impact of changes in diagnostic testing practices on estimates of COVID-19 transmission in the United States. MedRxiv. 2020;. pmid:32511624
- 10.White LF, Archer B, Pagano M. Estimating the reproductive number in the presence of spatial heterogeneity of transmission patterns. International journal of health geographics. 2013;12(1):1–10. pmid:23890514
- 11.Sun K, Wang W, Gao L, Wang Y, Luo K, Ren L, et al. Transmission heterogeneities, kinetics, and controllability of SARS-CoV-2. Science. 2021;371 (6526). pmid:33234698
- 12.Woolhouse ME, Dye C, Etard JF, Smith T, Charlwood J, Garnett G, et al. Heterogeneities in the transmission of infectious agents: implications for the design of control programs. Proceedings of the National Academy of Sciences. 1997;94(1):338–342. pmid:8990210
- 13.Zhang Y, Li Y, Wang L, Li M, Zhou X. Evaluating transmission heterogeneity and super-spreading event of COVID-19 in a metropolis of China. International journal of environmental research and public health. 2020;17(10):3705.
- 14.Sy KTL, Martinez ME, Rader B, White LF. Socioeconomic Disparities in Subway Use and COVID-19 Outcomes in New York City. American Journal of Epidemiology. 2020;. pmid:33372209
- 15.Mishra S, Kwong JC, Chan AK, Baral SD. Understanding heterogeneity to inform the public health response to COVID-19 in Canada. CMAJ. 2020;192(25):E684–E685. pmid:32493741
- 16.CDC. Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case Surveillance Data Access, Summary, and Limitations (version date: December 31, 2020); 2021.https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf.
- 17.Kang Y, Gao S, Liang Y, Li M, Kruse J. Multiscale Dynamic Human Mobility Flow Dataset in the U.S. during the COVID-19 Epidemic. Scientific Data. 2020; p. 1–13. pmid:33184280
- 18.SafeGraph. The impact of coronavirus (COVID-19) on foot traffic; 2020. Available from:https://www.safegraph.com/dashboard/covid19-commercepatterns.
- 19.Mishra. On the derivation of the renewal equation from an age-dependent branching process: an epidemic modelling perspective. arXiv preprint arXiv:200616487. 2020;.
- 20.Zhang J, et al. Evolving epidemiology of novel coronavirus diseases 2019 and possible interruption of local transmission outside Hubei Province in China: a descriptive and modeling study. MedRxiv. 2020;. pmid:32511424
- 21.Mossong J, Hens N, Jit M, Beutels P, Auranen K, Mikolajczyk R, et al. Social contacts and mixing patterns relevant to the spread of infectious diseases. PLoS medicine. 2008;5(3):e74. pmid:18366252
Subject Areas?For more information about PLOS Subject Areas, clickhere.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!
For more information about PLOS Subject Areas, clickhere.
We want your feedback. Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help!- Human mobility
Is the Subject Area"Human mobility" applicable to this article?
Thanks for your feedback.
- Epidemiology
Is the Subject Area"Epidemiology" applicable to this article?
Thanks for your feedback.
- COVID 19
Is the Subject Area"COVID 19" applicable to this article?
Thanks for your feedback.
- Simulation and modeling
Is the Subject Area"Simulation and modeling" applicable to this article?
Thanks for your feedback.
- Infectious disease control
Is the Subject Area"Infectious disease control" applicable to this article?
Thanks for your feedback.
- Pandemics
Is the Subject Area"Pandemics" applicable to this article?
Thanks for your feedback.
- Geography
Is the Subject Area"Geography" applicable to this article?
Thanks for your feedback.
- Regional geography
Is the Subject Area"Regional geography" applicable to this article?
Thanks for your feedback.
https://orcid.org/0000-0002-9424-7826