Evaluation of the US COVID-19 Scenario Modeling Hub for … – Nature.com
November 21, 2023
Overview of evaluation approaches for scenario projections
When evaluating the distance between a scenario projection and an observation, there are two potential factors at play: (1) the scenario assumptions may not match reality (e.g., scenario-specified vaccine uptake may underestimate realized uptake), and (2) if there were to be alignment between the scenario specifications and reality, model predictions may be imperfect due to miscalibration. The difference between projections and observations is a complex combination of both sources of disagreement, and importantly, observing projections that are close to observations does not necessarily imply projections are well-calibrated (i.e., for scenarios very far from reality, we might expect projections to deviate from observations). To address both components, we evaluated the plausibility of COVID-19 Scenario Modeling Hub (SMH) scenarios and the performance of SMH projections (ensemble and component models). A similar approach has been proposed by Hausfather et al.45. Below, we describe in turn the component models contributing to SMH, the construction of the ensemble, the evaluation of scenario assumptions, and our approaches to estimating model calibration and SMH performance.
SMH advertised new rounds of scenario projections across various modeling channels, using an open call to elicit projections from independent modeling teams. Scenario specifications were designed in collaboration with public health partners and modeling teams, and final specifications were published on a public GitHub repository (https://github.com/midas-network/covid19-scenario-modeling-hub). Teams then submitted projections to this same repository. For additional discussion about the philosophy and history of SMH, as well as details about SMH process, see Loo et al.43.
Over the course of the first sixteen rounds of SMH, thirteen independent models submitted projections, with most submitting to multiple rounds. Of participating models, prior experience in public health modeling varied substantially, ranging from teams with newly built models to address the COVID-19 pandemic and those with long-established relationships with local, state, and national public health agencies. The majority of submitting models were mechanistic compartmental models, though there was one semi-mechanistic model and two agent-based models. Some models were calibrated to, and made projections at, the county level, whereas others were calibrated toand made projections at the state level; many, but not all, had age structure. We have provided an overview of each model in TableS1. As models changed each round to accommodate different scenarios and adapt to the evolving pandemic context, we chose not to focus here on model-specific differences (in structure, parameters, or performance). For more information on round-specific implementations, we direct readers to other publications with details5,22.
Our analysis included state- and national-level projections of weekly incident cases, hospitalizations, and deaths from individual models and various ensembles for fourteen of the first sixteen rounds of SMH (Rounds 8 and 10 were not released publicly, and therefore are not included; see also TableS2 for a list of jurisdictions included). Each round included projections from between 4 and 9 individual models as well as ensembles. For a given round, modeling teams submitted projections for all weeks of the projection period, all targets (i.e., incident or cumulative cases, hospitalizations, and deaths), all four scenarios, and at least one location (i.e., states, territories, and national). Here, we evaluated only individual models that provided national projections in addition to state-level projections (i.e., excluding individual models that did not submit a national projection, though projections from these models are still included in the state-level ensembles that were evaluated). Submitted projections that did not comply with SMH conditions (e.g., for quantifying uncertainty or defining targets) were also excluded (0.8% of all submitted projections). Detailed description of exclusions can be found in TableS2.
Modeling teams submitted probabilistic projections for each target via 23 quantiles (e.g., teams provided projected weekly incident cases for Q1, Q2.5, Q5, Q10, Q20, , Q80, Q90, Q95, Q97.5, and Q99). We evaluated 3 methods for aggregating projections: untrimmed-LOP, trimmed-LOP (variations of probability averaging or linear opinion pool18, LOP), and median-Vincent (variation of quantile or Vincent averaging36,37 which is also used by other hubs11).
The untrimmed-LOP is calculated by taking an equally weighted average of cumulative probabilities across individual models at a single value. Because teams submitted projections for fixed quantiles, we used linear interpolation between these value-quantile pairs to ensure that all model projections were defined for the same values. We assumed that all projected cumulative probabilities jump to 0 and 1 outside of the defined value-quantile pairs (i.e., Q1-Q99). In other words, for a projection defined by cumulative distribution (F(x)) with quantile function ({F}^{-1}(x)), we assume that (F(x)=0) for all (x < {F}^{-1}(0.01)) and (F(x)=1) for all (x > {F}^{-1}(0.99).)
The trimmed-LOP uses exterior cumulative distribution function (CDF) trimming54 of the two outermost values to reduce the variance of the aggregate, compared to the untrimmed-LOP (i.e., the prediction intervals are narrower). To implement this method, we follow the same procedure as the untrimmed-LOP, but instead of using an equally-weighted average, we exclude the highest and lowest quantiles at a given value and equally weight all remaining values in the average. Under this trimming method, the exclusions at different values may be from different teams.
The median-Vincent aggregate is calculated by taking the median value for each specified quantile. These methods were implemented using the CombineDistributions package19 for the R statistical software55.
Projections in each SMH round were made for 4 distinct scenarios that detailed potential interventions, changes in behavior, or epidemiologic situations (Fig.1). Scenario design was guided by one or more primary purposes56, which were often informed by public health partners and our hypotheses about the most important uncertainties at the time. SMH scenarios were designed approximately one month before projections were submitted, and therefore 4-13 months before the end of the projection period, depending on the rounds projection horizon. Scenario assumptions, especially those about vaccine efficacy or properties of emerging viral variants, were based on the best data and estimates available at the time of scenario design (these were often highly uncertain). Here, our purpose was to evaluate SMH scenario assumptions using the best data and estimates currently available, after the projection period had passed. We assessed SMH scenarios from two perspectives:
based on their prospective purpose: we identified whether scenarios bracketed reality along each uncertainty axis (i.e., one axis of the 22 table defining scenarios, based on one key source of uncertainty for the round). Scenarios in most SMH rounds were designed to bracket true values of key epidemic drivers (though the true value was not known at the time of scenario design). In other words, along each uncertainty axis in an SMH round, scenarios specified two levels along this axis (e.g., optimistic and pessimistic assumptions). Here we tested whether the realized value fell between those two assumptions (if so, we called this bracketing).
for retrospective evaluation of calibration: we identified the set of plausible scenario-weeks for each round. One of our primary goals in this analysis was to assess and compare the calibration of different approaches (e.g., individual models, SMH ensemble, comparator models). To assess this in the most direct way possible, we chose scenarios and projection weeks that were close to what actually happened (i.e., we isolated error due to calibration by minimizing deviation between scenarios and reality; see Overview of evaluation approaches for scenario projections for details).
An evaluable scenario axis was defined as an axis for which assumptions could be confronted with subsequently observed data on epidemic drivers; in some instances, we could not find relevant data and the axis was not considered evaluable (e.g., NPI, see below). To evaluate scenario assumptions, we used external data sources and literature (TableS3). Due to differences across these sources, we validated each type of scenario assumption differently (vaccination, NPI, and variant characteristics; Fig.2), as described in detail below and in TableS3. Vaccine specifications and realized coverage are shown in Figs. S2S5, while details of our round-by-round evaluation are provided below.
Rounds 1-4 concentrated on the early roll-out of the vaccine in the US and compliance with NPIs. To evaluate our vaccine assumptions in these rounds, we used data on reported uptake from the US Centers for Disease Control and Prevention database57. For these rounds, scenarios prescribed monthly national coverage (state-specific uptake was intentionally left to the discretion of the modeling teams), so we only used national uptake to evaluate the plausibility of each vaccination scenario (Fig.S2). In these scenarios, bracketing was defined as reality falling between cumulative coverage in optimistic and pessimistic scenarios for 50% or more of all projection weeks. The plausible scenario was that scenario with the smallest absolute difference between cumulative coverage in the final projection week (or in cases of variant emergence, the last week of projections before emergence; details below) and the observed cumulative coverage. We also considered choosing the plausible scenario via the cumulative difference between observed and scenario-specified coverage over the entire projection period; this always led to selecting the same scenario as plausible.
When scenarios specified a coverage threshold, we compared assumptions with the reported fraction of people vaccinated at the end of the projection period. For instance, in Round 2 scenario C and D, we stipulated that coverage would not exceed 50% in any priority group, but reported vaccination exceeded this threshold. In Rounds 3-4, the prescribed thresholds were not exceeded during the truncated projection period.
By Round 5 (May 2021), vaccine uptake had started to saturate. Accordingly, in Rounds 5-7, vaccine assumptions were based on high and low saturation thresholds that should not be exceeded for the duration of the projection period, rather than monthly uptake curves. For these rounds, we evaluated which of the prescribed thresholds was closest to the reported cumulative coverage at the end of the projection period (Fig.S3). Later rounds took similar approaches to specifying uptake of childhood vaccination (Round 9) and bivalent boosters (Round 14-16). Rounds 9 (Fig.S4), and 14-15 (Fig.S5) specified weekly coverage and Round 16 specified a coverage threshold; we followed similar approaches in evaluating these scenarios.
For vaccine efficacy assumptions, we consulted population-level studies conducted during the period of the most prevalent variant during that round (TableS3). Similarly, for scenarios about emerging viral variants (regarding transmissibility increases, immune escape, and severity) and waning immunity, we used values from the literature as a ground truth for these scenario assumptions. We identified the most realistic scenario as that with the assumptions closest to the literature value (or average of literature values if multiple were available, TableS3).
Rounds 1-4 included assumptions about NPIs. We could not identify a good source of information on the efficacy ofand compliance to NPIs that would match the specificity prescribed in the scenarios (despite the availability of mobility and policy data, e.g., Hallas et al.58). Rounds 13-15 included assumptions about immune escape and severity of hypothetical variants that may have circulated in the post-Omicron era. Round 16 considered broad variant categories based on similar levels of immune escape, in response to the increasing genetic diversity of SARS-CoV-2 viruses circulating in fall 2022. There were no data available for evaluation of immune escape assumptions after the initial Omicron BA.1 wave. As such, NPI scenarios in Rounds 1-4 and immune escape variant scenarios in Rounds 13-16 were not evaluable for bracketing analyses, and therefore we considered all scenarios realistic in these cases. Overall, across 14 publicly released rounds, we identified a single most realistic scenario in 7 rounds, and two most realistic scenarios in the other 7.
Finally, in some rounds, a new viral variant emerged during the projection period that was not specified in the scenarios for that round. We considered this emergence to be an invalidation of scenario assumptions, and removed these weeks from the set of plausible scenario-weeks. Specifically, emergence was defined as the week after prevalence exceeded 50% nationally according to outbreak.info variant reports59,60,61, accessed via outbreak.info R client62. Accordingly, the Alpha variant (not anticipated in Round 1 scenarios) emerged on 3 April 2021, the Delta variant (not anticipated in Rounds 2-5) emerged on 26 June 2021, and the Omicron variant (not anticipated in Round 9) emerged on 25 December 2021.
To assess the added value of SMH projections against plausible alternative sources of information, we also assessed comparator models or other benchmarks. Comparator models based on historical data were not available here (e.g., there was no prior observation of COVID-19 in February in the US when we projected February 2021). There are many potential alternatives, and here we used three comparative models: naive, 4-week forecast, and trend-continuation.
The baseline naive model was generated by carrying recent observations forward, with variance based on historical patterns (Figs. S13S15). We used the 4-week ahead baseline model forecast from the COVID-19 Forecast Hub11 for the first week of the projection period as the naive model, and assumed this projection held for the duration of the projection period (i.e., this forecast was the naive projection for all weeks during the projection period). Because the COVID-19 Forecast Hub collects daily forecasts for hospitalizations, we drew 1000 random samples from each daily distribution in a given week and summed those samples to obtain a prediction for weekly hospitalizations. The naive model is flat and has relatively large prediction intervals in some instances.
As a forecast-based comparator, we used the COVID-19 Forecast Hub COVIDhub-4_week_ensemble ensemble model (Figs. S7S9). This model includes forecasts (made every week) from multiple component models (e.g., on average 41 component models between January and October 202111). We obtained weekly hospitalization forecasts from the daily forecasts of the COVID-19 Forecast Hub using the same method as the naive model. This 4-week forecast model is particularly skilled at death forecasts11; however, in practice, there is a mismatch in timing between when these forecasts were made and when SMH projections were made. For most SMH projection weeks, forecasts from this model would not yet be available (i.e., projection horizons more than 4 weeks into the future); yet, for the first 4 weeks of the SMH projection period, SMH projections may have access to more recent data. It should also be noted that the team running the COVID-19 Forecast Hub has flagged the 4-week ahead predictions of cases and hospitalizations as unreliable63. Further, SMH may be given an advantage by the post-hoc selection of plausible scenario-weeks based on the validity of scenario assumptions.
Finally, the trend-continuation model was based on a statistical generalized additive model (Figs. S10S12). The model was fit to the square root of the 14-day moving average with cubic spline terms for time, and was fit separately for each location. We considered inclusion of seasonal terms, but there were not enough historic data to meaningfully estimate any seasonality. For each round, we used only one year of data to fit the model, and projected forward for the duration of the projection period. The SMH ensemble consistently outperformed this alternative comparator model (see Figs. S16S21).
Prediction performance is typically based on a measure of distance between projections and ground truth observations. We used the Johns Hopkins CSSE dataset64 as a source of ground truth data on reported COVID-19 cases and deaths, and U.S. Health and Human Services Protect Public Data Hub65 as a source of reported COVID-19 hospitalizations. These sources were also used for calibration of the component models. CSSE data were only produced through 4 March 2023, so our evaluation of Rounds 13-16 ended at this date (1 week before the end of the 52 week projection period in Round 13, 11 weeks before the end of the 52 week projection period in Round 14, 9 weeks before the end of the 40 week projection period in Round 15, and 8 weeks before the end of the 26 week projection period in Round 16).
We used two metrics to measure performance of probabilistic projections, both common in the evaluation of infectious disease predictions. To define these metrics, let (F) be the projection of interest (approximated by a set of 23 quantile-value pairs) and (o) be the corresponding observed value. The (alpha)% prediction interval is the interval within which we expect the observed value to fall with (alpha)% probability, given reality perfectly aligns with the specified scenario.
Ninety-five percent (95%) coverage measures the percent of projections for which the observation falls within the 95% projection interval. In other words, 95% coverage is calculated as
$${C}_{95%}(F,o)=frac{1}{N}mathop{sum }limits_{i=1}^{N}1left({F}^{-1}(0.025)le ole {F}^{-1}(0.975)right)$$
(1)
where (1(cdot )) is the indicator function, i.e., (1({F}^{-1}(0.025)le ole {F}^{-1}(0.975))=1) if the observation falls between the values corresponding to Q2.5 and Q97.5, and is 0 otherwise. We calculated coverage over multiple locations for a given week (i.e., (i=1...N) for (N) locations), or across all weeks and locations.
Weighted interval score (WIS) measures the extent to which a projection captures an observation, and penalizes for wider prediction intervals35. First, given a projection interval (with uncertainty level (alpha)) defined by upper and lower bounds, (u={F}^{-1}left(1-frac{alpha }{2}right)) and (l={F}^{-1}left(frac{alpha }{2}right)), the interval score is calculated as
$${{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha }(F,o)=(u-l)+frac{2}{alpha }(l-o)1(o, < ,l)+frac{2}{alpha }(o-u)1(u, < ,o)$$
(2)
where again, (1(cdot )) is the indicator function. The first term of (I{S}_{alpha }) represents the width of the prediction interval, and the second two terms are penalties for over- and under-prediction, respectively. Then, using weights that approximate the continuous rank probability score66, the weighted interval score is calculated as
$${{{{{rm{WIS}}}}}}(F,o)=frac{1}{K+1/2}left(frac{1}{2}{{{{{rm{|}}}}}}o-{F}^{-1}(0.5){{{{{rm{|}}}}}}+mathop{sum }limits_{i=1}^{K}frac{{alpha }_{K}}{2},{{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha }right)$$
(3)
Each projection is defined by 23 quantiles comprising 11 intervals (plus the median), which we used for the calculation of WIS (i.e., we calculated ({{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha }) for (alpha=0.02,,0.05,,0.1,,0.2,...,0.8,,0.9) and (K=11)). It is worth noting that these metrics do not account for measurement error in the observations.
WIS values are on the scale of the observations, and therefore comparison of WIS across different locations or phases of the pandemic is not straightforward (e.g., the scale of case counts is very different between New York and Vermont). For this reason, we generated multiple variations of WIS metrics to account for variation in the magnitude of observations. First, for average normalized WIS (Fig.3b), we calculated the standard deviation of WIS, ({sigma }_{s,w,t,r}), across all scenarios and models for a given week, location, target, and round and divided the WIS by this standard deviation (i.e., ({{{{{{rm{WIS}}}}}}/sigma }_{s,w,t,r})). Doing so accounts for the scale of that week, target, and round, a procedure implemented in analyses of climate projections67. Then, we averaged normalized WIS values across strata of interest (e.g., across all locations, or all locations and weeks). Other standardization approaches that compute WIS on a log scale have been proposed68, though may not be as well suited for our analysis which focuses on planning and decision making.
An alternative rescaling introduced by Cramer et al.11, relative WIS, compares the performance of a set of projections to an average projection. This metric is designed to compare performance across predictions from varying pandemic phases. The relative WIS for model (i) is based on pairwise comparisons (to all other models, (j)) of average WIS. We calculated the average WIS across all projections in common between model (i) and model (j), where ({{{{{rm{WIS}}}}}}(i)) and ({{{{{rm{WIS}}}}}}(j)) are the average WIS of these projections (either in one round, or across all rounds for overall) for model (i) and model (j), respectively. Then, relative WIS is the geometric average of the ratio, or
$${{{{{rm{relative; WIS}}}}}}={left(mathop{prod }limits_{j=1}^{N}frac{{{{{{rm{WIS}}}}}}(i)}{{{{{{rm{WIS}}}}}}(j)}right)}^{1/N}$$
(4)
When comparing only two models that have made projections for all the same targets, weeks, locations, rounds, etc. the relative WIS is equivalent to a simpler metric, the ratio of average WIS for each model (i.e., (frac{{{{{{rm{WIS}}}}}}(i)}{{{{{{rm{WIS}}}}}}(j)})). We used this metric to compare each scenario from SMH ensemble to the 4-week forecast model (Fig.4). For this scenario comparison, we provided bootstrap intervals by recalculating the ratio with an entire week of projections excluded (all locations, scenarios). We repeated this for all weeks, and randomly drew from these 1000 times. From these draws we calculated the 5th and 95th quantiles to derive the 90% bootstrap interval, and we assumed performance is significantly better for one scenario over the others if the 90% bootstrap intervals do not overlap. We also used this metric to compare the ensemble projections to each of the comparative models (Fig.S22).
In addition to traditional forecast evaluation metrics, we assessed the extent to which SMH projections predict the qualitative shape of incident trajectories (whether trends will increase or decrease). We modified a method from McDonald et al.40 to classify observations and projections as increasing, flat or decreasing. First, we calculated the percent change in observed incident trajectories on a two week lag (i.e., (log ({o}_{T}+1)-log ({o}_{T-2}+1)) for each state and target). We took the distribution of percent change values across all locations for a given target and set the threshold for a decrease or increase assuming that 33% of observations will be flat (Fig.S23). Based on this approach, decreases were defined as those weeks with a percent change value below 23% for incident cases, 17% for incident hospitalizations, and 27% for incident deaths, respectively. Increases have a percent change value above 14%, 11%, 17%, respectively. See Fig.S34 for classification results with a one week lag and different assumptions about the percent of observations that are flat.
Then, to classify trends in projections, we again calculated the percent change on a two week lag of the projected median (we also consider the 75th and 95th quantiles because our aggregation method is known to generate a flat median when asynchrony between component models is high). For the first two projection weeks of each round, we calculated the percent change relative to the observations one and two weeks prior (as there are no projections to use for reference in the week prior, and two weeks prior, projection start date). We applied the same thresholds from the observations to classify a projection, and compared this classification to the observed classification. This method accounts for instances when SMH projections anticipate a change in trajectory but not the magnitude of that change (see Fig.S44), and it does not account for instances when SMH projections anticipate a change but miss the timing of that change (this occurred to some extent in Rounds 6 and 7, Delta variant wave). See Figs. S24S33 for classifications of all observations and projections.
We assessed how well SMH projections captured incident trends using precision and recall, two common metrics in evaluating classification tasks with three classes: increasing, flat, and decreasing41. To calculate these metrics, we grouped all projections by the projected and the observed trend (as in Fig.5d). Let ({N}_{{po}}) be the number of projections classified by SMH as trend (p) (rows of Fig.5d) and the corresponding observation was trend (o) (columns of Fig.5d). All possible combinations are provided in Table2. Then, for class (c) (either decreasing, flat, or increasing),
precision is the fraction of projections correctly classified as (c), out of the total number of projections classified as (c), or
$${{{{{rm{precisio}}}}}}{{{{{{rm{n}}}}}}}_{c}=frac{{N}_{{cc}}}{{sum }_{j=1}^{3}{N}_{{cj}}}$$
(5)
For example, the precision of increasing trends is the number of correctly classified increases (({N}_{{II}})) divided by the total number of projections classified as increasing (({N}_{{ID}}+{N}_{{IF}}+{N}_{{II}})).
recall is the fraction of projections correctly classified as (c), out of the total number of projections observed as (c), or
$${{{{{rm{recal}}}}}}{{{{{{rm{l}}}}}}}_{c}=frac{{N}_{{cc}}}{{sum }_{j=1}^{3}{N}_{{jc}}}$$
(6)
For example, the recall of increasing trends is the number of correctly classified increases (({N}_{{II}})) divided by the total number of observations that increased (({N}_{{DI}}+{N}_{{FI}}+{N}_{{II}})).
In some instances, we provide precision and recall summarized across all three classes; to do so, we average precision or recall across each of the three projected classes (decreasing, flat, increasing). The code and data to reproduce all analyses can be found in the public Github repository69.
Further information on research design is available in theNature Portfolio Reporting Summary linked to this article.
See the rest here:
Evaluation of the US COVID-19 Scenario Modeling Hub for ... - Nature.com