A population level study on the determinants of COVID-19 vaccination rates at the U.S. county level | Scientific Reports – Nature.com
Response variable
COVID-19 unvaccinated percentage (UP) is our chosen response variable. UP is calculated as the partial vaccination rate (PVR) subtracted from 100%, where the PVR is defined as the percentage of people in a county who had taken at least one dose of Pfizer (Comirnaty) or Moderna (Spikevax)16,24. As our goal is to deepen our understanding of vaccine hesitancy or vaccine refusal, we chose to define our variable as the percent of the population that did not receive any COVID-19 vaccine doses, rather than the percent fully vaccinated. Defining our variable as the percent fully vaccinated would complicate the interpretation of the variable as vaccine refusal, since it would exclude those that were willing to get the first dose but did not get the second before our time cutoff. The PVR data used to compute UP is sourced from Georgetown Universitys U.S. COVID-19 Vaccination Tracking website, which primarily relies on CDC data, supplemented with vaccination data from local health departments where CDC data is incomplete16. Vaccination data is not available for 69 counties in Alaska, Nebraska, Georgia, and Virginia, so these counties were excluded from the analysis. Additionally, since the Johnson & Johnson (J&J) vaccine only requires one dose to be fully vaccinated, the PVR excludes individuals who got the J&J vaccine. However, only 3.3% of vaccinations administered were J&J as of December 15, 202125, so the exclusion of J&J vaccinations should have minimal impact on our conclusions.
While our response variable measures lack of vaccine uptake, not hesitancy directly, we believe that this work will still provide insight on factors correlated with vaccine hesitancy. Previous work found that COVID-19 vaccine uptake is strongly correlated with vaccine hesitancy, as measured by survey data6. Further, vaccine uptake rates reflect real world vaccination behavior at the population level, in contrast to vaccine hesitancy surveys which are available for a subset of locations around the U.S. and suffer from sampling biases. In addition, we selected the time cutoff of December 15, 2021 in order to minimize the impact of non-hesitancy factors on uptake, such as vaccine eligibility and accessibility. Therefore, our response variable serves as a reasonable proxy for vaccine hesitancy, and we think it is the best choice possible based on the data that is currently available.
All demographic and socioeconomic variables are sourced from publicly available datasets at the county level, from the U.S. Census Bureaus 2020 Decennial Census17 and U.S. Department of Agriculture (USDA) Economic Research Service18. The percentage of Black people and the percentage of Latinx people represent the self-identified proportion of those races in each county. Postsecondary education percentage is measured as the percentage of adults with educational attainment more advanced than completing high school. The uninsured percentage is the percentage of people who reported not having health insurance. Additional metrics include median age, average number of vehicles per household and the median household income.
The political affiliation variable, defined as the percentage of voters who chose Donald Trump as their presidential candidate during the 2020 presidential election, is sourced from the MIT Election Data and Science Lab19. Previous work has found this data to be associated with vaccine hesitancy9,10. Compared with other political indicators, such as other election results, voter registration data, or public opinion polls, the presidential election has the highest voter turnout and the most policy influence. We hereby adopt this metric as a proxy of the political affiliation. It is referred as Republican presidential vote percentage (%) in the study.
In efforts to explore whether a county that experienced more burden from COVID-19 may be more willing to adopt preventative measures such as vaccination, we incorporate a variable to capture a county's historical COVID-19 infection rate. Specifically, to measure historical COVID-19 burden, we use the cumulative number of COVID-19 cases per 100,000 people as of December 15, 2021 from the Johns Hopkins University Center for Systems Science and Engineering (CSSE) COVID-19 GitHub20. In order to remove outliers, values that were more than 4 standard deviations above the mean were excluded.
Since MMR vaccination coverage is an indicator of vaccine acceptance before COVID-19, we hypothesize that higher (pre-pandemic) MMR vaccine uptake rates may be associated with higher COVID-19 vaccine coverage. Previous work has shown a strong association between MMR coverage and vaccine hesitancy26,27. In most U.S. states, the MMR vaccine is required for children to attend public school, making MMR coverage a strong indicator of anti-vaccination behavior. While this data reflects the results of parents making vaccination decisions for their children, in contrast to measuring COVID-19 uptake among mostly adult populations, the decision to forgo mandatory childhood vaccinations is indicative of strong hesitancy that we hypothesize may transfer to other vaccines. To test this hypothesis, we incorporate a variable in this analysis that is based on the MMR vaccination coverage rates of children in kindergarten in 2019, data that we gathered in a previous study21.
A set of variables intended to capture the potential role of information consumption on vaccine choice includes four television viewership rating variables and a Twitter misinformation variable. The county level viewership ratings (RTG) % for four major channels, namely FNC (Fox News Channel), CNN (Cable News Network), MSNBC (Microsoft National Broadcasting Company), and local news, are sourced from Nielsen Media, where RTG is measured by the estimated percentage of households tuned to a specific viewing source, e.g., news channel. The four viewership variables were computed as the average of the monthly viewership ratings for each channel from February to November 2021. January 2021 data were excluded due to anomalies caused by the January 6th U.S. Capital Attack. Nielsen data is not available for several counties in Virginia and Alaska and all counties in Hawaii, so these 72 counties are excluded from the analysis. The analysis also excludes outliers, defined as those counties with cable viewership values that are more than 4 standard deviations away from the mean. Additionally, within the model each of the cable TV viewership variables was standardized to a mean of 0 and standard deviation of 1, to provide a more interpretable understanding of the relative position of each countys rating.
Another information consumption variable included in the model is the Twitter misinformation variable. This variable is intended to capture the prevalence of COVID-19 vaccine misinformation in circulation on Twitter during a time that likely influenced behavior during the study period. The variable is based on a previous study by Pierri et al., who provided a variable that is representative of the percent of COVID-19 vaccine-related tweets that contain links to low credibility sources at the county level15,23. This variable has some limitations, as it is based on only the set of Twitter accounts that can be geolocated. To ensure a large enough sample size for a reliable estimate, counties with less than 50 geolocated accounts are not included, which results in a data set that includes 904 counties.After excludingoutliers, defined as values that are more than 4 standard deviations from the mean, we have 855 counties. An analogous data set is also available with a minimum of 10 and 100 geolocated accounts, but we opted to use the cutoff of 50 to balance having a more representative sample size of accounts per county with the number of counties we can include in our analysis. Due to the limited number of counties that this data is available for, a separate sub-analysis is conducted that includes this variable (Fig.2).
Various land-use variables are sourced from the U.S. Census Bureau, namely the population size and the number of residents in rural or urban areas for each county28. These variables are used to cluster counties for the sub-analyses, which are further described in the methods section. For the cluster-based analysis we categorize counties into mutually exclusive sets based on (1) population quartiles and (2) a binary rural or urban classification. For the binary classification, a county is classified as rural if the majority of the population is designated to live in areas classified as rural and otherwise classified as urban.
We use a Generalized Additive Model (GAM) to explore the relationship between each county's unvaccinated percentage and the aforementioned variables. GAMs provide a balance between model complexity and interpretability, and critically, they can reflect the relative importance of different features29,30. Specifically, GAMs model the response variable as the sum of unknown smooth functions of covariates, and unlike Generalized Linear Models (GLMs), GAMs offer the capability to model nonlinear relationships between variables. For example, a linear regression model may show an overall positive correlation between an input variable and the response variable, but using a GAM on the same data may reveal a more nuanced relationship, like a strong positive trend in some ranges of the input variable and a weak negative trend elsewhere. Due to the complex nature of relationships between our variables and vaccination uptake, GAMs are a more appropriate choice than linear models, since they can capture both linear and nonlinear relationships.
The proposed GAM is fitted to the unvaccinated percentage as the response variable, which is assumed to have a Gaussian distribution, and a log link. REML (restricted maximum likelihood) is used to estimate smoothing parameters, which returns relatively reliable and stable results. Specifically, the model in our primary analysis has the following form:
$${Y}_{i}sim Gaussian(mu )$$
$${text{log}}left(mu right)sim {s}_{1}left(cumulative ,COVID-19, case, rateright)+ {s}_{2}left(percentage, of ,Black, peopleright)+ {s}_{3}left(percentage, of ,Hispanic, peopleright)+ {s}_{4}left(postsecondary, education, percentageright)+ {s}_{5}(median ,household ,income) + {s}_{6}(median ,age) + {s}_{7}(vehicles, per, household) + {s}_{8}(uninsured, percentage) + {s}_{9}(MMR, coverage) + {s}_{10}(std(FNC, viewership) + {s}_{11}(std(CNN ,viewership)) + {s}_{12}(std(MSNBC, viewership) + {s}_{13}(std(Local, News, viewership) + {s}_{14}(Republican ,presidential ,vote ,percentage)$$
where Yi denotes the unvaccinated percentage for each county i. The model is a sum of smooth functions ({s}_{i}), and each smooth function consists of a number of basis functions (k). Sensitivity analysis that varies the number of basis functions was conducted. A value of k=3 for each smooth function was found to provide the optimal balance between preventing both underfitting and overfitting of the model and maximizing interpretability of the results. Additionally, the GAM model is weighted to prevent the highly imbalanced county population distribution from skewing the results. The weight is computed by normalizing each countys population by the average county population, taking a log transformation to adjust for the skewness. The weight implemented in the primary analysis is defined as:
$$weigh{t}_{i}=frac{{text{log}}(po{p}_{i})}{mean({sum }_{i}{text{log}}(po{p}_{i}))},$$
where i is the county index. The primary model is run for 2885 counties (reduced from the full set of counties due to missing data and data quality issues referenced previously).
As noted previously, the Twitter misinformation variable, ({s}_{15}(Twitter ,misinformation)), is only available for 855 counties, and is therefore run as a separate model using the same general function and weights as the primary model, but with the additional determinant included.
In addition to the primary model presented above, we conduct sub-analyses to determine how the relationship between unvaccinated percentage and its associated factors varies across urban versus rural counties. In the U.S., vaccination uptake was substantially lower in rural areas31. Multiple studies have examined the reasons for this discrepancy. Some factors associated with lower vaccination in rural areas were captured in our study, including lower educational attainment, voting for Trump in the 2020 election, and lower insurance coverage32. However, other relevant factors could not be incorporated in our study, including that rural residents have a lower perceived risk of COVID-19, higher vaccine hesitancy, and are less likely to adopt covid risk mitigating behaviors32,33. In order to better understand the influence of different factors in rural versus urban counties, we complete two cluster-based sub-analyses: one clustered by rural versus urban counties and another clustered into four quartiles based on population size, as described below. Due to the difficulty of neatly separating counties into urban or rural, we provide the population size cluster analysis to confirm that our urbanrural analysis is accurately capturing differences between urban and rural counites, which broadly aligns with higher versus lower county populations.
Land-use cluster-based analysis: Counties are clustered into two groups based on their primary land use pattern, namely as urban or rural counties. Two independent weighted GAM models are run, one for each group. The rural model includes 1,835 counties, and the urban model includes 1,050 counties.
Population cluster-based analysis: Counties are grouped into quantiles based on their population size. Four independent GAM models are generated, one for each distinct quantile. The respective models contain 664, 721, 739, and 761 counties ranging from the smallest to largest population size groups. GAMs are implemented without weights for each group in this sub-analysis, because the weighting is based on population size.
We evaluate the goodness-of-fit by conducting a diagnostic analysis for each model and sub-model. These evaluations include the Q-Q plots, histograms of residuals, mapping of residual values versus predicted values, and mapping of response against fitted values. The diagnostic analysis outcomes for the primary model are presented in the Supplementary Material (Fig. S2). The concurvity in the primary model is also measured to ensure pairwise values remain below 0.8 and avoid cases in which one variable is a smooth function of another. The outcomes of the diagnostic analysis demonstrated consistency in fit and performance across all models.
Read more: