Authors:Erin Teeple1;Caitlin Kuhlman2;Brandon Werner1;Randy Paffenroth1;2;3 andElke Rundensteiner1;2
Affiliations:1Data Science Program, Worcester Polytechnic Institute, Worcester, MA, U.S.A.;2Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, U.S.A.;3Department of Mathematics, Worcester Polytechnic Institute, Worcester, MA, U.S.A.
Keyword(s):Air Quality, Canonical Correlation Analysis, CCA, Epidemiology, Environmental Health.
Abstract:Quantifying health effects resulting from environmental exposures is a complex task. Underestimation of exposure-outcome associations may occur due to factors such as data quality, jointly distributed spectra of possible effects, and uncertainty about exposure levels. Parametric methods are commonly used in population health research because parameter estimates, rather than predictive accuracy, are useful for informing regulatory policies. This project considers complementary approaches for capturing population-level exposure-outcome associations: multiple linear regression and canonical correlation analysis (CCA). We apply these methods for the task of characterizing relationships between air quality and cause-specific mortality. We first create a national air pollution exposures-mortality outcomes data set by integrating United States Environmental Protection Agency (EPA) annual summary county-level air quality measurements for the period 1980-2014 with age-adjusted gender- and cause-specific county mortality rates from the same time period published by the Institute for Health Metrics and Evaluation (IHME). Code for data integration is made publicly available. We examine our model parameter estimates together with air quality-mortality rate associations, revealing statistically significant correlations between air quality variations and variations in cause-specific mortality which are particularly apparent when CCA is applied to our population health data set.(More)
Quantifying health effects resulting from environmental exposures is a complex task. Underestimation of exposure-outcome associations may occur due to factors such as data quality, jointly distributed spectra of possible effects, and uncertainty about exposure levels. Parametric methods are commonly used in population health research because parameter estimates, rather than predictive accuracy, are useful for informing regulatory policies. This project considers complementary approaches for capturing population-level exposure-outcome associations: multiple linear regression and canonical correlation analysis (CCA). We apply these methods for the task of characterizing relationships between air quality and cause-specific mortality. We first create a national air pollution exposures-mortality outcomes data set by integrating United States Environmental Protection Agency (EPA) annual summary county-level air quality measurements for the period 1980-2014 with age-adjusted gender- and cause-specific county mortality rates from the same time period published by the Institute for Health Metrics and Evaluation (IHME). Code for data integration is made publicly available. We examine our model parameter estimates together with air quality-mortality rate associations, revealing statistically significant correlations between air quality variations and variations in cause-specific mortality which are particularly apparent when CCA is applied to our population health data set.