Importance: Evaluating the association of social determinants of health with chronic diseases at the population level requires access to individual-level factors associated with disease, which are rarely available for large populations. Synthetic populations are a possible alternative for this purpose.
Objective: To construct and validate a synthetic population that statistically mimics the characteristics and spatial disease distribution of a real population, using real and synthetic data.
Design, setting, and participants: This population-based decision analytical model used data for Allegheny County, Pennsylvania, collected from January 2015 to December 2016, to build a semisynthetic population based on the synthetic population used by the modeling and simulation platform FRED (A Framework for Reconstructing Epidemiological Dynamics). Disease status was assigned to this population using health insurer claims data from the 3 major insurance providers in the county or from the National Health and Nutrition Examination Survey. Biological, social, and other variables were also obtained from the National Health Interview Survey, Allegheny County, and public databases. Data analysis was performed from November 2016 to February 2020.
Exposures: Risk of cardiovascular disease (CVD) death.
Main outcomes and measures: Difference between expected and observed CVD death risk. A validated risk equation was used to estimate CVD death risk.
Results: The synthetic population comprised 1 188 112 individuals with demographic characteristics similar to those of the 2010 census population in the same county. In the synthetic population, the mean (SD) age was 40.6 (23.3) years, and 622 997 were female individuals (52.4%). Mean (SD) observed 4-year rate of excess CVD death risk at the census tract level was -40 (523) per 100 000 persons. The correlation of social determinant data with difference between expected and observed CVD death risk indicated that income- and education-based social determinants were associated with risk. Estimating improved social determinants of health and biological factors associated with disease did not entirely remove the excess in CVD death rates. That is, a 20% improvement in the most significant determinants still resulted in 105 census tracts with excess CVD death risk, which represented 24% of the county population.
Conclusions and relevance: The results of this study suggest that creating a geographically explicit synthetic population from real and synthetic data is feasible and that synthetic populations are useful for modeling disease in large populations and for estimating the outcome of interventions.