We consider the situation where there is a known regression model that can be used to predict an outcome, Y, from a set of predictor variables X. A new variable B is expected to enhance the prediction of Y. A dataset of size n containing Y, X and B is available, and the challenge is to build an improved model for Y|X,B that uses both the available individual level data and some summary information obtained from the known model for Y|X. We propose a synthetic data approach, which consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the Y|X, B model. This combined dataset of size n+m now has missing values of B form of the observations, and is analyzed using methods that can handle missing data (e.g. multiple imputation). We present simulation studies and illustrate the method using data from the Prostate Cancer Prevention Trial. Though the synthetic data method is applicable to a general regression context, to provide some justification, we show in two special cases that the asymptotic variance of the parameter estimates in the Y|X, B model are identical to those from an alternative constrained maximum likelihood estimation approach. This correspondence in special cases and the method's broad applicability makes it appealing for use across diverse scenarios.
Keywords: Synthetic data; constrained maximum likelihood; data integration; prediction models.