TY - JOUR
T1 - Synthetic data generation of health and demographic surveillance systems data
T2 - a case study in a low- and middle-income country
AU - Mwigereri, Dorcas G.
AU - Kamotho, Nigel T.
AU - Waljee, Akbar K.
AU - Rego, Ryan T.
AU - Weinheimer-Haus, Eileen M.
AU - Alarakhiya, Farhana
AU - Ngugi, Anthony K.
AU - Price, W. Nicholson
AU - Zhu, Ji
AU - Wong, Stephen Peter
AU - Siwo, Geoffrey H.
N1 - Publisher Copyright:
© 2025 The Author(s). Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2025/12/1
Y1 - 2025/12/1
N2 - Objective: To evaluate effectiveness of open-source generative models in producing high-quality tabular synthetic data using a Health and Demographic Surveillance System (HDSS) dataset from rural Kenya, as a proof of concept in a low- and middle-income (LMIC) setting. Materials and Methods: Three open-source models (CTGAN, TableGAN, and CopulaGAN) were used to generate synthetic data from the Kaloleni/Rabai HDSS dataset. To assess the quality of the synthetic datasets generated by each model, we performed fidelity, utility, and privacy tests. Results: CTGAN outperformed the other models, producing synthetic data that closely mirrored the statistical properties of the real dataset while preserving privacy. Both CopulaGAN and TableGAN performed poorly, with TableGAN completely failing to generate realistic synthetic data. For the utility tests, Random Forest models trained on CTGAN-generated synthetic data achieved comparable performance to models trained on real data (accuracy: 72.4% vs 72.0%, P =. 38; F1 score: 71.4% vs 68.3%, P =. 22), indicating no statistically significant loss in predictive utility. The CTGAN model also yielded higher precision and recall than CopulaGAN, suggesting that the synthetic data generated by CTGAN better preserved the underlying structure of the real data. Discussion: CTGAN demonstrated superior performance in generating high-quality synthetic tabular HDSS data. CopulaGAN and TableGAN produced lower quality data, though these results may not generalize to other datasets. Conclusion: Synthetic data generation of tabular data using HDSS data, particularly via CTGAN, may enhance the accessibility of datasets in LMICs by creating synthetic datasets that preserve the characteristics and statistical properties of the original data, while upholding privacy and confidentiality.
AB - Objective: To evaluate effectiveness of open-source generative models in producing high-quality tabular synthetic data using a Health and Demographic Surveillance System (HDSS) dataset from rural Kenya, as a proof of concept in a low- and middle-income (LMIC) setting. Materials and Methods: Three open-source models (CTGAN, TableGAN, and CopulaGAN) were used to generate synthetic data from the Kaloleni/Rabai HDSS dataset. To assess the quality of the synthetic datasets generated by each model, we performed fidelity, utility, and privacy tests. Results: CTGAN outperformed the other models, producing synthetic data that closely mirrored the statistical properties of the real dataset while preserving privacy. Both CopulaGAN and TableGAN performed poorly, with TableGAN completely failing to generate realistic synthetic data. For the utility tests, Random Forest models trained on CTGAN-generated synthetic data achieved comparable performance to models trained on real data (accuracy: 72.4% vs 72.0%, P =. 38; F1 score: 71.4% vs 68.3%, P =. 22), indicating no statistically significant loss in predictive utility. The CTGAN model also yielded higher precision and recall than CopulaGAN, suggesting that the synthetic data generated by CTGAN better preserved the underlying structure of the real data. Discussion: CTGAN demonstrated superior performance in generating high-quality synthetic tabular HDSS data. CopulaGAN and TableGAN produced lower quality data, though these results may not generalize to other datasets. Conclusion: Synthetic data generation of tabular data using HDSS data, particularly via CTGAN, may enhance the accessibility of datasets in LMICs by creating synthetic datasets that preserve the characteristics and statistical properties of the original data, while upholding privacy and confidentiality.
KW - AI
KW - HDSS
KW - generative adversarial networks
KW - health and demographic surveillance systems
KW - machine learning
KW - synthetic data
UR - https://www.scopus.com/pages/publications/105022441401
U2 - 10.1093/jamiaopen/ooaf137
DO - 10.1093/jamiaopen/ooaf137
M3 - Article
AN - SCOPUS:105022441401
SN - 2574-2531
VL - 8
JO - JAMIA Open
JF - JAMIA Open
IS - 6
M1 - ooaf137
ER -