TY - GEN
T1 - Comparison of Data Imputation Performance in Deep Generative Models for Educational Tabular Missing Data
AU - Choi, Wan Chong
AU - Lam, Chan Tong
AU - Mendes, António José
N1 - Publisher Copyright:
© 2025 Copyright is held by the author(s).
PY - 2025
Y1 - 2025
N2 - Missing data presents a significant challenge in Educational Data Mining (EDM). Imputation techniques aim to reconstruct missing data while preserving critical information in datasets for more accurate analysis. Although imputation techniques have gained attention in various fields in recent years, their use for addressing missing data in education remains limited. This study contributes to filling the research gap by evaluating state-of-the-art deep generative models: Tabular Variational Autoencoder (TVAE), Conditional Tabular Generative Adversarial Networks (CTGAN), and Tabular Denoising Diffusion Probabilistic Models (TabDDPM) for imputing missing values using the Open University Learning Analytics Dataset (OULAD) with varying levels of missing data. These deep generative models identify relationships among demographic, behavioral, and partial assessment data to impute absent numerical assessment scores. TabDDPM showed the best imputation performance and maintained closer alignment with the original data, as demonstrated by the KL divergence and KDE plots. To further enhance predictive modeling performance with imputed data, this study proposes TabDDPM-SMOTE, which combines TabDDPM with the Synthetic Minority Over-sampling Technique (SMOTE) to tackle the class imbalance often encountered in educational datasets. Our TabDDPM-SMOTE model consistently achieves the highest F1-score when using the imputed data in XGBoost classification tasks, showcasing its strong efficiency and potential to enhance predictive effectiveness modeling.
AB - Missing data presents a significant challenge in Educational Data Mining (EDM). Imputation techniques aim to reconstruct missing data while preserving critical information in datasets for more accurate analysis. Although imputation techniques have gained attention in various fields in recent years, their use for addressing missing data in education remains limited. This study contributes to filling the research gap by evaluating state-of-the-art deep generative models: Tabular Variational Autoencoder (TVAE), Conditional Tabular Generative Adversarial Networks (CTGAN), and Tabular Denoising Diffusion Probabilistic Models (TabDDPM) for imputing missing values using the Open University Learning Analytics Dataset (OULAD) with varying levels of missing data. These deep generative models identify relationships among demographic, behavioral, and partial assessment data to impute absent numerical assessment scores. TabDDPM showed the best imputation performance and maintained closer alignment with the original data, as demonstrated by the KL divergence and KDE plots. To further enhance predictive modeling performance with imputed data, this study proposes TabDDPM-SMOTE, which combines TabDDPM with the Synthetic Minority Over-sampling Technique (SMOTE) to tackle the class imbalance often encountered in educational datasets. Our TabDDPM-SMOTE model consistently achieves the highest F1-score when using the imputed data in XGBoost classification tasks, showcasing its strong efficiency and potential to enhance predictive effectiveness modeling.
KW - Deep Learning Model
KW - Educational Data Mining
KW - Educational Tabular Missing Data
KW - Tabular Missing Data Imputation
UR - https://www.scopus.com/pages/publications/105023326511
U2 - 10.5281/zenodo.15870169
DO - 10.5281/zenodo.15870169
M3 - Conference contribution
AN - SCOPUS:105023326511
SN - 9781733673662
T3 - Proceedings of the International Conference on Educational Data Mining
SP - 133
EP - 142
BT - Proceedings of the 18th International Conference on Educational Data Mining, EDM 2025
A2 - Mills, Caitlin
A2 - Alexandron, Giora
A2 - Taibi, Davide
A2 - Lo Bosco, Giosuè
A2 - Paquette, Luc
PB - International Educational Data Mining Society
T2 - 18th International Conference on Educational Data Mining, EDM 2025
Y2 - 20 July 2025 through 23 July 2025
ER -