Diabetes Prediction on Pima Indian Dataset Using Machine Learning Techniques
DOI:
https://doi.org/10.64252/3a8wqx36Keywords:
Diabetes mellitus, Machine learning, Data processing, Pima Indian Diabetes Dataset (PIDD), Classification algorithms, Model optimization, Logistic Regression, Random Forest, XGBoostAbstract
Diabetes mellitus is a growing global health burden, and early prediction can support timely intervention. This study develops and evaluates a leakage-safe machine-learning pipeline for Type-2 diabetes prediction on the Pima Indians Diabetes Dataset (768 records). Clinically implausible zeros in Glucose, BloodPressure, SkinThickness, Insulin, and BMI are treated as missing and imputed with class-conditional medians. Pipelines include SelectKBest (ANOVA F-test) feature screening, standardization, optional PCA, and a diverse set of classifiers (Logistic Regression, SVM, KNN, Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost). Model selection uses stratified 5-fold cross-validation with GridSearchCV, followed by metaheuristic hyperparameter tuning (Genetic Algorithm, Particle Swarm Optimization, Differential Evolution). Evaluation on a held-out test set reports accuracy and ROC-AUC, alongside precision, recall, and F1.
Across models and optimizers, the best overall performance is achieved by Random Forest tuned with Differential Evolution, reaching ~0.897 test accuracy and ~0.956 ROC-AUC, with LightGBM and XGBoost close behind (~0.88–0.89 accuracy; ~0.95 AUC). Results indicate that tree-based ensembles, when carefully tuned and guarded against leakage, provide strong and stable generalization on tabular clinical data. We also provide model-agnostic explainability (SHAP) to highlight clinically intuitive drivers (e.g., glucose, pregnancies, age). The proposed, fully reproducible pipeline offers a practical foundation for clinical decision support and can be extended to larger and more diverse cohorts.