Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques

Abdelmgeid A. Ali; Galal R. Galal; Hassan S. Hassan

doi:10.64252/3a8wqx36

Authors

Abdelmgeid A. Ali Author
Galal R. Galal Author
Hassan S. Hassan Author

DOI:

https://doi.org/10.64252/3a8wqx36

Keywords:

Diabetes mellitus, Machine learning, Data processing, Pima Indians Diabetes Dataset (PIDD), Classification algorithms, Random Forest, XGBoost, LightGBM, Ensemble Learning, Feature Engineering

Abstract

Type 2 diabetes is a major public-health problem. We build a leakage-safe machine-learning workflow onthe PimaIndians DiabetesDataset (768records)to predict diabetes from routine clinical attributes. Clinically implausible zeros in Glucose, Blood Pressure, Skin Thickness, Insulin, and BMI are treated as missing and imputed with class-conditional medians. Continuous variables are standardized only for models that require scaling (e.g., LR, SVM, KNN); tree-based models use raw scales. Besides the eight original attributes, we engineer 16 clinically interpretable composite features and assess their utility with descriptive checks and model-agnostic explainability (SHAP). The model portfolio includes Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM .The final classifier is a soft-voting ensemble of XGBoost and LightGBM based on averaged predicted probabilities. Using a stratified train/validation procedure and a strictly held-out test set, the ensemble achieves Accuracy = 89.61%,ROC-AUC=94.52%,andF1=85.19%,out performing the individual models. SHAP highlights clinically coherent drivers (e.g., glucose, pregnancies, age, BMI related composites). Compared with recent Scopus-indexed studies on the same dataset (≈74–89% accuracy), our leakage-controlled and transparent pipeline provides competitive, reproducible results and a practical basis for clinical decision support that can be extended to larger, multi-site, and more diverse cohorts

Downloads

Download data is not yet available.

Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Indexing

Language