Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques

Authors

  • Abdelmgeid A. Ali Author
  • Galal R. Galal Author
  • Hassan S. Hassan Author

DOI:

https://doi.org/10.64252/3a8wqx36

Keywords:

Diabetes mellitus, Machine learning, Data processing, Pima Indians Diabetes Dataset (PIDD), Classification algorithms, Random Forest, XGBoost, LightGBM, Ensemble Learning, Feature Engineering

Abstract

Type 2 diabetes is a major public-health problem. We build a leakage-safe machine-learning workflow onthe PimaIndians DiabetesDataset (768records)to predict diabetes from routine clinical attributes. Clinically implausible zeros in Glucose, Blood Pressure, Skin Thickness, Insulin, and BMI are treated as missing and imputed with class-conditional medians. Continuous variables are standardized only for models that require scaling (e.g., LR, SVM, KNN); tree-based models use raw scales. Besides the eight original attributes, we engineer 16 clinically interpretable composite features and assess their utility with descriptive checks and model-agnostic explainability (SHAP). The model portfolio includes Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM .The final classifier is a soft-voting ensemble of XGBoost and LightGBM based on averaged predicted probabilities. Using  a  stratified  train/validation  procedure  and  a  strictly  held-out  test  set,  the  ensemble  achieves  Accuracy  = 89.61%,ROC-AUC=94.52%,andF1=85.19%,out performing the individual models. SHAP highlights clinically coherent drivers (e.g., glucose, pregnancies, age, BMI related composites). Compared with recent Scopus-indexed studies on the same dataset (≈74–89% accuracy), our leakage-controlled and transparent pipeline provides competitive, reproducible results and a practical basis for clinical decision support that can be extended to larger, multi-site, and more diverse cohorts

Downloads

Download data is not yet available.

Downloads

Published

2025-09-02

Issue

Section

Articles

How to Cite

Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques. (2025). International Journal of Environmental Sciences, 529-550. https://doi.org/10.64252/3a8wqx36