Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques
DOI:
https://doi.org/10.64252/3a8wqx36Keywords:
Diabetes mellitus, Machine learning, Data processing, Pima Indians Diabetes Dataset (PIDD), Classification algorithms, Random Forest, XGBoost, LightGBM, Ensemble Learning, Feature EngineeringAbstract
Type 2 diabetes is a major public-health problem. We build a leakage-safe machine-learning workflow onthe PimaIndians DiabetesDataset (768records)to predict diabetes from routine clinical attributes. Clinically implausible zeros in Glucose, Blood Pressure, Skin Thickness, Insulin, and BMI are treated as missing and imputed with class-conditional medians. Continuous variables are standardized only for models that require scaling (e.g., LR, SVM, KNN); tree-based models use raw scales. Besides the eight original attributes, we engineer 16 clinically interpretable composite features and assess their utility with descriptive checks and model-agnostic explainability (SHAP). The model portfolio includes Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM .The final classifier is a soft-voting ensemble of XGBoost and LightGBM based on averaged predicted probabilities. Using a stratified train/validation procedure and a strictly held-out test set, the ensemble achieves Accuracy = 89.61%,ROC-AUC=94.52%,andF1=85.19%,out performing the individual models. SHAP highlights clinically coherent drivers (e.g., glucose, pregnancies, age, BMI related composites). Compared with recent Scopus-indexed studies on the same dataset (≈74–89% accuracy), our leakage-controlled and transparent pipeline provides competitive, reproducible results and a practical basis for clinical decision support that can be extended to larger, multi-site, and more diverse cohorts