Improving Email Spam Detection And Classification Through Data Balancing And Ensemble Machine Learning-Based Boosting Approaches

Garima Mishra; Dr.Parth Gautam

doi:10.64252/2r18r594

Authors

Garima Mishra Author
Dr.Parth Gautam Author

DOI:

https://doi.org/10.64252/2r18r594

Keywords:

Cybersecurity, Email, Spam emails, machine learning, deep learning, balancing, boosting, ensemble models.

Abstract

Email is among the most used and effective internet communication and data or messages sharing method. With the importance and high usage of emails, spam mail has also grown at a great rate. Email systems are faced with the huge and complicated challenges of detecting and filtering spam. The use of traditional methods of identification such as blocklists, real-time blackhole listing and content-based methods is limited. As a result of these constraints, more advanced machine learning (ML) tools have been developed to ameliorate accuracy of spam detection. The current work deals with the problem of email spam identification which is on the rise, i. e., it is a relevant issue in the sphere of digital communications security in which undesired or malicious email messages could be employed to infringe upon the privacy and integrity of user databases. The overall aim was to create an effective classification system that could easily differentiate between spam and legitimate messages. On the Spambase data of the UCI data repository preprocessing was used using feature labeling, splitting training and testing sets and using SMOTEENN to balance the classes to reduce skewness. The three ensemble boosting models AdaBoost, Gradient Boosting (GBC), and CatBoost models were implemented and stringently tested in terms of confusion matrix, classification report, sensitivity, specificity, ROC curves and precision-recall curves. The results were high, with all models showing a similar level of performance, approximately 97.66% for AdaBoost, 97.92% for GBC, and 98% for CatBoost. Notably, CatBoost slightly exceeded the others. The comparative analysis proved that boosting-based models exhibit great resilience to misclassification of spam and non-spam and can be effectively utilized in real-life application. The significance of this work is that it integrates hybrid resampling with the most recent boosting techniques which ensures high performance with unequal data together with the highest possible detection. The values of the research performance measure indicate that ML models have the potential to enhance the adoption of cybersecurity solutions in combating email spam attacks.

Downloads

Download data is not yet available.

Improving Email Spam Detection And Classification Through Data Balancing And Ensemble Machine Learning-Based Boosting Approaches

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Indexing