Investigating the Effectiveness of Word2Vec for Spam Detection Using Lazy Predict Library

Authors

  • Aissa Fellah, Kheireddine Mekkaoui, Ahmed Zahaf, Atilla Elçi

Keywords:

E-mail spam, Word2Vec, Machine learning technique, Lazy Predict

Abstract

The proliferation of Email for exchanging information and messages through internet coincides with a significant rise in unsolicited email (spam),making it increasingly difficult for users to manage their in-boxes and identify legitimate messages. A multitude of detection methodologies have been established and refined to address the deluge of unsolicited electronic mail messages. These approaches encompass knowledge-based techniques, clustering algorithms, learning-based models, heuristic algorithms, and potentially other methodologies. It is noteworthy that while numerous advancements have been made, none of these detection models or techniques have achieved perfect predictive accuracy. Within the domain of spam email detection, machine learning(ML) and deep learning(DL) algorithms have emerged as the most effectual methodologies amongst the plethora of models proposed. Choosing the optimal model for a ML problem can be a challenging task.To solve this problem, we start by converting the email text into vector features using word2Vec and applying various machine learning classifiers on the dataset using Lazy Predict classifiers with default parameters for ML models, We'll then evaluate our basic model's performance after fine-tuning Word2Vec hyperparameters. Here basic model means "Model without parameters", we chose the best models, then applied a hyper parameter adjustment to them. This investigation explores the efficacy of word2Vec with ML in spam email classification. The proposed approach achieved a commendable accuracy of 0.99, signifying its potential as a valuable tool for enhancing spam detection capabilities.

Downloads

Download data is not yet available.

References

Ahmed, N. ,Amin, R., Aldabbas, H. ,Kounda, D. ,Alouff, B., Shah, T. , “Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges”, Security and Communication Networks, vol. 2022, pp. 1–19, 2022.

https://doi.org/10.1155/2022/1862888.

Mikolov, T., Chen, K., Corrado, G., Dean, J. , “Efficient Estimation of Word Representations in Vector Space”, arXiv.org, Sep. 07, 2013.

https://arxiv.org/abs/1301.3781

Pennington, J.,Socher, R., Manning, C., “GloVe: Global Vectors for Word Representation,” Association for Computational Linguistics, 2014. Available:

https://aclanthology.org/D14-1162.pdf

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv.org, Oct. 11, 2018. https://arxiv.org/abs/1810.04805

Kaplan, N. R. M. S. J., Shyam, P. D. A. N. P., Agarwal, G. S. A. A. S., Tom, A. H. V. G. K., Daniel, H. R. C. A. R., Winter, M. Z. J. W. C., ... & Mann, B., “Language models are few shot learners”, ar5iv.labs.arxiv.org, ‏ (2020).

https://ar5iv.labs.arxiv.org/html/2206.10498

Nallamothu T., Shais Khan, M., “Machine Learning for SPAM Detection”, Asian Journal of Advances in Research, vol. 6(1), pp. 167–179, 2023.

http://eprint.subtopublish.com/id/eprint/3333/

Alpaydin, E., “Introduction to machine learning”, The Mit Press, 2014. ISBN: 9780262043793 https://mitpress.mit.edu/9780262043793/introduction-to-machine- learning/

Jordan, M. I., Mitchell, T. M., “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, Jul. 2020.

https://doi.org/10.1126/science.aaa8415.

Raschka, S., Patterson, J., Nolet, C., “Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence,” Information, vol.11(4), p. 193, 2020. https://doi.org/10.3390/info11040193.

Pandala, S. R., “shankarpandala/lazypredict,” GitHub, Jan. 10, 2024.

https://github.com/shankarpandala/lazypredict

Gadde, S., Lakshmanarao, A., Satyanarayana, S., “SMS Spam Detection using Machine Learning and Deep Learning Techniques,” 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Mar. 2021. https://doi.org/10.1109/icaccs51430.2021.9441783.

Gupta, V., Mehta, A., Goel, A., Dixit, U., Pandey, A. C., “Spam Detection Using Ensemble Learning”, Harmony Search and Nature Inspired Optimization Algorithms, pp. 661–668, Aug. 2018. https://doi.org/10.1007/978-981-13-0761-4_63.

Roy, P. K., Singh, J. P., Banerjee, S., “Deep learning to filter SMS Spam,” Future Generation Computer Systems, vol. 102, pp. 524–533, Jan. 2020.

https://doi.org/10.1016/j.future.2019.09.001.

Jain, G., Sharma, M., Agarwal, B., “Optimizing semantic LSTM for spam detection,” International Journal of Information Technology, vol. 11(2), pp. 239–250, 2018. https://doi.org/10.1007/s41870-018-0157-5

Wei, F., Nguyen, T., “A Lightweight Deep Neural Model for SMS Spam Detection,” 2020 International Symposium on Networks, Computers and Communications (ISNCC), Oct. 2020. https://doi.org/10.1109/isncc49221.2020.9297350.

“The Radicati Group, Inc.” https://www.radicati.com/

Gashti, M. Z., “Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree,” Engineering, Technology & Applied Science Research, vol. 7(3), pp. 1713–1718, 2017. https://doi.org/10.48084/etasr.1171.

Bansal, C. , Sidhu, B., “Machine Learning based Hybrid Approach for Email Spam Detection,” IEEE Xplore, Sep. 01, 2021. https://ieeexplore.ieee.org/document/9596149.

Tida, V. S., Hsu, S., “Universal Spam Detection using Transfer Learning of BERT Model,” arxiv.org, Feb. 2022. https://doi.org/10.48550/arXiv.2202.03480.

Dada, E. G., Bassi, J. S., Chiroma, H., Abdulhamid, S. M., Adetunmbi, A. O., Ajibuwa, O. E., “Machine learning for email spam filtering: review, approaches and open research problems,” Heliyon, vol. 5(6), p. e01802, Jun. 2019. https://doi.org/10.1016/j.heliyon.2019.e01802.

“gensim: topic modelling for humans,” radimrehurek.com. accessed January 2024

https://radimrehurek.com/gensim/models/word2vec.html,

Donoho, D., “50 Years of Data Science,” Journal of Computational and Graphical Statistics, vol. 26, no. 4, pp. 745–766, Oct. 2017. https://doi.org/10.1080/10618600.2017.1384734.

“UCI Machine Learning Repository,” archive.ics.uci.edu. (accessed Mar. 10, 2024).

https://archive.ics.uci.edu/ml/datasets/SMS%2BSpam%2BCollecti on.

“Metatext,” metatext.io.

https://metatext.io/datasets/ling-spam-dataset

Downloads

Published

24.03.2024

How to Cite

Aissa Fellah. (2024). Investigating the Effectiveness of Word2Vec for Spam Detection Using Lazy Predict Library. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 2968–2977. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5887

Issue

Section

Research Article