Investigating the Effectiveness of Word2Vec for Spam Detection Using Lazy Predict Library
Keywords:
E-mail spam, Word2Vec, Machine learning technique, Lazy PredictAbstract
The proliferation of Email for exchanging information and messages through internet coincides with a significant rise in unsolicited email (spam),making it increasingly difficult for users to manage their in-boxes and identify legitimate messages. A multitude of detection methodologies have been established and refined to address the deluge of unsolicited electronic mail messages. These approaches encompass knowledge-based techniques, clustering algorithms, learning-based models, heuristic algorithms, and potentially other methodologies. It is noteworthy that while numerous advancements have been made, none of these detection models or techniques have achieved perfect predictive accuracy. Within the domain of spam email detection, machine learning(ML) and deep learning(DL) algorithms have emerged as the most effectual methodologies amongst the plethora of models proposed. Choosing the optimal model for a ML problem can be a challenging task.To solve this problem, we start by converting the email text into vector features using word2Vec and applying various machine learning classifiers on the dataset using Lazy Predict classifiers with default parameters for ML models, We'll then evaluate our basic model's performance after fine-tuning Word2Vec hyperparameters. Here basic model means "Model without parameters", we chose the best models, then applied a hyper parameter adjustment to them. This investigation explores the efficacy of word2Vec with ML in spam email classification. The proposed approach achieved a commendable accuracy of 0.99, signifying its potential as a valuable tool for enhancing spam detection capabilities.
Downloads
References
Ahmed, N. ,Amin, R., Aldabbas, H. ,Kounda, D. ,Alouff, B., Shah, T. , “Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges”, Security and Communication Networks, vol. 2022, pp. 1–19, 2022.
https://doi.org/10.1155/2022/1862888.
Mikolov, T., Chen, K., Corrado, G., Dean, J. , “Efficient Estimation of Word Representations in Vector Space”, arXiv.org, Sep. 07, 2013.
https://arxiv.org/abs/1301.3781
Pennington, J.,Socher, R., Manning, C., “GloVe: Global Vectors for Word Representation,” Association for Computational Linguistics, 2014. Available:
https://aclanthology.org/D14-1162.pdf
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv.org, Oct. 11, 2018. https://arxiv.org/abs/1810.04805
Kaplan, N. R. M. S. J., Shyam, P. D. A. N. P., Agarwal, G. S. A. A. S., Tom, A. H. V. G. K., Daniel, H. R. C. A. R., Winter, M. Z. J. W. C., ... & Mann, B., “Language models are few shot learners”, ar5iv.labs.arxiv.org, (2020).
https://ar5iv.labs.arxiv.org/html/2206.10498
Nallamothu T., Shais Khan, M., “Machine Learning for SPAM Detection”, Asian Journal of Advances in Research, vol. 6(1), pp. 167–179, 2023.
http://eprint.subtopublish.com/id/eprint/3333/
Alpaydin, E., “Introduction to machine learning”, The Mit Press, 2014. ISBN: 9780262043793 https://mitpress.mit.edu/9780262043793/introduction-to-machine- learning/
Jordan, M. I., Mitchell, T. M., “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, Jul. 2020.
https://doi.org/10.1126/science.aaa8415.
Raschka, S., Patterson, J., Nolet, C., “Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence,” Information, vol.11(4), p. 193, 2020. https://doi.org/10.3390/info11040193.
Pandala, S. R., “shankarpandala/lazypredict,” GitHub, Jan. 10, 2024.
https://github.com/shankarpandala/lazypredict
Gadde, S., Lakshmanarao, A., Satyanarayana, S., “SMS Spam Detection using Machine Learning and Deep Learning Techniques,” 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Mar. 2021. https://doi.org/10.1109/icaccs51430.2021.9441783.
Gupta, V., Mehta, A., Goel, A., Dixit, U., Pandey, A. C., “Spam Detection Using Ensemble Learning”, Harmony Search and Nature Inspired Optimization Algorithms, pp. 661–668, Aug. 2018. https://doi.org/10.1007/978-981-13-0761-4_63.
Roy, P. K., Singh, J. P., Banerjee, S., “Deep learning to filter SMS Spam,” Future Generation Computer Systems, vol. 102, pp. 524–533, Jan. 2020.
https://doi.org/10.1016/j.future.2019.09.001.
Jain, G., Sharma, M., Agarwal, B., “Optimizing semantic LSTM for spam detection,” International Journal of Information Technology, vol. 11(2), pp. 239–250, 2018. https://doi.org/10.1007/s41870-018-0157-5
Wei, F., Nguyen, T., “A Lightweight Deep Neural Model for SMS Spam Detection,” 2020 International Symposium on Networks, Computers and Communications (ISNCC), Oct. 2020. https://doi.org/10.1109/isncc49221.2020.9297350.
“The Radicati Group, Inc.” https://www.radicati.com/
Gashti, M. Z., “Detection of Spam Email by Combining Harmony Search Algorithm and Decision Tree,” Engineering, Technology & Applied Science Research, vol. 7(3), pp. 1713–1718, 2017. https://doi.org/10.48084/etasr.1171.
Bansal, C. , Sidhu, B., “Machine Learning based Hybrid Approach for Email Spam Detection,” IEEE Xplore, Sep. 01, 2021. https://ieeexplore.ieee.org/document/9596149.
Tida, V. S., Hsu, S., “Universal Spam Detection using Transfer Learning of BERT Model,” arxiv.org, Feb. 2022. https://doi.org/10.48550/arXiv.2202.03480.
Dada, E. G., Bassi, J. S., Chiroma, H., Abdulhamid, S. M., Adetunmbi, A. O., Ajibuwa, O. E., “Machine learning for email spam filtering: review, approaches and open research problems,” Heliyon, vol. 5(6), p. e01802, Jun. 2019. https://doi.org/10.1016/j.heliyon.2019.e01802.
“gensim: topic modelling for humans,” radimrehurek.com. accessed January 2024
https://radimrehurek.com/gensim/models/word2vec.html,
Donoho, D., “50 Years of Data Science,” Journal of Computational and Graphical Statistics, vol. 26, no. 4, pp. 745–766, Oct. 2017. https://doi.org/10.1080/10618600.2017.1384734.
“UCI Machine Learning Repository,” archive.ics.uci.edu. (accessed Mar. 10, 2024).
https://archive.ics.uci.edu/ml/datasets/SMS%2BSpam%2BCollecti on.
“Metatext,” metatext.io.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.