Spam E-mail Classification Recurrent Neural Networks for Spam E-mail Classification on an Agglutinative Language

Sahin Isik; Zuhal Kurt; Yildiray Anagun; Kemal Ozkan

doi:10.18201/ijisae.2020466316

Authors

Sahin Isik Eskisehir Osmangazi University https://orcid.org/0000-0003-1768-7104
Zuhal Kurt Atilim University https://orcid.org/0000-0003-1740-6982
Yildiray Anagun Eskisehir Osmangazi University https://orcid.org/0000-0003-2737-2720
Kemal Ozkan Eskisehir Osmangazi University https://orcid.org/0000-0003-2252-2128

DOI:

https://doi.org/10.18201/ijisae.2020466316

Keywords:

RNN, Odds Ratio, Mutual Information, Spam E-mail, LSTM

Abstract

In this study, we have provided an alternative solution to spam and legitimate email classification problem. The different deep learning architectures are applied on two feature selection methods, including the Mutual Information (MI) and Weighted Mutual Information (WMI). Firstly, feature selection methods including WMI and MI are applied to reduce number of selected terms. Secondly, the feature vectors are contructed with concept of bag-of-words (BoW) model. Finally, the performance of system is analysed with using Artificial Neural Network (ANN), Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BILSTM) models. After experimental simulations, we have observed that there is a competition between detection results of using WMI and MI when commented with accuracy rates for the agglutinative language, namely Turkish. The experimental scores shows that the LSTM and BILSTM gives 100% accuracy scores when combined with MI or WMI, for spam and legitimate emails. However, for particular cross validation, the performance WMI is higher than MI features in terms e-mail grouping. It turns out that WMI and MI with deep learning architectures seems more robust to spam email detection when considering the high detection scores.

Downloads

Download data is not yet available.

References

usa.kaspersky.com, Spam and Phishing Statistics Report Q1-2014,

https://usa.kaspersky.com/resource-center/threats/spam-statistics-report-q1-2014

itgovarnance.eu, Kaspersky records 130 million phishing attacks in Q2 2019, https://www.itgovernance.eu/

L. Özgür, T. Güngör, F. Gürgen, Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish, Pattern Recognition Letters, 25 (2004) 1819-1831

S. Gunal, Hybrid feature selection for text classification, Turkish Journal of Electrical Engineering Computer Sciences, 20 (2012) 1296-1311

A.K. Uysal, S. Gunal, S. Ergin, E.S. Gunal, A novel framework for SMS spam filtering, 2012 International Symposium on Innovations in Intelligent Systems and Applications, IEEE2012, pp. 1-4.

S. Ergin, S. Isik, The assessment of feature selection methods on agglutinative language for spam email detection: A special case for Turkish, Innovations in Intelligent Systems and Applications (INISTA) Proceedings, 2014 IEEE International Symposium on, IEEE2014, pp. 122-125.

S. Ergin, S. Isik, The investigation on the effect of feature vector dimension for spam email detection with a new framework, Information Systems and Technologies (CISTI), 2014 9th Iberian Conference on, IEEE2014, pp. 1-4.

S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation, 9 (1997) 1735-1780

M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks, IEEE transactions on Signal Processing, 45 (1997) 2673-2681

N. Chirawichitchai, P. Sa-nguansat, P. Meesad, A Comparative Study on Feature Weight in Thai Document Categorization Framework, IICS, Citeseer2010, pp. 257-266.

M. Lan, C.-L. Tan, H.-B. Low, S.-Y. Sung, A comprehensive comparative study on term weighting schemes for text categorization with support vector machines, Special interest tracks and posters of the 14th international conference on World Wide Web, ACM2005, pp. 1032-1033.

Z.-H. Deng, S.-W. Tang, D.-Q. Yang, M.Z.L.-Y. Li, K.-Q. Xie, A comparative study on feature weight in text categorization, Advanced Web Technologies and Applications, Springer2004, pp. 588-597.

D. Mladenic, Machine Learning on non-homogeneous, distributed text data, Ljubljana, Slovenia, Faculty of Computer and Information Science, University of Ljubljana, Diss, 3 (1998) 2

Z. Zheng, X. Wu, R. Srihari, Feature selection for text categorization on imbalanced data, ACM SIGKDD Explorations Newsletter, 6 (2004) 80-89

M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, Computer Vision and Pattern Recognition, 1991. Proceedings CVPR'91., IEEE Computer Society Conference on, IEEE1991, pp. 586-591.

M.E. Wall, A. Rechtsteiner, L.M. Rocha, Singular value decomposition and principal component analysis, A practical approach to microarray data analysis, Springer2003, pp. 91-109.

M.S. Bartlett, J.R. Movellan, T.J. Sejnowski, Face recognition by independent component analysis, Neural Networks, IEEE Transactions on, 13 (2002) 1450-1464

J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on computers, 18 (1969) 401-409

C.M. Bishop, Neural networks for pattern recognition, (1995)

A. Uysal, S. Gunal, S. Ergin, E. Sora Gunal, The Impact of Feature Extraction and Selection on SMS Spam Filtering, Elektronika ir Elektrotechnika, 19 (2012) 67-72

P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, ACM1998, pp. 604-613.

A. Jain, D. Zongker, Feature selection: Evaluation, application, and small sample performance, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19 (1997) 153-158

I.V. Oseledets, E.E. Tyrtyshnikov, Breaking the curse of dimensionality, or how to use SVD in many dimensions, SIAM Journal on Scientific Computing, 31 (2009) 3744-3759

D. Zongker, A. Jain, Algorithms for feature selection: An evaluation, Pattern Recognition, 1996., Proceedings of the 13th International Conference on, IEEE1996, pp. 18-22.