Enhancing Predictive Accuracy in Phishing Attack Detection: A Study on the Impact of Collinearity and Feature Selection in ML-based Logistic Regression Models


  • Sagar Aghera, Nikhil Yogesh Joshi


Phishing URL, Machine Learning, Logistic Regression, Collinearity.


Phishing threats present dangers, for people and businesses alike emphasizing the need, for creating reliable detection techniques. It is crucial to establish phishing tactics to protect confidential data and avoid monetary damages. This study delves deeper into the intricacies of logistic regression models and how these models could effectively detect phishing attacks with a focus on impact of factors like collinearity and feature selection on predictive accuracy and model performance. In addition to logistic regression, different machine learning models, such as Decision Tree Classifier, Gaussian Naive Bayes, Logistic Regression, K Nearest Neighbors and Linear Discriminant Analysis were also considered to analyze the relationships between predictor variables and successful phishing attack likelihood and the predictive accuracy from each of the methods. By conducting experiments and comparisons we show that addressing collinearity issues and employing feature selection techniques significantly improve the predictive accuracy of logistic regression models compared to other common machine learning models. Through a methodical process of feature engineering focused on addressing collinearity among predictors, we achieved a substantial reduction of over 35% in the false negative rate for the logistic regression model which is crucial as false negatives are more costly. These findings provide insights, for enhancing the efficiency of phishing detection systems to strengthen cybersecurity defenses against emerging threats.


Download data is not yet available.


Adeyemo, V.E., Balogun, A.O., Mojeed, H.A., Akande, N.O., Adewole, K.S. (2021). Ensemble-Based Logistic Model Trees for Website Phishing Detection. In: Anbar, M., Abdullah, N., Manickam, S. (eds) Advances in Cyber Security. ACeS 2020. Communications in Computer and Information Science, vol 1347. Springer, Singapore.

Moedjahedy, J., Setyanto, A., Alarfaj, F. K., & Alreshoodi, M. (2022). CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning. Future Internet, 14(8), 229.

Vajrobol, V., Gupta, B. B., & Gaurav, A. (2024). Mutual information based logistic regression for phishing URL detection. Cyber Security and Applications, 2, 100044.

Chiramdasu, R., Srivastava, G., Bhattacharya, S., Reddy, P. K., & Gadekallu, T. R. (2021, August). Malicious url detection using logistic regression. In 2021 IEEE International Conference on Omni-Layer Intelligent Systems (COINS) (pp. 1-6). IEEE.

Sarma, D., Mittra, T., Bawm, R. M., Sarwar, T., Lima, F. F., & Hossain, S. (2021). Comparative analysis of machine learning algorithms for phishing website detection. In Inventive Computation and Information Technologies: Proceedings of ICICIT 2020 (pp. 883-896). Springer Singapore.

Abedin, N. F., Bawm, R., Sarwar, T., Saifuddin, M., Rahman, M. A., & Hossain, S. (2020, December). Phishing attack detection using machine learning classification techniques. In 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) (pp. 1125-1130). IEEE.

Prasad,Arvind and Chandra,Shalini. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

Midi, H., Sarkar, S. K., & Rana, S. (2010). Collinearity diagnostics of binary logistic regression model. Journal of Interdisciplinary Mathematics, 13(3), 253–267.

Ben-Farag, S. O., & El-Saeiti, I. N. (2022) Effect and Influence of Class Imbalance and Multicollinearity in Binary Logistic Regression (A Comparative Simulation Study).

Alin, A. (2010). Multicollinearity. Wiley interdisciplinary reviews: computational statistics, 2(3), 370-374.




How to Cite

Sagar Aghera. (2024). Enhancing Predictive Accuracy in Phishing Attack Detection: A Study on the Impact of Collinearity and Feature Selection in ML-based Logistic Regression Models. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 723–728. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6277



Research Article