Chi-Square Method of Feature Selection: Impact of Pre-Processing of Data

Authors

  • Alisha Sikri, N. P. Singh, Surjeet Dalal

Keywords:

essential, assumptions, Python, subjecting, Chi-Square, algorithms

Abstract

Feature selection is a technique of lowering computation and data collecting costs and rejecting the less significant or redundant factors/variable which in turn may also increase the efficiency of machine learning algorithms. One of the often-used method of feature selection for categorical data is Chi-Square method which is based on certain assumptions. Defilement of assumptions has an impact on the computed p-values which are surrogate to importance of the features. The main purpose of this study is to identify the impact of pre-processing of data keeping in view the assumptions of Chi-Square on raking of features. A secondary objective is to see how in-built sub-routines of computer languages such as Python or R are incorporating the assumptions of Chi-Square. Based on empirical evidence it was found that it is essential to pre-process the data to fulfill the assumptions of chi-square before subjecting it to analysis using either R or Python, or any other application available on web or otherwise.

Downloads

Download data is not yet available.

Author Biography

Alisha Sikri, N. P. Singh, Surjeet Dalal

Alisha Sikri1, N. P. Singh2, Surjeet Dalal3

1 Research Scholar, Department of Computer Science and Engineering, SRM University, Delhi-NCR, India, Email: alisha.sikri92@gmail.com
2 Professor, School of Business Management and Commerce, MVN University, (NCR), Palwal, Haryana, India. Email: dr.npsingh@mvn.edu.in
3 Professor, Department of Computer Science and Engineering, Amity University, Gurugram, Haryana, India. Email: sdalal@ggn.amity.edu

 

References

Sulistiani, H., & Tjahyanto, A. (2017). Comparative analysis of feature selection method to predict customer loyalty. IPTEK the Journal of Engineering, 3(1), 1-5.

Al-Harbi, O. (2019). A comparative study of feature selection methods for dialectal Arabic sentiment classification using support vector machine. arXiv preprint arXiv:1902.06242.

Kumar, C. S., & Sree, R. J. (2014). Application Of Ranking Based Attribute Selection Filters to Perform Automated Evaluation of Descriptive Answers Through Sequential Minimal Optimization Models. ICTACT Journal on Soft Computing, 5(1).

Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University-Computer and Information Sciences, 32(2), 225-231.

Rachburee, N., & Punlumjeak, W. (2015, October). A comparison of feature selection approach between greedy, IG-ratio, Chi-square, and mRMR in educational mining. In 2015 7th international conference on information technology and electrical engineering (ICITEE) (pp. 420-424). IEEE.

Rafei, N. S. I. M., Hassan, R., Saedudin, R. R., Raffei, A. F. M., Zakaria, Z., & Kasim, S. (2019). Comparison of feature selection techniques in classifying stroke documents. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1244-1250.

Hazra, A., & Gogtay, N. (2016). Biostatistics series module 1: Basics of biostatistics. Indian Journal of Dermatology, 61(1), 10.

Putra, A. E., & Wardhani, L. K. (2019, November). Chi-Square Feature Selection Effect on Naive Bayes Classifier Algorithm Performance For Sentiment Analysis Document. In 2019 7th International Conference on Cyber and IT Service Management (CITSM) (Vol. 7, pp. 1-7). IEEE.

Nihan, S. T. (2020). Karl Pearson’ s chi-square tests. Educational Research and Reviews, 15(9), 575-580.

Mirkin, B. (2001). Eleven ways to look at the chi-squared coefficient for contingency tables. The American Statistician, 55(2), 111-120.

Goodman, L. A., & Kruskal, W. H. (1979). Measures of association for cross classifications. In Measures of association for cross classifications (pp. 2-34). Springer, New York, NY.

Bolboacă, S. D., Jäntschi, L., Sestraş, A. F., Sestraş, R. E., & Pamfil, D. C. (2011). Pearson-Fisher chi-square statistic revisited. Information, 2(3), 528-545.

Asuncion, A., & Newman, D. (2007). UCI machine learning repository.

Cai, L. J., Lv, S., & Shi, K. B. (2021). Application of an improved CHI feature selection algorithm. Discrete dynamics in nature and society, 2021.

Zhai, Y., Song, W., Liu, X., Liu, L., & Zhao, X. (2018, November). A chi-square statistics-based feature selection method in text classification. In 2018 IEEE 9th International conference on software engineering and service science (ICSESS) (pp. 160-163). IEEE.

Bisht, N., Ahmad, A., & Bisht, S. (2016). Application of feature selection methods and ensembles on network security dataset. International Journal of Computer Applications, 135(11), 1-5.

Pintas, J. T., Fernandes, L. A., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54(8), 6149-6200.

Muthuselvan, S., Rajapraksh, S., Somasundaram, K., & Karthik, K. (2018). Classification of liver patient dataset using machine learning algorithms. Int. J. Eng. Technol, 7(3.34), 323.

Gulia, A., Vohra, R., & Rani, P. (2014). Liver patient classification using intelligent techniques. International Journal of Computer Science and Information Technologies, 5(4), 5110-5115.

Ramana, B. V., Babu, M. S. P., & Venkateswarlu, N. B. (2011). A critical study of selected classification algorithms for liver disease diagnosis. International Journal of Database Management Systems, 3(2), 101-114.

Kim, H. Y. (2017). Statistical notes for clinical researchers: Chi-squared test and Fisher's exact test. Restorative dentistry & endodontics, 42(2), 152-155.

Brown, J. D. (2013). Chi-square and related statistics for 2× 2 contingency tables. Testing and Evaluation SIG, 33.

Matchima, K., Vongprasert, J., & Chutiman, N. (2018). The Development of a Correction Method for Ensuring a Continuity Value of The Chi-square Test with a Small Expected Cell Frequency. Naresuan University Journal: Science and Technology (NUJST), 26(1), 98-105.

Peritz, E., & Haviland, M. G. (1992). Yates's correction for continuity and the analysis of 2× 2 contingency tables. Statistics in medicine, 11(6), 845-847.

Dahiya, S., Handa, S. S., & Singh, N. P. (2017). A feature selection enabled hybrid-bagging algorithm for credit risk evaluation. Expert Systems, 34(6), e12217. doi:10.1111/exsy.12217

Dahiya, S., Handa, S. S., & Singh, N. P. (2016). A Rank Aggregation Algorithm for Ensemble of Multiple Feature Selection Techniques in Credit Risk Evaluation. (IJARAI) International Journal of Advanced Research in Artificial Intelligence, Vol. 5, No. 9, 2016.1-8

Bachri, O. S., Kusnadi, M. H., & Nurhayati, O. D. (2017). Feature selection based on CHI square in artificial neural network to predict the accuracy of student study period. International Journal of Civil Engineering and Technology, 8(8).

Mahmood, M.R. (2020). Two Feature Selection Methods Comparison Chi-square and Relief-F for Facial Expression Recognition, Journal of Physics: Conference Series 1804 (2021) 012056 doi:10.1088/1742-6596/1804/1/012056.

Mahmood, M. R., & Abdulrazzaq, M. B. (2022). Performance evaluation of chi-square and relief-F feature selection for facial expression recognition. Indonesian Journal of Electrical Engineering and Computer Science, 27(3), 1470-1478.

Zhai, Y., Song, W., Liu, X., Liu, L., & Zhao, X. (2018, November). A chi-square statistics-based feature selection method in text classification. In 2018 IEEE 9th International conference on software engineering and service science (ICSESS) (pp. 160-163). IEEE.

Alshaer, H. N., Otair, M. A., Abualigah, L., Alshinwan, M., & Khasawneh, A. M. (2021). Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application. Multimedia Tools and Applications, 80(7), 10373-10390.

Thabtah, F., Eljinini, M., Zamzeer, M., & Hadi, W. (2009, January). Naïve Bayesian based on Chi Square to categorize Arabic data. In Proceedings of the 11th international business information management association conference (IBIMA) conference on innovation and knowledge management in twin track economies, Cairo, Egypt (pp. 4-6).

Bachri, O. S., Kusnadi, M. H., & Nurhayati, O. D. (2017). Feature selection based on CHI square in artificial neural network to predict the accuracy of student study period. International Journal of Civil Engineering and Technology, 8(8).

Downloads

Published

04.02.2023

How to Cite

Alisha Sikri, N. P. Singh, Surjeet Dalal. (2023). Chi-Square Method of Feature Selection: Impact of Pre-Processing of Data. International Journal of Intelligent Systems and Applications in Engineering, 11(3s), 241–248. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/2680

Issue

Section

Research Article