A Block-Based Feature Selection Method for Classification of Web Pages

Azween  Abdullah; Sandeep  Kumar M.; Prabhu  J.; Balamurugan  Balusamy

Authors

Azween Abdullah Faculty of Applied Computing, Perdana University, Kuala Lumpur, Malaysia
Sandeep Kumar M. School of Computing Science & Engineering, Galgotias University, Uttar Pradesh 203201, India
Prabhu J. School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India.
Balamurugan Balusamy Associate Dean-Student Engagement, Delhi-National Capital Region (NCR), Shiv Nadar University, India.

Keywords:

Web classification, Feature extraction, Blocks Segmentation, Spam Filtering, Semantics Word, Classifier

Abstract

Webpage Classification is one of the methods for retrieving useful information that can be used for many purposes like searching, organizing, and spam filtering, and so on. Most of the existing web page classification algorithms focus on extracting the entire data however recent works focus on selective retrieval that could improve the efficiency of the classification. In this paper, we propose a block-wise feature selection algorithm that can segment a web page into blocks and finally filter out all non-important blocks. We introduce three features namely 1) keyword weighting, 2) block segmentation and, 3) similarity measures for improving the efficiency of the classification process. We select blocks that are very crucial in the classification process. Since the useless blocks are removed, the feature space is reduced and the accuracy is increased. The semantic words are also eliminated and the subset of most relevant features is choosing for building the classification model. The results shows an improved classification results as the relationship between the features and the target variable is understood easily. To demonstrate the efficiency of the proposed model, we compared it with other top machine learning classifiers. Two datasets are used in our experiment. The experimental results showed that our proposed work with four machine learning models and obtained up to 95% accuracy which is 11.7% more than existing models

Downloads

Download data is not yet available.

References

Ashokkumar, P., Arunkumar, N., & Don, S. (2018). Intelligent optimal route recommendation among heterogeneous objects with keywords. Computers & Electrical Engineering, 68, 526-535.

Cheng, M. Y., Kusoemo, D., & Gosno, R. A. (2020). Text mining-based construction site accident classification using hybrid supervised machine learning. Automation in Construction, 118, 103265.

Srivastava, S. K., Singh, S. K., & Suri, J. S. (2019). Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm. Computer methods and programs in biomedicine, 172, 35-51.

Palanivinayagam, A., & Nagarajan, S. (2020). An optimized iterative clustering framework for recognizing speech. International Journal of Speech Technology, 23(4), 767-777.

Galitsky, B. (2013). Machine learning of syntactic parse trees for search and classification of text. Engineering Applications of Artificial Intelligence, 26(3), 1072-1091.

Palanivinayagam, A., & Sasikumar, D. (2020). Drug recommendation with minimal side effects based on direct and temporal symptoms. Neural Computing and Applications, 32(15), 10971-10978.

Ashokkumar, P., & Don, S. (2019). Link-Based Clustering Algorithm for Clustering Web Documents. Journal of Testing and Evaluation, 47(6), 4096-4107.

Nigam, C., & Sharma, A. K. (2020). Experimental performance analysis of web recommendation model in web usage mining using KNN page ranking classification approach. Materials Today: Proceedings.

Buber, E., & Diri, B. (2019). Web Page Classification Using RNN. Procedia Computer Science, 154, 62-72.

Lee, J. H., Yeh, W. C., & Chuang, M. C. (2015). Web page classification based on a simplified swarm optimization. Applied Mathematics and Computation, 270, 13-24.

Kan, M. Y., & Thi, H. O. N. (2005, October). Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 325-326).

Li, H., Xu, Z., Li, T., Sun, G., & Choo, K. K. R. (2017). An optimized approach for massive web page classification using entity similarity based on semantic network. Future Generation Computer Systems, 76, 510-518.

Özel, S. A. (2011). A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications, 38(4), 3407-3415.

Chen, R. C., & Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435.

De Smedt, J., Lacka, E., Nita, S., Kohls, H. H., & Paton, R. (2021). Session stitching using sequence fingerprinting for web page visits. Decision Support Systems, 113579.

Chatterjee, M., & Namin, A. S. (2021). A fuzzy Dempster–Shafer classifier for detecting Web spams. Journal of Information Security and Applications, 59, 102793.

Kang, J., & Choi, J. (2008, September). Block classification of a web page by using a combination of multiple classifiers. In 2008 Fourth International Conference on Networked Computing and Advanced Information Management (Vol. 2, pp. 290-295). IEEE.

Xu, G., Yu, Z., & Qi, Q. (2018). Efficient sensitive information classification and topic tracking based on tibetan Web pages. IEEE Access, 6, 55643-55652.

Ye, H., Cao, B., Peng, Z., Chen, T., Wen, Y., & Liu, J. (2019). Web services classification based on wide & Bi-LSTM model. IEEE Access, 7, 43697-43706.

Yang, Z., Gui, Z., Wu, H., & Li, W. (2019). A latent feature-based multimodality fusion method for theme classification on web map service. IEEE Access, 8, 25299-25309.

Uzun, E., Özhan, E., Agun, H. V., Yerlikaya, T., & Buluş, H. N. (2020). Automatically Discovering Relevant Images From Web Pages. IEEE Access, 8, 208910-208921.

Semantic-textual-similarity-nlp(2020).URL https://www.kaggle.com/bhrt97/semantic-textual-similarity-nlp

Clueweb12. URL https://lemurproject.org/clueweb12/

J.Holze,S.Hellmann, E.Starke.dbpedia dataset (2021).URL https://www.dbpedia.org/

Smedt, J. D., Lacka, E., Nita, S., Kohls, H., & Paton, R. (2021). Session stitching using sequence fingerprinting for web page visits. Decision Support Systems, 150, 113579. doi:10.1016/j.dss.2021.113579

Chatterjee, M., & Namin, A. S. (2021). A fuzzy Dempster–Shafer classifier for detecting Web spams. Journal of Information Security and Applications, 59, 102793. doi:10.1016/j.jisa.2021.102793

Palanivinayagam A, El-Bayeh CZ, Damaševičius R. Twenty Years of Machine-Learning-Based Text Classification: A Systematic Review. Algorithms. 2023; 16(5):236. https://doi.org/10.3390/a16050236

Perdices, D., Ramos, J., García-Dorado, J. L., González, I., & López de Vergara, J. E. (2021). Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities. Computer Networks, 198, 108357. https://doi.org/10.1016/j.comnet.2021.108357

Rinaldi, A. M., Russo, C., & Tommasino, C. (2021). A semantic approach for document classification using deep neural networks and multimedia knowledge graph. Expert Systems with Applications, 169, 114320. doi:10.1016/j.eswa.2020.114320

Toçoğlu, M. A., & Onan, A. (2020). Sentiment analysis on students’ evaluation of Higher Educational Institutions. Advances in Intelligent Systems and Computing, 1693–1700. https://doi.org/10.1007/978-3-030-51156-2_197.

Jiang, X., Li, L., & Gao, G. (2022). Efficient secure and verifiable KNN set similarity search over outsourced clouds. High-Confidence Computing, 100100. https://doi.org/10.1016/j.hcc.2022.100100

Vu, D.-H., Vu, T.-S., & Luong, T.-D. (2022). An efficient and practical approach for privacy-preserving Naive Bayes classification. Journal of Information Security and Applications, 68, 103215. https://doi.org/10.1016/j.jisa.2022.103215

Wichitaksorn, N., Kang, Y., & Zhang, F. (2022). Random feature selection using random subspace logistic regression. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4108579

Kachhwaha, R. ., Vyas, A. P. ., Bhadada, R. ., & Kachhwaha, R. . (2023). SDAV 1.0: A Low-Cost sEMG Data Acquisition & Processing System For Rehabilitatio. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2), 48–56. https://doi.org/10.17762/ijritcc.v11i2.6109

Carmen Rodriguez, Predictive Analytics for Disease Outbreak Prediction and Prevention , Machine Learning Applications Conference Proceedings, Vol 3 2023.

A Block-Based Feature Selection Method for Classification of Web Pages

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By