A Block-Based Feature Selection Method for Classification of Web Pages
Keywords:
Web classification, Feature extraction, Blocks Segmentation, Spam Filtering, Semantics Word, ClassifierAbstract
Webpage Classification is one of the methods for retrieving useful information that can be used for many purposes like searching, organizing, and spam filtering, and so on. Most of the existing web page classification algorithms focus on extracting the entire data however recent works focus on selective retrieval that could improve the efficiency of the classification. In this paper, we propose a block-wise feature selection algorithm that can segment a web page into blocks and finally filter out all non-important blocks. We introduce three features namely 1) keyword weighting, 2) block segmentation and, 3) similarity measures for improving the efficiency of the classification process. We select blocks that are very crucial in the classification process. Since the useless blocks are removed, the feature space is reduced and the accuracy is increased. The semantic words are also eliminated and the subset of most relevant features is choosing for building the classification model. The results shows an improved classification results as the relationship between the features and the target variable is understood easily. To demonstrate the efficiency of the proposed model, we compared it with other top machine learning classifiers. Two datasets are used in our experiment. The experimental results showed that our proposed work with four machine learning models and obtained up to 95% accuracy which is 11.7% more than existing models
Downloads
References
Ashokkumar, P., Arunkumar, N., & Don, S. (2018). Intelligent optimal route recommendation among heterogeneous objects with keywords. Computers & Electrical Engineering, 68, 526-535.
Cheng, M. Y., Kusoemo, D., & Gosno, R. A. (2020). Text mining-based construction site accident classification using hybrid supervised machine learning. Automation in Construction, 118, 103265.
Srivastava, S. K., Singh, S. K., & Suri, J. S. (2019). Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm. Computer methods and programs in biomedicine, 172, 35-51.
Palanivinayagam, A., & Nagarajan, S. (2020). An optimized iterative clustering framework for recognizing speech. International Journal of Speech Technology, 23(4), 767-777.
Galitsky, B. (2013). Machine learning of syntactic parse trees for search and classification of text. Engineering Applications of Artificial Intelligence, 26(3), 1072-1091.
Palanivinayagam, A., & Sasikumar, D. (2020). Drug recommendation with minimal side effects based on direct and temporal symptoms. Neural Computing and Applications, 32(15), 10971-10978.
Ashokkumar, P., & Don, S. (2019). Link-Based Clustering Algorithm for Clustering Web Documents. Journal of Testing and Evaluation, 47(6), 4096-4107.
Nigam, C., & Sharma, A. K. (2020). Experimental performance analysis of web recommendation model in web usage mining using KNN page ranking classification approach. Materials Today: Proceedings.
Buber, E., & Diri, B. (2019). Web Page Classification Using RNN. Procedia Computer Science, 154, 62-72.
Lee, J. H., Yeh, W. C., & Chuang, M. C. (2015). Web page classification based on a simplified swarm optimization. Applied Mathematics and Computation, 270, 13-24.
Kan, M. Y., & Thi, H. O. N. (2005, October). Fast webpage classification using URL features. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 325-326).
Li, H., Xu, Z., Li, T., Sun, G., & Choo, K. K. R. (2017). An optimized approach for massive web page classification using entity similarity based on semantic network. Future Generation Computer Systems, 76, 510-518.
Özel, S. A. (2011). A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Systems with Applications, 38(4), 3407-3415.
Chen, R. C., & Hsieh, C. H. (2006). Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 31(2), 427-435.
De Smedt, J., Lacka, E., Nita, S., Kohls, H. H., & Paton, R. (2021). Session stitching using sequence fingerprinting for web page visits. Decision Support Systems, 113579.
Chatterjee, M., & Namin, A. S. (2021). A fuzzy Dempster–Shafer classifier for detecting Web spams. Journal of Information Security and Applications, 59, 102793.
Kang, J., & Choi, J. (2008, September). Block classification of a web page by using a combination of multiple classifiers. In 2008 Fourth International Conference on Networked Computing and Advanced Information Management (Vol. 2, pp. 290-295). IEEE.
Xu, G., Yu, Z., & Qi, Q. (2018). Efficient sensitive information classification and topic tracking based on tibetan Web pages. IEEE Access, 6, 55643-55652.
Ye, H., Cao, B., Peng, Z., Chen, T., Wen, Y., & Liu, J. (2019). Web services classification based on wide & Bi-LSTM model. IEEE Access, 7, 43697-43706.
Yang, Z., Gui, Z., Wu, H., & Li, W. (2019). A latent feature-based multimodality fusion method for theme classification on web map service. IEEE Access, 8, 25299-25309.
Uzun, E., Özhan, E., Agun, H. V., Yerlikaya, T., & Buluş, H. N. (2020). Automatically Discovering Relevant Images From Web Pages. IEEE Access, 8, 208910-208921.
Semantic-textual-similarity-nlp(2020).URL https://www.kaggle.com/bhrt97/semantic-textual-similarity-nlp
Clueweb12. URL https://lemurproject.org/clueweb12/
J.Holze,S.Hellmann, E.Starke.dbpedia dataset (2021).URL https://www.dbpedia.org/
Smedt, J. D., Lacka, E., Nita, S., Kohls, H., & Paton, R. (2021). Session stitching using sequence fingerprinting for web page visits. Decision Support Systems, 150, 113579. doi:10.1016/j.dss.2021.113579
Chatterjee, M., & Namin, A. S. (2021). A fuzzy Dempster–Shafer classifier for detecting Web spams. Journal of Information Security and Applications, 59, 102793. doi:10.1016/j.jisa.2021.102793
Palanivinayagam A, El-Bayeh CZ, Damaševičius R. Twenty Years of Machine-Learning-Based Text Classification: A Systematic Review. Algorithms. 2023; 16(5):236. https://doi.org/10.3390/a16050236
Perdices, D., Ramos, J., García-Dorado, J. L., González, I., & López de Vergara, J. E. (2021). Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities. Computer Networks, 198, 108357. https://doi.org/10.1016/j.comnet.2021.108357
Rinaldi, A. M., Russo, C., & Tommasino, C. (2021). A semantic approach for document classification using deep neural networks and multimedia knowledge graph. Expert Systems with Applications, 169, 114320. doi:10.1016/j.eswa.2020.114320
Toçoğlu, M. A., & Onan, A. (2020). Sentiment analysis on students’ evaluation of Higher Educational Institutions. Advances in Intelligent Systems and Computing, 1693–1700. https://doi.org/10.1007/978-3-030-51156-2_197.
Jiang, X., Li, L., & Gao, G. (2022). Efficient secure and verifiable KNN set similarity search over outsourced clouds. High-Confidence Computing, 100100. https://doi.org/10.1016/j.hcc.2022.100100
Vu, D.-H., Vu, T.-S., & Luong, T.-D. (2022). An efficient and practical approach for privacy-preserving Naive Bayes classification. Journal of Information Security and Applications, 68, 103215. https://doi.org/10.1016/j.jisa.2022.103215
Wichitaksorn, N., Kang, Y., & Zhang, F. (2022). Random feature selection using random subspace logistic regression. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4108579
Kachhwaha, R. ., Vyas, A. P. ., Bhadada, R. ., & Kachhwaha, R. . (2023). SDAV 1.0: A Low-Cost sEMG Data Acquisition & Processing System For Rehabilitatio. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2), 48–56. https://doi.org/10.17762/ijritcc.v11i2.6109
Carmen Rodriguez, Predictive Analytics for Disease Outbreak Prediction and Prevention , Machine Learning Applications Conference Proceedings, Vol 3 2023.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.