Document Clustering using Roberta and Convolution Neural Network Model

Authors

  • P. Saidesh Kumar Research Scholar, University College of Engineering, Osmania University, Hyderabad, India.
  • P. Vijayapal Reddy Prof., HOD CSE, Matrusri Engineering College, Hyderabad, India.

Keywords:

BERT, Convolution Neural Networks (CNN), Document Clustering, RoBERTa

Abstract

Document clustering quite helpful in many applications of text mining and information retrieval. The use of cluster analysis to text texts is known as document clustering. It may be used to swiftly retrieve or filter information as well as automatically arrange papers into categories and extract themes from texts. In this study, a document clustering technique based on deep learning and lexical text feature extraction is presented. RoBERTa (Robustly Optimized BERT Pre-training Approach), a  recommendation framework to extract text features where BERT (Bidirectional Encoder Representation from Transformers) receives significant hyper parameter alterations from RoBERTa. The BERT pre-next-sentence training objective is no longer used, and training in tiny batches results in significantly higher learning rates. The features are sent to CNN (Convolution Neural Networks) model containing dense and drop out layers. The proposed model obtained an accuracy of 98.3%for BBC dataset and 98.2% for News group dataset.

Downloads

Download data is not yet available.

References

J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine learning, vol. 109, no. 2, pp. 373-440, 2020.

G. Baryannis, S. Dani and G. Antoniou, “Predicting supply chain risks using machine learning: The trade-off between performance and interpretability,” Future Generation Computer Systems, vol. 101, pp. 993-1004, 2019.

C. Maione, F. Barbosa Jr and R. M. Barbosa, “Predicting the botanical and geographical origin of honey with multivariate data analysis and machine learning techniques: A review,” Computers and Electronics in Agriculture, vol. 157, pp. 436-446, 2019.

A. Khraisat, I. Gondal, P. Vamplew and J. Kamruzzaman, “Survey of intrusion detection systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, pp. 1-22, 2019.

R. Yan, J. Liao, J. Yang, W. Sun, M. Nong and F. Li, “Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering,” Expert Systems with Applications, vol. 169, pp. 1-15, 2021.

A. Al-Subaihin, F. Sarro, S. Black and L. Capra, “Empirical comparison of text-based mobile apps similarity measurement techniques,” Empirical Software Engineering, vol. 24, pp. 3290-3315, 2019.

A. Onan and M. A. Toçoğlu, “Weighted word embeddings and clustering‐based identification of question topics in MOOC discussion forum posts,” Computer Applications in Engineering Education, vol. 29, no. 4, pp. 675-689, 2021.

F. J. Arenas-Márquez, R. Martinez-Torres and S. Toral, “Convolutional neural encoding of online reviews for the identification of travel group type topics on TripAdvisor,” Information Processing & Management, vol. 58, no. 5, pp. 1-16, 2021.

R. Janani and S. Vijayarani, “Text document clustering using spectral clustering algorithm with particle swarm optimization,” Expert Systems with Applications, vol. 134, pp. 192-200, 2019.

S. A. Curiskis, B. Drake, T. R. Osborn and P. J. Kennedy, “An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit,” Information Processing & Management, vol. 57, no. 2, pp. 1-50, 2020.

A. Dogan and D. Birant, “Machine learning and data mining in manufacturing,” Expert Systems with Applications, vol. 166, pp.1-44, 2021.

M. M. Fard, T. Thonet and E. Gaussier, “Deep k-means: Jointly clustering with k-means and learning representations,” Pattern Recognition Letters, vol. 138, pp. 185-192, 2020.

S. Huang, Z. Kang and Z. Xu, “Auto-weighted multi-view clustering via deep matrix decomposition,” Pattern Recognition, vol. 97, pp. 1-11, 2020.

Y. Ren, K. Hu, X. Dai, L. Pan, S. C. Hoi and Z. Xu, “Semi-supervised deep embedded clustering,” Neurocomputing, vol. 325, pp. 121-130, 2019.

A. Elnagar, R. Al-Debsi and O. Einea, “Arabic text classification using deep learning models,” Information Processing & Management, vol. 57, no. 1, pp. 1-17, 2020.

A. Kumar, K. Srinivasan, W. H. Cheng and A. Y. Zomaya, “Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data,” Information Processing & Management, vol. 57, no. 1, pp. 1-34, 2020.

A. E. Muller, H. M. R. Ames, P. S. J. Jardim and C. J. Rose, “Machine learning in systematic reviews: Comparing automated text clustering with Lingo3G and human researcher categorization in a rapid review,” Research Synthesis Methods, vol. 13, no. 2, pp. 229-241, 2022.

G. A. Khan, J. Hu, T. Li, B. Diallo and H. Wang, “Multi-view data clustering via non-negative matrix factorization with manifold regularization,” International Journal of Machine Learning and Cybernetics, vol. 13, no. 3, pp. 677-689, 2022.

A. Mahmoud and M. Zrigui, “Deep neural network models for paraphrased text classification in the Arabic language,” Natural Language Processing and Information Systems: 24th International Conference on Applications of Natural Language to Information Systems, NLDB 2019, Salford, UK, June 26–28, 2019, Proceedings 24, pp. 3-16.

K. Chen, R. J. Mahfoud, Y. Sun, D. Nan, K. Wang, H. Haes Alhelou and P. Siano, “Defect texts mining of secondary device in smart substation with GloVe and attention-based bidirectional LSTM,” Energies, vol. 13, no. 17, pp. 1-17, 2020.

A. Alsharef, K. Aggarwal, D. Koundal, H. Alyami and D. Ameyed, “An automated toxicity classification on social media using LSTM and word embedding,” Computational Intelligence and Neuroscience, pp. 1-8, 2022.

Z. Wen, J. Phengsuwan, N. B. Thekkummal, R. Sun, P. jamathi-Chidananda, T. Shah, P. James and R. Ranjan, “Active Hazard Observation via Human in the Loop Social Media Analytics System,” Proceedings of the 29th ACM International Conference on Information & Knowledge Management, October 2020, pp. 3469-3472.

S. H. Park, B. C. Bae and Y. G. Cheong, “Emotion recognition from text stories using an emotion embedding model,” IEEE international conference on big data and smart computing (BigComp), February 2020, pp. 579-583.

M. R. Hossain and M. M. Hoque, “Covtexminer: Covid text mining using cnn with domain-specific glove embedding,” International Conference on Intelligent Computing & Optimization, October. 2022, pp. 65-74.

C. R. Rahman, M. D. Rahman, S. Zakir, M. Rafsan and M. E. Ali, “BSpell: A CNN-blended BERT Based Bengali Spell Checker,” arXiv preprint arXiv:2208.09709, pp. 1-14, 2022.

M. Yaseen, H. S. Salih, M. Aljanabi, A. H. Ali and S. A. Abed, “Improving Process Efficiency in Iraqi universities: a proposed management information system,” Iraqi Journal For Computer Science and Mathematics, vol. 4, no. 1, pp. 211-219, 2023.

M. Aljanabi and S. Y. Mohammed, “Metaverse: open possibilities,” Iraqi Journal For Computer Science and Mathematics, vol. 4, no. 3, pp. 79-86, 2023.

A. S. Shaker, O. F. Youssif, M. Aljanabi, Z. Abbood and M.S. Mahdi, “SEEK Mobility Adaptive Protocol Destination Seeker Media Access Control Protocol for Mobile WSNs,” Iraqi Journal For Computer Science and Mathematics, vol. 4, no. 1, pp. 130-145, 2023.

H. S. Salih, M. Ghazi and M. Aljanabi, “Implementing an Automated Inventory Management System for Small and Medium-sized Enterprises,” Iraqi Journal For Computer Science and Mathematics, vol. 4, no. 2, pp. 238-244, 2023.

Downloads

Published

13.12.2023

How to Cite

Kumar, P. S. ., & Reddy, P. V. . (2023). Document Clustering using Roberta and Convolution Neural Network Model. International Journal of Intelligent Systems and Applications in Engineering, 12(8s), 221–230. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4112

Issue

Section

Research Article