Exploring NLP Techniques for Duplicate Question Detection to Maximizing Responses on Q&A Websites

Authors

  • Nilesh B. Korade Assistant Professor, Department of Computer Engineering, JSPM’s Rajarshi Shahu College of Engineering, Tathawade, Pune – 411033, Maharashtra, India
  • Mahendra B. Salunke Assistant Professor, Department of Computer Engineering, PCET’s, Pimpri Chinchwad College of Engineering and Research, Ravet, Pune-412101, Maharashtra, India
  • Gayatri G. Asalkar Research Scholar, Department of Computer Science and Engineering, Shri Jagdishprasad Jhabarmal Tibrewala University, Vidyanagari, Churela-333001, Rajasthan, India
  • Rutuja G. Khedkar Assistant Professor, Department of Computer Engineering, JSPM’s, Rajarshi Shahu College of Engineering, Tathawade, Pune – 411033, Maharashtra, India
  • Ashwini U. Bhosale Assistant Professor, Department of Computer Engineering, JSPM’s, Rajarshi Shahu College of Engineering, Tathawade, Pune – 411033, Maharashtra, India
  • Dhanashri M. Joshi Assistant Professor, Department of Computer Engineering, JSPM’s, Rajarshi Shahu College of Engineering, Tathawade, Pune – 411033, Maharashtra, India
  • Amol C. Jadhav Assistant Professor, Department of Computer Engineering, JSPM’s, Rajarshi Shahu College of Engineering, Tathawade, Pune – 411033, Maharashtra, India

Keywords:

Duplicate Question Detection, Feature engineering, Vectorization, Word2Vec, Cascaded CNN

Abstract

Emerging technologies known as Question Answering Systems (QAS) offer accurate and precise responses to common questions. Duplicate Question Detection (DQD) has demonstrated its capacity to enhance the user experience and drastically decrease response time by utilizing past responses. Word choice and sentence construction might differ significantly, making it difficult to determine if the two questions are asking the same thing. Finding questions on question-and-answer sites such as Quora, Stack Overflow, Blurtit, etc. that are semantically identical is very important to make sure that users receive both high-quality and high-quantity content according to the question's purpose, improving the user experience entirely. Quora's dataset of four lacks labelled question pairs used in the presented research. Our research has involved the construction of new features and the demonstration of their ability to improve accuracy. The study examines various vectorization methods and how they affect accuracy, with Word2Vec proving to be a good performer among the methods. In order to identify duplicate questions in the question pair dataset, we explored and used various machine learning and deep learning techniques. The cascaded CNN outperforms other modern algorithms and offers outstanding value over all assessment metrics.

Downloads

Download data is not yet available.

References

H. Isotani, H. Washizaki, Y. Fukazawa, T. Nomoto, S. Ouji, S. Saito, “Sentence embedding and fine-tuning to automatically identify duplicate bug”, Frontiers in Computer Science, vol. 4, 2023, doi: 10.3389/fcomp.2022.1032452.

L. Wang, L. Zhang, and J. Jiang, “Duplicate Question Detection With Deep Learning in Stack Overflow”, IEEE Access, vol. 8, pp. 25964- 25975, 2020, doi: 10.1109/ACCESS.2020.2968391.

M. S. M. Jabbar, L. Kumar, H. W. Samuel, M.Y. Kim, S. Prabharkar, R. Goebel, and O. Zaiane, “DeepDup: Duplicate Question Detection in Community Question Answering”, Proceedings of the 2021 5th International Conference on Deep Learning Technologies (ICDLT '21), Association for Computing Machinery, New York, pp. 8–12, 2021, doi: 10.1145/3480001.3480021.

H. Lattar, A. B. Salem, H. B. Ghezala, and H. B. Ghezala, “Duplicate record detection approach based on sentence embeddings”, 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 269-274, 2021, doi: 10.1109/WETICE49692.2020.00059.

S. Rani, A. Kumar, N. Kumar, and S. Kumar, “Deep Neural Model for Duplicate Question Detection Using Support Vector Machines (Svm)”, Turkish Journal of Computer and Mathematics Education, vol. 12, no. 6, pp. 4024-4033, 2021.

S. Rani, A. Kumar, N. Kumar, “Eliminating Data Duplication in CQA Platforms Using Deep Neural Model”, Computational Intelligence and Neuroscience, vol. 2022, doi: 10.1155/2022/2067449.

O. Rakhmanov, “A Comparative Study on Vectorization and Classification Techniques in Sentiment Analysis to Classify Student-Lecturer Comments”, 9th International Young Scientist Conference on Computational Science (YSC 2020), vol. 178, pp. 194–204, 2020, doi: 10.1016/j.procs.2020.11.021.

Z. Vujovic, “Classification Model Evaluation Metrics”, International Journal of Advanced Computer Science and Applications, vol. 12, no. 6, 2021, doi:10.14569/IJACSA.2021.0120670.

J. Babu, S. Thara, “Finding the Duplicate Questions in Stack Overflow using Word Embeddings”, Procedia Computer Science, vol. 171, pp. 2729-2733, 2020, doi: 10.1016/j.procs.2020.04.296.

D. Basavesha., and Y. S. Nijagunarya, Detecting Duplicate Questions in Community Based Websites Using Machine Learning, Proceedings of the International Conference on Innovative Computing & Communication (ICICC) 2021, April 2021, doi:10.2139/ssrn.3835083.

J. Babu, and S. Thara, Finding the Duplicate Questions in Stack Overflow using Word Embeddings, “Third International Conference on Computing and Network Communications (CoCoNet’19)”, pp. 2729–2733, 2020, doi: 10.1016/j.procs.2020.04.296.

L. Wang, L. Zhang and J. Jiang, Duplicate Question Detection With Deep Learning in Stack Overflow, IEEE Access, vol. 8, pp. 25964-25975, 2020, doi: 10.1109/ACCESS.2020.2968391.

Z. Imtiaz, M.Umer, M. Ahmad, S. Ullah, G.S. Choi, and A. Mehmood, Duplicate Questions Pair Detection Using Siamese MaLSTM, IEEE Access, vol. 8, pp. 21932-21942, 2020, doi: 10.1109/ACCESS.2020.2969041.

G. Vinodhini and R. M. Chandrasekaran, Sentiment classification using principal component analysis based neural network model, International Conference on Information Communication and Embedded Systems (ICICES2014), Chennai, India, 2014, pp. 1-6, doi: 10.1109/ICICES.2014.7033961.

G. Malik, M. Cevik, and A. Başar, Data Augmentation for Conflict and Duplicate Detection in Software Engineering Sentence Pairs, “CASCON '23: Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering”, pp. 34–43, sept. 2023, doi: 10.5555/3615924.3615928.

Z. G. Zhou, Research on Sentiment Analysis Model of Short Text Based on Deep Learning, Hindawi Scientific Programming, vol. 2022, doi: 10.1155/2022/2681533.

H. Kim, and Y. S. Jeong, Sentiment Classification Using Convolutional Neural Networks, Applied Sciences, vol. 9, no. 11,2019, doi:10.3390/app9112347.

Quora Question Pairs: <https://www.kaggle.com/c/quora-question-pairs/data>.

M. A. Palomino, and F. Aider, “Evaluating the Effectiveness of Text Pre-Processing in Sentiment Analysis”, Applied Sciences, vol.12, no. 17, 2022. doi: 10.3390/app12178765.

N. Ansari, and R, Sharma, “Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study”, ACM Conference, 2020, doi: 10.48550/arXiv.2004.11694.

N. Alvi, and K. H. Talukder, "Sentiment Analysis of Bengali Text using CountVectorizer with Logistic Regression," 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, pp. 01-05, 2021, doi: 10.1109/ICCCNT51525.2021.9580017.

P. Rajesh, and G. Suseendran, "Prediction of N-Gram Language Models Using Sentiment Analysis on E-Learning Reviews," 2020 International Conference on Intelligent Engineering and Management (ICIEM), London, UK, pp. 510-514, 2020, doi: 10.1109/ICIEM48762.2020.9160260.

S. Sumesh, and S. H. Aswini, "Natural Language Processing based Recommendation System for Courses *," 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, pp. 930-936, 2023, doi: 10.1109/ICICT57646.2023.10134234.

M. Sharma, G. Choudhary, and S. Susan, "Resume Classification using Elite Bag-of-Words Approach," 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, pp. 1409-1413, 2023, doi: 10.1109/ICSSIT55814.2023.10061036.

S. Nazir, M. Asif, S. A. Sahi, S. Ahmad, Y. Y. Ghadi, and M. H. Aziz, "Toward the Development of Large-Scale Word Embedding for Low-Resourced Language," in IEEE Access, vol. 10, pp. 54091-54097, 2022, doi: 10.1109/ACCESS.2022.3173259.

D. S. Asudani, N. K. Nagwani, and P. Singh, “Impact of word embedding models on text analytics in deep learning environment: a review” Artificial Intelligence Review, vol. 56, pp. 10345–10425, 2023, doi: 10.1007/s10462-023-10419-1.

M. Švaňa, "Extending Word2Vec with Domain-Specific Labels," 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), Sofia, Bulgaria, pp. 157-160, 2022, doi: 10.15439/2022F37.

A. Samih, A. Ghadi, and A. Fennan, “ExMrec2vec: Explainable Movie Recommender System based on Word2vec” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 12, no. 8, 2021, doi:10.14569/IJACSA.2021.0120876.

E. M. Dharma , F. L. Gaol , H. L. H. S. Warnars , and B. Soewito, “The Accuracy Comparison Among Word2vec, Glove, And Fasttext Towards Convolution Neural Network (Cnn) Text Classification”, Journal of Theoretical and Applied Information Technology, vol.100, no 2, 2022.

A. Desai, A. Zumbo, M. Giordano, P. Morandini, M. E. Laino, E. Azzolini, A. Fabbri, S. Marcheselli, A. L. Giotta, S. Luzzi, et al. “Word2vec Word Embedding-Based Artificial Intelligence Model in the Triage of Patients with Suspected Diagnosis of Major Ischemic Stroke: A Feasibility Study”, International Journal of Environmental Research and Public Health, vol. 19, no. 22, 2022, doi:10.3390/ijerph192215295.

R. Drikvandi, O. Lawal, “Sparse Principal Component Analysis for Natural Language Processing”, Annals of Data Science, vol. 10, pp. 25-41, 2023, doi: 10.1007/s40745-020-00277-x.

O. A. Alomari, A. Elnagar, I. Afyouni, I. Shahin, A. B. Nassif, I. A. Hashem, and M. Tubishat, "Hybrid Feature Selection Based on Principal Component Analysis and Grey Wolf Optimizer Algorithm for Arabic News Article Classification," IEEE Access, vol. 10, pp. 121816-121830, 2022, doi: 10.1109/ACCESS.2022.3222516.

S. W. Choi, and B. H. S. Kim, “Applying PCA to Deep Learning Forecasting Models for Predicting PM2.5” Sustainability, vol. 13, no. 7 2021, .doi: 10.3390/su13073726.

L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel, M. Al-Amidie, and L. Farhan, “Review of deep learning: concepts, CNN architectures, challenges, applications, future directions”, Journal of Big Data, vol. 8, no. 53, 2021, doi: 10.1186/s40537-021-00444-8.

P. Choudhary, and P. Pathak, "A Review of Convolution Neural Network Used in Various Applications," 2021 5th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, pp. 1-5, 2021, doi: 10.1109/ISCON52037.2021.9702315.

N. A. Mazlan, K. A. Othman, S. Shahbudin, and M. Kassim, "Convolution Neural Network (CNN) Architectures Analysis for Photovoltaic (PV) Module Defect Images Classification," 2022 International Conference on Computer Engineering, Network, and Intelligent Multimedia (CENIM), Surabaya, Indonesia, pp. 390-395, 2022, doi: 10.1109/CENIM56801.2022.10037564.

S. Allamy, and A. L. Koerich, "1D CNN Architectures for Music Genre Classification," 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, pp. 01-07, 2021, doi: 10.1109/SSCI50451.2021.9659979.

E. U. H. Qazi, A. Almorjan, and T. Zia, “A One-Dimensional Convolutional Neural Network (1D-CNN) Based Deep Learning System for Network Intrusion Detection”, Applied Sciences, vol. 12, no. 16, 2022, doi: 10.3390/app12167986.

D. Kilichev, and W. Kim, “Hyperparameter Optimization for 1D-CNN-Based Network Intrusion Detection Using GA and PSO”, Mathematics, vol. 11, no. 17, 2023, doi: 10.3390/math11173724.

N. B. Korade, and M. Zuber, “Stock Price Forecasting using Convolutional Neural Networks and Optimization Techniques”, vol. 13, no. 11, pp. 378-385, 2022, doi: 10.14569/IJACSA.2022.0131142.

N. B. Korade, and M. Zuber, “Boost Stock Forecasting Accuracy Using the Modified Firefly Algorithm and Multichannel Convolutional Neural Network”, Journal of Theoretical and Applied Information Technology, vol. 101, no. 7, pp. 2668- 2677, 2023.

N. B. Korade, and M. Zuber, “Stock Forecasting Using Multichannel CNN and Firefly Algorithm”, Proceedings of the 2nd International Conference on Cognitive and Intelligent Computing, pp. 447-458, 2023, doi: 10.1007/978-981-99-2742-5_46.

Y. Nam, and C. Lee, “Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions”, sensors, vol.21, no. 13, 2021, doi: 10.3390/s21134399.

S. Manna, "Small Sample Estimation of Classification Metrics," 2022 Interdisciplinary Research in Technology and Management (IRTM), Kolkata, India, 2022, pp. 1-3, doi: 10.1109/IRTM54583.2022.9791645.

R. G. Guendel, F. Fioranelli, and A. Yarovoy, "Evaluation Metrics for Continuous Human Activity Classification Using Distributed Radar Networks," 2022 IEEE Radar Conference (RadarConf22), New York City, NY, USA, 2022, pp. 1-6, doi: 10.1109/RadarConf2248738.2022.9764181.

Downloads

Published

24.03.2024

How to Cite

Korade, N. B. ., Salunke, M. B. ., Asalkar, G. G. ., Khedkar, R. G. ., Bhosale, A. U. ., Joshi, D. M. ., & Jadhav, A. C. . (2024). Exploring NLP Techniques for Duplicate Question Detection to Maximizing Responses on Q&A Websites. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 11–20. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5218

Issue

Section

Research Article