Development and Evaluation of Extended Text Pre-processing Techniques for Hindi Document Clustering

Authors

  • Mukta M. Deshpande Research Scholar Symbiosis Institute of Computer Studies and Research Pune, India
  • Prafulla B. Bafna Assistant Professor Symbiosis Institute of Computer Studies and Research Pune, India

Keywords:

Pre-processing, feature extraction, Tokenization, Stopwords, Lemmatization, Hindi Document Clustering

Abstract

Data pre-processing, which involves cleaning and converting raw text data into an appropriate format for analysis, is a vital stage in text analytics. Clustering is a widely used technique in text analytics for grouping similar data points. However, the pre-processing techniques applied to the data can greatly influence the quality and effectiveness of clustering results. The goal of this study is to examine how the pre-processing methods that has been suggested affects clustering algorithm performance. Several distinct combinations of pre-processing methods have been applied to produce document clustering. The goal was to identify the optimal pre-processing combination that produces the most accurate and meaningful clusters.  The effects of the clustering technique are assessed after applying the Normalized Mutual Information (NMI), silhouette score, and Adjusted Rand Index (ARI). Principal Component Analysis (PCA) and dendrograms are two visualization techniques explored in this study to gain insights into the clustering results. The findings from this study can help enhance our understanding of the pre-processing techniques required in the clustering process and help researchers and practitioners implement clustering algorithms to achieve greater accuracy.

Downloads

Download data is not yet available.

References

S. Kumar and T. D. Singh, “Fake news detection on Hindi news dataset,” Glob. Transit. Proc., vol. 3, no. 1, pp. 289–297, Jun. 2022, doi: 10.1016/j.gltp.2022.03.014.

I.-C. Chang, T.-K. Yu, Y.-J. Chang, and T.-Y. Yu, “Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals,” Sustainability, vol. 13, no. 19, p. 10856, Sep. 2021, doi: 10.3390/su131910856.

R. Rani and D. K. Lobiyal, “Automatic Construction of Generic Stop Words List for Hindi Text,” Procedia Comput. Sci., vol. 132, pp. 362–370, 2018, doi: 10.1016/j.procs.2018.05.196.

P. Verma and A. Verma, “Accountability of NLP Tools in Text Summarization for Indian Languages,” J. Sci. Res., vol. 64, no. 01, pp. 258–263, 2020, doi: 10.37398/JSR.2020.640149.

P. B. Bafna and J. R., “Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 4, 2020, doi: 10.14569/IJACSA.2020.0110419.

P. B. Bafna and J. R., “An Application of Zipf’s Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3, 2020, doi: 10.14569/IJACSA.2020.0110331.

J. K. and J. R., “Stop-Word Removal Algorithm and its Implementation for Sanskrit Language,” Int. J. Comput. Appl., vol. 150, no. 2, pp. 15–17, Sep. 2016, doi: 10.5120/ijca2016911462.

D. J. Ladani and N. P. Desai, “Stopword Identification and Removal Techniques on TC and IR applications: A Survey,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India: IEEE, Mar. 2020, pp. 466–472. doi: 10.1109/ICACCS48705.2020.9074166.

M. Nandathilaka, S. Ahangama, and G. T. Weerasuriya, “A Rule-based Lemmatizing Approach for Sinhala Language,” in 2018 3rd International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka: IEEE, Dec. 2018, pp. 1–5. doi: 10.1109/ICITR.2018.8736134.

K. Jacksi, R. Kh. Ibrahim, S. R. M. Zeebaree, R. R. Zebari, and M. A. M. Sadeeq, “Clustering Documents based on Semantic Similarity using HAC and K-Mean Algorithms,” in 2020 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq: IEEE, Dec. 2020, pp. 205–210. doi: 10.1109/ICOASE51841.2020.9436570.

K. R. Shahapure and C. Nicholas, “Cluster Quality Analysis Using Silhouette Score,” in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), sydney, Australia: IEEE, Oct. 2020, pp. 747–748. doi: 10.1109/DSAA49011.2020.00096.

K. Seki, M. S. Ortiz, and J. Mostafa, “Effectiveness and Efficiency for Document Clustering in Biomedicine,” in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA: IEEE, Nov. 2019, pp. 1620–1623. doi: 10.1109/BIBM47256.2019.8983328.

A. Sen, M. Pandey, and K. Chakravarty, “Random Centroid Selection for K-means Clustering: A Proposed Algorithm for Improving Clustering Results,” in 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA), Gunupur, India: IEEE, Mar. 2020, pp. 1–4. doi: 10.1109/ICCSEA49143.2020.9132921.

V. Mehta, S. Bawa, and J. Singh, “WEClustering: word embeddings-based text clustering technique for large datasets,” Complex Intell. Syst., vol. 7, no. 6, pp. 3211–3224, Dec. 2021, doi: 10.1007/s40747-021-00512-9.

“An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language,” KSII Trans. Internet Inf. Syst., vol. 15, no. 6, Jun. 2021, doi: 10.3837/tiis.2021.06.006.

Kulkarni, A. P. ., & T. N., M. . (2023). Hybrid Cloud-Based Privacy Preserving Clustering as Service for Enterprise Big Data. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2s), 146–156. https://doi.org/10.17762/ijritcc.v11i2s.6037

Mr. Kankan Sarkar. (2016). Design and analysis of Low Power High Speed Pulse Triggered Flip Flop. International Journal of New Practices in Management and Engineering, 5(03), 01 - 06. Retrieved from http://ijnpme.org/index.php/IJNPME/article/view/45

Yadav, N., Saini, D.K.J.B., Uniyal, A., Yadav, N., Bembde, M.S., Dhabliya, D. Prediction of Omicron cases in India using LSTM: An advanced approach of artificial intelligence (2023) Journal of Interdisciplinary Mathematics, 26 (3), pp. 361-370.

Downloads

Published

10.11.2023

How to Cite

Deshpande, M. M. ., & Bafna, P. B. . (2023). Development and Evaluation of Extended Text Pre-processing Techniques for Hindi Document Clustering. International Journal of Intelligent Systems and Applications in Engineering, 12(4s), 406–419. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3799

Issue

Section

Research Article