Development and Evaluation of Extended Text Pre-processing Techniques for Hindi Document Clustering
Keywords:
Pre-processing, feature extraction, Tokenization, Stopwords, Lemmatization, Hindi Document ClusteringAbstract
Data pre-processing, which involves cleaning and converting raw text data into an appropriate format for analysis, is a vital stage in text analytics. Clustering is a widely used technique in text analytics for grouping similar data points. However, the pre-processing techniques applied to the data can greatly influence the quality and effectiveness of clustering results. The goal of this study is to examine how the pre-processing methods that has been suggested affects clustering algorithm performance. Several distinct combinations of pre-processing methods have been applied to produce document clustering. The goal was to identify the optimal pre-processing combination that produces the most accurate and meaningful clusters. The effects of the clustering technique are assessed after applying the Normalized Mutual Information (NMI), silhouette score, and Adjusted Rand Index (ARI). Principal Component Analysis (PCA) and dendrograms are two visualization techniques explored in this study to gain insights into the clustering results. The findings from this study can help enhance our understanding of the pre-processing techniques required in the clustering process and help researchers and practitioners implement clustering algorithms to achieve greater accuracy.
Downloads
References
S. Kumar and T. D. Singh, “Fake news detection on Hindi news dataset,” Glob. Transit. Proc., vol. 3, no. 1, pp. 289–297, Jun. 2022, doi: 10.1016/j.gltp.2022.03.014.
I.-C. Chang, T.-K. Yu, Y.-J. Chang, and T.-Y. Yu, “Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals,” Sustainability, vol. 13, no. 19, p. 10856, Sep. 2021, doi: 10.3390/su131910856.
R. Rani and D. K. Lobiyal, “Automatic Construction of Generic Stop Words List for Hindi Text,” Procedia Comput. Sci., vol. 132, pp. 362–370, 2018, doi: 10.1016/j.procs.2018.05.196.
P. Verma and A. Verma, “Accountability of NLP Tools in Text Summarization for Indian Languages,” J. Sci. Res., vol. 64, no. 01, pp. 258–263, 2020, doi: 10.37398/JSR.2020.640149.
P. B. Bafna and J. R., “Marathi Document: Similarity Measurement using Semantics-based Dimension Reduction Technique,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 4, 2020, doi: 10.14569/IJACSA.2020.0110419.
P. B. Bafna and J. R., “An Application of Zipf’s Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 3, 2020, doi: 10.14569/IJACSA.2020.0110331.
J. K. and J. R., “Stop-Word Removal Algorithm and its Implementation for Sanskrit Language,” Int. J. Comput. Appl., vol. 150, no. 2, pp. 15–17, Sep. 2016, doi: 10.5120/ijca2016911462.
D. J. Ladani and N. P. Desai, “Stopword Identification and Removal Techniques on TC and IR applications: A Survey,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India: IEEE, Mar. 2020, pp. 466–472. doi: 10.1109/ICACCS48705.2020.9074166.
M. Nandathilaka, S. Ahangama, and G. T. Weerasuriya, “A Rule-based Lemmatizing Approach for Sinhala Language,” in 2018 3rd International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka: IEEE, Dec. 2018, pp. 1–5. doi: 10.1109/ICITR.2018.8736134.
K. Jacksi, R. Kh. Ibrahim, S. R. M. Zeebaree, R. R. Zebari, and M. A. M. Sadeeq, “Clustering Documents based on Semantic Similarity using HAC and K-Mean Algorithms,” in 2020 International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq: IEEE, Dec. 2020, pp. 205–210. doi: 10.1109/ICOASE51841.2020.9436570.
K. R. Shahapure and C. Nicholas, “Cluster Quality Analysis Using Silhouette Score,” in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), sydney, Australia: IEEE, Oct. 2020, pp. 747–748. doi: 10.1109/DSAA49011.2020.00096.
K. Seki, M. S. Ortiz, and J. Mostafa, “Effectiveness and Efficiency for Document Clustering in Biomedicine,” in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA: IEEE, Nov. 2019, pp. 1620–1623. doi: 10.1109/BIBM47256.2019.8983328.
A. Sen, M. Pandey, and K. Chakravarty, “Random Centroid Selection for K-means Clustering: A Proposed Algorithm for Improving Clustering Results,” in 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA), Gunupur, India: IEEE, Mar. 2020, pp. 1–4. doi: 10.1109/ICCSEA49143.2020.9132921.
V. Mehta, S. Bawa, and J. Singh, “WEClustering: word embeddings-based text clustering technique for large datasets,” Complex Intell. Syst., vol. 7, no. 6, pp. 3211–3224, Dec. 2021, doi: 10.1007/s40747-021-00512-9.
“An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language,” KSII Trans. Internet Inf. Syst., vol. 15, no. 6, Jun. 2021, doi: 10.3837/tiis.2021.06.006.
Kulkarni, A. P. ., & T. N., M. . (2023). Hybrid Cloud-Based Privacy Preserving Clustering as Service for Enterprise Big Data. International Journal on Recent and Innovation Trends in Computing and Communication, 11(2s), 146–156. https://doi.org/10.17762/ijritcc.v11i2s.6037
Mr. Kankan Sarkar. (2016). Design and analysis of Low Power High Speed Pulse Triggered Flip Flop. International Journal of New Practices in Management and Engineering, 5(03), 01 - 06. Retrieved from http://ijnpme.org/index.php/IJNPME/article/view/45
Yadav, N., Saini, D.K.J.B., Uniyal, A., Yadav, N., Bembde, M.S., Dhabliya, D. Prediction of Omicron cases in India using LSTM: An advanced approach of artificial intelligence (2023) Journal of Interdisciplinary Mathematics, 26 (3), pp. 361-370.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Mukta M. Deshpande, Prafulla B. Bafna
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.