Development and Evaluation of Extended Text Pre-processing Techniques for Hindi Document Clustering


  • Mukta M. Deshpande Research Scholar Symbiosis Institute of Computer Studies and Research Pune, India
  • Prafulla B. Bafna Assistant Professor Symbiosis Institute of Computer Studies and Research Pune, India


Pre-processing, feature extraction, Tokenization, Stopwords, Lemmatization, Hindi Document Clustering


Data pre-processing, which involves cleaning and converting raw text data into an appropriate format for analysis, is a vital stage in text analytics. Clustering is a widely used technique in text analytics for grouping similar data points. However, the pre-processing techniques applied to the data can greatly influence the quality and effectiveness of clustering results. The goal of this study is to examine how the pre-processing methods that has been suggested affects clustering algorithm performance. Several distinct combinations of pre-processing methods have been applied to produce document clustering. The goal was to identify the optimal pre-processing combination that produces the most accurate and meaningful clusters.  The effects of the clustering technique are assessed after applying the Normalized Mutual Information (NMI), silhouette score, and Adjusted Rand Index (ARI). Principal Component Analysis (PCA) and dendrograms are two visualization techniques explored in this study to gain insights into the clustering results. The findings from this study can help enhance our understanding of the pre-processing techniques required in the clustering process and help researchers and practitioners implement clustering algorithms to achieve greater accuracy.


