HYBRID STOPWORD DETECTION FOR CLUSTERING TAMIL TEXT DATA
Keywords:
Text clustering, TF-IDF, Co-occurrence graph, Hybrid, dynamic stopword, HDSCAN algorithmAbstract
The Traditional stopword removal techniques rely on static lists or individual statistical filters, which
often fail to capture the contextual irrelevance of words across domains and languages. This research paper proposes
a hybrid dynamic stopword detection framework that integrates three methods such as frequency analysis, TF-IDF
scoring, and frequency-normalized centrality from word co-occurrence graphs. The frequency- normalized method
adaptively filters contextually uninformative terms while preserving semantically rich content, significantly
improving downstream clustering and topic modelling performance. The filtered text is embedded using feature
fusion of Sentence-BER and TF-IDF vectors and, reduced via UMAP, and clustered using HDBSCAN, a densitybased algorithm capable of identifying clusters of varying shapes and densities. The number of dynamic stopwords
increased in the proposed hybrid method. Evaluations on Tamil text data demonstrate enhanced clustering quality
measured by Silhouette score, DBI, and topic coherence, proving the method’s effectiveness for morphological rich
Tamil language.
Downloads
References
Asyaky M S & Mandala R,2021, “Improving the performance of HDBSCAN on short text clustering by using
word embedding and UMAP”, ICAICTA, IEEE, doi:10.1109/ICAICTA53211.2021.9640285.
Becker N & Nolet C,2022, “Faster HDBSCAN soft clustering with RAPIDS cuML”, NVIDIA Developer
Blog.
Bot A. A, Semedo G. D, Zaidi N & Cameron J, 2023, “FLASC: A flare-sensitive HDBSCAN post-processing
routine”, arXiv preprint arXiv:2311.15887,doi.org/10.48550/arXiv.2311.15887.
Campello R J G B, Moulavi D & Sander J, 2013, “Density-based clustering based on hierarchical density
estimates”, P-1 KDD conferences, Vol(7819), Part 2, pp.160–172, doi.org/10.1007/978-3-642-37456-2_14.
Chowdhury, H. A., Bhattacharyya, D., & Kalita, J,2021, “UIFDBC: User-input-free density-based
clustering”, Knowledge-Based Systems, 214, 106741. doi.org/10.1016/j.knosys.2020.106741.
Ester M, Kriegel H P, Sander J & X Xu, 1996, "A density-based algorithm for discovering clusters in large
spatial databases with noise," in Proceedings of 2nd International Conference on Knowledge Discovery and
Data Mining (KDD), Portland, OR, USA, pp. 226–231.
Ghosh A, Naldi M C & Sander J, 2024, “GLOSH: Global-local outlier scores for HDBSCAN”, In Proceedings
of the 27th International Conference on Extending Database Technology (EDBT 2024), Paestum, Italy,
OpenProceedings.org. https://doi.org/10.48786/edbt.2024.17.
Liu P, Zhou D & Wu N J, 2007, “VDBSCAN: Varied Density Based Spatial Clustering of Applications with
Noise,” in proceedings of IEEE International Conference on Service Systems and Service Management,
Chengdu, China.
McInnes L & Healy J,2017, “Accelerated hierarchical density-based clustering”, IEEE ICDM Workshops,
pp.33–42, doi.org/10.1109/ICDMW.2017.12.
Sao S, Prokopenko S & Lebrun-Grandie D, 2024, “PANDORA: Parallel dendrogram construction for
HDBSCAN clustering”, arXiv preprint arXiv:2401.06089, doi.org/10.48550/arXiv.2401.06089.
Schubert E, Sander J, Ester M, Kriegel H P & Xu X, 2017, “DBSCAN revisited, revisited: Why and how you
should (still) use DBSCAN”, ACM Transactions on Database Systems (TODS), vol(42(3)), pp.1–21.
doi.org/10.1145/3068335.
Tiwari K K, Raguvanshi V & Jain A,2016, “DBSCAN: An assessment of density-based clustering and its
approaches”, International Journal of Scientific Research & Engineering Trends, vol (2(5)), ISSN (Online):
-566X.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.