HYBRID STOPWORD DETECTION FOR CLUSTERING TAMIL TEXT DATA

Authors

  • S.Sujatha, Grasha Jacob

Keywords:

Text clustering, TF-IDF, Co-occurrence graph, Hybrid, dynamic stopword, HDSCAN algorithm

Abstract

The Traditional stopword removal techniques rely on static lists or individual statistical filters, which
often fail to capture the contextual irrelevance of words across domains and languages. This research paper proposes
a hybrid dynamic stopword detection framework that integrates three methods such as frequency analysis, TF-IDF
scoring, and frequency-normalized centrality from word co-occurrence graphs. The frequency- normalized method
adaptively filters contextually uninformative terms while preserving semantically rich content, significantly
improving downstream clustering and topic modelling performance. The filtered text is embedded using feature
fusion of Sentence-BER and TF-IDF vectors and, reduced via UMAP, and clustered using HDBSCAN, a densitybased algorithm capable of identifying clusters of varying shapes and densities. The number of dynamic stopwords
increased in the proposed hybrid method. Evaluations on Tamil text data demonstrate enhanced clustering quality
measured by Silhouette score, DBI, and topic coherence, proving the method’s effectiveness for morphological rich
Tamil language.

Downloads

Download data is not yet available.

References

Asyaky M S & Mandala R,2021, “Improving the performance of HDBSCAN on short text clustering by using

word embedding and UMAP”, ICAICTA, IEEE, doi:10.1109/ICAICTA53211.2021.9640285.

Becker N & Nolet C,2022, “Faster HDBSCAN soft clustering with RAPIDS cuML”, NVIDIA Developer

Blog.

Bot A. A, Semedo G. D, Zaidi N & Cameron J, 2023, “FLASC: A flare-sensitive HDBSCAN post-processing

routine”, arXiv preprint arXiv:2311.15887,doi.org/10.48550/arXiv.2311.15887.

Campello R J G B, Moulavi D & Sander J, 2013, “Density-based clustering based on hierarchical density

estimates”, P-1 KDD conferences, Vol(7819), Part 2, pp.160–172, doi.org/10.1007/978-3-642-37456-2_14.

Chowdhury, H. A., Bhattacharyya, D., & Kalita, J,2021, “UIFDBC: User-input-free density-based

clustering”, Knowledge-Based Systems, 214, 106741. doi.org/10.1016/j.knosys.2020.106741.

Ester M, Kriegel H P, Sander J & X Xu, 1996, "A density-based algorithm for discovering clusters in large

spatial databases with noise," in Proceedings of 2nd International Conference on Knowledge Discovery and

Data Mining (KDD), Portland, OR, USA, pp. 226–231.

Ghosh A, Naldi M C & Sander J, 2024, “GLOSH: Global-local outlier scores for HDBSCAN”, In Proceedings

of the 27th International Conference on Extending Database Technology (EDBT 2024), Paestum, Italy,

OpenProceedings.org. https://doi.org/10.48786/edbt.2024.17.

Liu P, Zhou D & Wu N J, 2007, “VDBSCAN: Varied Density Based Spatial Clustering of Applications with

Noise,” in proceedings of IEEE International Conference on Service Systems and Service Management,

Chengdu, China.

McInnes L & Healy J,2017, “Accelerated hierarchical density-based clustering”, IEEE ICDM Workshops,

pp.33–42, doi.org/10.1109/ICDMW.2017.12.

Sao S, Prokopenko S & Lebrun-Grandie D, 2024, “PANDORA: Parallel dendrogram construction for

HDBSCAN clustering”, arXiv preprint arXiv:2401.06089, doi.org/10.48550/arXiv.2401.06089.

Schubert E, Sander J, Ester M, Kriegel H P & Xu X, 2017, “DBSCAN revisited, revisited: Why and how you

should (still) use DBSCAN”, ACM Transactions on Database Systems (TODS), vol(42(3)), pp.1–21.

doi.org/10.1145/3068335.

Tiwari K K, Raguvanshi V & Jain A,2016, “DBSCAN: An assessment of density-based clustering and its

approaches”, International Journal of Scientific Research & Engineering Trends, vol (2(5)), ISSN (Online):

-566X.

Downloads

Published

15.10.2024

How to Cite

S.Sujatha. (2024). HYBRID STOPWORD DETECTION FOR CLUSTERING TAMIL TEXT DATA . International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 3584–3595. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7789

Issue

Section

Research Article