A Hybrid Deep Learning Framework for Automated Document Clustering and Intelligent Label Generation

Authors

  • Poonam Mishra, Neeraj Gupta

Keywords:

Document clustering, deep learning, transformer models, label generation, natural language processing, unsupervised learning

Abstract

The exponential growth of digital documents across various domains has necessitated the development of sophisticated automated systems for document organization and categorization. This paper presents a novel hybrid deep learning framework that combines unsupervised clustering techniques with intelligent label generation mechanisms to address the challenges of automated document classification. The proposed framework integrates transformer-based embeddings, hierarchical clustering algorithms, and neural language models to achieve superior performance in both clustering accuracy and interpretability. Our approach demonstrates significant improvements over traditional methods, achieving a silhouette score of 0.847 and normalized mutual information of 0.923 across diverse document corpora. The framework's ability to generate meaningful, human-interpretable labels for discovered clusters represents a substantial advancement in making automated document organization systems more practical and user-friendly. Experimental results on benchmark datasets including Reuters-21578, 20 Newsgroups, and custom enterprise document collections validate the effectiveness of our hybrid approach.

DOI: https://doi.org/10.17762/ijisae.v12i23s.7821

Downloads

Download data is not yet available.

References

Chen, L., & Zhang, Y. (2023). Digital transformation and document management: A comprehensive survey. Information Systems Research, 34(2), 245-267.

Kumar, S., Patel, R., & Singh, A. (2022). Enterprise document analytics in the digital age. Journal of Information Management, 28(4), 112-128.

Rodriguez, M., & Thompson, K. (2023). Challenges in unstructured data processing for modern organizations. Data Science Review, 15(3), 67-84.

Williams, J., Brown, S., & Davis, L. (2022). Knowledge asset management through automated document organization. Knowledge Management Systems, 19(7), 334-351.

Liu, X., Wang, H., & Zhou, M. (2023). Unsupervised learning approaches for document clustering: A systematic review. Machine Learning Quarterly, 41(2), 89-106.

Anderson, P., & Clark, T. (2022). Semantic relationship mining in high-dimensional text data. Pattern Recognition Letters, 156, 78-92.

Zhang, Q., Li, W., & Chen, R. (2023). Transformer-based document representation learning: Recent advances and applications. Neural Computing and Applications, 35(14), 10245-10262.

Patel, N., Kumar, A., & Sharma, V. (2022). Deep learning revolution in natural language processing. AI Communications, 35(4), 267-285.

Taylor, M., & Johnson, K. (2023). Integrated approaches to document clustering and labeling: A comparative analysis. Information Processing & Management, 60(3), 103298.

Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.

Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations, 1-12.

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171-4186.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., & Long, J. (2018). A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6, 39501-39514.

Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. Proceedings of the 33rd International Conference on Machine Learning, 478-487.

Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational deep embedding: An unsupervised and generative approach to clustering. Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1965-1972.

Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). ClusterGAN: Latent space clustering in generative adversarial networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4610-4617.

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning, 1597-1607.

Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 490-499.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Liu, Y., Zhang, X., Wang, L., & Chen, M. (2021). Neural cluster labeling for scientific document collections. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 892-901.

Yang, L., Wang, S., & Liu, H. (2023). Component analysis in hybrid clustering frameworks: An empirical study. Pattern Analysis and Machine Intelligence, 45(8), 9876-9891.

Downloads

Published

30.12.2024

How to Cite

Poonam Mishra. (2024). A Hybrid Deep Learning Framework for Automated Document Clustering and Intelligent Label Generation. International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 3642 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7821

Issue

Section

Research Article