A Hybrid Deep Learning Framework for Automated Document Clustering and Intelligent Label Generation
Keywords:
Document clustering, deep learning, transformer models, label generation, natural language processing, unsupervised learningAbstract
The exponential growth of digital documents across various domains has necessitated the development of sophisticated automated systems for document organization and categorization. This paper presents a novel hybrid deep learning framework that combines unsupervised clustering techniques with intelligent label generation mechanisms to address the challenges of automated document classification. The proposed framework integrates transformer-based embeddings, hierarchical clustering algorithms, and neural language models to achieve superior performance in both clustering accuracy and interpretability. Our approach demonstrates significant improvements over traditional methods, achieving a silhouette score of 0.847 and normalized mutual information of 0.923 across diverse document corpora. The framework's ability to generate meaningful, human-interpretable labels for discovered clusters represents a substantial advancement in making automated document organization systems more practical and user-friendly. Experimental results on benchmark datasets including Reuters-21578, 20 Newsgroups, and custom enterprise document collections validate the effectiveness of our hybrid approach.
Downloads
References
Chen, L., & Zhang, Y. (2023). Digital transformation and document management: A comprehensive survey. Information Systems Research, 34(2), 245-267.
Kumar, S., Patel, R., & Singh, A. (2022). Enterprise document analytics in the digital age. Journal of Information Management, 28(4), 112-128.
Rodriguez, M., & Thompson, K. (2023). Challenges in unstructured data processing for modern organizations. Data Science Review, 15(3), 67-84.
Williams, J., Brown, S., & Davis, L. (2022). Knowledge asset management through automated document organization. Knowledge Management Systems, 19(7), 334-351.
Liu, X., Wang, H., & Zhou, M. (2023). Unsupervised learning approaches for document clustering: A systematic review. Machine Learning Quarterly, 41(2), 89-106.
Anderson, P., & Clark, T. (2022). Semantic relationship mining in high-dimensional text data. Pattern Recognition Letters, 156, 78-92.
Zhang, Q., Li, W., & Chen, R. (2023). Transformer-based document representation learning: Recent advances and applications. Neural Computing and Applications, 35(14), 10245-10262.
Patel, N., Kumar, A., & Sharma, V. (2022). Deep learning revolution in natural language processing. AI Communications, 35(4), 267-285.
Taylor, M., & Johnson, K. (2023). Integrated approaches to document clustering and labeling: A comparative analysis. Information Processing & Management, 60(3), 103298.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Salton, G., & McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations, 1-12.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171-4186.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., & Long, J. (2018). A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6, 39501-39514.
Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. Proceedings of the 33rd International Conference on Machine Learning, 478-487.
Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational deep embedding: An unsupervised and generative approach to clustering. Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1965-1972.
Mukherjee, S., Asnani, H., Lin, E., & Kannan, S. (2019). ClusterGAN: Latent space clustering in generative adversarial networks. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4610-4617.
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning, 1597-1607.
Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 490-499.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Liu, Y., Zhang, X., Wang, L., & Chen, M. (2021). Neural cluster labeling for scientific document collections. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 892-901.
Yang, L., Wang, S., & Liu, H. (2023). Component analysis in hybrid clustering frameworks: An empirical study. Pattern Analysis and Machine Intelligence, 45(8), 9876-9891.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.