Gated Dual Adaptive Attention Mechanism with Semantic Reasoning, Character Awareness, and Visual-Semantic Ensemble Fusion Decoder for Text Recognition in Natural Scene Images

Authors

  • A. S. Venkata Praneel Department of Computer Science and Engineering, GITAM (Deemed-to-be University), Visakhapatnam-530045, AP, India
  • T. Srnivasa Rao Department of Computer Science and Engineering, GITAM (Deemed-to-be University), Visakhapatnam-530045, AP, India

Keywords:

Instance Segmentation, Text recognition, TCN, PBTPN, MS-RCNN, GDAAM, Semantic Reasoning, Character awareness, Visual cue, Semantic cue

Abstract

Text recognition in natural scene images poses a significant challenge due to variations in font styles, sizes, orientations, complex backgrounds, and lighting conditions. In this paper, a Gated Dual Adaptive Attention Mechanism (GDAAM), a novel framework that combines Mask Scoring Region-based Convolutional Neural Networks (MS-RCNN), Pyramid-based Text Proposal Networks (PBTPN), and Transformation Component Networks (TCN) as encoder, along with semantic reasoning, character awareness, and a visual-semantic ensemble fusion decoder for accurate text recognition in natural scene images is proposed. The encoder component of GDAAM leverages two robust architectures: MS-RCNN and PBTPN+TCN. MS-RCNN is utilised for its strong object detection capabilities, allowing for accurate localisation of text regions within the scene images. PBTPN+TCN captures temporal dependencies and contextual information in images containing text sequences. GDAAM extracts comprehensive features from spatial and temporal dimensions by combining these encoders, enabling effective representation of text elements. To facilitate fine-grained attention modelling, it incorporates the GDAAM in its decoder. It allows the model to selectively focus on relevant visual and textual cues, dynamically adapting its attention weights based on the input. GDAAM efficiently integrates visual and textual information by incorporating gate mechanisms enhancing text recognition accuracy in challenging natural scene images. Semantic reasoning is another crucial aspect integrated into GDAAM. A reasoning module incorporates contextual information, enabling the model to reason and make informed decisions. GDAAM selectively attends to relevant visual and textual cues, leveraging attention mechanisms, enhancing its understanding, and promoting more accurate text recognition. GDAAM addresses character awareness to handle complex text layouts, irregularities, and occlusions commonly found in natural scene images. This awareness further improves the model's ability to accurately recognise text in challenging visual environments. The proposed visual-semantic ensemble fusion decoder in GDAAM combines the visual and semantic features to generate the final text recognition results. GDAAM achieves coherent and contextually consistent text recognition outputs by effectively fusing and integrating information from both modalities, improving overall performance. Extensive experiments on benchmark datasets like SVT, ICDAR 2013, ICDAR 2015, IIIT5K, SVTP and CUTE 80 for text recognition in natural scene images demonstrate the effectiveness of GDAAM. The results show that GDAAM outperforms state-of-the-art approaches in terms of accuracy and robustness. GDAAM demonstrates superior performance in challenging text recognition tasks. The proposed model surpasses existing approaches, opening new avenues for accurate and robust text recognition in complex visual environments.

Downloads

Download data is not yet available.

References

Wojna, Z., Gorban, A. N., Lee, D. S., Murphy, K., Yu, Q., Li, Y., & Ibarz, J. (2017, November). Attention-based extraction of structured information from street view imagery. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR) (Vol. 1, pp. 844-850). IEEE.

Yang, X., He, D., Zhou, Z., Kifer, D., & Giles, C. L. (2017, August). Learning to read irregular text with attention mechanisms. In IJCAI (Vol. 1, No. 2, p. 3).

Lyu, P., Liao, M., Yao, C., Wu, W., & Bai, X. (2018). Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV) (pp. 67-83).

Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., & Bai, X. (2018). Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9), 2035-2048.

Yao, C., Bai, X., & Liu, W. (2014). A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11), 4737-4749.

Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., ... & Bai, X. (2019). Symmetry-constrained rectification network for scene text recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9147-9156).

Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., ... & Cai, M. (2020, April). Decoupled attention network for text recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 12216-12224).

Dai, P., Zhang, H., & Cao, X. (2019). Deep multi-scale context aware feature aggregation for curved scene text detection. IEEE Transactions on Multimedia, 22(8), 1969-1984.

Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015, December). Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2 (pp. 2017-2025).

Wan, Z., He, M., Chen, H., Bai, X., & Yao, C. (2020, April). Textscanner: Reading characters in order for robust scene text recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 07, pp. 12120-12127).

Yue, X., Kuang, Z., Lin, C., Sun, H., & Zhang, W. (2020, August). Robustscanner: Dynamically enhancing positional clues for robust text recognition. In European Conference on Computer Vision (pp. 135-151). Cham: Springer International Publishing.

Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., ... & Bai, X. (2019, July). Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8714-8721).

Li, H., Wang, P., Shen, C., & Zhang, G. (2019, July). Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8610-8617).

Luo, C., Jin, L., & Sun, Z. (2019). Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90, 109-118.

Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017). Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision (pp. 5076-5084).

Yang, L., Wang, P., Li, H., Li, Z., & Zhang, Y. (2020). A holistic representation guided attention network for scene text recognition. Neurocomputing, 414, 67-75.

Long, S., He, X., & Yao, C. (2021). Scene text detection and recognition: The deep learning era. International Journal of Computer Vision, 129, 161-184.

Mishra, A., Alahari, K., & Jawahar, C. V. (2012, June). Top-down and bottom-up cues for scene text recognition. In 2012 IEEE conference on computer vision and pattern recognition (pp. 2687-2694). IEEE.

Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4042-4049).

Seok, J. H., & Kim, J. H. (2015). Scene text recognition using a Hough forest implicit shape model and semi-Markov conditional random fields. Pattern Recognition, 48(11), 3584-3599.

Wang, T., Wu, D. J., Coates, A., & Ng, A. Y. (2012, November). End-to-end text recognition with convolutional neural networks. In Proceedings of the 21st international conference on pattern recognition (ICPR2012) (pp. 3304-3308). IEEE.

Wu, X., Chen, Q., Xiao, Y., Li, W., Liu, X., & Hu, B. (2020). LCSegNet: An efficient semantic segmentation network for large-scale complex Chinese character recognition. IEEE Transactions on Multimedia, 23, 3427-3440.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International journal of computer vision, 116, 1-20.

Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.

Lee, C. Y., & Osindero, S. (2016). Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2231-2239).

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).

Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., & Xie, L. (2019). Aggregation cross-entropy for sequence recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6538-6547).

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227.

Cheng, Z., Xu, Y., Bai, F., Niu, Y., Pu, S., & Zhou, S. (2018). Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5571-5579).

Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. V. D., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099.

Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017, July). Convolutional sequence to sequence learning. In International conference on machine learning (pp. 1243-1252). PMLR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Wang, K., Babenko, B., & Belongie, S. (2011, November). End-to-end scene text recognition. In 2011 International conference on computer vision (pp. 1457-1464). IEEE.

Bai, F., Cheng, Z., Niu, Y., Pu, S., & Zhou, S. (2018). Edit probability for scene text recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1508-1516).

He, P., Huang, W., Qiao, Y., Loy, C., & Tang, X. (2016, March). Reading scene text in deep convolutional sequences. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).

Fang, S., Xie, H., Zha, Z. J., Sun, N., Tan, J., & Zhang, Y. (2018, October). Attention and language ensemble for scene text recognition with convolutional sequence modeling. In Proceedings of the 26th ACM international conference on Multimedia (pp. 248-256).

Liu, W., Chen, C., & Wong, K. Y. (2018, April). Char-net: A character-aware neural network for distorted scene text recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017, July). Language modeling with gated convolutional networks. In International conference on machine learning (pp. 933-941). PMLR.

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2018). Universal transformers. arXiv preprint arXiv:1807.03819.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Praneel, A. V., & Rao, T. S. (2023). Scene Text Detection Using Pyramid-Based Text Proposal Network and Transformation Component Network. (pp.21-32). IJCSE.

Liu, Z., Wang, L., & Qiao, J. (2022). Visual and semantic ensemble for scene text recognition with gated dual mutual attention. International Journal of Multimedia Information Retrieval, 11(4), 669-680.

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2315-2324).

Ben-Younes, H., Cadene, R., Thome, N., & Cord, M. (2019, July). Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8102-8109).

Venkata Praneel, A. S., Srinivasa Rao, T., & Ramakrishna Murty, M. (2020). A survey on accelerating the classifier training using various boosting schemes within cascades of boosted ensembles. In Intelligent Manufacturing and Energy Sustainability: Proceedings of ICIMES 2019 (pp. 809-825). Springer Singapore.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., & Robinson, T. (2013). One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Downloads

Published

25.12.2023

How to Cite

Praneel, A. S. V. ., & Rao, T. S. . (2023). Gated Dual Adaptive Attention Mechanism with Semantic Reasoning, Character Awareness, and Visual-Semantic Ensemble Fusion Decoder for Text Recognition in Natural Scene Images . International Journal of Intelligent Systems and Applications in Engineering, 12(1), 221–234. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3779

Issue

Section

Research Article