Generation of Image Captions using Deep Learning and Natural Language Processing: A Review

Authors

  • M Balakrishna Mallapu, Deepthi Godavarthi

Keywords:

Deep learning, Natural Language Processing, Computer Vision

Abstract

Deep Learning methodologies have significant possibilities for applications that endeavour to generate image captions or image descriptions automatically. Image captioning is among the most academically hard obstacles in image research. The caption of images is an extremely important study area that aims to automatically generate descriptive words based on an image's visual content. It's a multidisciplinary method that combines Artificial Intelligence (AI), Natural Language Processing (NLP), and Computer Vision (CV). Recognizing the Primary elements of the image, characteristics, and interactions is required for captioning. It should also generate sentences that are syntactically and semantically correct. Next, we evaluated the present literature discusses utilizing the language models to improve various applications, including image captioning, report creation, report categorization, extraction of findings, and visual query response and so on. In this article, we intend to present a comprehensive overview of available captioning of images using deep learning approaches. We also describe the datasets and assessment measures commonly utilized in deep learning for the automatic captioning of images.

Downloads

Download data is not yet available.

References

You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4651–4659

Bigham JP, Lin I, Savage S (2017) The effects of not knowing what You Don’t know on web accessibility for blind web users. In Proceedings of the 19th International ACM SIGACCESS conference on computers and accessibility, 101-109

Giraud S, Thérouanne P, Steiner DD (2018) Web accessibility: filtering redundant and irrelevant information improves website usability for blind users. International Journal of Human-Computer Studies 111:23–35

Kuber R, Yu W, Strain P, Murphy E, McAllister G (2020) Assistive multimodal interfaces for improving web accessibility. UMBC Information Systems Department Collection

MacLeod H, Bennett CL, Morris MR, Cutrell E (2017) Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI conference on human factors in computing systems, 5988-5999

Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, ‘‘From show to tell: A survey on deep learning-based image captioning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 539–559, Jan. 2023.

T. Wolf et al., ‘‘Transformers: State-of-the-art natural language processing,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Demonstrations, 2020, pp. 38–45.

X. Jiang, J. Ma, G. Xiao, Z. Shao, and X. Guo, ‘‘A review of multimodal image matching: Methods and applications,’’ Inf. Fusion, vol. 73, pp. 22–71, Sep. 2021.

L. K. Allen, S. D. Creer, and M. C. Poulos, ‘‘Natural language processing as a technique for conducting text-based research,’’ Lang. Linguistics Compass, vol. 15, no. 7, Jul. 2021, Art. no. e12433.

A. M. Rinaldi, C. Russo, and C. Tommasino, ‘‘Automatic image captioning combining natural language processing and deep neural networks,’’ Results Eng., vol. 18, Jun. 2023, Art. no. 101107.

N. Xu, H. Zhang, A.-A. Liu, W. Nie, Y. Su, J. Nie, and Y. Zhang, ‘‘Multilevel policy and reward-based deep reinforcement learning framework for image captioning,’’ IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1372–1383, May 2020.

X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, ‘‘Deep hierarchical encoder–decoder network for image captioning,’’ IEEE Trans. Multimedia, vol. 21, no. 11, pp. 2942–2956, Nov. 2019.

Z. Zohourianshahzadi and J. K. Kalita, ‘‘Neural attention for image captioning: Review of outstanding methods,’’ Artif. Intell. Rev., vol. 55, no. 5, pp. 3833–3862, Nov. 2021.

S. Li, Z. Tao, K. Li, and Y. Fu, ‘‘Visual to text: Survey of image and video captioning,’’ IEEE Trans. Emerg. Topics Comput. Intell., vol. 3, no. 4, pp. 297–312, Aug. 2019.

Rumelhart, D. E., G. E. Hinton and R. J. Williams (1985). Learning internal representations by error propagation, California Univ San Diego La Jolla Inst for Cognitive Science.

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [CrossRef]

Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Hasan, M.; Van Essen, B.C.; Awwal, A.A.; Asari, V.K. A state-of-the-art survey on deep learning theory and architectures. Electronics 2019, 8, 292. [CrossRef]

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [CrossRef]

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556

Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360.

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.

H. Sharma and A. S. Jalal, ‘‘Incorporating external knowledge for image captioning using CNN and LSTM,’’ Mod. Phys. Lett. B, vol. 34, no. 28, Jul. 2020, Art. no. 2050315

Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

Huang, Qiuyuan, et al. "Tensor product generation networks for deep NLP modeling." arXiv preprint arXiv:1709.09118 (2017).

Manning, C. and H. Schutze (1999). Foundations of statistical natural language processing, MIT Press.

Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need 2017. doi:10.48550/ARXIV.1706.03762.

Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 5884–8.

Li N, Liu S, Liu Y, et al. Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, vol 33; 2019. p. 6706–13.

Vila LC, Escolano C, Fonollosa JA, et al. End-to-end speech translation with the transformer. In: IberSPEECH; 2018. p. 60–3.

Topal MO, Bas A, van Heerden I. Exploring transformers in natural language generation: Gpt, bert, and xlnet. arXiv 2021:210208036.

Graves A, Mohamed Ar, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 6645–9.

Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv 2014:14021128.

Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018:181004805.

Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training; 2018. Available from https://s3-us-west-2. amazonaws.com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper.pdf, 2018.

Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. arXiv 2020:200514165.

Lee, J. and K. Toutanova (2018). "Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805.

Touvron, H., T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro and F. Azhar (2023). "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971.

Peng, H., R. Schwartz and N. A. Smith (2019). "PaLM: A hybrid parser and language model." arXiv preprint arXiv:1909.02134.

Radford, A., K. Narasimhan, T. Salimans and I. Sutskever (2018). "Improving language understanding by generative pre-training."

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÃąr, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 49–56.

Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. 2017. Attend to You: Personalized Image Captioning with Context Sequence Memory Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 6432–6440.

Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, Vol. 5. 10.

Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. 2011. Learning photographic global tonal adjustment with a database of input/output image pairs. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 97–104.

Papineni K, Roukos S, Ward T, Zhu W-Ji(2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4566–4575, 2015

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision pp 382–398. Springer

Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966. PMLR

Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2989–2998, Venice. IEEE

Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: 2017 IEEE International Conference on computer vision (ICCV), pp 1231–1240, Venice. IEEE

Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton attribute decomposition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7378–7387, Honolulu. IEEE

Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570

Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv:1805.09019 [cs], May

Jie W, Chen T, Hefeng W, Yang Z, Luo G, Lin Liang (2021) Fine-grained image captioning with globallocal discriminative objective. IEEE Trans Multimedia 23:2413–2427

Zhao W, Xinxiao W, Luo J (2021) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192

Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman Keisuke, Deguchi Daisuke, Murase Hiroshi, Satoh Shin’Ichi (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162951–162961

Ishan, Tajrian Islam, et al. "Bengali Image Captioning Using Vision Encoder-Decoder Model."

Shetty, A., Kale, Y., Patil, Y. et al. Optimal transformers-based image captioning using beam search. Multimed Tools Appl (2023).

Ansari, Khustar, and Priyanka Srivastava. "An efficient automated image caption generation by the encoder decoder model." Multimedia Tools and Applications (2024): 1-26.

Verma, Akash, et al. "Automatic image caption generation using deep learning." Multimedia Tools and Applications 83.2 (2024): 5309-5325.

Wang B, Zheng X, Bo Q, Xiaoqiang L (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Select Top Appl Earth Observ Remote Sens 13:256–270

Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Select Top Appl Earth Observ Remote Sens 13:4462–4475

Sumbul G, Nayak S, Demir B (2021) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934

Huang Wei, Wang Qi, Li X (2021) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440

Li X, Zhang X, Huang W, Wang Q (2021) Truncation cross-entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257

Ma X, Zhao R, Shi Z (2021) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005

R. Ramos and B. Martins, "Using Neural Encoder-Decoder Models With Continuous Outputs for Remote Sensing Image Captioning," in IEEE Access, vol. 10, pp. 24852-24863, 2022, doi: 10.1109/ACCESS.2022.3151874.

Nanal, Wrucha, and Mohammadreza Hajiarbabi. "Captioning Remote Sensing Images Using Transformer Architecture." 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, 2023.

Chen, Junsong, et al. "SMFE-Net: a saliency multi-feature extraction framework for VHR remote sensing image classification." Multimedia Tools and Applications 83.2 (2024): 3831-3854.

Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN)

Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375

Wang L, Bai Z, Zhang Y, Hongtao L (2020) Show, recall, and tell: image captioning with recall mechanism. Proc AAAI Conf ArtifIntell 34(07):12176–12183

Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587

Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474

Luo Y, Ji J, Sun X, Cao L, Yongjian W, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf ArtifIntell 35:2286–2293

Wang Y, Jungang X, Sun Y (2022) End-to-end transformer-based model for image captioning. Proc AAAI Conf ArtifIntell 36:2585–2594

Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019

D. A. Hafeth, S. Kollias and M. Ghafoor, "Semantic Representations With Attention Networks for Boosting Image Captioning," in IEEE Access, vol. 11, pp. 40230-40239, 2023, doi: 10.1109/ACCESS.2023.3268744.

Sharma, Himanshu, and Swati Srivastava. "Multilevel attention and relation network-based image captioning model." Multimedia Tools and Applications 82.7 (2023): 10981-11003.

Hossen, Md Bipul, et al. "GVA: guided visual attention approach for automatic image caption generation." Multimedia Systems 30.1 (2024): 50.

Ravinder, Paspula, and Saravanan Srinivasan. "Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm." (2024).

Grubinger M, Clough PM, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop ontoImage, volume 2

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J ArtifIntell Res 47:853–899

Gong Y, Wang L, Hodosh M, Hockenmaier Julia, Lazebnik Svetlana (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer.

Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440

Downloads

Published

24.03.2024

How to Cite

M Balakrishna Mallapu. (2024). Generation of Image Captions using Deep Learning and Natural Language Processing: A Review. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 3582–3596. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5995

Issue

Section

Research Article