Generation of Image Captions using Deep Learning and Natural Language Processing: A Review
Keywords:
Deep learning, Natural Language Processing, Computer VisionAbstract
Deep Learning methodologies have significant possibilities for applications that endeavour to generate image captions or image descriptions automatically. Image captioning is among the most academically hard obstacles in image research. The caption of images is an extremely important study area that aims to automatically generate descriptive words based on an image's visual content. It's a multidisciplinary method that combines Artificial Intelligence (AI), Natural Language Processing (NLP), and Computer Vision (CV). Recognizing the Primary elements of the image, characteristics, and interactions is required for captioning. It should also generate sentences that are syntactically and semantically correct. Next, we evaluated the present literature discusses utilizing the language models to improve various applications, including image captioning, report creation, report categorization, extraction of findings, and visual query response and so on. In this article, we intend to present a comprehensive overview of available captioning of images using deep learning approaches. We also describe the datasets and assessment measures commonly utilized in deep learning for the automatic captioning of images.
Downloads
References
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4651–4659
Bigham JP, Lin I, Savage S (2017) The effects of not knowing what You Don’t know on web accessibility for blind web users. In Proceedings of the 19th International ACM SIGACCESS conference on computers and accessibility, 101-109
Giraud S, Thérouanne P, Steiner DD (2018) Web accessibility: filtering redundant and irrelevant information improves website usability for blind users. International Journal of Human-Computer Studies 111:23–35
Kuber R, Yu W, Strain P, Murphy E, McAllister G (2020) Assistive multimodal interfaces for improving web accessibility. UMBC Information Systems Department Collection
MacLeod H, Bennett CL, Morris MR, Cutrell E (2017) Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI conference on human factors in computing systems, 5988-5999
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, ‘‘From show to tell: A survey on deep learning-based image captioning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 539–559, Jan. 2023.
T. Wolf et al., ‘‘Transformers: State-of-the-art natural language processing,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Demonstrations, 2020, pp. 38–45.
X. Jiang, J. Ma, G. Xiao, Z. Shao, and X. Guo, ‘‘A review of multimodal image matching: Methods and applications,’’ Inf. Fusion, vol. 73, pp. 22–71, Sep. 2021.
L. K. Allen, S. D. Creer, and M. C. Poulos, ‘‘Natural language processing as a technique for conducting text-based research,’’ Lang. Linguistics Compass, vol. 15, no. 7, Jul. 2021, Art. no. e12433.
A. M. Rinaldi, C. Russo, and C. Tommasino, ‘‘Automatic image captioning combining natural language processing and deep neural networks,’’ Results Eng., vol. 18, Jun. 2023, Art. no. 101107.
N. Xu, H. Zhang, A.-A. Liu, W. Nie, Y. Su, J. Nie, and Y. Zhang, ‘‘Multilevel policy and reward-based deep reinforcement learning framework for image captioning,’’ IEEE Trans. Multimedia, vol. 22, no. 5, pp. 1372–1383, May 2020.
X. Xiao, L. Wang, K. Ding, S. Xiang, and C. Pan, ‘‘Deep hierarchical encoder–decoder network for image captioning,’’ IEEE Trans. Multimedia, vol. 21, no. 11, pp. 2942–2956, Nov. 2019.
Z. Zohourianshahzadi and J. K. Kalita, ‘‘Neural attention for image captioning: Review of outstanding methods,’’ Artif. Intell. Rev., vol. 55, no. 5, pp. 3833–3862, Nov. 2021.
S. Li, Z. Tao, K. Li, and Y. Fu, ‘‘Visual to text: Survey of image and video captioning,’’ IEEE Trans. Emerg. Topics Comput. Intell., vol. 3, no. 4, pp. 297–312, Aug. 2019.
Rumelhart, D. E., G. E. Hinton and R. J. Williams (1985). Learning internal representations by error propagation, California Univ San Diego La Jolla Inst for Cognitive Science.
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [CrossRef]
Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Hasan, M.; Van Essen, B.C.; Awwal, A.A.; Asari, V.K. A state-of-the-art survey on deep learning theory and architectures. Electronics 2019, 8, 292. [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
H. Sharma and A. S. Jalal, ‘‘Incorporating external knowledge for image captioning using CNN and LSTM,’’ Mod. Phys. Lett. B, vol. 34, no. 28, Jul. 2020, Art. no. 2050315
Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
Huang, Qiuyuan, et al. "Tensor product generation networks for deep NLP modeling." arXiv preprint arXiv:1709.09118 (2017).
Manning, C. and H. Schutze (1999). Foundations of statistical natural language processing, MIT Press.
Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need 2017. doi:10.48550/ARXIV.1706.03762.
Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 5884–8.
Li N, Liu S, Liu Y, et al. Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, vol 33; 2019. p. 6706–13.
Vila LC, Escolano C, Fonollosa JA, et al. End-to-end speech translation with the transformer. In: IberSPEECH; 2018. p. 60–3.
Topal MO, Bas A, van Heerden I. Exploring transformers in natural language generation: Gpt, bert, and xlnet. arXiv 2021:210208036.
Graves A, Mohamed Ar, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 6645–9.
Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv 2014:14021128.
Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018:181004805.
Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training; 2018. Available from https://s3-us-west-2. amazonaws.com/openai-assets/research-covers/language-unsupervised/language_ understanding_paper.pdf, 2018.
Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. arXiv 2020:200514165.
Lee, J. and K. Toutanova (2018). "Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805.
Touvron, H., T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro and F. Azhar (2023). "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971.
Peng, H., R. Schwartz and N. A. Smith (2019). "PaLM: A hybrid parser and language model." arXiv preprint arXiv:1909.02134.
Radford, A., K. Narasimhan, T. Salimans and I. Sutskever (2018). "Improving language understanding by generative pre-training."
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÃąr, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 49–56.
Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. 2017. Attend to You: Personalized Image Captioning with Context Sequence Memory Networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 6432–6440.
Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, Vol. 5. 10.
Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. 2011. Learning photographic global tonal adjustment with a database of input/output image pairs. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 97–104.
Papineni K, Roukos S, Ward T, Zhu W-Ji(2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4566–4575, 2015
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision pp 382–398. Springer
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966. PMLR
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2989–2998, Venice. IEEE
Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: 2017 IEEE International Conference on computer vision (ICCV), pp 1231–1240, Venice. IEEE
Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton attribute decomposition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7378–7387, Honolulu. IEEE
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv:1805.09019 [cs], May
Jie W, Chen T, Hefeng W, Yang Z, Luo G, Lin Liang (2021) Fine-grained image captioning with globallocal discriminative objective. IEEE Trans Multimedia 23:2413–2427
Zhao W, Xinxiao W, Luo J (2021) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192
Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman Keisuke, Deguchi Daisuke, Murase Hiroshi, Satoh Shin’Ichi (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162951–162961
Ishan, Tajrian Islam, et al. "Bengali Image Captioning Using Vision Encoder-Decoder Model."
Shetty, A., Kale, Y., Patil, Y. et al. Optimal transformers-based image captioning using beam search. Multimed Tools Appl (2023).
Ansari, Khustar, and Priyanka Srivastava. "An efficient automated image caption generation by the encoder decoder model." Multimedia Tools and Applications (2024): 1-26.
Verma, Akash, et al. "Automatic image caption generation using deep learning." Multimedia Tools and Applications 83.2 (2024): 5309-5325.
Wang B, Zheng X, Bo Q, Xiaoqiang L (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Select Top Appl Earth Observ Remote Sens 13:256–270
Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Select Top Appl Earth Observ Remote Sens 13:4462–4475
Sumbul G, Nayak S, Demir B (2021) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934
Huang Wei, Wang Qi, Li X (2021) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440
Li X, Zhang X, Huang W, Wang Q (2021) Truncation cross-entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257
Ma X, Zhao R, Shi Z (2021) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005
R. Ramos and B. Martins, "Using Neural Encoder-Decoder Models With Continuous Outputs for Remote Sensing Image Captioning," in IEEE Access, vol. 10, pp. 24852-24863, 2022, doi: 10.1109/ACCESS.2022.3151874.
Nanal, Wrucha, and Mohammadreza Hajiarbabi. "Captioning Remote Sensing Images Using Transformer Architecture." 2023 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, 2023.
Chen, Junsong, et al. "SMFE-Net: a saliency multi-feature extraction framework for VHR remote sensing image classification." Multimedia Tools and Applications 83.2 (2024): 3831-3854.
Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN)
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375
Wang L, Bai Z, Zhang Y, Hongtao L (2020) Show, recall, and tell: image captioning with recall mechanism. Proc AAAI Conf ArtifIntell 34(07):12176–12183
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474
Luo Y, Ji J, Sun X, Cao L, Yongjian W, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf ArtifIntell 35:2286–2293
Wang Y, Jungang X, Sun Y (2022) End-to-end transformer-based model for image captioning. Proc AAAI Conf ArtifIntell 36:2585–2594
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019
D. A. Hafeth, S. Kollias and M. Ghafoor, "Semantic Representations With Attention Networks for Boosting Image Captioning," in IEEE Access, vol. 11, pp. 40230-40239, 2023, doi: 10.1109/ACCESS.2023.3268744.
Sharma, Himanshu, and Swati Srivastava. "Multilevel attention and relation network-based image captioning model." Multimedia Tools and Applications 82.7 (2023): 10981-11003.
Hossen, Md Bipul, et al. "GVA: guided visual attention approach for automatic image caption generation." Multimedia Systems 30.1 (2024): 50.
Ravinder, Paspula, and Saravanan Srinivasan. "Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm." (2024).
Grubinger M, Clough PM, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop ontoImage, volume 2
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J ArtifIntell Res 47:853–899
Gong Y, Wang L, Hodosh M, Hockenmaier Julia, Lazebnik Svetlana (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer.
Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440
Downloads
Published
How to Cite
Issue
Section
License
![Creative Commons License](http://i.creativecommons.org/l/by-sa/4.0/88x31.png)
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.