A Quantitative Based Research on the Production of Image Captioning


  • Samuel-Soma M. Ajibade Department of Computer Engineering, Istanbul Ticaret University, Turkiye
  • Abdelhamid Zaidi Department of Mathematics, College of Science, Qassim University, P.O. Box 6644, Buraydah 51452, Saudi Arabia
  • Siti Sarah Maidin Faculty of Data Science & Information Technology, INTI International University, Nilia, Malaysia
  • Wan Hussain Wan Ishak School of Computing, Universiti Utara Malaysia, Malaysia
  • Adedotun Adetunla Department of Mechanical and Mechatronics Engineering, Afe Babalola University, Ado Ekiti, Nigeria


Attention Model, Image Caption, Multimodal Model, Region Level Captions, Semantic Content


It is widely recognized that modern systems can discern the context of an image and enrich it with relevant captions through the fusion of computer vision and natural language processing, a technique referred to as 'image caption production.' This article aims to shed light on and analyze various image captioning techniques that have evolved over the past few decades, including the Attention Model, Region-Level Caption Detection, Semantic Content-Based Models, Multimodal Models, and more.  The evaluation of these techniques employs diverse criteria such as Precision Rate, Recall Rate, F1 Score, Accuracy Rate, among others, while employing various datasets for comparison. This article offers a comprehensive structural examination of contemporary image captioning methods. Researchers can leverage the insights from this analysis to develop innovative, improved approaches that sidestep the shortcomings of older methods while retaining their beneficial aspects.


Download data is not yet available.


P. T. Pham, M. Moens, and T. Tuytelaars, "Cross-Media Alignment of Names and Faces," in IEEE Transactions on Multimedia, vol. 12, no. 1, pp. 13-27, Jan. 2010, doi: 10.1109/TMM.2009.2036232.

S. Ye, J. Han, and N. Liu, "Attentive Linear Transformation for Image Captioning," in IEEE Transactions on Image Processing, vol. 27, no. 11, pp. 5514-5524, Nov. 2018, doi: 10.1109/TIP.2018.2855406.

C. C. Park, B. Kim and G. KIM, "Towards Personalized Image Captioning via Multimodal Memory Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 999-1012, 1 April 2019, doi: 10.1109/TPAMI.2018.2824816.

C. Yan et al., "Task-Adaptive Attention for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 43-51, Jan. 2022, doi: 10.1109/TCSVT.2021.3067449.

A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664-676, 1 April 2017, doi: 10.1109/TPAMI.2016.2598339.

A. Javed, K. B. Bajwa, H. Malik, and A. Irtaza, "An Efficient Framework for Automatic Highlights Generation from Sports Videos," in IEEE Signal Processing Letters, vol. 23, no. 7, pp. 954-958, July 2016, doi: 10.1109/LSP.2016.2573042.

X. Lu, B. Wang, X. Zheng, and X. Li, "Exploring Models and Data for Remote Sensing Image Caption Generation," in IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183-2195, April 2018, doi: 10.1109/TGRS.2017.2776321.

A. Tariq and H. Foroosh, "A Context-Driven Extractive Framework for Generating Realistic Image Descriptions," in IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 619-632, Feb. 2017, doi: 10.1109/TIP.2016.2628585.

N. Yu, X. Hu, B. Song, J. Yang, and J. Zhang, "Topic-Oriented Image Captioning Based on Order-Embedding," in IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2743-2754, June 2019, doi: 10.1109/TIP.2018.2889922.

N. Xu et al., "Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning," in IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1372-1383, May 2020, doi: 10.1109/TMM.2019.2941820.

Z. -J. Zha, D. Liu, H. Zhang, Y. Zhang and F. Wu, "Context-Aware Visual Policy Network for Fine-Grained Image Captioning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 710-722, 1 Feb. 2022, doi: 10.1109/TPAMI.2019.2909864.

M. Yang et al., "Multitask Learning for Cross-Domain Image Captioning," in IEEE Transactions on Multimedia, vol. 21, no. 4, pp. 1047-1061, April 2019, doi: 10.1109/TMM.2018.2869276.

L. Cheng, W. Wei, X. Mao, Y. Liu, and C. Miao, "Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation," in IEEE Access, vol. 8, pp. 154953-154965, 2020, doi: 10.1109/ACCESS.2020.3018752.

Y. Huang, J. Chen, W. Ouyang, W. Wan, and Y. Xue, "Image Captioning With End-to-End Attribute Detection and Subsequent Attributes Prediction," in IEEE Transactions on Image Processing, vol. 29, pp. 4013-4026, 2020, doi: 10.1109/TIP.2020.2969330.

B. Wang, C. Wang, Q. Zhang, Y. Su, Y. Wang and Y. Xu, "Cross-Lingual Image Caption Generation Based on Visual Attention Model," in IEEE Access, vol. 8, pp. 104543-104554, 2020, doi: 10.1109/ACCESS.2020.2999568.

J. Ji, C. Xu, X. Zhang, B. Wang and X. Song, "Spatio-Temporal Memory Attention for Image Captioning," in IEEE Transactions on Image Processing, vol. 29, pp. 7615-7628, 2020, doi: 10.1109/TIP.2020.3004729.

J. Wu, T. Chen, H. Wu, Z. Yang, G. Luo, and L. Lin, "Fine-Grained Image Captioning With Global-Local Discriminative Objective," in IEEE Transactions on Multimedia, vol. 23, pp. 2413-2427, 2021, doi: 10.1109/TMM.2020.3011317.

L. Wu, M. Xu, L. Sang, T. Yao, and T. Mei, "Noise Augmented Double-Stream Graph Convolutional Networks for Image Captioning," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 8, pp. 3118-3127, Aug. 2021, doi: 10.1109/TCSVT.2020.3036860.

M. Liu, H. Hu, L. Li, Y. Yu, and W. Guan, "Chinese Image Caption Generation via Visual Attention and Topic Modeling," in IEEE Transactions on Cybernetics, vol. 52, no. 2, pp. 1247-1257, Feb. 2022, doi: 10.1109/TCYB.2020.2997034.

S. Zhang, Y. Zhang, Z. Chen and Z. Li, "VSAM-Based Visual Keyword Generation for Image Caption," in IEEE Access, vol. 9, pp. 27638-27649, 2021, doi: 10.1109/ACCESS.2021.3058425.

D. Hou, Z. Zhao, Y. Liu, F. Chang, and S. Hu, "Automatic Report Generation for Chest X-Ray Images via Adversarial Reinforcement Learning," in IEEE Access, vol. 9, pp. 21236-21250, 2021, doi: 10.1109/ACCESS.2021.3056175.

Z. Zhou et al., "An Image Captioning Model Based on Bidirectional Depth Residuals and its Application," in IEEE Access, vol. 9, pp. 25360-25370, 2021, doi: 10.1109/ACCESS.2021.3057091.

H. Ben et al., "Unpaired Image Captioning With semantic-Constrained Self-Learning," in IEEE Transactions on Multimedia, vol. 24, pp. 904-916, 2022, doi: 10.1109/TMM.2021.3060948.

H. Yanagimoto and M. Shozu, "Multiple Perspective Caption Generation with Attention Mechanism," 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI), 2020, pp. 110-115, doi: 10.1109/IIAI-AAI50415.2020.00031.

A Guide to Image Captioning, Proteinatlas web resource, https://towardsdatascience.com/a-guide-to-image-captioning-e9fd5517f350, Accessed 05 February 2022.

Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.

Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.

Lin, Chin-Yew. "ROUGE: A package for automatic evaluation of summaries." Text Summarization Branches Out, 2004.

Anderson, Peter, Basura Fernando, Mark Johnson, and Stephen Gould. “SPICE: Semantic Propositional Image Caption Evaluation.” ECCV (2016).

Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "CIDER: Consensus-based image description evaluation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Dalianis, H. (2018). Evaluation Metrics and Evaluation. In: Clinical Text Mining. Springer, Cham. https://doi.org/10.1007/978-3-319-78503-5_6.

Cross entropy, Proteinatlas web resource, https://en.wikipedia.org/wiki/Cross_entropy/, Accessed 5th March 2022.

Plausibility, Proteinatlas web resource, https://www.sciencedirect.com/topics/mathematics/plausibility, Accessed 5th March 2022.

What is "Relevance" and how is it calculated? Proteinatlas web resource, https://dimensions.freshdesk.com/support/solutions/articles/23000022475/, Accessed 5th March 2022.

Tran, Kenneth & He, Xiaodong & Zhang, Lei & Sun, Jian. (2016). Rich Image Captioning in the Wild. 434-441. 10.1109/CVPRW.2016.61.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156-3164, doi: 10.1109/CVPR.2015.7298935.

Lin, TY. et al. (2014). Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48.

Hsankesara (2018), "Flickr Image dataset", https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset, Accessed on 11th March 2022.

RSICD,https://github.com/201528014227051/RSICD_optimal, Accessed on 11th March 2022.

Lalu Erfandi Maula Yusnu (2021), Oxford 102 Flower Dataset, https://www.kaggle.com/datasets/nunenuh/pytorch-challange-flower-dataset, Accessed on 11th March 2022.

Thomee, Bart & Elizalde, Benjamin & Shamma, David & Ni, Karl & Friedland, Gerald & Poland, Douglas & Borth, Damian & Li, Li-Jia. (2016). YFCC100M: the new data in multimedia research. Communications of the ACM. 59. 64-73. 10.1145/2812802.

Krause, Jonathan & Johnson, Justin & Krishna, Ranjay & Fei-Fei, Li. (2016). A Hierarchical Approach for Generating Descriptive Image Paragraphs.

Raddar (2020), "Chest X-rays (Indiana University)",https://www.kaggle.com/datasets/raddar/chest-xrays-indiana-university, Accessed 11th March 2022.

Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2019). MIMIC-CXR Database (version 2.0.0). PhysioNet. https://doi.org/10.13026/C2JT1Q.

"Labeled Faces in the Wild Home", http://vis-www.cs.umass.edu/lfw/, Accessed 11th March 2022.

P Naga Srinivasu, Akash Kumar Bhoi, Rutvij Jhaveri, G Thippa Reddy. Muhammad Bilal, "Probabilistic Deep Q Network for Real-time Path Planning in Censorious Robotic Procedures using Force Sensors", Journal of Real-Time Image Processing, Springer, 2021.

Aditya Khamparia, Deepak Gupta, Victor Hugo C. de Albuquerque, Arun Kumar Sangaiah, Rutvij H. Jhaveri, "Internet of Health Things-driven Deep Learning System for Detection and Classification of Cervical Cells using Transfer Learning", The Journal of Supercomputing, DOI: https://doi.org/10.1007/s11227-020-03159-4 , Springer, Jan 2020.

Jagannath Paramguru, Subrat Kumar Barik, Ajit Kumar Barisal, Gaurav Dhiman, Rutvij H. Jhaveri, Mohammed Alkahtani, Mustufa Haider Abidi, "Addressing Economic Dispatch Problem with Multiple Fuels using Oscillatory Particle Swarm Optimization", Computers, Materials & Continua (CMC, ISSN: 1546-2218), Tech Science Press, Aug 2021.

Surono, S., Rivaldi, M., Dewi, D. A., & Irsalinda, N. (2023). New Approach to Image Segmentation: U-Net Convolutional Network for Multiresolution CT Image Lung Segmentation. Emerging Science Journal, 7(2), 498-506.

Mr. Rahul Sharma. (2013). Modified Golomb-Rice Algorithm for Color Image Compression. International Journal of New Practices in Management and Engineering, 2(01), 17 - 21. Retrieved from http://ijnpme.org/index.php/IJNPME/article/view/13

Arularasan, A. N. ., Aarthi, E. ., Hemanth, S. V. ., Rajkumar, N. ., & Kalaichelvi, T. . (2023). Secure Digital Information Forward Using Highly Developed AES Techniques in Cloud Computing. International Journal on Recent and Innovation Trends in Computing and Communication, 11(4s), 122–128. https://doi.org/10.17762/ijritcc.v11i4s.6315

Ms. Pooja Sahu. (2015). Automatic Speech Recognition in Mobile Customer Care Service. International Journal of New Practices in Management and Engineering, 4(01), 07 - 11. Retrieved from http://ijnpme.org/index.php/IJNPME/article/view/34

Diniesh, V. C. ., Prasad, L. V. R. C. ., Bharathi , R. J. ., Selvarani, A., Theresa, W. G. ., Sumathi, R. ., & Dhanalakshmi, G. . (2023). Performance Evaluation of Energy Efficient Optimized Routing Protocol for WBANs Using PSO Protocol. International Journal on Recent and Innovation Trends in Computing and Communication, 11(4s), 116–121. https://doi.org/10.17762/ijritcc.v11i4s.6314




How to Cite

M. Ajibade, S.-S. ., Zaidi, A. ., Maidin, S. S. ., Wan Ishak, W. H. ., & Adetunla, A. . (2023). A Quantitative Based Research on the Production of Image Captioning. International Journal of Intelligent Systems and Applications in Engineering, 11(4), 816–830. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3615



Research Article