Story Telling of a Single Image Using Redescriptions through Image Description Vision Transformer (IDVT) Algorithm


  • Darapu Uma, M. Kamala Kumari


Vision Transformer, Image Description, Feature Extraction, Story Telling, Redescription


Image Captioning is a process of transforming an input image into textual description. It uses both Computer Vision and Natural Language Processing techniques in order to generate captions. There are various image caption applications which include automation of annotation and tagging of images, self-driving cars, virtual and augmented reality applications, surveillance and security systems, object recognition and detection of images and videos. The existing techniques proposed are Bidirectional Recurrent Neural Network (BRNN), Convolution and Recurrent Neural Networks (CNN and RNN) with lack of context and appropriate meaning. The present paper proposes story telling of a single image using vision transformers. This paper narrates a story of a single image by applying a proposed algorithm named as Image Description Vision Transformer (IDVT).IDVT combines both preprocessing techniques and unsupervised algorithms of k means and mean shift to generate various descriptions of the same image and finally end up with a story.


Download data is not yet available.


K. Han et al., "A Survey on Vision Transformer," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87-110, 1 Jan. 2023, doi: 10.1109/TPAMI.2022.3152247.

T. Jaknamon and S. Marukatat, "ThaiTC:Thai Transformer-based Image Captioning," 2022 17th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, 2022, pp. 1-4, doi: 10.1109/iSAI- C. Orhei NLP56921.2022.9960246.

, M. Mocofan, S. Vert and R. Vasiu, "End-to-End Computer Vision Framework," 2020 International Symposium on Electronics and Telecommunications (ISETC), Timisoara, Romania, 2020, pp. 1-4, doi: 10.1109/ISETC50328.2020.9301078.

J. Wang, Z. Chen, A. Ma and Y. Zhong, "Capformer: Pure Transformer for Remote Sensing Image Caption," IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022, pp. 7996-7999, doi: 10.1109/IGARSS46834.2022.9883199

P. G. Shambharkar, P. Kumari, P. Yadav and R. Kumar, "Generating Caption for Image using Beam Search and Analyzation with Unsupervised Image Captioning Algorithm," 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 2021, pp. 857-864, doi: 10.1109/ICICCS51141.2021.9432245.

Y. Yang, "Image-Caption Pair Replacement Algorithm towards Semi-supervised Novel Object Captioning," 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi'an, China, 2022, pp. 266-273, doi: 10.1109/ICSP54964.2022.9778729.

Sule Anjomshoae, Daniel Omeiza, Lili Jiang,Context-based image explanations for deep neural networks,Image and Vision Computing,Volume 116,2021,104310,ISSN 0262-8856,

J. Wang, Z. Chen, A. Ma and Y. Zhong, "Capformer: Pure Transformer for Remote Sensing Image Caption," IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022, pp. 7996-7999, doi: 10.1109/IGARSS46834.2022.9883199.

Absalom E. Ezugwu, Abiodun M. Ikotun, Olaide O. Oyelade, Laith Abualigah, Jeffery O. Agushaka, Christopher I. Eke, Andronicus A. Akinyelu,A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects,Engineering Applications of Artificial Intelligence,Volume 110,2022,104743,ISSN 0952-1976,

S, D., S, Q., Y, X., S, A. & S., W. (2019). Image caption generation with high-level image features. Pattern Recognition Letters, 123:89–95. doi: 10.1016/j.patrec.2019.03.021.

Ding, G., Chen, M., Zhao, S. et al. Neural Image Caption Generation with Weighted Training and Reference. Cogn Comput 11, 763–777 (2019).

M. A. Hassan, S. Saleem, M. Z. Khan and M. U. G. Khan, "Story Based Video Retrieval using Deep Visual and Textual Information," 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE), Islamabad, Pakistan, 2019, pp. 166-171, doi: 10.1109/C-CODE.2019.8680995.

I. K. Raharjana, D. Siahaan and C. Fatichah, "User Stories and Natural Language Processing: A Systematic Literature Review," in IEEE Access, vol. 9, pp. 53811-53826, 2021, doi: 10.1109/ACCESS.2021.3070606.

Megha J Panicker, Vikas Upadhayay, Gunjan Sethi, Vrinda Mathur,” Image Caption Generator,” International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-10 Issue-3, January 2021.

J. Vaishnavi and V. Narmatha, "Video Captioning based on Image Captioning as Subsidiary Content," 2022 Second International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 2022, pp. 1-6, doi: 10.1109/ICAECT54875.2022.9807935.

W. Zhang and J. Ma, "Image Caption Enhancement with GRIT, Portable ResNet and BART Context-Tuning," 2022 6th International Conference on Universal Village (UV), Boston, MA, USA, 2022, pp. 1-6, doi: 10.1109/UV56588.2022.10185494.

A. Z. Al-Jamal, M. J. Bani-Amer and S. Aljawarneh, "Image Captioning Techniques: A Review," 2022 International Conference on Engineering & MIS (ICEMIS), Istanbul, Turkey, 2022, pp. 1-5, doi: 10.1109/ICEMIS56295.2022.9914173.

V. Atliha and D. Šešok, "Comparison of VGG and ResNet used as Encoders for Image Captioning," 2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 2020, pp. 1-4, doi: 10.1109/eStream50540.2020.9108880.

W. Kang and W. Hu, "A Survey of Image Caption Tasks," 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Nanjing, China, 2022, pp. 71-74, doi: 10.1109/CEI57409.2022.9950150.

A. Ueda, W. Yang and K. Sugiura, "Switching Text-Based Image Encoders for Captioning Images With Text," in IEEE Access, vol. 11, pp. 55706-55715, 2023, doi: 10.1109/ACCESS.2023.3282444.

Al Nahian, M.S., Tasrin, T., Gandhi, S., Gaines, R., Harrison, B.: A hierarchical approach for visual storytelling using image description. In: International Conference on Interactive Digital Storytelling. pp. 304–317. Springer (2019).

Malakan, Zainy M., Ghulam Mubashar Hassan, and Ajmal Mian. "Vision transformer based model for describing a set of images as a story." In Australasian Joint Conference on Artificial Intelligence, pp. 15-28. Cham: Springer International Publishing, 2022.

Chen, H., Huang, Y., Takamura, H., Nakayama, H.: Commonsense knowledge aware concept selection for diverse and informative visual storytelling. arXiv preprint arXiv:2102.02963 (2021).

Kang, Y., Park, H., Smit, B., & Kim, J. (2022, November 17). Moftransformer: a Multi-modal Pre-training Transformer for Universal Transfer Learning in Metal-organic Frameworks.

Chang, Y.-H.; Chen, Y.-J.; Huang, R.-H.; Yu, Y.-T. Enhanced Image Captioning with Color Recognition Using Deep Learning Methods. Appl. Sci. 2022, 12, 209.

Darapu Uma, M.Kamala Kumari, "A Comprehensive Survey and Comparison on Story Construction Techniques Using Deep Learning for Scene Recognition," International Journal of Computer Sciences and Engineering, Vol.10, Issue.12, pp.14-22, 2022.




How to Cite

Darapu Uma. (2024). Story Telling of a Single Image Using Redescriptions through Image Description Vision Transformer (IDVT) Algorithm. International Journal of Intelligent Systems and Applications in Engineering, 12(22s), 390–412. Retrieved from



Research Article