Enhanced Caption Generation Model Using Hawk Swarm Optimization Based Bilstm Model

Authors

  • Sumedh Pundlikrao Ingale, Gajendra Rambhau Bamnote

Keywords:

Hawk swarm optimization, BiLSTM, tokenization, images, transcripts and VGG-16

Abstract

The accurate representation and contextual understanding are difficult problems to solve when integrating computer vision and language processing in the field of creating captions for visual data. Intricate subtitles and maintaining visual details are challenges for existing models also achieving high accuracy with less over fitting issues is difficult. Hence to overcome these difficulties develop a hawk swarm optimization based BiLSTM model (HSO-BiLSTM) to enhance the process of generating captions for visual data. To achieve this, here utilize two datasets: Flickr30k (d1) and COCO (d2). Initially, segregate the images and their corresponding transcripts. Subsequently, perform separate preprocessing for images and transcripts. For images, conduct distinct preprocessing steps, and then employ VGG-16 for feature extraction. In the case of transcripts, construct a vocabulary, tokenize the text, assign indices, and pad sequences. Afterward, integrate both sets of features to optimize a Bidirectional Long Short-Term Memory (BiLSTM) model. To enhance the effectiveness of the BiLSTM, utilize the Harris Hawk optimization (HHO) and Harmony Search optimization techniques for fine-tuning. The optimized BiLSTM is then employed to generate captions for the transcripts. The metrics for dataset 1 acquired values are 0.50, 0.32, 0.55 and 26.66, and similarly for dataset 2 acquired values are 0.49, 0.32, 0.57 and 26.58.

Downloads

Download data is not yet available.

References

X. Liu, W. Liu, and W. Xing, "Image Caption Generation with Local Semantic Information and Global Information." In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 680-685. IEEE, 2019.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A neural image caption generator,” Computer Vision and Pattern Recognition, pp. 3156-3164, 2015.

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural Networks (m-RNN),”International Conference On Learning Representations, 2015.

K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,”IEEE Transactions on Neural Networks &Learning Systems, vol. 28, no.10, pp. 2222-2232, 2017.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,”International Conference on Computer Vision, 2017, pp. 4904-4912.

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image Captioning with Semantic Attention,”Computer Vision and Pattern Recognition, pp. 4651-4659, 2016.

R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural language models,” Multimodal Neural Language Models. International Conference on Machine Learning, pp. 595-603, 2014.

P. Kuznetsova, V. Ordonez, A. Berg, T. Berg, and Y. Choi, “Collective generation of natural image descriptions.” In: Annual meeting of the association for computational linguistics, pp 359–368. 2012.

P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi, “Treetalk: composition and compression of trees for image descriptions.” Trans Assoc Comput Linguist. 2(10):351–62, 2014.

L. Li, S. Tang, L. Deng, Y. Zhang, and Q. Tian, “Image caption with global-local attention.” in: Proceedings of the AAAI, pp. 4133–4139, 2017.

Z. Yang, Y. Yuan, Y. Wu, W.W. Cohen, and R.R. Salakhutdinov, “Review networks for caption generation,” in: Proceedings of the NIPS, pp. 2361–2369, 2016.

K. Fu, J. Jin, R. Cui, F. and Sha, C. Zhang, “Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts,” IEEE Trans. Pattern Anal. Mach. Intell. 39 (12), 2321–2334, 2017.

Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Trans. Pattern Anal. Mach. Intell. 40 (6),1367–1381, 2018.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in: Proceedings of the IEEE ICCV, pp. 22–29, 2017.

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in: Proceedings of the CVPR, pp. 4651–4659, 2016,

M. Hermans, and B. Schrauwen, “Training and analysing deep recurrent neural networks,” in: Proceedings of the NIPS, pp. 190–198, 2013.

V. Yngve, “A model and an hypothesis for language structure,” Proc. Am. Philos. Soc. 104 (5), 444–466, 1960.

J. Johnson, Karpathy A. Fei-Fei, and L. Densecap, “Fully convolutional localization networks for dense captioning.” In: IEEE Conference on computer vision and pattern recognition, pp 4565–4574. 2016.

A. Karpathy, Li FF. “Deep visual-semantic alignments for generating image descriptions.” In: IEEE conference on computer vision and pattern recognition, pp 3128–3137. 2015.

A. Krizhevsky, I. Sutskever, and GE. Hinton, “Imagenet classificationwith deep convolutional neural networks.” In: Advances in neural information processing systems, pp 1097–1105. 2012.

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, T. Berg, “Baby talk: understanding and generating simple image descriptions.” In: IEEE conference on computer vision and pattern recognition, pp 1601–1608. 2011.

S. Ding, S. Qu, Y. Xi, A.K. Sangaiah, and S. Wan, "Image caption generation with high-level image features." Pattern Recognition Letters, 123: 89-95, 2019.

Y.H. Tan, and C.S. Chan, "Phrase-based image caption generator with hierarchical LSTM network." Neurocomputing 333: 86-100, 2019.

X. He, B. Shi, X. Bai, G.S. Xia, Z. Zhang, and W. Dong, "Image caption generation with part of speech guidance." Pattern Recognition Letters, 119: 229-237, 2019.

G. Ding, M. Chen, S. Zhao, H. Chen, J. Han, and Q. Liu, "Neural image caption generation with weighted training and reference." Cognitive Computation 11, no. 6: 763-777, 2019.

A.Yuan, X. Li, and X. Lu, "3G structure for image caption generation." Neurocomputing 330: 17-28, 2019.

L. Cheng, W. Wei, X. Mao, Y. Liu, and C. Miao, "Stack-VS: Stacked visual-semantic attention for image caption generation." IEEE Access 8: 154953-154965, 2020.

H. Zhang, C. Ma, Z. Jiang, and J. Lian, "Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s." IEEE Access 11: 134-143, 2022.

A.A. Heidari, S. Mirjalili, H. Faris, I. Aljarah, M. Mafarja, and H. Chen, "Harris hawks optimization: Algorithm and applications." Future generation computer systems 97: 849-872, 2019.

O.M.D. Alia, and R. Mandava, "The variants of the harmony search algorithm: an overview." Artificial Intelligence Review 36: 49-68, 2011.

R. Dey, and F.M. Salem, "Gate-variants of gated recurrent unit (GRU) neural networks." In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pp. 1597-1600, 2017.

F. Shahid, A. Zameer, and M. Muneeb, "A novel genetic LSTM model for wind power forecast." Energy 223: 120069, 2021.

A.A. Sharfuddin, M.N. Tihami, and M.S. Islam, "A deep recurrent neural network with bilstm model for sentiment classification." In 2018 International conference on Bangla speech and language processing (ICBSLP), pp. 1-4. IEEE, 2018.

R. J. Kavitha, C. Thiagarajan, P. Indira Priya, A. Vivek Anand, Essam A. Al-Ammar, Madhappan Santhamoorthy, and P. Chandramohan. "Improved Harris Hawks Optimization with Hybrid Deep Learning Based Heating and Cooling Load Prediction on residential buildings." Chemosphere 309: 136525, 2022.

Downloads

Published

24.03.2024

How to Cite

Sumedh Pundlikrao Ingale. (2024). Enhanced Caption Generation Model Using Hawk Swarm Optimization Based Bilstm Model. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 3080–3092. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5899

Issue

Section

Research Article