Deep Neural Networks for Automated Image Captioning to Improve Accessibility for Visually Impaired Users

Yashwant  Dongare; Bhalchandra M.  Hardas; Rashmita  Srinivasan; Vidula  Meshram; Mithun G.  Aush; Atul  Kulkarni

Authors

Yashwant Dongare Assistant Professor, Computer Engineering, Department Vishwakarma Institute of Information Technology Pune, Maharashtra, India
Bhalchandra M. Hardas Assistant Professor, Department of Electronics and Computer science, Shri Ramdeobaba college of Engineering and Management, Nagpur, Maharashtra, India
Rashmita Srinivasan Associate Professor, Department of Civil Engineering, Maharashtra Institute of Technology (Autonomous), Aurangabad, Maharashtra, India
Vidula Meshram Assistant Professor, Department of Computer Engineering, Vishwakarma Institute of Information Technology Pune, Maharashtra, India
Mithun G. Aush Assistant Professor, Department of Electrical Engineering, Chh. Shahu College of Engineering, Aurangabad, Maharashtra, India
Atul Kulkarni Professor, Department of Mechanical Engineering, Vishwakarma Institute of Information Technology Pune

Keywords:

Image caption, Convolution neural network, deep learning, LSTM, RNN, Automated caption generation

Abstract

Many researchers are using artificial intelligence and machine learning models to aid the blind due to the advancements in image understanding and automatic image captioning. This research investigates the design and evaluation of deep neural network models for automatic picture captioning, with a focus on improving accessibility for those with visual impairments. The recommended method makes use of deep learning techniques, specifically convolutional neural networks (CNNs) for identifying characteristics in images and recurrent neural networks (RNNs) for generating descriptive captions. The appropriate features are extracted from the input photographs by the CNN and supplied into the RNN so that textual descriptions can be generated. The models are created utilizing techniques like attention processing and beam search to improve the caliber and coherence of the output captions. They are trained using large-scale image caption datasets. Extensive tests are carried out utilizing benchmark datasets as MS COCO and Flickr30k to assess the performance of the created models. The effectiveness of the generated captions is evaluated using objective measures like BLEU, METEOR, and CIDEr. Additionally, a user research with people who are visually impaired is carried out to determine how well the automatic picture captioning system improves accessibility. The outcomes show that the suggested deep neural network models for automatic picture captioning are effective.

Downloads

Download data is not yet available.

References

Song H, Zhu J, Jiang Y (2020) avtmNet: adaptive visual-text merging network for image captioning. Comput Electr Eng 84:1–12

Wei Y, Tran S, Xu S, Kang B, Springer M (2020) Deep learning for retail product recognition: challenges and techniques. Comput Intell Neurosci 1–23

Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using dense net network and adaptive attention. Signal Process Image Commun 85:1–9

Singh A, Singh TD, Bandyopadhyay S (2021) An encoder-decoder based framework for hindi image caption generation. Multimedia tools and applications, 1-20

Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) DAA: dual LSTMs with adaptive attention for image captioning. Neurocomputing 364:322–329

Iwamura K, Kasahara JYL, Moro A, Yamashita A, Asama H (2021) Image captioning using motion-CNN with object detection. Sensors 21(4):1–13

R. Kiros, R. Salakhutdinov, and R. Zemel, "Multimodal neural language models," in ICML, 2014.

T. Mikolov et al., "Efficient estimation of word representations in vector space," International Conference on Learning Representations: Workshops Track, 2013.

D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in the International Conference on Learning Representations (ICLR), 2015.

I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in Neural Information Processing Systems, 2014, pp. 3104-3112.

J. Johnson, A. Karpathy, and L. Fei-Fei, "Densecap: Fully convolutional localization networks for dense captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4565-4574.

J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.

A. Farhadi et al., "Every Picture Tells a Story: Generating Sentences from Images," Computer Vision ECCV, 2016.

B. Krishnakumar, K. Kousalya, S. Gokul, R. Karthikeyan, and D. Kaviyarasu, "IMAGE CAPTION GENERATOR USING DEEP LEARNING," International Journal of Advanced Science and Technology, 2020.

Al-Muzaini HA, Al-Yahya TN, Benhidour H (2018) Automatic Arabic image captioning using RNN-LST M-based language model and CNN. Int J Adv Comput Sci Appl 9(6):67–73

Amritkar C, Jabade V (2018) Image caption generation using deep learning technique. In 2018 fourth international conference on computing communication control and automation (ICCUBEA). IEEE, Pune, pp 1–4

Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

Bigham JP, Lin I, Savage S (2017) The effects of not knowing what You Don’t know on web accessibility for blind web users. In proceedings of the 19th international ACM SIGACCESS conference on computers and accessibility, 101-109

Deng Z, Jiang Z, Lan R, Huang W, Luo X (2020) Image captioning using dense net network and adaptive attention. Signal Process Image Commun 85:1–9

Geng, W, Han F, Lin J, Zhu L, Bai J, Wang S, He L, Xiao Q, Lai Z (2018) Fine-grained grocery product recognition by one-shot learning. In Proceedings of the 26th ACM international conference on Multimedia, pp 1706–1714

Giraud S, Thérouanne P, Steiner DD (2018) Web accessibility: filtering redundant and irrelevant information improves website usability for blind users. International Journal of Human-Computer Studies 111:23–35

Guinness D, Cutrell E, Morris MR (2018) Caption crawler: enabling reusable alternative text descriptions using reverse image search. In proceedings of the 2018 CHI conference on human factors in computing systems, Montréal, QC, Canada, pp 1–11

Hossain MDZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36

Klasson M, Zhang C, Kjellström H (2019) A hierarchical grocery store image dataset with visual and semantic labels. In 2019 IEEE winter conference on applications of computer vision (WACV), 491-500

Kuber R, Yu W, Strain P, Murphy E, McAllister G (2020) Assistive multimodal interfaces for improving web accessibility. UMBC Information Systems Department Collection

Leo M, Carcagnì P, Distante C (2021) A systematic investigation on end-to-end deep recognition of grocery products in the wild. In 2020 25th international conference on pattern recognition (ICPR), IEEE, 7234-7241

Loganathan K, Kumar RS, Nagaraj V, John TJ (2020) CNN & LSTM using python for automatic image captioning. Materials Today: Proceedings, CNN & LSTM using python for automatic image captioning, pp 1–5

MacLeod H, Bennett CL, Morris MR, Cutrell E (2017) Understanding blind people’s experiences with computer-generated captions of social media images. In proceedings of the 2017 CHI conference on human factors in computing systems, 5988-5999

Makav B, Kılıç V (2019) A new image captioning approach for visually impaired people. In 2019 11th international conference on electrical and electronics engineering (ELECO), IEEE, 945-949

Melas-Kyriazi L, Rush AM, Han G (2018) Training for diversity in image paragraph captioning. In proceedings of the 2018 conference on empirical methods in natural language processing, 757-761

K. Agnihotri, P. Chilbule, S. Prashant, P. Jain and P. Khobragade, "Generating Image Description Using Machine Learning Algorithms," 2023 11th International Conference on Emerging Trends in Engineering & Technology - Signal and Information Processing (ICETET - SIP), Nagpur, India, 2023, pp. 1-6, doi: 10.1109/ICETET-SIP58143.2023.10151472.

Sadeghi D, Shoeibi A, Ghassemi N, Moridian P, Khadem A, Alizadehsani R, Teshnehlab M, Gorriz JM, Nahavandi S (2021) An overview on artificial intelligence techniques for diagnosis of schizophrenia based on magnetic resonance imaging modalities: methods, challenges, and future works. arXiv preprint arXiv: 2103.03081

Sehgal S, Sharma J, Chaudhary N (2020) Generating image captions based on deep learning and natural language processing. In 2020 8th international conference on reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), IEEE, 165–169

Deep Neural Networks for Automated Image Captioning to Improve Accessibility for Visually Impaired Users

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Announcements

Information for Authors

ijisae

Information

trindex