Enhancing Sign Language Recognition: A Fusion of Bidirectional LSTMs and BiGRUs in Video Processing

Authors

  • Ajay M. Pol, Shrinivas A. Patil

Keywords:

Bidirectional RNN, Deep Learning, ResNet101, Sign Language Recognition, Video Sequence

Abstract

In this comprehensive study, the focus is on sign language recognition using the WLASL dataset, where various models undergo evaluation, leading to a meticulous analysis of their performance metrics. Notably, the proposed reinforcement learning (RL) model emerges as a standout performer, showcasing exceptional results with an accuracy rate of 99%, sensitivity at 99%, specificity reaching 98%, and an impressive F1 Score of 99%. A noteworthy observation is the superior feature extraction capabilities of EfficientNet-B1, outperforming the widely used ResNet-101. The integration of bidirectional recurrent neural networks (RNN) emphasizes the critical role of temporal understanding in enhancing the accuracy of sign language recognition. Moreover, the RL-enhanced EfficientNet-B1 demonstrates excellence not only in accuracy but also in generating contextually rich captions, as evidenced by a commendable BLEU score of 0.51. These findings not only contribute significantly to the ongoing advancements in sign language recognition technology but also underscore the pivotal role of reinforcement learning and model selection in achieving heightened accuracy and contextual understanding, particularly within the challenging context of the WLASL dataset.

Downloads

Download data is not yet available.

References

I. Papastratis, C. Chatzikonstantinou, D. Konstantinidis, K. Dimitropoulos, and P. Daras, “Artificial Intelligence Technologies for Sign Language,” Sensors (Basel)., vol. 21, no. 17, Sep. 2021, doi: 10.3390/S21175843.

M. Papatsimouli, P. Sarigiannidis, and G. F. Fragulis, “A Survey of Advancements in Real-Time Sign Language Translators: Integration with IoT Technology,” Technol. 2023, Vol. 11, Page 83, vol. 11, no. 4, p. 83, Jun. 2023, doi: 10.3390/TECHNOLOGIES11040083.

M. Alaghband, H. R. Maghroor, and I. Garibay, “A survey on sign language literature,” Mach. Learn. with Appl., vol. 14, p. 100504, Dec. 2023, doi: 10.1016/J.MLWA.2023.100504.

W. Xu, J. Yu, Z. Miao, L. Wan, Y. Tian, and Q. Ji, “Deep Reinforcement Polishing Network for Video Captioning,” IEEE Trans. Multimed., vol. 23, pp. 1772–1784, 2021, doi: 10.1109/TMM.2020.3002669.

D. Yasin, A. Sohail, and I. Siddiqi, “Semantic Video Retrieval using Deep Learning Techniques,” Proc. 2020 17th Int. Bhurban Conf. Appl. Sci. Technol. IBCAST 2020, pp. 338–343, Jan. 2020, doi: 10.1109/IBCAST47879.2020.9044601.

M. Nabati and A. Behrad, “Video captioning using boosted and parallel Long Short-Term Memory networks,” Comput. Vis. Image Underst., vol. 190, p. 102840, Jan. 2020, doi: 10.1016/J.CVIU.2019.102840.

M. Chohan, A. Khan, M. S. Mahar, S. Hassan, A. Ghafoor, and M. Khan, “Image Captioning using Deep Learning: A Systematic Literature Review,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 5, pp. 278–286, 2020, doi: 10.14569/IJACSA.2020.0110537.

J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, “Streamlined Dense Video Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 6581–6590, Apr. 2019, doi: 10.1109/CVPR.2019.00675.

M. Abdar et al., “A Review of Deep Learning for Video Captioning,” Apr. 2023, Accessed: Nov. 23, 2023. [Online]. Available: https://arxiv.org/abs/2304.11431v1.

Z. Guo, Y. Hou, and W. Li, “Sign language recognition via dimensional global–local shift and cross-scale aggregation,” Neural Comput. Appl., pp. 1–13, Mar. 2023, doi: 10.1007/S00521-023-08380-9/METRICS.

T. Fujii, Y. Sei, Y. Tahara, R. Orihara, and A. Ohsuga, “‘Never fry carrots without cutting.’ Cooking Recipe Generation from Videos Using Deep Learning Considering Previous Process,” Proc. - 2019 IEEE/ACIS 4th Int. Conf. Big Data, Cloud Comput. Data Sci. BCD 2019, pp. 124–129, May 2019, doi: 10.1109/BCD.2019.8885222

S. Han, J. Liu, J. Zhang, P. Gong, X. Zhang, and H. He, “Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph,” Complex Intell. Syst., vol. 9, no. 5, pp. 4995–5012, Oct. 2023, doi: 10.1007/S40747-023-00998-5/FIGURES/19.

D. Li, C. R. Opazo, X. Yu, and H. Li, “Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison,” Proc. - 2020 IEEE Winter Conf. Appl. Comput. Vision, WACV 2020, pp. 1448–1458, Oct. 2019, doi: 10.1109/WACV45572.2020.9093512.

O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/S11263-015-0816-Y.

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., Sep. 2014, Accessed: May 10, 2023. [Online]. Available: https://arxiv.org/abs/1409.1556v6.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 770–778, Dec. 2015, doi: 10.1109/CVPR.2016.90.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520, Jan. 2018, doi: 10.1109/CVPR.2018.00474.

G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 2261–2269, Aug. 2016, doi: 10.1109/CVPR.2017.243.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 2818–2826, Dec. 2015, doi: 10.1109/CVPR.2016.308.

M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” 36th Int. Conf. Mach. Learn. ICML 2019, vol. 2019-June, pp. 10691–10700, May 2019, Accessed: Mar. 03, 2024. [Online]. Available: https://arxiv.org/abs/1905.11946v5.

Downloads

Published

16.03.2024

How to Cite

Shrinivas A. Patil, A. M. P. . (2024). Enhancing Sign Language Recognition: A Fusion of Bidirectional LSTMs and BiGRUs in Video Processing. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 869–876. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5366

Issue

Section

Research Article