Hybrid Deep Learning Techniques for Large-Scale Video Classification

Saif Saad  Alnuaimi; Bilal Hikmat  Rasheed; D.  Yuvaraj; P.  Sundaravadivel; R. Augustian  Isaac

Authors

Saif Saad Alnuaimi Department of Computer Science, Cihan University-Duhok, Duhok, Iraq
Bilal Hikmat Rasheed Department of Computer Science, Cihan University-Duhok, Duhok, Iraq
D. Yuvaraj Department of Computer Science, Cihan University-Duhok, Duhok, Iraq
P. Sundaravadivel Department of Artificial Intelligence & Machine Learning, Saveetha, Engineering College, Saveetha Nagar, Thandalam, Chennai, India
R. Augustian Isaac Department of Artificial Intelligence & Machine Learning, Saveetha, Engineering College, Saveetha Nagar, Thandalam, Chennai, India

Keywords:

Deep Learning, Video Classification, Convolutional Neural Networks, Recurrent Neural Networks, Feature Extraction

Abstract

Effective large-scale video management and classification are becoming more and more necessary due to the Internet's video data rapidly increase. A comprehensive evaluation of the trade-off between timeliness and efficacy should be made during real-world implementation. In industrial deployments, the frame extraction function is frequently used to categorize video actions, while the video classification technique integrated with a time segment network is implemented. The scientific literature now contains several reviews and research articles on the topic of video categorization. With the ability to analyze spatial and temporal information concurrently and efficiently, the combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) provides an effective framework for video categorization issues. This research proposes a comparison to evaluate how CNNs and RNNs integrated into different architectures might use temporal information to enhance video classification accuracy using deep learning. To optimize the performance of the proposed design for a CNN and RNN hybrid that works well, an innovative action template-based feature extraction technique is presented. This approach extracts features by analyzing the similarity between each frame's informative areas. Using RNN based video classifiers extensive experiments were performed on the UCF-50 and UCF-101 datasets. The efficiency of the suggested Feature extraction technique is demonstrated by the considerable improvement in video categorization accuracy shown in the experimental data, as examined by a one-way statistical evaluation of variance.

Downloads

Download data is not yet available.

References

Gong, X., & Li, Z. (2022). A Video Classification Method Based on Spatiotemporal Detail Attention and Feature Fusion. Mobile Information Systems, 2022.

Hu, Z. P., Zhang, R. X., Qiu, Y., Zhao, M. Y., & Sun, Z. (2021). 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimedia Tools and Applications, 80, 33179-33192.

Savran Kızıltepe, R., Gan, J. Q., & Escobar, J. J. (2023). A novel keyframe extraction method for video classification using deep neural networks. Neural Computing and Applications, 35(34), 24513-24524.

Ballas, N., Yao, L., Pal, C., & Courville, A. (2015). Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432.

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).

Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).

Duan, H., Zhao, Y., Xiong, Y., Liu, W., & Lin, D. (2020, August). Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision (pp. 670-688). Cham: Springer International Publishing.

Kalfaoglu, M. E., Kalkan, S., & Alatan, A. A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 (pp. 731-747). Springer International Publishing.

Mao, F., Wu, X., Xue, H., & Zhang, R. (2018). Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European conference on computer vision (ECCV) workshops (pp. 0-0).

Qiu, Z., Yao, T., Ngo, C. W., Tian, X., & Mei, T. (2019). Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12056-12065).

Savran Kızıltepe, R., Gan, J. Q., & Escobar, J. J. (2019). Combining very deep convolutional neural networks and recurrent neural networks for video classification. In Advances in Computational Intelligence: 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, Gran Canaria, Spain, June 12-14, 2019, Proceedings, Part II 15 (pp. 811-822). Springer International Publishing.

Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083-7093).

Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).

Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., & Gong, B. (2021). Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16020-16030).

Wang, X., Xiong, X., Neumann, M., Piergiovanni, A. J., Ryoo, M. S., Angelova, A., ... & Hua, W. (2020). Attentionnas: Spatiotemporal attention cell search for video classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 (pp. 449-465). Springer International Publishing.

Ahmad, H., Khan, H. U., Ali, S., Rahman, S. I. U., Wahid, F., & Khattak, H. (2022). Effective video summarization approach based on visual attention.

Apostolidis, E., Balaouras, G., Mezaris, V., & Patras, I. (2021, November). Combining global and local attention with positional encoding for video summarization. In 2021 IEEE international symposium on multimedia (ISM) (pp. 226-234). IEEE.

Wu, G., Lin, J., & Silva, C. T. (2022). Intentvizor: Towards generic query guided interactive video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10503-10512).

Ghauri, J. A., Hakimov, S., & Ewerth, R. (2021, July). Supervised video summarization via multiple feature sets with parallel attention. In 2021 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6s). IEEE.

Bao, G., Li, D., & Mei, Y. (2020, September). Features extraction based on optical-flow and mutual information entropy. In Journal of Physics: Conference Series (Vol. 1646, No. 1, p. 012112). IOP Publishing.

Nguyen-Thai, B., Le, V., Morgan, C., Badawi, N., Tran, T., & Venkatesh, S. (2021). A spatio-temporal attention-based model for infant movement assessment from videos. IEEE journal of biomedical and health informatics, 25(10), 3911-3920.

Nasir, J. A., Khan, O. S., & Varlamis, I. (2021). Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 1(1), 100007.

Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.

Kollias, D., & Zafeiriou, S. (2020). Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing, 12(3), 595-606.

Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018). Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. In Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA (pp. 623-632). Springer Singapore.

Zhang, X., Chen, F., & Huang, R. (2018). A combination of RNN and CNN for attention-based relation classification. Procedia computer science, 131, 911-917.

Zhou, C., Sun, C., Liu, Z., & Lau, F. (2015). A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.

Drumond, T. F., Viéville, T., & Alexandre, F. (2019). Bio-inspired analysis of deep learning on not-so-big data using data-prototypes. Frontiers in computational neuroscience, 12, 100.

Kar, A. K. (2016). Bio inspired computing–a review of algorithms and scope of applications. Expert Systems with Applications, 59, 20-32.

Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine vision and applications, 24(5), 971-981.

Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

Wu, Z., Wang, X., Jiang, Y. G., Ye, H., & Xue, X. (2015, October). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 461-470).

Muruganandam, S., Joshi, R., Suresh, P., Balakrishna, N., Kishore, K. H., & Manikanthan, S. V. (2023). A deep learning based feed forward artificial neural network to predict the K-barriers for intrusion detection using a wireless sensor network. Measurement: Sensors, 25, 100613.

Hybrid Deep Learning Techniques for Large-Scale Video Classification

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By