Hybrid Deep Learning Techniques for Large-Scale Video Classification
Keywords:
Deep Learning, Video Classification, Convolutional Neural Networks, Recurrent Neural Networks, Feature ExtractionAbstract
Effective large-scale video management and classification are becoming more and more necessary due to the Internet's video data rapidly increase. A comprehensive evaluation of the trade-off between timeliness and efficacy should be made during real-world implementation. In industrial deployments, the frame extraction function is frequently used to categorize video actions, while the video classification technique integrated with a time segment network is implemented. The scientific literature now contains several reviews and research articles on the topic of video categorization. With the ability to analyze spatial and temporal information concurrently and efficiently, the combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) provides an effective framework for video categorization issues. This research proposes a comparison to evaluate how CNNs and RNNs integrated into different architectures might use temporal information to enhance video classification accuracy using deep learning. To optimize the performance of the proposed design for a CNN and RNN hybrid that works well, an innovative action template-based feature extraction technique is presented. This approach extracts features by analyzing the similarity between each frame's informative areas. Using RNN based video classifiers extensive experiments were performed on the UCF-50 and UCF-101 datasets. The efficiency of the suggested Feature extraction technique is demonstrated by the considerable improvement in video categorization accuracy shown in the experimental data, as examined by a one-way statistical evaluation of variance.
Downloads
References
Gong, X., & Li, Z. (2022). A Video Classification Method Based on Spatiotemporal Detail Attention and Feature Fusion. Mobile Information Systems, 2022.
Hu, Z. P., Zhang, R. X., Qiu, Y., Zhao, M. Y., & Sun, Z. (2021). 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimedia Tools and Applications, 80, 33179-33192.
Savran Kızıltepe, R., Gan, J. Q., & Escobar, J. J. (2023). A novel keyframe extraction method for video classification using deep neural networks. Neural Computing and Applications, 35(34), 24513-24524.
Ballas, N., Yao, L., Pal, C., & Courville, A. (2015). Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
Duan, H., Zhao, Y., Xiong, Y., Liu, W., & Lin, D. (2020, August). Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision (pp. 670-688). Cham: Springer International Publishing.
Kalfaoglu, M. E., Kalkan, S., & Alatan, A. A. (2020). Late temporal modeling in 3d cnn architectures with bert for action recognition. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 (pp. 731-747). Springer International Publishing.
Mao, F., Wu, X., Xue, H., & Zhang, R. (2018). Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European conference on computer vision (ECCV) workshops (pp. 0-0).
Qiu, Z., Yao, T., Ngo, C. W., Tian, X., & Mei, T. (2019). Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12056-12065).
Savran Kızıltepe, R., Gan, J. Q., & Escobar, J. J. (2019). Combining very deep convolutional neural networks and recurrent neural networks for video classification. In Advances in Computational Intelligence: 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, Gran Canaria, Spain, June 12-14, 2019, Proceedings, Part II 15 (pp. 811-822). Springer International Publishing.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7083-7093).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., & Gong, B. (2021). Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16020-16030).
Wang, X., Xiong, X., Neumann, M., Piergiovanni, A. J., Ryoo, M. S., Angelova, A., ... & Hua, W. (2020). Attentionnas: Spatiotemporal attention cell search for video classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 (pp. 449-465). Springer International Publishing.
Ahmad, H., Khan, H. U., Ali, S., Rahman, S. I. U., Wahid, F., & Khattak, H. (2022). Effective video summarization approach based on visual attention.
Apostolidis, E., Balaouras, G., Mezaris, V., & Patras, I. (2021, November). Combining global and local attention with positional encoding for video summarization. In 2021 IEEE international symposium on multimedia (ISM) (pp. 226-234). IEEE.
Wu, G., Lin, J., & Silva, C. T. (2022). Intentvizor: Towards generic query guided interactive video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10503-10512).
Ghauri, J. A., Hakimov, S., & Ewerth, R. (2021, July). Supervised video summarization via multiple feature sets with parallel attention. In 2021 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6s). IEEE.
Bao, G., Li, D., & Mei, Y. (2020, September). Features extraction based on optical-flow and mutual information entropy. In Journal of Physics: Conference Series (Vol. 1646, No. 1, p. 012112). IOP Publishing.
Nguyen-Thai, B., Le, V., Morgan, C., Badawi, N., Tran, T., & Venkatesh, S. (2021). A spatio-temporal attention-based model for infant movement assessment from videos. IEEE journal of biomedical and health informatics, 25(10), 3911-3920.
Nasir, J. A., Khan, O. S., & Varlamis, I. (2021). Fake news detection: A hybrid CNN-RNN based deep learning approach. International Journal of Information Management Data Insights, 1(1), 100007.
Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.
Kollias, D., & Zafeiriou, S. (2020). Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing, 12(3), 595-606.
Masood, S., Srivastava, A., Thuwal, H. C., & Ahmad, M. (2018). Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. In Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA (pp. 623-632). Springer Singapore.
Zhang, X., Chen, F., & Huang, R. (2018). A combination of RNN and CNN for attention-based relation classification. Procedia computer science, 131, 911-917.
Zhou, C., Sun, C., Liu, Z., & Lau, F. (2015). A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630.
Drumond, T. F., Viéville, T., & Alexandre, F. (2019). Bio-inspired analysis of deep learning on not-so-big data using data-prototypes. Frontiers in computational neuroscience, 12, 100.
Kar, A. K. (2016). Bio inspired computing–a review of algorithms and scope of applications. Expert Systems with Applications, 59, 20-32.
Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine vision and applications, 24(5), 971-981.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Wu, Z., Wang, X., Jiang, Y. G., Ye, H., & Xue, X. (2015, October). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 461-470).
Muruganandam, S., Joshi, R., Suresh, P., Balakrishna, N., Kishore, K. H., & Manikanthan, S. V. (2023). A deep learning based feed forward artificial neural network to predict the K-barriers for intrusion detection using a wireless sensor network. Measurement: Sensors, 25, 100613.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.