Comparative Analysis of Various Textual-Visual Models for Self-Attentive Query Focused Video Summarization

Authors

  • Sheetal Girase, Mangesh Bedekar, Devashish Bote, Vidya Dhopate

Keywords:

Video summarization, keyframes, multimodal fusion, semantic embedding space

Abstract

The exponential growth of video data presents a significant challenge in extracting pertinent information from it. Video summarization aims to address this issue by extracting essential information from video data in order to facilitate the exploration of videos. Given the subjective nature of determining "relevant information" in a video based on user preferences, it is imperative to establish a mechanism that takes into account the users' preferences during the process of generating a summary. One approach that can be employed is to enable users to input a query. Rather than generating a predetermined and inflexible summary for a given video input, this study has investigated a method of generating a video summary that caters to the preferences of the user. Query Focused Video Summarization (QFVS) is regarded as a supervised learning problem in the context of the YouTube Dataset [4]. It aims to produce a summary based on user inputs, specifically the video and the textual query. The query relevance of frames from the video is determined by mapping them to a shared multimodal semantic embedding space. By utilising our attention network and encoder, we have successfully enhanced the accuracy rate from 61.91% [4] to 74.60%. Extensive experiments were conducted utilising deep learning models, specifically ResNet34 and DenseNet, to extract image features. Additionally, word2vec and GloVe were employed for word mappings. The integration of textual and image features is employed for diverse experimental purposes.

Downloads

Download data is not yet available.

References

Sharghi, Aidean, Boqing Gong, and Mubarak Shah. "Query-focused extractive video summarization." European Conference on Computer Vision. Springer, Cham, 2016.

Sharghi, Aidean, Jacob S. Laurel, and Boqing Gong. "Query-focused video summarization: Dataset, evaluation, and a memory network based approach." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Plummer, Bryan A., Matthew Brown, and Svetlana Lazebnik. "Enhancing video summarization via vision-language embedding." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Huang, Jia-Hong, and Marcel Worring. "Query-controllable video summarization." Proceedings of the 2020 International Conference on Multimedia Retrieval. 2020.

Ajmal, Muhammad, et al. "Video summarization: techniques and classification." International Conference on Computer Vision and Graphics. Springer, Berlin, Heidelberg, 2012.

Xiao, Shuwen, et al. "Query-biased self-attentive network for query-focused video summarization." IEEE Transactions on Image Processing 29 (2020): 5889-5899.

Vasudevan, Arun Balajee, et al. "Query-adaptive video summarization via quality-aware relevance estimation." Proceedings of the 25th ACM international conference on Multimedia. 2017.

Lee, Y. J., Ghosh, J., & Grauman, K. (2012, June). “Discovering important people and objects for egocentric video summarization.” In 2012 IEEE conference on computer vision and pattern recognition (pp. 1346-1353). IEEE.

Gygli, Michael, Helmut Grabner, and Luc Van Gool. "Video summarization by learning submodular mixtures of objectives." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Li, Sheng, et al. "Visual to text: Survey of image and video captioning." IEEE Transactions on Emerging Topics in Computational Intelligence 3.4 (2019): 297-312.

Zhang, Yujia, et al. "Query-conditioned three-player adversarial network for video summarization." arXiv preprint arXiv:1807.06677 (2018).

Ahmed, Sekh Arif, et al. “Query-based video synopsis for intelligent traffic monitoring applications." IEEE Transactions on Intelligent Transportation Systems 21.8 (2019): 3457-3468.

Ji, Zhong, et al. "Query-aware sparse coding for multi-video summarization." arXiv preprint arXiv:1707.04021 (2017).

Oosterhuis, Harrie, Sujith Ravi, and Michael Bendersky. "Semantic video trailers." arXiv preprint arXiv:1609.01819 (2016).

Sreenu, G., and MA Saleem Durai. "Intelligent video surveillance: a review through deep learning techniques for crowd analysis." Journal of Big Data 6.1 (2019): 1-27.

Mithun, Niluthpol Chowdhury, Sujoy Paul, and Amit K. Roy-Chowdhury. "Weakly supervised video moment retrieval from text queries." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Del Molino, Ana Garcia, et al. "Summarization of egocentric videos: A comprehensive survey." IEEE Transactions on Human-Machine Systems 47.1 (2016): 65-76.

Baskurt, Kemal Batuhan, and Refik Samet. "Video synopsis: A survey." Computer Vision and Image Understanding 181 (2019): 26-38.

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Sebastian, Tinumol, and Jiby J. Puthiyidam. "A survey on video summarization techniques." Int. J. Comput. Appl 132.13 (2015): 30-32.

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Frome, Andrea, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. "DeViSE: A deep visual-semantic embedding model." (2013).

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

Xiao, Shuwen, et al. "Convolutional hierarchical attention network for query-focused video summarization." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.

Sharghi, Aidean, et al. "Improving sequential determinantal point processes for supervised video summarization." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Zhang, Ke, et al. "Video summarization with long short-term memory." European conference on computer vision. Springer, Cham, 2016.

Gong, Boqing, et al. "Diverse sequential subset selection for supervised video summarization." Advances in neural information processing systems 27 (2014): 2069-2077.

Zhang, Ke, et al. "Summary transfer: Exemplar-based subset selection for video summarization." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Fu, Tsu-Jui, Shao-Heng Tai, and Hwann-Tzong Chen. "Attentive and adversarial learning for video summarization." 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019.

Zhang, Yujia, et al. "DTR-GAN: Dilated temporal relational adversarial network for video summarization." Proceedings of the ACM Turing Celebration Conference-China. 2019.

Fajtl, Jiri, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. "Summarizing videos with attention." In Asian Conference on Computer Vision, pp. 39-54. Springer, Cham, 2018.

Chu, Wen-Sheng, Yale Song, and Alejandro Jaimes. "Video co-summarization: Video summarization by visual co-occurrence." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

De Avila, Sandra Eliza Fontes, Ana Paula Brandao Lopes, Antonio da Luz Jr, and Arnaldo de Albuquerque Araújo. "VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method." Pattern Recognition Letters 32, no. 1 (2011): 56-68.

Ngo, Chong-Wah, Yu-Fei Ma, and Hong-Jiang Zhang. "Automatic video summarization by graph modeling." Proceedings Ninth IEEE International Conference on Computer Vision. IEEE, 2003.

Panda, Rameswar, and Amit K. Roy-Chowdhury. "Collaborative summarization of topic-related videos." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

Zhu, Xiatian, Chen Change Loy, and Shaogang Gong. "Video synopsis by heterogeneous multi-source correlation." Proceedings of the IEEE International Conference on Computer Vision. 2013.

Zhou, Kaiyang, Yu Qiao, and Tao Xiang. "Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.

Rochan, Mrigank, and Yang Wang. "Video summarization by learning from unpaired data." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Jeffrey P, Richard S, Christopher DM. “GloVe: global vectors for word representation.” In: Proceedings of the empirical methods in natural language processing (EMNLP 2014) 12. 2014

G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, "Densely Connected Convolutional Networks." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach

Downloads

Published

26.03.2024

How to Cite

Sheetal Girase. (2024). Comparative Analysis of Various Textual-Visual Models for Self-Attentive Query Focused Video Summarization. International Journal of Intelligent Systems and Applications in Engineering, 12(21s), 2002–2011. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5770

Issue

Section

Research Article