Enhanced Video Anomaly Detection Using Weakly-Supervised LSTM Framework and Comparative Analysis of I3D and ViT Feature Extraction Techniques
Keywords:
LSTM, Multiple-Instance Learning, Video Anomaly Detection, Weakly-SupervisedAbstract
Anomaly detection in video surveillance is a crucial yet challenging task, especially when anomalies exhibit a small difference compared to their normal counterparts. The nature of this problem becomes even more complex when relying on weakly-supervised approaches with video-level labels. In this study, we leverage a weakly supervised framework, treating each video as a sequence of instances and propose a novel method to detect anomalies that utilizes LSTM-based models to effectively capture temporal-dependencies. Furthermore, to assess the effectiveness of the model in detecting anomalies, we compared the performance of two feature extraction techniques - Inflated 3D ConvNet (I3D) and Vision Transformer (ViT). Extensive experiments conducted on RTX 4090 GPU and large-scale benchmark dataset - UCF-Crime demonstrate that our model achieves better anomaly detection performance (AUC : 90% with I3D and 86% with ViT) compared to existing state-of-the-art methods. The comparative analysis of the I3D and ViT feature extraction methods provide insights into their applicability to different types of video anomalies.
Downloads
References
B. Antic and B. Ommer, “Video parsing for abnormality detection,” 2011 International Conference on Computer Vision, pp. 2415–2422, IEEE, 2011.
G. Bertasius, H. Wang, and L. Torresani, “Is spacetime attention all you need for video understanding?” Proceedings of the International Conference on Machine Learning (ICML), 2021.
J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” CoRR, vol. abs/1901.03407, 2019.
W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, “Tevad: Improved video anomaly detection with captions,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5548–5558, 2023.
Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection,” arXiv preprint arXiv:2211.15098, 2022.
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
M. Fayyaz and J. Gall, “SCT: Set constrained temporal transformer for set supervised action segmentation,” CoRR, vol. abs/2003.14266, 2020.
J.-C. Feng, F.-T. Hong, and W.-S. Zheng, “MIST: Multiple instance self-training framework for video anomaly detection,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14009–14018, 2021.
D. Gong et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” IEEE International Conference on Computer Vision (ICCV), 2019.
Y. Gong et al., “Multiscale continuity-aware refinement network for weakly supervised video anomaly detection,” Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2022.
S. Gopalakrishnan, “A public health perspective of road traffic accidents,” Family Medicine and Primary Care, 2012.
K. Han et al., “A survey on vision transformer,” IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1–1, 2020.
M. Hasan et al., “Learning temporal regularity in video sequences,” Proceedings of IEEE Computer Vision and Pattern Recognition, 2016.
R. Hinami, T. Mei, and S. Satoh, “Joint detection and recounting of abnormal events by learning deep generic knowledge,” Proceedings of the IEEE International Conference on Computer Vision, pp. 3619–3627, 2017.
H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “CLIPTSA: Clip-assisted temporal self-attention for weakly supervised video anomaly detection,” 2023 IEEE International Conference on Image Processing (ICIP), pp. 3230–3234, IEEE, 2023.
L. Kratz and K. Nishino, “Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1446–1453, IEEE, 2009.
S. Li, F. Liu, and L. Jiao, “Self-training multisequence learning with transformer for weakly supervised video anomaly detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 24, 2022.
W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection: A new baseline,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6536–6545, 2018.
Y. Liu, J. Liu, W. Ni, and L. Song, “Abnormal event detection with self-guiding multi-instance ranking framework,” 2022 International Joint Conference on Neural Networks (IJCNN), pp. 01–07, IEEE, 2022.
Y. Liu, J. Liu, X. Zhu, D. Wei, X. Huang, and L. Song, “Learning task-specific representation for video anomaly detection with spatio-temporal attention,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2190–2194, IEEE, 2022.
C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in MATLAB,” Proceedings of the IEEE International Conference on Computer Vision, pp. 2720–2727, 2013.
W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked RNN framework,” Proceedings of the IEEE International Conference on Computer Vision, pp. 341–349, 2017.
S. Majhi, S. Das, and F. Bremond, “DAM: Dissimilarity attention module for weakly-supervised video anomaly detection,” 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8, IEEE, 2021.
S. Majhi, R. Dash, and P. K. Sa, “Two-stream CNN architecture for anomalous event detection in real-world scenarios,” International Conference on Computer Vision and Image Processing, pp. 343–353, Springer, 2020.
G. Pang, C. Shen, L. Cao, and A. Van Den Hengel, “Deep learning for anomaly detection: A review,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–38, 2021.
D. Purwanto, Y.-T. Chen, and W.-H. Fang, “Dance with self-attention: A new look of conditional random fields on anomaly detection in videos,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 173–183, 2021.
M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, and N. Sebe, “Plug-and-play CNN for crowd motion analysis: An application in abnormal event detection,” 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1689–1698, IEEE, 2018.
M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” 2017 IEEE International Conference on Image Processing (ICIP), pp. 1577–1581, IEEE, 2017.
L. Ruff et al., “Deep one-class classification,” Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 4393–4402, PMLR, 2018.
W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6479–6488, 2018.
Y. Tian et al., “Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4975–4986, 2021.
D. Tran et al., “Learning spatiotemporal features with 3D convolutional networks,” Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497, IEEE, 2015.
A. Ullah et al., “One-shot learning for surveillance anomaly recognition using Siamese 3D CNN,” 2020 International Joint Conference on Neural Networks (IJCNN), 2020.
W. Ullah et al., “Intelligent dual stream CNN and echo state network for anomaly detection,” Knowledge-Based Systems, vol. 253, p. 109456, 2022.
W. Ullah et al., “Artificial intelligence of things-assisted two-stream neural network for anomaly detection in surveillance big video data,” Future Generation Computer Systems, vol. 129, pp. 286–297, 2022.
W. Ullah et al., “An efficient anomaly recognition framework using an attention residual LSTM in surveillance videos,” Sensors, vol. 21, no. 8, p. 2811, 2021.
B. Wan et al., “Weakly supervised video anomaly detection via center-guided discriminative learning,” 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2020.
J. Wu et al., “Weakly-supervised spatio-temporal anomaly detection in surveillance video,” The Thirtieth International Joint Conference on Artificial Intelligence, 2021.
[40] P. Wu et al., “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” European Conference on Computer Vision (ECCV), 2020.
J. Yin et al., “LiDAR-based online 3D video object detection with graph-based message passing and spatiotemporal transformer attention,” CoRR, vol. abs/2004.01389, 2020.
S. Yu et al., “Cross-epoch learning for weakly supervised anomaly detection in surveillance videos,” IEEE Signal Processing Letters, vol. 28, pp. 2137–2141, 2021.
M. Z. Zaheer et al., “A self-reasoning framework for anomaly detection using video-level labels,” IEEE Signal Processing Letters, vol. 27, pp. 1705–1709, 2020.
D. Zhang et al., “Weakly supervised video anomaly detection via transformer-enabled temporal relation learning,” IEEE Signal Processing Letters, 2022.
J. Zhang et al., “Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection,” 2019 IEEE International Conference on Image Processing (ICIP), pp. 4030–4034, 2019.
Y. Zhang et al., “Video anomaly detection based on locality sensitive hashing filters,” Pattern Recognition, vol. 59, pp. 302–311, 2016.
B. Zhao, F.-F. Li, and E. P. Xing, “Online detection of unusual events in videos via dynamic sparse coding,” CVPR 2011, pp. 3313–3320, IEEE, 2011.
J.-X. Zhong et al., “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1237–1246, 2019.
Y. Zhu and S. Newsam, “Motion-aware feature for improved video anomaly detection,” arXiv preprint arXiv:1907.10211, 2019.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.