Audio-Visual Speech Reconstruction Using Hybrid Deep Learning with Conditional Random Fields and Intelligent Chasing Optimization

Authors

  • Aditya. N. Magdum, S. B. Patil

Keywords:

Lip-to-Speech Synchronization, Hybrid Deep Learning, Conditional Random Field (CRF), Intelligent Chasing Optimization (ICO), Structured Similarity Index (SSIM), Audio-Visual (AV) Features, Speech Reconstruction

Abstract

With its many uses in virtual reality, education, training, and other domains, lip-to-speech (LTS) synchronization is an essential tool for creating lifelike face animations.  However, existing approaches still struggle to create high-fidelity facial animations, particularly when faced with issues like lip jitter and unstable facial motions in continuous frame sequences.  To improve LTS models' capacity to precisely reconstruct speech from visual data, this study develops a Hybrid Deep Learning model coupled with Conditional Random Field-based Intelligent Chasing Optimization (HDL-CRF-ICO).  For the preprocessing stage, the model chooses 100 frames at random, and the Structured Similarity Index (SSIM) is used to identify keyframes. Similarity scores are computed by this index, and which frames are chosen for additional processing are determined by certain criteria.  The model then makes use of sophisticated methods, such as AV features, which improve speech recognition by combining visual information from lip movements with audio inputs.  By offering the optimum global solution, the ICO algorithm speeds up convergence, and by lowering the error value, it allows the model to produce precise results. Accordingly, the proposed model obtained the performance as Bilingual Evaluation Understudy (BLEU) scores of 0.48, Metric for Evaluation of Translation with Explicit Ordering (METEOR) scores of 0.30, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores of 0.53, and Semantic Propositional Image Caption Evaluation (SPICE) scores of 24.9, as well as for K-Fold and METEOR with 0.31, SPICE with 25.7, BLEU with 0.49, and ROUGE with 0.54 for training percentage using Grid Audio-Visual Speech Corpus dataset.

 

Downloads

Download data is not yet available.

References

Niu, Z. and Mak, B., "On the Audio-visual Synchronization for Lip-to-Speech Synthesis", In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.7843-7852, 2023.

Yang, Q., Bai, Y., Liu, F. and Zhang, W., "Integrated visual transformer and flash attention for lip-to-speech generation GAN", Scientific Reports, vol.14, no.1, pp.4525, 2024.

Kim, M., Hong, J. and Ro, Y.M., "Lip-to-speech synthesis in the wild with multi-task learning", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, June 2023.

Park, S.J., Kim, M., Choi, J. and Ro, Y.M., "Exploring Phonetic Context-Aware Lip-Sync for Talking Face Generation", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4325-4329, IEEE, April 2024.

Wang, J., Qian, X., Zhang, M., Tan, R.T. and Li, H., "Seeing what you said: Talking face generation guided by a lip reading expert", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.14653-14662, 2023.

Liu, L., Wang, J., Chen, S., and Li, Z., "VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization, Electronics, vol.13, no.18, pp.3657, 2024.

Sheng, Z.Y., Ai, Y. and Ling, Z.H., "Zero-shot personalized lip-to-speech synthesis with face image based voice control", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, June 2023.

Hong, J., Kim, M., Choi, J. and Ro, Y.M., "Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring" In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.18783-18794, 2023.

Wang, J., Pan, Z., Zhang, M., Tan, R.T. and Li, H., "Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition", In Proceedings of the AAAI Conference on Artificial Intelligence, vol.38, no.17, pp.19144-19152, March 2024.

Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., Han, J., Liu, J., Ding, E. and Wang, J., "Expressive talking head generation with granular audio-visual control", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.3387-3396, 2022.

Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X. and Liu, Z., "Pose-controllable talking face generation by implicitly modularized audio-visual representation", In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.4176-4186, 2021.

Mukhopadhyay, S., Suri, S., Gadde, R.T., and Shrivastava, A., "Diff2lip: Audio conditioned diffusion models for lip-synchronization", In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.5292-5302, 2024.

He, Y., Seng, K.P. and Ang, L.M., "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild. Sensors", vol.23, no.4, pp.1834, 2023.

Lenglet, M., Perrotin, O. and Bailly, G., "FastLips: an End-to-End Audiovisual Text-to-Speech System with Lip Features Prediction for Virtual Avatars", In International Speech Communication Association ISCA, pp.3450-3454, September 2024.

Lu, J., Sisman, B., Liu, R., Zhang, M. and Li, H., "Visualtts: Tts with accurate lip-speech synchronization for automatic voice over", In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.8032-8036, May 2022.

Passos, L.A., Papa, J.P., Del Ser, J., Hussain, A. and Adeel, A., "Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement", Information Fusion, vol.90, pp.1-11, 2023.

Li, J., Li, C., Wu, Y. and Qian, Y., "Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.32, pp.1941-1953, 2024.

He, Y., Seng, K.P. and Ang, L.M., "Generative adversarial networks (GANs) for audio-visual speech recognition in artificial intelligence IoT. Information, vol.14, no.10, pp.575, 2023.

Kalshetty, R. and Parveen, A., "Abnormal event detection model using an improved ResNet101 in context aware surveillance system", Cognitive Computation and Systems, vol.5, no.2, pp.153-167, 2023.

Lashkov, I., Kashevnik, A., Shilov, N., Parfenov, V. and Shabaev, A., "Driver dangerous state detection based on OpenCV & dlib libraries using mobile video processing", In IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing, pp.74-79, August 2019.

Chen, L., Yao, X., Tan, C., He, W., Su, J., Weng, F., Chew, Y., Ng, N.P.H. and Moon, S.K., "In-situ crack and keyhole pore detection in laser directed energy deposition through acoustic signal and deep learning", Additive Manufacturing, vol.69, pp.103547, 2023.

Spandana, S., Madhura, B., Sandhya, A., Manish, A., and Kumar, K.P., "A Hybrid CNN-BILSTM Model for Continuous Sign Language Recognition Using Iterative Training", International Journal of Engineering Science and Advanced Technology, vol.23, no.05, 2023.

Murugaiyan, S. and Uyyala, S.R., "Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and bilstm", Cognitive Computation, vol.15, no.3, pp.914-931, 2023.

Peng, X., Cao, H., Prasad, R. and Natarajan, P., "Text extraction from video using conditional random fields", International Conference on Document Analysis and Recognition, IEEE, pp.1029-1033, September 2011.

Shehab, M., Mashal, I., Momani, Z., Shambour, M.K.Y., AL-Badareen, A., Al-Dabet, S., Bataina, N., Alsoud, A.R. and Abualigah, L., "Harris hawks optimization algorithm: variants and applications", Archives of Computational Methods in Engineering, vol.29, no.7, pp.5579-5603, 2022.

Abdollahzadeh, B., Gharehchopogh, F.S. and Mirjalili, S., "African vultures optimization algorithm: A new nature-inspired metaheuristic algorithm for global optimization problems", Computers & Industrial Engineering, vol.158, pp.107408, 2021.

The Grid Audio-Visual Speech Corpus Dataset, “https://zenodo.org/records/3625687”, on January 2025.

Ganesan, P., Jagatheesaperumal, S.K., Gaftandzhieva, S. and Doneva, R., "Novel Cognitive Assisted Adaptive Frame Selection for Continuous Sign Language Recognition in Videos Using ConvLSTM", International Journal of Advanced Computer Science & Applications, vol.15, no.7, 2024.

Downloads

Published

30.11.2023

How to Cite

Aditya. N. Magdum. (2023). Audio-Visual Speech Reconstruction Using Hybrid Deep Learning with Conditional Random Fields and Intelligent Chasing Optimization. International Journal of Intelligent Systems and Applications in Engineering, 12(2), 834 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7646

Issue

Section

Research Article