Audio-Visual Speech Reconstruction Using Hybrid Deep Learning with Conditional Random Fields and Intelligent Chasing Optimization
Keywords:
Lip-to-Speech Synchronization, Hybrid Deep Learning, Conditional Random Field (CRF), Intelligent Chasing Optimization (ICO), Structured Similarity Index (SSIM), Audio-Visual (AV) Features, Speech ReconstructionAbstract
With its many uses in virtual reality, education, training, and other domains, lip-to-speech (LTS) synchronization is an essential tool for creating lifelike face animations. However, existing approaches still struggle to create high-fidelity facial animations, particularly when faced with issues like lip jitter and unstable facial motions in continuous frame sequences. To improve LTS models' capacity to precisely reconstruct speech from visual data, this study develops a Hybrid Deep Learning model coupled with Conditional Random Field-based Intelligent Chasing Optimization (HDL-CRF-ICO). For the preprocessing stage, the model chooses 100 frames at random, and the Structured Similarity Index (SSIM) is used to identify keyframes. Similarity scores are computed by this index, and which frames are chosen for additional processing are determined by certain criteria. The model then makes use of sophisticated methods, such as AV features, which improve speech recognition by combining visual information from lip movements with audio inputs. By offering the optimum global solution, the ICO algorithm speeds up convergence, and by lowering the error value, it allows the model to produce precise results. Accordingly, the proposed model obtained the performance as Bilingual Evaluation Understudy (BLEU) scores of 0.48, Metric for Evaluation of Translation with Explicit Ordering (METEOR) scores of 0.30, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores of 0.53, and Semantic Propositional Image Caption Evaluation (SPICE) scores of 24.9, as well as for K-Fold and METEOR with 0.31, SPICE with 25.7, BLEU with 0.49, and ROUGE with 0.54 for training percentage using Grid Audio-Visual Speech Corpus dataset.
Downloads
References
Niu, Z. and Mak, B., "On the Audio-visual Synchronization for Lip-to-Speech Synthesis", In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.7843-7852, 2023.
Yang, Q., Bai, Y., Liu, F. and Zhang, W., "Integrated visual transformer and flash attention for lip-to-speech generation GAN", Scientific Reports, vol.14, no.1, pp.4525, 2024.
Kim, M., Hong, J. and Ro, Y.M., "Lip-to-speech synthesis in the wild with multi-task learning", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, June 2023.
Park, S.J., Kim, M., Choi, J. and Ro, Y.M., "Exploring Phonetic Context-Aware Lip-Sync for Talking Face Generation", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4325-4329, IEEE, April 2024.
Wang, J., Qian, X., Zhang, M., Tan, R.T. and Li, H., "Seeing what you said: Talking face generation guided by a lip reading expert", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.14653-14662, 2023.
Liu, L., Wang, J., Chen, S., and Li, Z., "VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization, Electronics, vol.13, no.18, pp.3657, 2024.
Sheng, Z.Y., Ai, Y. and Ling, Z.H., "Zero-shot personalized lip-to-speech synthesis with face image based voice control", In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, June 2023.
Hong, J., Kim, M., Choi, J. and Ro, Y.M., "Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring" In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.18783-18794, 2023.
Wang, J., Pan, Z., Zhang, M., Tan, R.T. and Li, H., "Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition", In Proceedings of the AAAI Conference on Artificial Intelligence, vol.38, no.17, pp.19144-19152, March 2024.
Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., Han, J., Liu, J., Ding, E. and Wang, J., "Expressive talking head generation with granular audio-visual control", In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.3387-3396, 2022.
Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X. and Liu, Z., "Pose-controllable talking face generation by implicitly modularized audio-visual representation", In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.4176-4186, 2021.
Mukhopadhyay, S., Suri, S., Gadde, R.T., and Shrivastava, A., "Diff2lip: Audio conditioned diffusion models for lip-synchronization", In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.5292-5302, 2024.
He, Y., Seng, K.P. and Ang, L.M., "Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild. Sensors", vol.23, no.4, pp.1834, 2023.
Lenglet, M., Perrotin, O. and Bailly, G., "FastLips: an End-to-End Audiovisual Text-to-Speech System with Lip Features Prediction for Virtual Avatars", In International Speech Communication Association ISCA, pp.3450-3454, September 2024.
Lu, J., Sisman, B., Liu, R., Zhang, M. and Li, H., "Visualtts: Tts with accurate lip-speech synchronization for automatic voice over", In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.8032-8036, May 2022.
Passos, L.A., Papa, J.P., Del Ser, J., Hussain, A. and Adeel, A., "Multimodal audio-visual information fusion using canonical-correlated graph neural network for energy-efficient speech enhancement", Information Fusion, vol.90, pp.1-11, 2023.
Li, J., Li, C., Wu, Y. and Qian, Y., "Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.32, pp.1941-1953, 2024.
He, Y., Seng, K.P. and Ang, L.M., "Generative adversarial networks (GANs) for audio-visual speech recognition in artificial intelligence IoT. Information, vol.14, no.10, pp.575, 2023.
Kalshetty, R. and Parveen, A., "Abnormal event detection model using an improved ResNet101 in context aware surveillance system", Cognitive Computation and Systems, vol.5, no.2, pp.153-167, 2023.
Lashkov, I., Kashevnik, A., Shilov, N., Parfenov, V. and Shabaev, A., "Driver dangerous state detection based on OpenCV & dlib libraries using mobile video processing", In IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing, pp.74-79, August 2019.
Chen, L., Yao, X., Tan, C., He, W., Su, J., Weng, F., Chew, Y., Ng, N.P.H. and Moon, S.K., "In-situ crack and keyhole pore detection in laser directed energy deposition through acoustic signal and deep learning", Additive Manufacturing, vol.69, pp.103547, 2023.
Spandana, S., Madhura, B., Sandhya, A., Manish, A., and Kumar, K.P., "A Hybrid CNN-BILSTM Model for Continuous Sign Language Recognition Using Iterative Training", International Journal of Engineering Science and Advanced Technology, vol.23, no.05, 2023.
Murugaiyan, S. and Uyyala, S.R., "Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and bilstm", Cognitive Computation, vol.15, no.3, pp.914-931, 2023.
Peng, X., Cao, H., Prasad, R. and Natarajan, P., "Text extraction from video using conditional random fields", International Conference on Document Analysis and Recognition, IEEE, pp.1029-1033, September 2011.
Shehab, M., Mashal, I., Momani, Z., Shambour, M.K.Y., AL-Badareen, A., Al-Dabet, S., Bataina, N., Alsoud, A.R. and Abualigah, L., "Harris hawks optimization algorithm: variants and applications", Archives of Computational Methods in Engineering, vol.29, no.7, pp.5579-5603, 2022.
Abdollahzadeh, B., Gharehchopogh, F.S. and Mirjalili, S., "African vultures optimization algorithm: A new nature-inspired metaheuristic algorithm for global optimization problems", Computers & Industrial Engineering, vol.158, pp.107408, 2021.
The Grid Audio-Visual Speech Corpus Dataset, “https://zenodo.org/records/3625687”, on January 2025.
Ganesan, P., Jagatheesaperumal, S.K., Gaftandzhieva, S. and Doneva, R., "Novel Cognitive Assisted Adaptive Frame Selection for Continuous Sign Language Recognition in Videos Using ConvLSTM", International Journal of Advanced Computer Science & Applications, vol.15, no.7, 2024.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.