Analysis of Voice to Predict the Physical Attributes of an Individual by Using the Face Reconstruction Approach

Authors

  • Prashant S. Kolhe Department of Electronics and Telecommunication, College of Engineering, Dharashiv-413501, Maharashtra, India
  • Ranjeet Bidwe Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University) (SIU), Lavale, Pune 412115
  • Deepak Mane Vishwakarma Institute of Technology, Bibwewadi, Pune-411037, Maharashtra, India
  • Bapurao Bandgar School of Computer Studies, Sri Balaji University, Tathawade, Pune-411033, Maharashtra, India
  • Sunil Shinde Vishwakarma Institute of Technology, Bibwewadi, Pune-411037, Maharashtra, India
  • Sunil M. Sangve Vishwakarma Institute of Technology, Bibwewadi, Pune-411037, Maharashtra, India

Keywords:

Face Recognition, Face Reconstruction, Voice analysis, Feature Extraction, Voice Encoder

Abstract

We hear people's voices on the radio, on telephones, etc. Many features can be extracted from a person's Voice. Factors like age, gender, and ethnicity can characterize a person's Voice. How many can we infer about someone from the way they communicate? In this work, we have tried to recommend a neural community structure that enables us to extract features like age, gender, and ethnicity from a person's speech. Different features of the Voice helped us to extract different features, like the pitch of the Voice giving information about a person's gender. Accent, speed, and pronunciation gave us information about a person's ethnicity. The main goal of this work is to determine the extent of information that can be extracted from a person's speech for the Indian accent datasets. Here, we have used a Voice Encoder and Face decoder model. Voice Encoder keeps track of the vocal features, and the face decoder uses these vocal features to generate the face. The whole neural architecture is inspired by Generative Adversarial Networks (GANs).

Downloads

Download data is not yet available.

References

H. Maniyar, S. v. Budihal, and S. v Siddamal, “Persons facial image synthesis from audio with Generative Adversarial Networks,” ECTI Transactions on Computer and Information Technology (ECTI-CIT), vol. 16, no. 2, pp. 135–141, May 2022, doi: 10.37936/ecti-cit.2022162.246995.

O. P. Roy and V. Kumar, “A Survey on Voice over Internet Protocol (VoIP) Reliability Research,” IOP Conf Ser Mater Sci Eng, vol. 1020, no. 1, p. 12015, Jan. 2021, doi: 10.1088/1757-899X/1020/1/012015.

B. Zope, S. Mishra, K. Shaw, D. R. Vora, K. Kotecha, and R. V. Bidwe, “Question Answer System: A State-of-Art Representation of Quantitative and Qualitative Analysis,” Big Data and Cognitive Computing, vol. 6, no. 4, 2022, doi: 10.3390/bdcc6040109.

S. Shahriar, “GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network,” Displays, vol. 73, p. 102237, 2022, doi: https://doi.org/10.1016/j.displa.2022.102237.

R. V. Bidwe et al., “Deep Learning Approaches for Video Compression: A Bibliometric Analysis,” Big Data and Cognitive Computing, vol. 6, no. 2, p. 44, Apr. 2022, doi: 10.3390/bdcc6020044.

D. Mane, R. Bidwe, B. Zope, and N. Ranjan, “Traffic Density Classification for Multiclass Vehicles Using Customized Convolutional Neural Network for Smart City,” 2022, pp. 1015–1030. doi: 10.1007/978-981-19-2130-8_78.

D. Mane, K. Shah, R. Solapure, R. Bidwe, and S. Shah, “Image-Based Plant Seedling Classification Using Ensemble Learning,” 2023, pp. 433–447. doi: 10.1007/978-981-19-2225-1_39.

S. Bidwe, Dr. G. Kale, and R. Bidwe, “TRAFFIC MONITORING SYSTEM FOR SMART CITY BASED ON TRAFFIC DENSITY ESTIMATION,” Indian Journal of Computer Science and Engineering, vol. 13, no. 5, pp. 1388–1400, Oct. 2022, doi: 10.21817/indjcse/2022/v13i5/221305006.

S. Fenghour, D. Chen, K. Guo, B. Li, and P. Xiao, “Deep Learning-Based Automated Lip-Reading: A Survey,” IEEE Access, vol. 9, pp. 121184–121205, 2021, doi: 10.1109/ACCESS.2021.3107946.

A. Fernandez-Lopez and F. M. Sukno, “Survey on automatic lip-reading in the era of deep learning,” Image Vis Comput, vol. 78, pp. 53–72, 2018, doi: https://doi.org/10.1016/j.imavis.2018.07.002.

Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image Vis Comput, vol. 32, no. 9, pp. 590–605, 2014, doi: https://doi.org/10.1016/j.imavis.2014.06.004.

W. Mattheyses and W. Verhelst, “Audiovisual speech synthesis: An overview of the state-of-the-art,” Speech Commun, vol. 66, pp. 182–217, 2015, doi: https://doi.org/10.1016/j.specom.2014.11.001.

L. Chen, G. Cui, Z. Kou, H. Zheng, and C. Xu, “What comprises a good talking-head video generation?,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.

C. Sheng et al., “Deep Learning for Visual Speech Analysis: A Survey,” arXiv preprint arXiv:2205.10839, 2022.

T.-H. Oh et al., “Speech2Face: Learning the Face Behind a Voice,” CoRR, vol. abs/1905.09773, 2019, [Online]. Available: http://arxiv.org/abs/1905.09773

Ephratet al., “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” CoRR, vol. abs/1804.03619, 2018, [Online]. Available: http://arxiv.org/abs/1804.03619

D. R. Reddy, “Speech recognition by machine: A review,” Proceedings of the IEEE, vol. 64, pp. 501–531, 1976.

R. A. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, and T. Alhussain, “Speech Emotion Recognition Using Deep Learning Techniques: A Review,” IEEE Access, vol. PP, p. 1, Dec. 2019, doi: 10.1109/ACCESS.2019.2936124.

F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman, “Synthesizing Normalized Faces from Facial Identity Features,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 3386–3395. doi: 10.1109/CVPR.2017.361.

M. Wang and W. Deng, “Deep Face Recognition: A Survey,” CoRR, vol. abs/1804.06655, 2018, [Online]. Available: http://arxiv.org/abs/1804.06655

C. Sheng, M. Pietikäinen, Q. Tian, and L. Liu, “Cross-Modal Self-Supervised Learning for Lip Reading: When Contrastive Learning Meets Adversarial Training,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2456–2464. doi: 10.1145/3474085.3475415.

R. Arandjelovic and A. Zisserman, “Objects that Sound,” in Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2018.

P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards Practical Lipreading with Distilled and Efficient Models,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7608–7612. doi: 10.1109/ICASSP39728.2021.9415063.

T. Afouras, J. S. Chung, and A. Zisserman, “ASR is All You Need: Cross-Modal Distillation for Lip Reading,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2143–2147. doi: 10.1109/ICASSP40776.2020.9054253.

H. Liu, Z. Chen, and B. Yang, “Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion,” in Proc. Interspeech 2020, 2020, pp. 3520–3524. doi: 10.21437/Interspeech.2020-3146.

C. Sheng, X. Zhu, H. Xu, M. Pietikäinen, and L. Liu, “Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading,” IEEE Trans Multimedia, vol. 24, pp. 3545–3557, 2022, doi: 10.1109/TMM.2021.3102433.

Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” CoRR, vol. abs/1706.08612, 2017, [Online]. Available: http://arxiv.org/abs/1706.08612

Sarode, S., Thatte, R., Toshniwal, K., Warade, J., Bidwe, R. V., & Zope, B. (2023, March). A System for Language Translation using Sequence-to-sequence Learning based Encoder. In 2023 International Conference on Emerging Smart Computing and Informatics (ESCI) (pp. 1-5). IEEE.

Mane, Deepak, et al. "An Improved Transfer Learning Approach for Classification of Types of Cancer." Traitement du Signal 39.6 (2022): 2095.

Khetani, V., Gandhi, Y., Bhattacharya, S., Ajani, S. N., & Limkar, S. (2023). Cross-Domain Analysis of ML and DL: Evaluating their Impact in Diverse Domains. International Journal of Intelligent Systems and Applications in Engineering, 11(7s), 253–262.

Downloads

Published

29.01.2024

How to Cite

Kolhe, P. S. ., Bidwe, R. ., Mane, D. ., Bandgar, B. ., Shinde, S. ., & Sangve, S. M. . (2024). Analysis of Voice to Predict the Physical Attributes of an Individual by Using the Face Reconstruction Approach. International Journal of Intelligent Systems and Applications in Engineering, 12(13s), 496–504. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4616

Issue

Section

Research Article

Most read articles by the same author(s)