Analysis of Voice to Predict the Physical Attributes of an Individual by Using the Face Reconstruction Approach
Keywords:
Face Recognition, Face Reconstruction, Voice analysis, Feature Extraction, Voice EncoderAbstract
We hear people's voices on the radio, on telephones, etc. Many features can be extracted from a person's Voice. Factors like age, gender, and ethnicity can characterize a person's Voice. How many can we infer about someone from the way they communicate? In this work, we have tried to recommend a neural community structure that enables us to extract features like age, gender, and ethnicity from a person's speech. Different features of the Voice helped us to extract different features, like the pitch of the Voice giving information about a person's gender. Accent, speed, and pronunciation gave us information about a person's ethnicity. The main goal of this work is to determine the extent of information that can be extracted from a person's speech for the Indian accent datasets. Here, we have used a Voice Encoder and Face decoder model. Voice Encoder keeps track of the vocal features, and the face decoder uses these vocal features to generate the face. The whole neural architecture is inspired by Generative Adversarial Networks (GANs).
Downloads
References
H. Maniyar, S. v. Budihal, and S. v Siddamal, “Persons facial image synthesis from audio with Generative Adversarial Networks,” ECTI Transactions on Computer and Information Technology (ECTI-CIT), vol. 16, no. 2, pp. 135–141, May 2022, doi: 10.37936/ecti-cit.2022162.246995.
O. P. Roy and V. Kumar, “A Survey on Voice over Internet Protocol (VoIP) Reliability Research,” IOP Conf Ser Mater Sci Eng, vol. 1020, no. 1, p. 12015, Jan. 2021, doi: 10.1088/1757-899X/1020/1/012015.
B. Zope, S. Mishra, K. Shaw, D. R. Vora, K. Kotecha, and R. V. Bidwe, “Question Answer System: A State-of-Art Representation of Quantitative and Qualitative Analysis,” Big Data and Cognitive Computing, vol. 6, no. 4, 2022, doi: 10.3390/bdcc6040109.
S. Shahriar, “GAN computers generate arts? A survey on visual arts, music, and literary text generation using generative adversarial network,” Displays, vol. 73, p. 102237, 2022, doi: https://doi.org/10.1016/j.displa.2022.102237.
R. V. Bidwe et al., “Deep Learning Approaches for Video Compression: A Bibliometric Analysis,” Big Data and Cognitive Computing, vol. 6, no. 2, p. 44, Apr. 2022, doi: 10.3390/bdcc6020044.
D. Mane, R. Bidwe, B. Zope, and N. Ranjan, “Traffic Density Classification for Multiclass Vehicles Using Customized Convolutional Neural Network for Smart City,” 2022, pp. 1015–1030. doi: 10.1007/978-981-19-2130-8_78.
D. Mane, K. Shah, R. Solapure, R. Bidwe, and S. Shah, “Image-Based Plant Seedling Classification Using Ensemble Learning,” 2023, pp. 433–447. doi: 10.1007/978-981-19-2225-1_39.
S. Bidwe, Dr. G. Kale, and R. Bidwe, “TRAFFIC MONITORING SYSTEM FOR SMART CITY BASED ON TRAFFIC DENSITY ESTIMATION,” Indian Journal of Computer Science and Engineering, vol. 13, no. 5, pp. 1388–1400, Oct. 2022, doi: 10.21817/indjcse/2022/v13i5/221305006.
S. Fenghour, D. Chen, K. Guo, B. Li, and P. Xiao, “Deep Learning-Based Automated Lip-Reading: A Survey,” IEEE Access, vol. 9, pp. 121184–121205, 2021, doi: 10.1109/ACCESS.2021.3107946.
A. Fernandez-Lopez and F. M. Sukno, “Survey on automatic lip-reading in the era of deep learning,” Image Vis Comput, vol. 78, pp. 53–72, 2018, doi: https://doi.org/10.1016/j.imavis.2018.07.002.
Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image Vis Comput, vol. 32, no. 9, pp. 590–605, 2014, doi: https://doi.org/10.1016/j.imavis.2014.06.004.
W. Mattheyses and W. Verhelst, “Audiovisual speech synthesis: An overview of the state-of-the-art,” Speech Commun, vol. 66, pp. 182–217, 2015, doi: https://doi.org/10.1016/j.specom.2014.11.001.
L. Chen, G. Cui, Z. Kou, H. Zheng, and C. Xu, “What comprises a good talking-head video generation?,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
C. Sheng et al., “Deep Learning for Visual Speech Analysis: A Survey,” arXiv preprint arXiv:2205.10839, 2022.
T.-H. Oh et al., “Speech2Face: Learning the Face Behind a Voice,” CoRR, vol. abs/1905.09773, 2019, [Online]. Available: http://arxiv.org/abs/1905.09773
Ephratet al., “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” CoRR, vol. abs/1804.03619, 2018, [Online]. Available: http://arxiv.org/abs/1804.03619
D. R. Reddy, “Speech recognition by machine: A review,” Proceedings of the IEEE, vol. 64, pp. 501–531, 1976.
R. A. Khalil, E. Jones, M. Babar, T. Jan, M. Zafar, and T. Alhussain, “Speech Emotion Recognition Using Deep Learning Techniques: A Review,” IEEE Access, vol. PP, p. 1, Dec. 2019, doi: 10.1109/ACCESS.2019.2936124.
F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman, “Synthesizing Normalized Faces from Facial Identity Features,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 3386–3395. doi: 10.1109/CVPR.2017.361.
M. Wang and W. Deng, “Deep Face Recognition: A Survey,” CoRR, vol. abs/1804.06655, 2018, [Online]. Available: http://arxiv.org/abs/1804.06655
C. Sheng, M. Pietikäinen, Q. Tian, and L. Liu, “Cross-Modal Self-Supervised Learning for Lip Reading: When Contrastive Learning Meets Adversarial Training,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2456–2464. doi: 10.1145/3474085.3475415.
R. Arandjelovic and A. Zisserman, “Objects that Sound,” in Proceedings of the European Conference on Computer Vision (ECCV), Sep. 2018.
P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards Practical Lipreading with Distilled and Efficient Models,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7608–7612. doi: 10.1109/ICASSP39728.2021.9415063.
T. Afouras, J. S. Chung, and A. Zisserman, “ASR is All You Need: Cross-Modal Distillation for Lip Reading,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2143–2147. doi: 10.1109/ICASSP40776.2020.9054253.
H. Liu, Z. Chen, and B. Yang, “Lip Graph Assisted Audio-Visual Speech Recognition Using Bidirectional Synchronous Fusion,” in Proc. Interspeech 2020, 2020, pp. 3520–3524. doi: 10.21437/Interspeech.2020-3146.
C. Sheng, X. Zhu, H. Xu, M. Pietikäinen, and L. Liu, “Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading,” IEEE Trans Multimedia, vol. 24, pp. 3545–3557, 2022, doi: 10.1109/TMM.2021.3102433.
Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” CoRR, vol. abs/1706.08612, 2017, [Online]. Available: http://arxiv.org/abs/1706.08612
Sarode, S., Thatte, R., Toshniwal, K., Warade, J., Bidwe, R. V., & Zope, B. (2023, March). A System for Language Translation using Sequence-to-sequence Learning based Encoder. In 2023 International Conference on Emerging Smart Computing and Informatics (ESCI) (pp. 1-5). IEEE.
Mane, Deepak, et al. "An Improved Transfer Learning Approach for Classification of Types of Cancer." Traitement du Signal 39.6 (2022): 2095.
Khetani, V., Gandhi, Y., Bhattacharya, S., Ajani, S. N., & Limkar, S. (2023). Cross-Domain Analysis of ML and DL: Evaluating their Impact in Diverse Domains. International Journal of Intelligent Systems and Applications in Engineering, 11(7s), 253–262.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.