Automatic Speech Recognition System for Low Resource Punjabi Language using Deep Neural Network-Hidden Markov Model (DNN-HMM)
Keywords:
Children Automatic Speech Recognition, Low Resource Language, Punjabi Speech, Data Collection, Deep Neural Networks, DNN-HMMAbstract
In recent years, speech recognition technology has advanced significantly, enabling seamless human-machine interaction. The majority of these advances, however, have focused on major languages with abundant data and resources, neglecting the rich linguistic diversity inherent in low resource languages. There are unique challenges associated with speech recognition in low resource languages, because there is a lack of comprehensive linguistic resources as well as data. To ensure inclusivity and promote global accessibility, researchers recognize the need to bridge this gap. This article focuses on the development of children ASR in the Punjabi languages, along with the potential benefits that can be gained from addressing this understudied field. For the purpose, the speech data from children have been collected (who speak Punjabi) and collected audio data were then segmented using PRAAT (open source software) followed by transcription of segmented audio files. The feature extraction has been implemented using the MFCC algorithm. The process of acoustic modelling has been implemented using various models which include MONO, Tri1, Tri2 and Tri3. The acoustic model then was trained with DNN-HMM to increase the accuracy of the children's ASR in Punjabi language. The results reveal 83.9% accuracy of children ASR in the Punjabi language. Further, the comparison with the existing models shows that the proposed DNN-HMM model gives better results.
Downloads
References
Alim, S. A., & Rashid, N. K. A. (2018). Some commonly used speech feature extraction algorithms (pp. 2-19). London, UK: IntechOpen. http://dx.doi.org/10.5772/intechopen.80419
Bawa, P., & Kadyan, V. (2021). Noise robust in-domain children speech enhancement for automatic Punjabi recognition system under mismatched conditions. Applied Acoustics, 175, 107810. https://doi.org/10.1016/j.apacoust.2020.107810
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech communication, 56, 85-100. https://doi.org/10.1016/j.specom.2013.07.008
Bhardwaj, V., Kukreja, V., & Singh, A. (2021). Usage of Prosody Modification and Acoustic Adaptation for Robust Automatic Speech Recognition (ASR) System. Rev. d'Intelligence Artif., 35(3), 235-242. https://doi.org/10.18280/ria.350307 |
Bhardwaj, V., & Kukreja, V. (2021). Effect of pitch enhancement in Punjabi children's speech recognition system under disparate acoustic conditions. Applied Acoustics, 177, 107918. https://doi.org/10.1016/j.apacoust.2021.107918
Chohan, M. N., & García, M. I. M. (2019). Phonemic comparison of English and punjabi. International Journal of English Linguistics, 9(4), 347-357. https://doi.org/10.5539/ijel.v9n4p347
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.(1986) “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536.
Deka, B., Chakraborty, J., Dey, A., Nath, S., Sarmah, P., Nirmala, S. R., & Vijaya, S. (2018). Speech corpora of under resourced languages of north-east India. In 2018 Oriental COCOSDA-International Conference on Speech Database and Assessments (pp. 72-77). IEEE. https://doi.org/10.1109/ICSDA.2018.8693038
Dua, M., Aggarwal, R. K., & Biswas, M. (2018). Optimizing integrated features for Hindi automatic speech recognition system. Journal of Intelligent Systems, 29(1), 959-976. https://doi.org/10.1515/jisys-2018-0057
Guglani, J., & Mishra, A. N. (2020). Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit. Applied Acoustics, 167, 107386. https://doi.org/10.1016/j.apacoust.2020.107386
Gupta, S., Jaafar, J., Ahmad, W. W., & Bansal, A. (2013). Feature extraction using MFCC. Signal & Image Processing: An International Journal, 4(4),
https://doi.org/101-108. 10.5121/sipij.2013.4408
H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, Vol-4, Aug, 1990. https://doi.org/10.1121/1.399423 Hasegawa-Johnson, M. A., Jyothi, P., McCloy, D., Mirbagheri, M., Di Liberto, G. M., Das, A., & Lee, A. K. C. (2016). ASR for under-resourced languages from probabilistic transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1), 50-63. https://doi.org/10.1109/TASLP.2016.2621659
Hasija, T., Kadyan, V., & Guleria, K. (2021, A March). Recognition of Children Punjabi Speech using Tonal Non-Tonal Classifier. In 2021 International Conference on Emerging Smart Computing and Informatics (ESCI) (pp. 702-706). IEEE.
Hasija, T., Kadyan, V., & Guleria, K. (2021, B August). Out Domain Data Augmentation on Punjabi Children Speech Recognition using Tacotron. In Journal of Physics: Conference Series (Vol. 1950, No. 1, p. 012044). IOP Publishing.
Hasija, T., Kadyan, V., Guleria, K., Alharbi, A., Alyami, H., & Goyal, N. (2022). Prosodic feature-based discriminatively trained low resource speech recognition system. Sustainability, 14(2), 614.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97. https://doi.org/10.1109/MSP.2012.2205597
https://punjabi.lrc.columbia.edu/?page_id=11 Jyoti Guglani,” Continuous Speech Recognition Of Punjabi Language”, Ph.D Dissertation. Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, 2022.
Kadyan, V. Acoustic Features Optimization for Punjabi Automatic Speech Recognition System. Ph.D. Dissertation, Chitkara University, Rajpura, India, 2018.
Kadyan, V.; Mantri, A.; Aggarwal, R.K. A heterogeneous speech feature vectors generation approach with hybrid hmm classifiers. Int. J. Speech Technol. 2017, 20, 761–769. https://doi.org/10.1007/s10772-017-9446-9
Kherdekar, V. A., & Naik, S. A. Speech Recognition System Approaches, Techniques And Tools For Mathematical Expressions: A Review. International Journal Of Scientific & Technology Research Volume 8, Issue 08, August 2019, ISSN 2277-8616,pp 1255-1263.
https://api.semanticscholar.org/CorpusID:202890136
Kim, C. and Stern, R. M., "Power-normalized cepstral coefficients (PNCC) for robust speech recognition," in Proc. of ICASSP, Vol-1, May, 2012. https://doi.org/10.1109/TASLP.2016.2545928
Lata, S., & Arora, S. (2013, August). Laryngeal tonal characteristics of Punjabi—an experimental study. In 2013 International Conference on Human Computer Interactions (ICHCI) (pp. 1-6). https://doi.org/10.1109/ICHCI-IEEE.2013.6887793
Liu, Z., Wu, Z., Li, T., Li, J., & Shen, C. (2018). GMM and CNN hybrid method for short utterance speaker recognition. IEEE Transactions on Industrial informatics, 14(7), 3244-3252. https://doi.org /10.1109/TII.2018.2799928
Lu X., Li S., and Fujimoto M.(2019), Automatic Speech Recognition. Book chapter, Speech-to-Speech Translation, SpringerBriefs in Computer Science, https://doi.org/10.1007/978-981-15-0595-9_2 Schedl M., Yi-Hsuan Yang, and Perfecto Herrera-Boyer. 2016. Introduction to intelligent music systems and applications. ACM Trans. Intell. Syst. Technol. 8, 2 (Oct. 2016), 17:1–17:8. https://doi.org/10.1145/2991468
Nagano, T., Fukuda, T., Suzuki, M., Kurata, G. (2019). Data augmentation based on vowel stretch for improving children's speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 502-508. https://doi.org/10.1109/ASRU46091.2019.9003741
Noyes, J. M., Haigh, R., & Starr, A. F. (1989). Automatic speech recognition for disabled people. Applied Ergonomics, 20(4), 293-298.
https://doi.org/10.1016/0003-6870(89)90193-2
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (Dec. 2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Serizel, R., & Giuliani, D. (2017). Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Natural Language Engineering, 23(3), 325-350. https://doi.org/10.1017/S135132491600005X
Shahnawazuddin, S., Adiga, N., Kathania, H.K., Sai, B. T. (2020). Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recognition Letters, 131: 213-218. https://doi.org/10.1016/j.patrec.2019.12.019
Shivakumar, P.G., Potamianos, A., Lee, S., Narayanan, S. (2014). Improving speech recognition for children using acoustic adaptation and pronunciation modeling. In WOCCI, 15-19.
Shi, L., Ahmad, I., He, Y., & Chang, K. (2018). Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments. Journal of Communications and Networks, 20(5), 509-518. https://doi.org/10.1109/JCN.2018.000075
Sobti, R., Kadyan, V., & Guleria, K. (2022). Challenges for Designing of Children Speech Corpora: A State-of-the-Art Review. ECS Transactions, 107(1), 9053. https://doi.org/10.1149/10701.9053ecst
Surinderpal Singh Dhanjal,“Speech Analysis And Synthesis Of The Punjabi Language”, Ph.D Dissertation. Thapar University, 2014
Vincent Berment, Methods to computerize “little equipped” languages and groups of languages, Theses, Universite Joseph-Fourier - Grenoble I, May 2004, ´ https://tel.archives-ouvertes.fr/tel-00006313.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, and others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (Feb. 2015), 529–533. https://doi.org/10.1038/nature14236
Yan Liu, Yang Liu, Shenghua Zhong, and Songtao Wu. 2017. Implicit visual learning: Image recognition via dissipative learning model. ACM Trans. Intell. Syst. Technol. 8, 2 (Jan. 2017), 31:1–31:24. https://doi.org/10.1145/2974024
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, and others. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 (Oct. 2016) https://doi.org/10.48550/arXiv.1609.08144
Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, A. E. D., Jin, W., & Schuller, B. (2018). Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Transactions on Intelligent Systems and Technology (TIST), 9(5), 1-28. https://doi.org/10.1145/3178115
Zixing Zhang, Nicholas Cummins, and Björn Schuller. 2017. Advanced data exploitation for speech analysis—An overview. IEEE Sign. Process. Mag. 34 (July 2017). https://doi.org/10.1109/MSP.2017.2699358
Speech Recognition — Feature Extraction MFCC & PLP | by Jonathan Hui | Medium Accessed on 21.01.2024
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.