Fine-tuning ASR Model Performance on Indian Regional Accents for Accurate Chemical Term Prediction in Audio

Authors

  • Sonali Kothari Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • Shwetambari Chiwhane Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • Rithwik Satya Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • Md. Asad Ansari Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • Shreeja Mehta Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • Pranav Naranatt Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
  • M. Karthikeyan NCL-CSIR, Baner, Pune

Keywords:

Automatic Speech Recognition, ASR models, Chemical term identification in Indian regional accents, Deep Speech, Fine-tuning ASR, Wav2Vec, Whisper, Performance evaluation

Abstract

Automatic Speech Recognition (ASR) models have recently become famous for their incredible ability to provide highly accurate transcriptions of human speech. They have been in the radius of further research and development. The study compared three state-of-the-art ASR models: Deep Speech, Wav2Vec, and Whisper. The proposed research has evaluated their performance on a dataset of audio recordings containing chemical terms spoken in various Indian regional accents. This research aims to precisely identify a model with the best accuracy of transcribing chemical terms spoken in Indian regional accents and fine-tune it further for efficient prediction.

Downloads

Download data is not yet available.

References

Droua-Hamdani, Ghania, Sid-Ahmed Selouani, and Malika Boudraa. "Speaker-independent ASR for modern standard Arabic: effect of regional accents." International Journal of Speech Technology 15 (2012): 487-493.

Winata, Genta Indra, et al. "Learning fast adaptation on cross-accented speech recognition." arXiv preprint arXiv:2003.01901 (2020).

Vergyri, Dimitra, Lori Lamel, and Jean-Luc Gauvain. "Automatic speech recognition of multiple accented English data." Eleventh Annual Conference of the International Speech Communication Association. 2010.

Tomanek, Katrin, et al. "Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech." arXiv preprint arXiv:2109.06952 (2021).

Hinsvark, Arthur, et al. "Accented speech recognition: A survey." arXiv preprint arXiv:2104.10747 (2021).

Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022).

Kahn, Jacob, et al. "Libri-light: A benchmark for ASR with limited or no supervision." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.

Baevski, Alexei, et al. "wav2vec 2.0: A framework for self-supervised learning of speech representations." Advances in neural information processing systems 33 (2020): 12449-12460.

Likhomanenko, Tatiana, et al. "Rethinking evaluation in ASR: Are our models robust enough?." arXiv preprint arXiv:2010.11745 (2020).

Perero-Codosero, Juan M., et al. "Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription." IberSPEECH. 2018.

Gao, Dongji, et al. "EURO: ESPnet Unsupervised ASR Open-source Toolkit." arXiv preprint arXiv:2211.17196 (2022).

Lee, K-F., H-W. Hon, and Raj Reddy. "An overview of the SPHINX speech recognition system." IEEE Transactions on Acoustics, Speech, and Signal Processing 38.1 (1990): 35-45.

Chan, William, et al. "Listen, attend and spell." arXiv preprint arXiv:1508.01211 (2015).

Hannun, Awni, et al. "Deep Speech: Scaling up end-to-end speech recognition." arXiv preprint arXiv:1412.5567 (2014).

Ravanelli, Mirco, et al. "SpeechBrain: A general-purpose speech toolkit." arXiv preprint arXiv:2106.04624 (2021).

Wang, Changhan, et al. "fairseq s2t: Fast speech-to-text modeling with fairseq." arXiv preprint arXiv:2010.05171 (2020).

Li-Wei Chen, Alexander Rudnicky “Exploring WAV2VEC2.0 Fine Tuning for Improved Speech Emotion Recognition”

Gulati, M. ., Yadav, R. K. ., & Tewari, G. . (2023). Physiological Conditions Monitoring System Based on IoT. International Journal on Recent and Innovation Trends in Computing and Communication, 11(4s), 199–202. https://doi.org/10.17762/ijritcc.v11i4s.6514

Dhabliya, D. (2021). An Integrated Optimization Model for Plant Diseases Prediction with Machine Learning Model . Machine Learning Applications in Engineering Education and Management, 1(2), 21–26. Retrieved from http://yashikajournals.com/index.php/mlaeem/article/view/15

Downloads

Published

21.09.2023

How to Cite

Kothari, S. ., Chiwhane, S. ., Satya, R. ., Ansari, M. A. ., Mehta, S. ., Naranatt, P. ., & Karthikeyan, M. . (2023). Fine-tuning ASR Model Performance on Indian Regional Accents for Accurate Chemical Term Prediction in Audio. International Journal of Intelligent Systems and Applications in Engineering, 11(4), 485–494. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/3583

Issue

Section

Research Article

Most read articles by the same author(s)