Deep Learning for Next-Gen Audio Enhancement Systems

Authors

  • G Naga Jyothi, Paidimalla Naga Raju, Raghu Kalyana, D.N.V.S. Vijaya Lakshmi

Keywords:

deep learning, connectionist temporal memory, automatic speech recognition, music information retrieval, source separation, audio enhancement, environmental sounds

Abstract

This article reviews the latest advancements in deep learning approaches for audio signal processing. Speech, music, and ambient sound processing are examined concurrently to elucidate similarities and contrasts within these domains, emphasizing common methodologies, challenges, significant references, and opportunities for cross-fertilization between fields. This study covers the principal feature representations, including log-mel spectra and raw waveforms, with deep learning models, including convolutional neural networks, variations of the long short-term memory architecture, and specialized audio neural network models. Subsequently, significant domains of deep learning applications are addressed, including audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization, and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Ultimately, critical concerns and prospective inquiries about the use of deep learning in audio signal processing are delineated.

Downloads

Download data is not yet available.

References

. Computational Analysis of Sound Scenes and Events; Virtanen, T., Plumbley, M.D., Ellis, D., Eds.; Springer International Publishing: Berlin, Germany, 2018; doi:10.1007/978-3-319-63450-0.

. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.

. G. Hinton, L. Deng et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.

. A.-R. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, 2009, pp. 39–47

. Qian, K.; Ren, Z.; Pandit, V.; Yang, Z.; Zhang, Z.; Schuller, B. Wavelets Revisited for the Classification of Acoustic Scenes. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017. 12.

. Ren, Z.; Pandit, V.; Qian, K.; Yang, Z.; Zhang, Z.; Schuller, B. Deep Sequential Image Features for Acoustic Scene Classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017.

. A. Narayanan, A. Misra et al., “Toward domain -invariant speech recognition via large scale training,” in SLT, 2018, pp. 441–447. [77] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, vol. 117, 2013.

. K. Ullrich, J. Schlüter, and T. Grill, “Boundary Detection in Music Structure Analysis using Convolutional Neural Networks,” in ISMIR, 2014. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “A neural network based algorithm for speaker localization in a multi-room environment,” in proc. in IEEE International Workshop on Machine Learning for Signal Processing, 2016.

. J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn, “Fundamentals of noise reduction,” in Springer Handbook of Speech Processing. Springer, 2008, pp. 843–872.

. A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Interspeech, 2018.

. Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Transactions on ASLP, vol. 22, no. 12, pp. 1849–1858, 2014.

. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” in ICASSP, 2015.

. B.-H. Juang and L. R. Rabiner, “Automatic speech recognition–a brief history of the technology development,” Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, vol. 1, p. 67, 2005.

Downloads

Published

30.10.2024

How to Cite

G Naga Jyothi. (2024). Deep Learning for Next-Gen Audio Enhancement Systems. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 5626 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7487

Issue

Section

Research Article