Deep Learning for Next-Gen Audio Enhancement Systems

G Naga Jyothi

Authors

G Naga Jyothi, Paidimalla Naga Raju, Raghu Kalyana, D.N.V.S. Vijaya Lakshmi

Keywords:

deep learning, connectionist temporal memory, automatic speech recognition, music information retrieval, source separation, audio enhancement, environmental sounds

Abstract

This article reviews the latest advancements in deep learning approaches for audio signal processing. Speech, music, and ambient sound processing are examined concurrently to elucidate similarities and contrasts within these domains, emphasizing common methodologies, challenges, significant references, and opportunities for cross-fertilization between fields. This study covers the principal feature representations, including log-mel spectra and raw waveforms, with deep learning models, including convolutional neural networks, variations of the long short-term memory architecture, and specialized audio neural network models. Subsequently, significant domains of deep learning applications are addressed, including audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization, and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Ultimately, critical concerns and prospective inquiries about the use of deep learning in audio signal processing are delineated.

Downloads

Download data is not yet available.

References

. Computational Analysis of Sound Scenes and Events; Virtanen, T., Plumbley, M.D., Ellis, D., Eds.; Springer International Publishing: Berlin, Germany, 2018; doi:10.1007/978-3-319-63450-0.

. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.

. G. Hinton, L. Deng et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.

. A.-R. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, 2009, pp. 39–47

. Qian, K.; Ren, Z.; Pandit, V.; Yang, Z.; Zhang, Z.; Schuller, B. Wavelets Revisited for the Classification of Acoustic Scenes. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017. 12.

. Ren, Z.; Pandit, V.; Qian, K.; Yang, Z.; Zhang, Z.; Schuller, B. Deep Sequential Image Features for Acoustic Scene Classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017.

. A. Narayanan, A. Misra et al., “Toward domain -invariant speech recognition via large scale training,” in SLT, 2018, pp. 441–447. [77] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, vol. 117, 2013.

. K. Ullrich, J. Schlüter, and T. Grill, “Boundary Detection in Music Structure Analysis using Convolutional Neural Networks,” in ISMIR, 2014. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “A neural network based algorithm for speaker localization in a multi-room environment,” in proc. in IEEE International Workshop on Machine Learning for Signal Processing, 2016.

. J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn, “Fundamentals of noise reduction,” in Springer Handbook of Speech Processing. Springer, 2008, pp. 843–872.

. A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Interspeech, 2018.

. Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Transactions on ASLP, vol. 22, no. 12, pp. 1849–1858, 2014.

. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” in ICASSP, 2015.

. B.-H. Juang and L. R. Rabiner, “Automatic speech recognition–a brief history of the technology development,” Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, vol. 1, p. 67, 2005.

Deep Learning for Next-Gen Audio Enhancement Systems

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By

Deep Learning for Next-Gen Audio Enhancement Systems

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By