Deep Learning for Next-Gen Audio Enhancement Systems
Keywords:
deep learning, connectionist temporal memory, automatic speech recognition, music information retrieval, source separation, audio enhancement, environmental soundsAbstract
This article reviews the latest advancements in deep learning approaches for audio signal processing. Speech, music, and ambient sound processing are examined concurrently to elucidate similarities and contrasts within these domains, emphasizing common methodologies, challenges, significant references, and opportunities for cross-fertilization between fields. This study covers the principal feature representations, including log-mel spectra and raw waveforms, with deep learning models, including convolutional neural networks, variations of the long short-term memory architecture, and specialized audio neural network models. Subsequently, significant domains of deep learning applications are addressed, including audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization, and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Ultimately, critical concerns and prospective inquiries about the use of deep learning in audio signal processing are delineated.
Downloads
References
. Computational Analysis of Sound Scenes and Events; Virtanen, T., Plumbley, M.D., Ellis, D., Eds.; Springer International Publishing: Berlin, Germany, 2018; doi:10.1007/978-3-319-63450-0.
. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, p. 533, 1986.
. G. Hinton, L. Deng et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
. A.-R. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, 2009, pp. 39–47
. Qian, K.; Ren, Z.; Pandit, V.; Yang, Z.; Zhang, Z.; Schuller, B. Wavelets Revisited for the Classification of Acoustic Scenes. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017. 12.
. Ren, Z.; Pandit, V.; Qian, K.; Yang, Z.; Zhang, Z.; Schuller, B. Deep Sequential Image Features for Acoustic Scene Classification. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Munich, Germany, 16–17 November 2017.
. A. Narayanan, A. Misra et al., “Toward domain -invariant speech recognition via large scale training,” in SLT, 2018, pp. 441–447. [77] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” in ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, vol. 117, 2013.
. K. Ullrich, J. Schlüter, and T. Grill, “Boundary Detection in Music Structure Analysis using Convolutional Neural Networks,” in ISMIR, 2014. Vesperini, P. Vecchiotti, E. Principi, S. Squartini, and F. Piazza, “A neural network based algorithm for speaker localization in a multi-room environment,” in proc. in IEEE International Workshop on Machine Learning for Signal Processing, 2016.
. J. Chen, J. Benesty, Y. A. Huang, and E. J. Diethorn, “Fundamentals of noise reduction,” in Springer Handbook of Speech Processing. Springer, 2008, pp. 843–872.
. A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Interspeech, 2018.
. Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Transactions on ASLP, vol. 22, no. 12, pp. 1849–1858, 2014.
. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” in ICASSP, 2015.
. B.-H. Juang and L. R. Rabiner, “Automatic speech recognition–a brief history of the technology development,” Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, vol. 1, p. 67, 2005.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


