Spectrogram Enhanced SimCLRV2 Emotional Representation Strategy for Kid’s Speech using Multisource Transfer Learning in CNN
Keywords:
Tri-cut and Tri-mix augmentation, speech emotion recognition (SER), transfer learning, contrastive loss, contrastive learning, SimCLRV2, Incomplete Multi-source Transfer LearningAbstract
A model that learns voice image representation, which outperform past techniques is developed. An unlabeled dataset is used to develop a semi-supervised representation contrastive learning strategy that picks and compares anchor, negative, and positive (APN) features with just 1% training data of real-time kid’s speech. Most of kid’s emotion detection model evaluate its model with adult’s emotion dataset, which may not give accurate result. To overcome the limitation of less labelled kid’s dataset we use Spectrogram Enhanced SimCLRV2 model since we can train the model with minimal available kid’s dataset with which we can classify and predict kid’s emotion effectively. Its goal is to maximize agreement across variously enhanced samples in the latent region of the input representation using contrastive loss. The quality of acquired emotional representations is greatly enhanced by the introduction of learnable nonlinear transformations between learnt emotional representations and contrastive losses. Multi-source transfer learning develops the network's capacity for accurate classification, finds the missing information in each source, and transfers it to the target data to complete it. As of our knowledge this is the first work that have utilized IMTL with SimCLR to improve target labels as well as overcomes less data. In order to train the network without labels using a self-supervised algorithm we use Ravdess song and Iemocap dataset. We use real-time kid’s data as the teacher network. We employ Ravdess speech with labels as supervised algorithm. By incorporating these discoveries and analyzing those using SimCLR CSL methodologies in Zenodo children recording un-labeled dataset, which is the student network, we can dramatically outperform previous methods for semi-supervised learning in recognition rate of kid’s emotion by training network with 1% real-time kid’s data.
Downloads
References
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” CoRR, vol. abs/2002.05709, 2020.
Chen, Xinlei, et al. "Improved baselines with momentum contrastive learning," arXiv preprint arXiv:2003.04297 (2020).
Poole, Ben, et al. "On variational bounds of mutual information," International Conference on Machine Learning. PMLR, 2019.
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805 (2018).
Bachman, Philip, R. Devon Hjelm, and William Buchwalter. "Learning representations by maximizing mutual information across views," arXiv preprint arXiv:1906.00910 (2019).
Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding," arXiv preprint arXiv:1807.03748 (2018).
Stolar, Melissa N., et al. "Real time speech emotion recognition using RGB image classification and transfer learning," 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS). IEEE, 2017.
Lech, Margaret, et al. "Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding," Frontiers in Computer Science 2 (2020): 14.
Padi, Sarala, et al. "Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation," arXiv preprint arXiv:2108.02510 (2021).
Cheuk, Kin Wai, et al. "nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks," IEEE Access 8 (2020): 161981-162003.
Cheuk, K. W., Kat Agres, and D. Herremans. "nnaudio: A pytorch audio processing tool using 1D convolution neural networks," ISMIR–Late breaking demo (2019).
Zhang, Hua, et al. "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition," Frontiers in Physiology 12 (2021).
Falcon, William, and Kyunghyun Cho. "A framework for contrastive self-supervised learning and designing a new approach," arXiv preprint arXiv:2009.00104 (2020).
Tian, Yonglong, Dilip Krishnan, and Phillip Isola. "Contrastive multiview coding," Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer International Publishing, 2020.
Wu, Zhirong, et al. "Unsupervised feature learning via non-parametric instance discrimination," Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Ye, Mang, et al. "Unsupervised embedding learning via invariant and spreading instance feature," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Guan, Qing, Yunjun Wang, Bo Ping, Duanshu Li, Jiajun Du, Yu Qin, Hongtao Lu, Xiaochun Wan, and Jun Xiang. "Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: a pilot study," Journal of Cancer 10, no. 20 (2019): 4876.
Harzallah, H., Jurie, F., & Schmid, C. (2009, September). Combining efficient object localization and image classification. In 2009 IEEE 12th international conference on computer vision (pp. 237-244). IEEE.
Anirudh Shenoy “Pseudo-Labeling to deal with small datasets,” Published in Towards Data Science, 2019.
[20] Tzirakis, Panagiotis, Jiehao Zhang, and Bjorn W. Schuller. "End-to-end speech emotion recognition using deep neural networks," In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5089-5093. IEEE, 2018.
Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. "Big self-supervised models are strong semi-supervised learners," Advances in neural information processing systems 33 (2020): 22243-22255.
Schneider, Steffen, et al. "wav2vec: Unsupervised pre-training for speech recognition," arXiv preprint arXiv:1904.05862 (2019).
Ding, Zhengming, Ming Shao, and Yun Fu. "Incomplete multisource transfer learning," IEEE transactions on neural networks and learning systems 29.2 (2016): 310-323.
Sajjad, Muhammad, and Soonil Kwon. "Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM," IEEE access 8 (2020): 79861-79875.
Alzubaidi, Laith “Novel transfer learning approach for medical imaging with limited labeled data,” Cancers,13,7,1590, MDPI. (2020).
Dataset - https://datasetsearch.research.google.com/
C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42, 2008.
Shin, Sungho, Jongwon Kim, Yeonguk Yu, Seongju Lee, and Kyoobin Lee. "Self-supervised transfer learning from natural images for sound classification," Applied Sciences 11, no. 7 (2021): 3043.
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge," International journal of computer vision 115 (2015): 211-252.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition," In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.
Palanisamy, Kamalesh, Dipika Singhania, and Angela Yao. "Rethinking CNN models for audio classification," arXiv preprint arXiv:2007.11154 (2020).
B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. IEEE Int.Conf. Comput. Vis., Dec. 2013, pp. 2960–2967.
Gat, Itai, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, and Ron Hoory. "Speaker normalization for self-supervised speech emotion recognition," In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7342-7346. IEEE, 2022.
Karita, Shigeki, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki et al. "A comparative study on transformer vs rnn in speech applications." In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449-456. IEEE, 2019.
Jiang, Dongwei, Wubo Li, Miao Cao, Wei Zou, and Xiangang Li. "Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning." arXiv preprint arXiv:2010.13991 (2020).
András Béres “Semi-supervised image classification using contrastive pretraining with SimCLR,” Keras.io.
https://keras.io/examples/vision/semisupervised_simclr/
Dataset- https://www.researchgate.net/post/Anyone-know-of-a-free-download-of-an-emotional-speech database/5e62f7d1f8ea52d5cd35f0fc/citation/download.
Preethi, V., & Jesi, V. E. (2024). Triangular Region Cut-Mix Augmentation Algorithm based Speech Emotion Recognition system with Transfer Learning Approach. IEEE Access.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.