Spectrogram Enhanced SimCLRV2 Emotional Representation Strategy for Kid’s Speech using Multisource Transfer Learning in CNN

Authors

  • Preethi V., V. Elizabeth Jesi

Keywords:

Tri-cut and Tri-mix augmentation, speech emotion recognition (SER), transfer learning, contrastive loss, contrastive learning, SimCLRV2, Incomplete Multi-source Transfer Learning

Abstract

A model that learns voice image representation, which outperform past techniques is developed. An unlabeled dataset is used to develop a semi-supervised representation contrastive learning strategy that picks and compares anchor, negative, and positive (APN) features with just 1% training data of real-time kid’s speech. Most of kid’s emotion detection model evaluate its model with adult’s emotion dataset, which may not give accurate result. To overcome the limitation of less labelled kid’s dataset we use Spectrogram Enhanced SimCLRV2 model since we can train the model with minimal available kid’s dataset with which we can classify and predict kid’s emotion effectively. Its goal is to maximize agreement across variously enhanced samples in the latent region of the input representation using contrastive loss. The quality of acquired emotional representations is greatly enhanced by the introduction of learnable nonlinear transformations between learnt emotional representations and contrastive losses. Multi-source transfer learning develops the network's capacity for accurate classification, finds the missing information in each source, and transfers it to the target data to complete it. As of our knowledge this is the first work that have utilized IMTL with SimCLR to improve target labels as well as overcomes less data. In order to train the network without labels using a self-supervised algorithm we use Ravdess song and Iemocap dataset. We use real-time kid’s data as the teacher network. We employ Ravdess speech with labels as supervised algorithm. By incorporating these discoveries and analyzing those using SimCLR CSL methodologies in Zenodo children recording un-labeled dataset, which is the student network, we can dramatically outperform previous methods for semi-supervised learning in recognition rate of kid’s emotion by training network with 1% real-time kid’s data.

Downloads

Download data is not yet available.

References

T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” CoRR, vol. abs/2002.05709, 2020.

Chen, Xinlei, et al. "Improved baselines with momentum contrastive learning," arXiv preprint arXiv:2003.04297 (2020).

Poole, Ben, et al. "On variational bounds of mutual information," International Conference on Machine Learning. PMLR, 2019.

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805 (2018).

Bachman, Philip, R. Devon Hjelm, and William Buchwalter. "Learning representations by maximizing mutual information across views," arXiv preprint arXiv:1906.00910 (2019).

Oord, Aaron van den, Yazhe Li, and Oriol Vinyals. "Representation learning with contrastive predictive coding," arXiv preprint arXiv:1807.03748 (2018).

Stolar, Melissa N., et al. "Real time speech emotion recognition using RGB image classification and transfer learning," 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS). IEEE, 2017.

Lech, Margaret, et al. "Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding," Frontiers in Computer Science 2 (2020): 14.

Padi, Sarala, et al. "Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation," arXiv preprint arXiv:2108.02510 (2021).

Cheuk, Kin Wai, et al. "nnaudio: An on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks," IEEE Access 8 (2020): 161981-162003.

Cheuk, K. W., Kat Agres, and D. Herremans. "nnaudio: A pytorch audio processing tool using 1D convolution neural networks," ISMIR–Late breaking demo (2019).

Zhang, Hua, et al. "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition," Frontiers in Physiology 12 (2021).

Falcon, William, and Kyunghyun Cho. "A framework for contrastive self-supervised learning and designing a new approach," arXiv preprint arXiv:2009.00104 (2020).

Tian, Yonglong, Dilip Krishnan, and Phillip Isola. "Contrastive multiview coding," Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer International Publishing, 2020.

Wu, Zhirong, et al. "Unsupervised feature learning via non-parametric instance discrimination," Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Ye, Mang, et al. "Unsupervised embedding learning via invariant and spreading instance feature," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Guan, Qing, Yunjun Wang, Bo Ping, Duanshu Li, Jiajun Du, Yu Qin, Hongtao Lu, Xiaochun Wan, and Jun Xiang. "Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: a pilot study," Journal of Cancer 10, no. 20 (2019): 4876.

Harzallah, H., Jurie, F., & Schmid, C. (2009, September). Combining efficient object localization and image classification. In 2009 IEEE 12th international conference on computer vision (pp. 237-244). IEEE.

Anirudh Shenoy “Pseudo-Labeling to deal with small datasets,” Published in Towards Data Science, 2019.

[20] Tzirakis, Panagiotis, Jiehao Zhang, and Bjorn W. Schuller. "End-to-end speech emotion recognition using deep neural networks," In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5089-5093. IEEE, 2018.

Chen, Ting, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. "Big self-supervised models are strong semi-supervised learners," Advances in neural information processing systems 33 (2020): 22243-22255.

Schneider, Steffen, et al. "wav2vec: Unsupervised pre-training for speech recognition," arXiv preprint arXiv:1904.05862 (2019).

Ding, Zhengming, Ming Shao, and Yun Fu. "Incomplete multisource transfer learning," IEEE transactions on neural networks and learning systems 29.2 (2016): 310-323.

Sajjad, Muhammad, and Soonil Kwon. "Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM," IEEE access 8 (2020): 79861-79875.

Alzubaidi, Laith “Novel transfer learning approach for medical imaging with limited labeled data,” Cancers,13,7,1590, MDPI. (2020).

Dataset - https://datasetsearch.research.google.com/

C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42, 2008.

Shin, Sungho, Jongwon Kim, Yeonguk Yu, Seongju Lee, and Kyoobin Lee. "Self-supervised transfer learning from natural images for sound classification," Applied Sciences 11, no. 7 (2021): 3043.

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge," International journal of computer vision 115 (2015): 211-252.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition," In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778. 2016.

Palanisamy, Kamalesh, Dipika Singhania, and Angela Yao. "Rethinking CNN models for audio classification," arXiv preprint arXiv:2007.11154 (2020).

B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proc. IEEE Int.Conf. Comput. Vis., Dec. 2013, pp. 2960–2967.

Gat, Itai, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, and Ron Hoory. "Speaker normalization for self-supervised speech emotion recognition," In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7342-7346. IEEE, 2022.

Karita, Shigeki, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki et al. "A comparative study on transformer vs rnn in speech applications." In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449-456. IEEE, 2019.

Jiang, Dongwei, Wubo Li, Miao Cao, Wei Zou, and Xiangang Li. "Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning." arXiv preprint arXiv:2010.13991 (2020).

András Béres “Semi-supervised image classification using contrastive pretraining with SimCLR,” Keras.io.

https://keras.io/examples/vision/semisupervised_simclr/

Dataset- https://www.researchgate.net/post/Anyone-know-of-a-free-download-of-an-emotional-speech database/5e62f7d1f8ea52d5cd35f0fc/citation/download.

Preethi, V., & Jesi, V. E. (2024). Triangular Region Cut-Mix Augmentation Algorithm based Speech Emotion Recognition system with Transfer Learning Approach. IEEE Access.

Downloads

Published

12.06.2024

How to Cite

Preethi V. (2024). Spectrogram Enhanced SimCLRV2 Emotional Representation Strategy for Kid’s Speech using Multisource Transfer Learning in CNN. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 3930–3939. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6952

Issue

Section

Research Article