Exploring Marathi-English Code-Mixing: Comprehensive Analysis of NLP Applications (QA and NER)

Dhiraj  Amin; Sharvari  Govilkar; Sagar  Kulkarni; Madhura  Vyawahare; Shubhangi  Chavan; Pooja  Pandey

Authors

Dhiraj Amin Department of Computer Engineering, Pillai College of Engineering, Navi Mumbai, Maharashtra, India
Sharvari Govilkar Department of Computer Engineering, Pillai College of Engineering, Navi Mumbai, Maharashtra, India
Sagar Kulkarni Department of Computer Engineering, Pillai College of Engineering, Navi Mumbai, Maharashtra, India
Madhura Vyawahare Department of Computer Engineering, SVKM’s MPSTME, Mumbai Maharashtra, India
Shubhangi Chavan Department of Computer Engineering, Pillai College of Engineering, Navi Mumbai, Maharashtra, India
Pooja Pandey Department of Computer Science and Engineering, Prestige Institute of Engineering Management and Research, Indore, Madhya Pradesh, India

Keywords:

Code-Mixed, Language Model, Marathi BERT, Natural Language Processing, Named Entity Recognition, Question Answering

Abstract

Code-mixing, the linguistic practice of blending elements from multiple languages, is a common phenomenon that reflects the linguistic and cultural context of speakers. This research investigates Marathi-English code-mixing, with a focus on natural language processing (NLP) applications such as question answering (QA) and named entity recognition (NER). A sophisticated Marathi-English code-mixed QA system is proposed, which can comprehend and respond to questions that span multiple languages. The effectiveness of the system is evaluated using real and synthetic code-mixed QA datasets, revealing promising results, with the MuRIL model achieving an exact match (EM) score of 0.41 and 0.62 on real and synthetic datasets, respectively. The same model, when fine-tuned for code-mixed NER on the MahaRoBERTa code-mixed NER dataset, achieves an impressive F1 score of 73.92, outperforming other models in accurately labeling named entities in code-mixed text. This research advances code-mixed language processing by addressing issues in multilingual communication contexts.

Downloads

Download data is not yet available.

References

S. Singh, M. Anand Kumar, and K. P. Soman, “CEN@Amrita: Information retrieval on CodeMixed Hindi English tweets using vector space models,” in CEUR Workshop Proceedings, 2016.

D. S. Sharma et al., “Improving Document Ranking using Query Expansion and Classification Techniques for Mixed Script Information Retrieval,” 2016.

K. R. Chandu, M. Chinnakotla, A. W. Black, and M. Shrivastava, “WebShodh: A code mixed factoid question answering system for web,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017. doi: 10.1007/978-3-319-65813-1_9.

D. Gupta, P. Lenka, A. Ekbal, and P. Bhattacharyya, “Uncovering code-mixed challenges: A framework for linguistically driven question generation and neural based question answering,” in CoNLL 2018 - 22nd Conference on Computational Natural Language Learning, Proceedings, 2018. doi: 10.18653/v1/k18-1012.

D. Gupta, A. Ekbal, and P. Bhattacharyya, “A deep neural network framework for English Hindi question answering,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 2, 2019, doi: 10.1145/3359988.

S. Thara, E. Sampath, B. Venkata Sitarami Reddy, M. Vidhya Sai Bhagavan, and M. Phanindra Reddy, “Code mixed question answering Challenge using deep learning methods,” in Proceedings of the 5th International Conference on Communication and Electronics Systems, ICCES 2020, 2020. doi: 10.1109/ICCES48766.2020.09137971.

S. Dowlagar and R. Mamidi, “CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data,” in SemEval 2022 - 16th International Workshop on Semantic Evaluation, Proceedings of the Workshop, 2022. doi: 10.18653/v1/2022.semeval-1.214.

K. Singh, I. Sen, and P. Kumaraguru, “Language identification and named entity recognition in hinglish code mixed tweets,” in ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 2018. doi: 10.18653/v1/p18-3008.

A. El Mekki, A. El Mahdaouy, M. Akallouch, I. Berrada, and A. Khoumsi, “UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed Complex Named Entity Recognition via Pseudo Labels using Multilingual Transformer,” in SemEval 2022 - 16th International Workshop on Semantic Evaluation, Proceedings of the Workshop, 2022. doi: 10.18653/v1/2022.semeval-1.207.

V. K. Srirangam, A. A. Reddy, V. Singh, and M. Shrivastava, “Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds., Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 183–189. doi: 10.18653/v1/P19-2025.

C. Sabty, A. Sherif, M. Elmahdy, and S. Abdennadher, “Techniques for Named Entity Recognition on Arabic-English Code-Mixed Data,” International Journal of Robotic Computing, 2019, doi: 10.35708/tai1868-126245.

R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, and J. P. McCrae, “Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 68–72. doi: 10.1109/ICACCS48705.2020.9074379.

Y. Madhani et al., “Aksharantar: Towards building open transliteration tools for the next billion users,” ArXiv, vol. abs/2205.03018, 2022.

D. Amin et al., “Marathi-English Code-mixed Text Generation,” ArXiv, vol. abs/2309.16202, 2023.

D. Amin, S. Govilkar, and S. Kulkarni, “Question answering using deep learning in low resource Indian language Marathi,” ArXiv, vol. abs/2309.15779, 2023.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” Oct. 2018.

S. Khanuja et al., “MuRIL: Multilingual representations for Indian languages,” Mar. 2021.

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” Apr. 2019.

P. Patil, A. Ranade, M. Sabane, O. Litake, and R. Joshi, “L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models,” ArXiv, vol. abs/2204.06029, 2022.

Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” Jul. 2019.

T. Pires, E. Schlinger, and D. Garrette, “How Multilingual is Multilingual BERT?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4996–5001. doi: 10.18653/v1/P19-1493.

A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” Nov. 2019.

R. Joshi, “L3Cube-MahaCorpus and MahaBERT: Marathi monolingual corpus, Marathi BERT language models, and resources,” Feb. 2022.

Exploring Marathi-English Code-Mixing: Comprehensive Analysis of NLP Applications (QA and NER)

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

Announcements

Information for Authors

ijisae

Information

trindex