Exploring Marathi-English Code-Mixing: Comprehensive Analysis of NLP Applications (QA and NER)
Keywords:
Code-Mixed, Language Model, Marathi BERT, Natural Language Processing, Named Entity Recognition, Question AnsweringAbstract
Code-mixing, the linguistic practice of blending elements from multiple languages, is a common phenomenon that reflects the linguistic and cultural context of speakers. This research investigates Marathi-English code-mixing, with a focus on natural language processing (NLP) applications such as question answering (QA) and named entity recognition (NER). A sophisticated Marathi-English code-mixed QA system is proposed, which can comprehend and respond to questions that span multiple languages. The effectiveness of the system is evaluated using real and synthetic code-mixed QA datasets, revealing promising results, with the MuRIL model achieving an exact match (EM) score of 0.41 and 0.62 on real and synthetic datasets, respectively. The same model, when fine-tuned for code-mixed NER on the MahaRoBERTa code-mixed NER dataset, achieves an impressive F1 score of 73.92, outperforming other models in accurately labeling named entities in code-mixed text. This research advances code-mixed language processing by addressing issues in multilingual communication contexts.
Downloads
References
S. Singh, M. Anand Kumar, and K. P. Soman, “CEN@Amrita: Information retrieval on CodeMixed Hindi English tweets using vector space models,” in CEUR Workshop Proceedings, 2016.
D. S. Sharma et al., “Improving Document Ranking using Query Expansion and Classification Techniques for Mixed Script Information Retrieval,” 2016.
K. R. Chandu, M. Chinnakotla, A. W. Black, and M. Shrivastava, “WebShodh: A code mixed factoid question answering system for web,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017. doi: 10.1007/978-3-319-65813-1_9.
D. Gupta, P. Lenka, A. Ekbal, and P. Bhattacharyya, “Uncovering code-mixed challenges: A framework for linguistically driven question generation and neural based question answering,” in CoNLL 2018 - 22nd Conference on Computational Natural Language Learning, Proceedings, 2018. doi: 10.18653/v1/k18-1012.
D. Gupta, A. Ekbal, and P. Bhattacharyya, “A deep neural network framework for English Hindi question answering,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 2, 2019, doi: 10.1145/3359988.
S. Thara, E. Sampath, B. Venkata Sitarami Reddy, M. Vidhya Sai Bhagavan, and M. Phanindra Reddy, “Code mixed question answering Challenge using deep learning methods,” in Proceedings of the 5th International Conference on Communication and Electronics Systems, ICCES 2020, 2020. doi: 10.1109/ICCES48766.2020.09137971.
S. Dowlagar and R. Mamidi, “CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data,” in SemEval 2022 - 16th International Workshop on Semantic Evaluation, Proceedings of the Workshop, 2022. doi: 10.18653/v1/2022.semeval-1.214.
K. Singh, I. Sen, and P. Kumaraguru, “Language identification and named entity recognition in hinglish code mixed tweets,” in ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 2018. doi: 10.18653/v1/p18-3008.
A. El Mekki, A. El Mahdaouy, M. Akallouch, I. Berrada, and A. Khoumsi, “UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed Complex Named Entity Recognition via Pseudo Labels using Multilingual Transformer,” in SemEval 2022 - 16th International Workshop on Semantic Evaluation, Proceedings of the Workshop, 2022. doi: 10.18653/v1/2022.semeval-1.207.
V. K. Srirangam, A. A. Reddy, V. Singh, and M. Shrivastava, “Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds., Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 183–189. doi: 10.18653/v1/P19-2025.
C. Sabty, A. Sherif, M. Elmahdy, and S. Abdennadher, “Techniques for Named Entity Recognition on Arabic-English Code-Mixed Data,” International Journal of Robotic Computing, 2019, doi: 10.35708/tai1868-126245.
R. Priyadharshini, B. R. Chakravarthi, M. Vegupatti, and J. P. McCrae, “Named Entity Recognition for Code-Mixed Indian Corpus using Meta Embedding,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020, pp. 68–72. doi: 10.1109/ICACCS48705.2020.9074379.
Y. Madhani et al., “Aksharantar: Towards building open transliteration tools for the next billion users,” ArXiv, vol. abs/2205.03018, 2022.
D. Amin et al., “Marathi-English Code-mixed Text Generation,” ArXiv, vol. abs/2309.16202, 2023.
D. Amin, S. Govilkar, and S. Kulkarni, “Question answering using deep learning in low resource Indian language Marathi,” ArXiv, vol. abs/2309.15779, 2023.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” Oct. 2018.
S. Khanuja et al., “MuRIL: Multilingual representations for Indian languages,” Mar. 2021.
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” Apr. 2019.
P. Patil, A. Ranade, M. Sabane, O. Litake, and R. Joshi, “L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models,” ArXiv, vol. abs/2204.06029, 2022.
Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” Jul. 2019.
T. Pires, E. Schlinger, and D. Garrette, “How Multilingual is Multilingual BERT?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 4996–5001. doi: 10.18653/v1/P19-1493.
A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” Nov. 2019.
R. Joshi, “L3Cube-MahaCorpus and MahaBERT: Marathi monolingual corpus, Marathi BERT language models, and resources,” Feb. 2022.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.