Part of Speech and Morph Category Prediction for Gujarati

Authors

  • Jatayu Baxi Department of Computer Engineering, Dharmsinh Desai University, Nadiad (Gujarat)
  • Om Soni Department of Computer Engineering, Dharmsinh Desai University, Nadiad (Gujarat)
  • Brijesh Bhatt Department of Computer Engineering, Dharmsinh Desai University, Nadiad (Gujarat)

Keywords:

NLP, Transformer, BERT, Gujarati, Deep Learning

Abstract

This paper presents a novel approach for the prediction of Part of Speech (POS) category and Morphological features for the Gujarati language. POS tagging and Morphological analysis are foundation level tasks in almost all Natural Language Processing (NLP) applications. For the low resource and morphologically rich languages like Gujarati, the task becomes more challenging. In this work, we explore transformer based pre-trained models for the underlying task. We propose 4 different models for the prediction of POS category and Morph features. Along with the prediction of POS tagging and Morphological features individually, this work also explores the linguistic relationship between these features and proposes a single joint model for the prediction of POS-MORPH features. The joint model achieves F1 score of 0.98 and outperforms individual models.

Downloads

Download data is not yet available.

References

G. Cardona and B. Suthar, “Gujarati,” in The Indo-Aryan languages. Routledge, 2007, pp. 722–765.

M. F. Porter, “An algorithm for suffix stripping,” Program, 1980.

K. Koskenniemi, “Two-level model for morphological analysis.” in IJCAI, vol. 83, 1983, pp. 683–685.

J. Goldsmith, “Unsupervised learning of the morphology of a natural language,” Computational Linguistics 27(2), pp. 153– 198, 2005.

M. Bapat, H. Gune, and P. Bhattacharyya, “A paradigm-based finite state morphological analyzer for marathi,” in Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing, 2010, pp. 26–34.

A. Kumar, V. Dhanalakshmi, R. Rekha, K. Soman, S. Rajen- dran et al., “Morphological analyzer for agglutinative languages using machine learning approaches,” in 2009 International Conference on Advances in Recent Technologies in Commu- nication and Computing. IEEE, 2009, pp. 433–435.

D. K. Malladi and P. Mannem, “Context based statistical morphological analyzer and its effect on Hindi dependency parsing,” SPMRL 2013 - 4th Workshop on Statistical Parsing of Morphologically Rich Languages, Proceedings of the Workshop, no. October, pp. 119–128, 2013.

C. Malaviya, M. R. Gormley, and G. Neubig, “Neural factor graph models for cross-lingual morphological tagging,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 2653–2663. [Online]. Available: https://www.aclweb.org/anthology/P18-1247

G. Heigold, G. Neumann, and J. van Genabith, “An extensive empirical evaluation of character-based morphological tagging for 14 languages,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 505–513. [Online]. Available: https://www.aclweb.org/anthology/E17-1048

D. Kondratyuk, “Cross-lingual lemmatization and morphology tagging with two-stage multilingual BERT fine-tuning,” in Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 12–18. [Online]. Available: https://aclanthology.org/W19-4203

P. Singh, G. Rutten, and E. Lefever, “A pilot study for BERT language modelling and morphological analysis for ancient and medieval Greek,” in Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Punta Cana, Dominican Republic (online): Association for Computational Linguistics, Nov. 2021, pp. 128–137. [Online]

E. C. Acikgoz, T. Chubakov, M. Kural, G. Şahin, and D. Yuret, “Transformers on multilingual clause-level morphology,” in Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL). Abu Dhabi, United Arab Emirates (Hybrid): Association for Computational Linguistics, Dec. 2022, pp. 100–105. [Online].

Available: https://aclanthology.org/2022.mrl-1.10

E. Brill, “A simple rule-based part of speech tagger,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.

M. Divyapushpalakshmi and R. Ramalakshmi, “An efficient sentimental analysis using hybrid deep learning and optimiza- tion technique for twitter using parts of speech (pos) tagging,” International Journal of Speech Technology, vol. 24, pp. 329– 339, 2021.

B. Pham, “Parts of speech tagging: Rule-based,” 2020.

M. Constant and A. Sigogne, “Mwu-aware part-of-speech tag- ging with a crf model and lexical resources,” in Proceedings of the workshop on multiword expressions: from parsing and generation to the real world, 2011, pp. 49–56.

T. D. Singh, A. Ekbal, and S. Bandyopadhyay, “Manipuri pos tagging using crf and svm: A language independent approach,” in proceeding of 6th International conference on Natural Lan- guage Processing (ICON-2008), 2008, pp. 240–245.

T. Dalai, T. K. Mishra, and P. K. Sa, “Part-of-speech tagging of odia language using statistical and deep learning based approaches,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 6, pp. 1–24, 2023.

R. D. Deshmukh and A. Kiwelekar, “Deep learning techniques for part of speech tagging by natural language processing,” in 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA). IEEE, 2020, pp. 76–81.

A. Singh, C. Verma, S. Seal, and V. Singh, “Development of part of speech tagger using deep learning,” Int J Eng Adv Technol, vol. 9, no. 1, pp. 3384–91, 2019.

P. Srivastava, K. Chauhan, D. Aggarwal, A. Shukla, J. Dhar, and V. P. Jain, “Deep learning based unsupervised pos tagging for sanskrit,” in Proceedings of the 2018 International Con- ference on Algorithms, Computing and Artificial Intelligence, 2018, pp. 1–6.

A. A. Maksutov, V. I. Zamyatovskiy, V. O. Morozov, and S. O. Dmitriev, “The transformer neural network architecture for part-of-speech tagging,” in 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). IEEE, 2021, pp. 536–540.

H. Li, H. Mao, and J. Wang, “Part-of-speech tagging with rule- based data preprocessing and transformer,” Electronics, vol. 11, no. 1, p. 56, 2021.

M. V. Gamit, R. Joshi, and E. Patel, “A review on part-of- speech tagging on gujarati language,” International Research Journal of Engineering and Technology (IRJET), 2019.

C. Patel and K. Gali, “Part-of-speech tagging for gujarati using conditional random fields,” in Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages, 2008.

C. Tailor and B. Patel, “Hybrid pos tagger for gujarati text,” in Soft Computing and its Engineering Applications: Second International Conference, icSoftComp 2020, Changa, Anand, India, December 11–12, 2020, Proceedings 2. Springer, 2021, pp. 134–144.

C. Jobanputra, N. Parikh, V. Vora, and S. K. Bharti, “Parts- of-speech tagger for gujarati language using long-short-term- memory,” in 2021 International Conference on Artificial Intel- ligence and Machine Vision (AIMV). IEEE, 2021, pp. 1–5.

J. Baxi, P. Patel, and B. Bhatt, “Morphological Analyzer for Gujarati using Paradigm based approach with Knowledge based and Statistical Methods,” Proceedings of the 12th International Conference on Natural Language Processing, no. December, pp. 178–182, 2015. [Online]. Available: https://www.aclweb.org/anthology/W15-5927

J. Baxi and B. Bhatt, “Morpheme boundary detection & gram- matical feature prediction for gujarati : Dataset & model,” in Proceedings of the 18th International Conference on Natural Language Processing, NIT, Silchar, Dec. 2021.

——, “A bidirectional-lstm based morphological analyzer for gujarati,” In Press, (In Press).

J. Baxi and b. bhatt, “Gujmorph - a dataset for creating gujarati morphological analyzer,” in Proceedings of the Language Resources and Evaluation Conference. Marseille, France: European Language Resources Asso- ciation, June 2022, pp. 7088–7095. [Online]. Available: https://aclanthology.org/2022.lrec-1.767

M. Straka, J. Hajič, and J. Straková, “UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds. Portorož, Slovenia: European Language Resources Association (ELRA), May 2016, pp. 4290–4297. [Online]. Available: https://aclanthology.org/L16-1680

M. Straka, J. Straková, and J. Hajic, “UDPipe at SIGMORPHON 2019: Contextualized embeddings, regularization with morphological categories, corpora merging,” in Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 95–103. [Online]. Available: https://aclanthology.org/W19- 4212

K. Batsuren, O. Goldman, S. Khalifa, N. Habash, W. Kieraś, G. Bella, B. Leonard, G. Nicolai, K. Gorman, Y. G. Ate, M. Ryskina, S. Mielke, E. Budianskaya, C. El- Khaissi, T. Pimentel, M. Gasser, W. A. Lane, M. Raj, M. Coler, J. R. M. Samame, D. S. Camaiteri, E. Z. Rojas, D. López Francis, A. Oncevay, J. López Bautista, G. C. S. Villegas, L. T. Hennigen, A. Ek, D. Guriel, P. Dirix, J.-P. Bernardy, A. Scherbakov, A. Bayyr-ool, A. Anastasopoulos, R. Zariquiey, K. Sheifer, S. Ganieva, H. Cruz, R. Karahóǧa, S. Markantonatou, G. Pavlidis, M. Plugaryov, E. Klyachko, A. Salehi, C. Angulo, J. Baxi, A. Krizhanovsky, N. Krizhanovskaya, E. Salesky, C. Vania, S. Ivanova, J. White, R. H. Maudslay, J. Valvoda, R. Zmigrod, P. Czarnowska, I. Nikkarinen, A. Salchak, B. Bhatt, C. Straughn, Z. Liu, J. N. Washington, Y. Pinter, D. Ataman, M. Wolinski, T. Suhardijanto, A. Yablonskaya, N. Stoehr, H. Dolatian, Z. Nuriah, S. Ratan, F. M. Tyers, E. M. Ponti, G. Aiton, A. Arora, R. J. Hatcher, R. Kumar, J. Young, D. Rodionova, A. Yemelina, T. Andrushko, I. Marchenko, P. Mashkovtseva, A. Serova, E. Prud’hommeaux, M. Nepomniashchaya, F. Giunchiglia, E. Chodroff, M. Hulden, M. Silfverberg, A. D. McCarthy, D. Yarowsky, R. Cotterell, R. Tsarfaty, and E. Vylomova, “UniMorph 4.0: Universal Morphology,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, Jun. 2022, pp. 840– 855. [Online]. Available: https://aclanthology.org/2022.lrec- 1.89

J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman, “Universal dependencies v1: A multilingual treebank collection,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA), May 2016, pp. 1659–1666. [Online]. Available: https://www.aclweb.org/anthology/L16-1262

O. Goldman and R. Tsarfaty, “Morphology without borders: Clause-level morphology,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 1455–1472, 2022. [Online]. Available: https://aclanthology.org/2022.tacl-1.83

M. Hämäläinen, N. Partanen, J. Rueter, and K. Alnajjar, “Neural morphology dataset and models for multiple languages, from the large to the endangered,” in Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Reykjavik, Iceland (Online): Linköping University Electronic Press, Sweden, May 31–2 Jun. 2021, pp. 166–177. [Online]. Available: https://aclanthology.org/2021.nodalida-main.17

K. Batsuren, G. Bella, and F. Giunchiglia, “MorphyNet: a large multilingual database of derivational and inflectional morphology,” in Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, G. Nicolai, K. Gorman, and R. Cotterell, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 39–48. [Online]. Available: https://aclanthology.org/2021.sigmorphon-1.5

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805

D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhat- tacharyya, M. M. Khapra, and P. Kumar, “IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages,” in Find- ings of EMNLP, 2020.

R. Joshi, “L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based hindi and marathi languages,” arXiv preprint arXiv:2211.11418, 2022.

A. Yajnik and M. Prajapati, “Part of speech tagging using statistical approach for gujarati text,” Int J Appl Res Sci Eng, 2017.

D. N. Shah and H. Bhadka, “Paradigm-based morphological an- alyzer for the gujarati language,” in Intelligent Communication, Control and Devices: Proceedings of ICICCD 2018. Springer, 2020, pp. 469–481.

Downloads

Published

24.03.2024

How to Cite

Baxi , J. ., Soni , O. ., & Bhatt , B. . (2024). Part of Speech and Morph Category Prediction for Gujarati. International Journal of Intelligent Systems and Applications in Engineering, 12(3), 586–599. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/5290

Issue

Section

Research Article