Multi-Model Analysis on Author Attribution Detection on Assamese Text

Authors

  • Smriti Priya Medhi, Shikhar Kr. Sarma

Keywords:

Assamese, Author Attributes, Automatic Authorship Detection, Low Resource NLP

Abstract

Author attribution detection is a crucial task in the field of forensic linguistics and computational stylometry, aiming to identify the author of a given text based on linguistic features. This study focuses on the application of multi-model analysis for author attribution detection specifically in the context of Assamese text, which is a less explored area compared to other languages. The proposed approach is a first ever attempt for Assamese language, and involves the integration of multiple traditional machine learning models, like Support Vector Machines (SVM), Multinomial Naïve Bayes (MNB) etc. These models are trained on a dataset consisting of a diverse collection of Assamese texts authored by different individual authors. A structured and sizable dataset has been created as part of the current work.  Key linguistic features, including word n-grams, character n-grams, and part-of-speech tags, are extracted from the text to represent the writing styles of each author. These features are then used as inputs to the multi-model framework, which combine the predictions of individual models to make a final author attribution decision. Experimental results demonstrate the effectiveness of the proposed multi-model approach in author attribution detection on Assamese text. The study contributes to the Assamese Natural Language Processing, by adding a novel work on authorship detection for these low resources and underrepresented language- Assamese, and highlights the importance of using multiple models for improved performance in computational stylometric analysis.

Downloads

Download data is not yet available.

References

S. Garg, D. S. Panwar, A. Gupta, and R. Katarya, “A literature review on sentiment analysis techniques involving social media platforms,” PDGC 2020 - 2020 6th Int. Conf. Parallel, Distrib. Grid Comput., pp. 254–259, Nov. 2020, doi: 10.1109/PDGC50313.2020.9315735.

W. A. Woods, “Transition Network Grammars for Natural Language Analysis,” Commun. ACM, vol. 13, no. 10, pp. 591–606, 1970, doi: 10.1145/355598.362773.

T. Winograd, “Understanding natural language,” Cogn. Psychol., vol. 3, no. 1, pp. 1–191, Jan. 1972, doi: 10.1016/0010-0285(72)90002-3.

W. M. Reynolds and G. E. Miller, of of Psychology, vol. 5. 2003.

S. E. Fahlman, “Representing and Using Real-World Knowledge.,” Energy Technology Review, vol. 1. pp. 451–470, 1979.

J. Weizenbaum, “ELIZA-A computer program for the study of natural language communication between man and machine,” Commun. ACM, vol. 9, no. 1, pp. 36–45, 1966, doi: 10.1145/365153.365168.

K. Talukdar and S. K. Sarma, “Parts of Speech Taggers for Indo Aryan Languages: A critical Review of Approaches and Performances,” 2023 4th Int. Conf. Comput. Commun. Syst. I3CS 2023, 2023, doi: 10.1109/I3CS58314.2023.10127336.

Kuwali Talukdar Shikhar Kumar Sarma, “UPoS Tagger for Low Resource Assamese Language: LSTM and BiLSTM based Modelling,” 2023 IEEE Int. Conf. Mach. Learn. Appl. Netw. Technol., pp. 1–6, 2023.

B. Basumatary, M. Rahman, and S. K. Sarma, “Deep Learning Based Bodo Parts of Speech Taggere,” IEEE Explor. 2023 14th Int. Conf. Comput. Commun. Netw. Technol., pp. 1–5, 2023.

K. Kanchan Baruah, P. Das, A. Hannan, and S. Kr Sarma, “Assamese-English Bilingual Machine Translation,” Int. J. Nat. Lang. Comput., vol. 3, no. 3, pp. 73–82, 2014, doi: 10.5121/ijnlc.2014.3307.

M. A. Ahmed, K. Talukdar, P. A. Boruah, S. K. Sarma, and K. Kashyap, “GUIT-NLP’s submission to Shared Task: Low Resource Indic Language Translation,” Conf. Mach. Transl. - Proc., pp. 933–938, 2023, doi: 10.18653/v1/2023.wmt-1.87.

K. K. Kashyap, S. K. Sarma, and M. A. Ahmed, “Improving translation between English, Assamese bilingual pair with monolingual data, length penalty and model averaging,” 2024.

K. Talukdar, S. K. Sarma, F. Naznin, and K. K. Kashyap, “Influence of Data Quality and Quantity on Assamese-Bodo Neural Machine Translation,” IEEE Explor. 2023 14th Int. Conf. Comput. Commun. Netw. Technol., pp. 1–5, 2023.

J. Sarmah and S. Kumar Sarma, “Survey on Word Sense Disambiguation: An Initiative towards an Indo-Aryan Language,” Int. J. Eng. Manuf., vol. 6, no. 3, pp. 37–52, 2016, doi: 10.5815/ijem.2016.03.04.

J. Sarmah and S. Kr., “Decision Tree based Supervised Word Sense Disambiguation for Assamese,” Int. J. Comput. Appl., vol. 141, no. 1, pp. 42–48, 2016, doi: 10.5120/ijca2016909488.

M. A. Ahmed, K. K. Kashyap, and S. K. Sarma, “Pre-processing and Resource Modelling for English-Assamese NMT System,” 4th Int. Conf. Comput. Commun. Syst., pp. 1–6, 2023.

M. P. Bhuyan and S. K. Sarma, “Automatic Formation, Termination Correction of Assamese word using Predictive Syntactic NLP,” Proc. 3rd Int. Conf. Commun. Electron. Syst. ICCES 2018, pp. 544–548, Oct. 2018, doi: 10.1109/CESYS.2018.8724023.

A. K. Barman, J. Sarmah, and S. K. Sarma, “Development of assamese rule based stemmer using WordNet,” Proc. 10th Glob. WordNet Conf., pp. 135–139, 2020.

A. K. Barman, J. Sarmah, and S. K. Sarma, “WordNet based information retrieval system for assamese,” Proc. - UKSim 15th Int. Conf. Comput. Model. Simulation, UKSim 2013, pp. 480–484, 2013, doi: 10.1109/UKSIM.2013.90.

S. Kr and S. Dibyajyoti, “Building Multilingual Lexical Resources Using Wordnets : Structure , Design and Implementation,” vol. 1, no. December, pp. 161–170, 2012.

S. K. Sarma, B. Brahma, M. Gogoi, and M. B. Ramchiary, “A Wordnet for Bodo language: Structure and development,” Glob. Wordnet Conf. GWC 2010, 2010.

N. Baruah, S. K. Sarma, and S. Borkotokey, “Evaluation of Content Compaction in Assamese Language,” Procedia Comput. Sci., vol. 171, pp. 2275–2284, Jan. 2020, doi: 10.1016/J.PROCS.2020.04.246.

B. Brahma, A. K. Barman, P. Shikhar, K. Sarma, and B. Boro, “Corpus Building of Literary Lesser Rich Language-Bodo: Insights and Challenges,” vol. 1, no. December, pp. 29–34, 2012.

S. Swain, G. Mishra, and C. Sindhu, “Recent approaches on authorship attribution techniques-An overview,” Proc. Int. Conf. Electron. Commun. Aerosp. Technol. ICECA 2017, vol. 2017-Janua, no. October, pp. 557–566, 2017, doi: 10.1109/ICECA.2017.8203599.

A. Abbasi and H. Chen, “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace,” ACM Trans. Inf. Syst., vol. 26, no. 2, 2008, doi: 10.1145/1344411.1344413.

J. Burrows, “‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship,” Lit. Linguist. Comput., vol. 17, no. 3, pp. 267–287, Sep. 2002, doi: 10.1093/LLC/17.3.267.

A. Abbasi and H. Chen, “Analysis to Extremist-,” IEEE Intell. Syst., no. October, pp. 67–75, 2005.

S. Argamon, M. Šarić, and S. S. Stein, “Style mining of electronic messages for multiple authorship discrimination: First results,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 475–480, 2003, doi: 10.1145/956750.956805.

M. Gamon, “Linguistic correlates of style: Authorship classification with deep linguistic analysis features,” COLING 2004 - Proc. 20th Int. Conf. Comput. Linguist., 2004.

G. Hirst and O. Feiguina, “Bigrams of syntactic labels for authorship discrimination of short texts,” Lit. Linguist. Comput., vol. 22, no. 4, pp. 405–417, 2007, doi: 10.1093/llc/fqm023.

S. Raghavan, A. Kovashka, and R. Mooney, “Authorship attribution using probabilistic context-free grammars,” ACL 2010 - 48th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., pp. 38–42, 2010.

E. Stamatatos, N. Fakotakis, and G. Kokkinakis, “Computer-based authorship attribution without lexical measures,” Lang. Resour. Eval., vol. 35, no. 2, pp. 193–214, 2001.

D. S. Sharma, R. Sangal, S. Proc, S. Phani, S. Lahiri, and A. Biswas, “Authorship Attribution in Bengali Language,” NLP Association of India. NLPAI, pp. 100–105, 2015, Accessed: Mar. 21, 2024. [Online]. Available: https://aclanthology.org/W15-5915.

J. S. Kallimani, C. P. Chandrika, A. Singh, and Z. Khan, “Authorship Identification Using Supervised Learning and n-Grams for Hindi Language,” J. Comput. Theor. Nanosci., vol. 17, no. 9, pp. 4258–4261, Dec. 2020, doi: 10.1166/JCTN.2020.9058.

C. P. Chandrika and J. S. Kallimani, “Authorship Attribution on Kannada Text using Bi-Directional LSTM Technique,” Int. J. Adv. Comput. Sci. Appl., vol. 13, no. 9, pp. 963–971, Dec. 2022, doi: 10.14569/IJACSA.2022.01309111.

W. Oliveira, E. Justino, and L. S. Oliveira, “Comparing compression models for authorship attribution,” Forensic Sci. Int., vol. 228, no. 1–3, pp. 100–104, 2013, doi: 10.1016/j.forsciint.2013.02.025.

I. I. Ayogu and V. A. Olutayo, “Authorship Attribution using Rough Sets based Feature Selection Techniques Authorship Attribution using Rough Sets based Feature Selection Techniques,” no. May, 2020, doi: 10.5120/ijca2016911889.

S. Avram and M. Oltean, “A Comparison of Several AI Techniques for Authorship Attribution on Romanian Texts,” pp. 1–40, 2022.

S. Nagaprasad, N. Krishnaveni, J. K. R. Sastry, and A. Vinayababu, “On authorship attribution of telugu text,” Indian J. Sci. Technol., vol. 9, no. 35, pp. 1–7, 2016, doi: 10.17485/ijst/2016/v9i35/98735.

“Faculty of Natural Sciences Department of Computer Sciences Authorship Attribution in Modern Hebrew Presented By David Gabay.”

R. Ramezani, “A language-independent authorship attribution approach for author identification of text documents,” Expert Syst. Appl., vol. 180, no. May 2021, 2021, doi: 10.1016/j.eswa.2021.115139.

R. Modaber Dabagh, “Authorship attribution and statistical text analysis,” Adv. Methodol. Stat., vol. 4, no. 2, pp. 149–163, 2007, doi: 10.51936/uvjx7198.

E. Reisi and H. M. Farimani, “a Uthorship a Ttribution in H Istorical and L Iterary T Exts By,” pp. 118–127, 2021, doi: 10.22034/jaisis.2021.269735.1018.

H. Wang and A. Riddell, “CCTAA: A Reproducible Corpus for Chinese Authorship Attribution Research,” 2022 Lang. Resour. Eval. Conf. Lr. 2022, no. June, pp. 5889–5893, 2022.

“C-17: Population by bilingualism and trilingualism, India - 2011.” https://censusindia.gov.in/nada/index.php/catalog/10262.

W. Bright and R. C. Nigam, “Grammatical Sketches of Indian Languages, with Comparative Vocabulary and Texts (Part I),” Language (Baltim)., vol. 54, no. 1, p. 247, Mar. 1978, doi: 10.2307/413037.

N. Saharia, “A First Step Towards Parsing of Assamese Text,” Spec. Vol. Probl. Parsing Indian Lang., vol. 11, no. 5, pp. 30–34, 2011.

Downloads

Published

16.06.2024

How to Cite

Smriti Priya Medhi. (2024). Multi-Model Analysis on Author Attribution Detection on Assamese Text. International Journal of Intelligent Systems and Applications in Engineering, 12(4), 255–266. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/6209

Issue

Section

Research Article