Evaluating Arabic Lexicon Structure with Machine Learning Techniques


  • Aya Mohammed Abdul-Samad Department of Computer Information Systems, Computer Science and Information Technology University of Basrah Basrah, Iraq
  • Salma A. Mahmood Department of Computer Information Systems, Computer Science and Information Technology University of Basrah Basrah, Iraq


Automatic Arabic lexicon generation, web scraping, web information extraction, ML


In an age where big data reigns supreme, the advent of technological advancements has led to an exponential increase in digital textual data, particularly in the realm of the Arabic language. This surge has spurred the proliferation of electronic Arabic lexicons, which, while abundant, often lack the structured format required for effective use in Natural Language Processing (NLP) applications. This study seeks to bridge this gap by presenting a methodology for the extraction, structuring, and storage of lexicon data to render it suitable for NLP tools and technologies. Utilizing web scraping techniques, the study harvested lexical data from various online sources, transforming it into well-organized Excel files. The corpus encompasses a rich assembly of nouns (10,000 words), verbs (10,000 words), letters (70 words), adverbs (500 words), and pronouns (20 words), thus laying the groundwork for a comprehensive Arabic lexicon. Furthermore, the study leveraged several machine learning models to evaluate the structuring of the lexicon. The Support Vector Machine (SVM) and Random Forest models exhibited commendable accuracy (both at 0.85), underscoring the high quality of the data structuring process. Meanwhile, models like Logistic Regression and Multinomial Naive Bayes, despite lower precision and recall metrics, maintained moderate accuracy, which demonstrates the potential for further refinement.


Download data is not yet available.


O. Hamed, S. Salah, and A. A. Freihat, "ALRT: Cutting edge tool for automatic generation of arabic lexical recognition tests," in Proceedings of the Third International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2022) co-located with ICNLSP 2022, Trento, Italy, 2022, pp. 43-49.

M. S. Parvez, K. S. A. Tasneem, S. S. Rajendra, and K. R. Bodke, "Analysis of different web data extraction techniques," in 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, India, 2018, pp. 1-7.

O. Batarfi, M. Dahab, and A. Ezz, "Building an arabic semantic lexicon for hajj," International Journal of Computer Applications, vol. 181, pp. 9-15, 2019.

M. Jarrar and H. Amayreh, "An arabic-multilingual database with a lexicographic search engine," in Natural Language Processing and Information Systems, Cham, 2019, pp. 234-246.

S. Subhan, E. Sediyono, and F. Farikhin, "The semantic analysis of twitter data with generative lexicon for the information of traffic congestion," Journal of Advances in Information Systems and Technology, vol. 1, pp. 45-54, 2019.

A. Alexandrescu, "Optimization and security in information retrieval, extraction, processing, and presentation on a cloud platform," Information, vol. 10, p. 200, 2019.

M. Khder, "Web scraping or web crawling: state of art, techniques, approaches and application," International Journal of Advances in Soft Computing and its Applications, vol. 13, pp. 145-168, 2021.

R. Egger, M. Kroner, and A. Stöckl, "Web scraping," in Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications, R. Egger, Ed. Cham: Springer International Publishing, 2022, pp. 67-82.

A. Brenning and S. Henn, "Web scraping: a promising tool for geographic data acquisition," arXiv preprint arXiv:2305.19893, 2023.

S. D. S. Sirisuriya, "Importance of web scraping as a data source for machine learning algorithms - Review," in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), Peradeniya, Sri Lanka, 2023, pp. 134-139.

S. Shreekumar, S. Mundke, and M. Dhanawade, "Importance of web scraping in e- commerce business," NCRD’s Technical Review, vol. 7, pp. 1-14, 2022.

M. R. Mufid, A. Basofi, M. U. H. A. Rasyid, I. F. Rochimansyah, and A. rokhim, "Design an MVC model using Python for flask framework development," in 2019 International Electronics Symposium (IES), Surabaya, Indonesia, 2019, pp. 214-219.

S. S. Chawathe, "Data structures for ordered short character-sequences," in 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), NV, USA, 2021, pp. 1370-1376.

A. Haque and S. Singh, "Anti-scraping application development," in 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, 2015, pp. 869-874.

A. Luscombe, K. Dick, and K. Walby, "Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences," Quality & Quantity, vol. 56, pp. 1023-1044, 2022.

K. Shaalan, "A survey of arabic named entity recognition and classification," Computational Linguistics, vol. 40, pp. 469-510, 2014.

Muaad, Abdullah Y., et al. "Arabic document classification: performance investigation of preprocessing and representation techniques." Mathematical Problems in Engineering 2022 (2022): 1-16.Alsaleem, Saleh. "Automated Arabic Text Categorization Using SVM and NB." Int. Arab. J. e Technol. 2.2 (2011): 124-128.




How to Cite

Abdul-Samad, A. M. ., & Mahmood, S. A. . (2024). Evaluating Arabic Lexicon Structure with Machine Learning Techniques . International Journal of Intelligent Systems and Applications in Engineering, 12(11s), 595–604. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/4480



Research Article