Evaluating Arabic Lexicon Structure with Machine Learning Techniques
Keywords:
Automatic Arabic lexicon generation, web scraping, web information extraction, MLAbstract
In an age where big data reigns supreme, the advent of technological advancements has led to an exponential increase in digital textual data, particularly in the realm of the Arabic language. This surge has spurred the proliferation of electronic Arabic lexicons, which, while abundant, often lack the structured format required for effective use in Natural Language Processing (NLP) applications. This study seeks to bridge this gap by presenting a methodology for the extraction, structuring, and storage of lexicon data to render it suitable for NLP tools and technologies. Utilizing web scraping techniques, the study harvested lexical data from various online sources, transforming it into well-organized Excel files. The corpus encompasses a rich assembly of nouns (10,000 words), verbs (10,000 words), letters (70 words), adverbs (500 words), and pronouns (20 words), thus laying the groundwork for a comprehensive Arabic lexicon. Furthermore, the study leveraged several machine learning models to evaluate the structuring of the lexicon. The Support Vector Machine (SVM) and Random Forest models exhibited commendable accuracy (both at 0.85), underscoring the high quality of the data structuring process. Meanwhile, models like Logistic Regression and Multinomial Naive Bayes, despite lower precision and recall metrics, maintained moderate accuracy, which demonstrates the potential for further refinement.
Downloads
References
O. Hamed, S. Salah, and A. A. Freihat, "ALRT: Cutting edge tool for automatic generation of arabic lexical recognition tests," in Proceedings of the Third International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2022) co-located with ICNLSP 2022, Trento, Italy, 2022, pp. 43-49.
M. S. Parvez, K. S. A. Tasneem, S. S. Rajendra, and K. R. Bodke, "Analysis of different web data extraction techniques," in 2018 International Conference on Smart City and Emerging Technology (ICSCET), Mumbai, India, 2018, pp. 1-7.
O. Batarfi, M. Dahab, and A. Ezz, "Building an arabic semantic lexicon for hajj," International Journal of Computer Applications, vol. 181, pp. 9-15, 2019.
M. Jarrar and H. Amayreh, "An arabic-multilingual database with a lexicographic search engine," in Natural Language Processing and Information Systems, Cham, 2019, pp. 234-246.
S. Subhan, E. Sediyono, and F. Farikhin, "The semantic analysis of twitter data with generative lexicon for the information of traffic congestion," Journal of Advances in Information Systems and Technology, vol. 1, pp. 45-54, 2019.
A. Alexandrescu, "Optimization and security in information retrieval, extraction, processing, and presentation on a cloud platform," Information, vol. 10, p. 200, 2019.
M. Khder, "Web scraping or web crawling: state of art, techniques, approaches and application," International Journal of Advances in Soft Computing and its Applications, vol. 13, pp. 145-168, 2021.
R. Egger, M. Kroner, and A. Stöckl, "Web scraping," in Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications, R. Egger, Ed. Cham: Springer International Publishing, 2022, pp. 67-82.
A. Brenning and S. Henn, "Web scraping: a promising tool for geographic data acquisition," arXiv preprint arXiv:2305.19893, 2023.
S. D. S. Sirisuriya, "Importance of web scraping as a data source for machine learning algorithms - Review," in 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), Peradeniya, Sri Lanka, 2023, pp. 134-139.
S. Shreekumar, S. Mundke, and M. Dhanawade, "Importance of web scraping in e- commerce business," NCRD’s Technical Review, vol. 7, pp. 1-14, 2022.
M. R. Mufid, A. Basofi, M. U. H. A. Rasyid, I. F. Rochimansyah, and A. rokhim, "Design an MVC model using Python for flask framework development," in 2019 International Electronics Symposium (IES), Surabaya, Indonesia, 2019, pp. 214-219.
S. S. Chawathe, "Data structures for ordered short character-sequences," in 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), NV, USA, 2021, pp. 1370-1376.
A. Haque and S. Singh, "Anti-scraping application development," in 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, 2015, pp. 869-874.
A. Luscombe, K. Dick, and K. Walby, "Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences," Quality & Quantity, vol. 56, pp. 1023-1044, 2022.
K. Shaalan, "A survey of arabic named entity recognition and classification," Computational Linguistics, vol. 40, pp. 469-510, 2014.
Muaad, Abdullah Y., et al. "Arabic document classification: performance investigation of preprocessing and representation techniques." Mathematical Problems in Engineering 2022 (2022): 1-16.Alsaleem, Saleh. "Automated Arabic Text Categorization Using SVM and NB." Int. Arab. J. e Technol. 2.2 (2011): 124-128.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.