Improving the Identification of Hate Speech in Arabic Social Media Content Using Emojis Translation
Keywords:
Hate speech; Offensive; Arabic Text pre-processing; Emojis; Deep Learning; Bi-LSTMAbstract
The presence of hate speech on the internet substantially threatens the well-being and safety of individuals using online platforms, hence requiring sophisticated approaches to detect and maintain a constructive atmosphere within social networks. However, extracting information from Arabic text posted on social networking platforms poses considerable challenges. This research paper presents a novel approach that utilizes artificial intelligence techniques to detect instances of hate speech in Arabic-language content disseminated through social media platforms. A supervised deep learning model is developed using the Bi-LSTM (Bidirectional Long Short-Term Memory) architecture and employing Arabic text pre-processing techniques to improve the model's overall performance. The model has undergone training and evaluation using a compilation of four public Arabic datasets containing instances of hate speech, which have been sourced from various social media platforms. The empirical results illustrate that the deep learning model proposed in this study demonstrates exceptional precision, with an accuracy rate of 98.4. The model demonstrates robust generalization skills, efficiently identifying instances of hate speech in Arabic text from several sources with varying degrees of complexity. Moreover, our study provides empirical evidence to support the claim that pre-processing emojis rather than removing them improves the effectiveness of deep learning models in detecting hate speech in Arabic text on social media.
Downloads
References
Brown A. What is hate speech? Part 1: The Myth of Hate. Law and Philos 2017; 36: 419–468.
Poletto F, Basile V, Sanguinetti M, et al. Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resources & Evaluation 2021; 55: 477–523.
Al-Dossari AA-H and H. Detection of Hate Speech in Social Networks: A Survey on Multilingual Corpus. Computer Science & Information Technology (CS & IT) 2019; 9: 83.
Uysal AK, Gunal S. The impact of pre-processing on text classification. Information Processing & Management 2014; 50: 104–112.
AlOtaibi S, Khan MB. Sentiment Analysis Challenges of Informal Arabic Language. International Journal of Advanced Computer Science and Applications (ijacsa); 8. Epub ahead of print 28 2017. DOI: 10.14569/IJACSA.2017.080237
Salloum SA, AlHamad AQ, Al-Emran M, et al. A Survey of Arabic Text Mining. In: Shaalan K, Hassanien AE, Tolba F (eds) Intelligent Natural Language Processing: Trends and Applications. Cham: Springer International Publishing, pp. 417–431.
Mubarak H, Darwish K, Magdy W, et al. Overview of OSACT4 Arabic Offensive Language Detection Shared Task. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. Marseille, France: European Language Resource Association, pp. 48–52.
Albadi N, Kurdi M, Mishra S. Are they Our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). 2018, pp. 69–76.
Soliman AB, Eissa K, El-Beltagy SR. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 2017; 117: 256–265.
Aluru SS, Mathew B, Saha P, et al. Deep Learning Models for Multilingual Hate Speech Detection. Epub ahead of print 9 December 2020. DOI: 10.48550/arXiv.2004.06465.
Duquenne P-A, Gong H, Schwenk H. Multimodal and Multilingual Embeddings for Large-Scale Speech Mining. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., pp. 15748–15761.
Aldjanabi W, Dahou A, Al-qaness MAA, et al. Arabic Offensive and Hate Speech Detection Using a Cross-Corpora Multi-Task Learning Model. Informatics 2021; 8: 69.
AlKhamissi B, Diab M. Meta AI at Arabic Hate Speech 2022: MultiTask Learning with Self-Correction for Hate Speech Classification. Epub ahead of print 16 May 2022. DOI: 10.48550/arXiv.2205.07960.
Bjerva J. One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis. Epub ahead of print 3 November 2017. DOI: 10.48550/arXiv.1711.01100.
Shapiro A, Khalafallah A, Torki M. AlexU-AIC at Arabic Hate Speech 2022: Contrast to Classify. In: Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection. Marseille, France: European Language Resources Association, pp. 200–208.
[16] Althobaiti MJ. BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis. International Journal of Advanced Computer Science and Applications (IJACSA); 13. Epub ahead of print 40/31 2022. DOI: 10.14569/IJACSA.2022.01305109.
[17] Rex R. Pre-processing Techniques for Text Mining, https://www.academia.edu/35015140/Pre-processing_Techniques_for_Text_Mining (accessed 14 February 2023).
Sarang P. Natural Language Understanding. In: Sarang P (ed) Artificial Neural Networks with TensorFlow 2: ANN Architecture Machine Learning Projects. Berkeley, CA: Apress, pp. 405–469.
Sunagar P, Kanavalli A, Shetty ND. Feature Extraction And Selection Techniques For Text Classification: A Survey. International Journal of Advanced Research in Engineering and Technology (IJARET. Epub ahead of print December 2020. DOI: 10.34218/IJARET.11.12.2020.268.
Pradnya K, Manisha M. A Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification. IJSR 2016; 5: 1267–1275.
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management 1988; 24: 513–523.
Zhang Y, Jin R, Zhou Z-H. Understanding bag-of-words model: a statistical framework. Int J Mach Learn & Cyber 2010; 1: 43–52.
Li S, Gong B. Word embedding and text classification based on deep learning methods. MATEC Web Conf 2021; 336: 06022.
Liu H, Cocea M. Traditional Machine Learning. In: Liu H, Cocea M (eds) Granular Computing Based Machine Learning: A Big Data Processing Approach. Cham: Springer International Publishing, pp. 11–22.
Kowsher Md, Tahabilder A, Islam Sanjid MdZ, et al. LSTM-ANN & BiLSTM-ANN: Hybrid deep learning models for enhanced classification accuracy. Procedia Computer Science 2021; 193: 131–140.
Mulki H, Haddad H, Bechikh Ali C, et al. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. In: Proceedings of the Third Workshop on Abusive Language Online. Florence, Italy: Association for Computational Linguistics, pp. 111–118.
Mubarak H, Darwish K, Magdy W. Abusive Language Detection on Arabic Social Media. In: Proceedings of the First Workshop on Abusive Language Online. Vancouver, BC, Canada: Association for Computational Linguistics, pp. 52–56.
Alakrot A, Murray L, Nikolov NS. Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic. Procedia Computer Science 2018; 142: 174–181.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.