Soft Voting Ensemble with N-GRAM Vectorization for Accurate News Classification on Twitter Data

Nadiah Jaffreen Shaik

Authors

Nadiah Jaffreen Shaik, Tatavarthy Santhi Sri

Keywords:

Ensemble Learning, N-gram Vectorization, Social Media Analytics, News Classification, Text Classification Techniques

Abstract

With the exponential growth in the social media platform, the need for effective tools to accurately classify tweets, according to their news content has become very crucial. This work presents a comprehensive comparison of text classification techniques with various N-gram and vectorization techniques. Using the CybAttT dataset, our methodology employs text preprocessing techniques and then follow the data towards modelling. Six various Machine learning models are used on two vectorizing techniques namely TF-IDF and count vectorization. Following towards efficacy, these baseline six models are further again developed on soft voting ensemble model to leverage the strengths of each individual classifier. And the performance of each model was evaluated based on accuracy, precision, recall, and F1-score performance metrics. The results are rigorously compared on various N-gram configurations. From the experimental results, the soft voting ensemble model achieved an accuracy of 97.02% and 96.56% for count vectorization and TF-IDF for default n-gram. The comparison is also observed on bigram and trigram and out of all models, the ensemble model scores are superior to other machine learning models. These finding reveals that the proposed ensemble model advances text classification dynamics and also proposes a robust framework for researchers and practitioners focusing on social media analytics.

Downloads

Download data is not yet available.

References

M. A. I. Mallick and R. Nath, “Navigating the Cyber security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Developments,” World Scientific News, vol. 190, p. 1–69, 2024.

Ö. Aslan, S. S. Aktuğ, M. Ozkan-Okay, A. A. Yilmaz and E. Akin, “A comprehensive review of cyber security vulnerabilities, threats, attacks, and solutions,” Electronics, vol. 12, p. 1333, 2023.

M. Stamp, Introduction to machine learning with applications in information security, Chapman and Hall/CRC, 2022.

V. Chang, L. Golightly, Q. A. Xu, T. Boonmee and B. S. Liu, “Cybersecurity for children: an investigation into the application of social media,” Enterprise Information Systems, vol. 17, p. 2188122, 2023.

P. A. Henríquez and F. Alessandri, “Analyzing Digital Societal Interactions and Sentiment Classification in Twitter (X) during critical events in Chile,” Heliyon, 2024.

J. M. Nagata, Z. Memon, J. Talebloo, M. P. H. K. Li, P. Low, I. Y. Shao, K. T. Ganson, A. Testa, J. He and C. D. Brindis, “Prevalence and Patterns of Social Media Use in Early Adolescents,” Academic Pediatrics, p. 102784, 2025.

Y. Demchenko, J. J. Cuadrado-Gallego, O. Chertov and M. Aleksandrova, “Finding Data on the Web, Data Sets, Web Scraping, Web API,” Springer, 2024, p. 417–446.

K. Szabó Nagy, J. Kapusta and M. Munk, “Feature extraction from unstructured texts as a combination of the morphological and the syntactic analysis and its usage in fake news classification tasks,” Neural Computing and Applications, vol. 35, p. 22055–22067, 2023.

M. S. M. Suhaimin, M. H. A. Hijazi, E. G. Moung, P. N. E. Nohuddin, S. Chua and F. Coenen, “Social media sentiment analysis and opinion mining in public security: Taxonomy, trend analysis, issues and future directions,” Journal of King Saud University-Computer and Information Sciences, p. 101776, 2023.

U. Krzeszewska, A. Poniszewska-Marańda and J. Ochelska-Mierzejewska, “Systematic comparison of vectorization methods in classification context,” Applied Sciences, vol. 12, p. 5119, 2022.

M. H. Ahmed, S. Tiun, N. Omar and N. S. Sani, “Short text clustering algorithms, application and challenges: A survey,” Applied Sciences, vol. 13, p. 342, 2022.

T. D. Jayasiriwardene and G. U. Ganegoda, “Keyword extraction from Tweets using NLP tools for collecting relevant news,” 2020.

A. Z. Klein, A. Magge, K. O'Connor, J. I. Flores Amaro, D. Weissenbacher and G. Gonzalez Hernandez, “Toward using Twitter for tracking COVID-19: a natural language processing pipeline and exploratory data set,” Journal of medical Internet research, vol. 23, p. e25314, 2021.

A. A. Hnaif, E. Kanan and T. Kanan, “Sentiment Analysis for Arabic Social Media News Polarity.,” Intelligent Automation & Soft Computing, vol. 28, 2021.

M. S. Raja and L. A. Raj, “Fake news detection on social networks using Machine learning techniques,” Materials Today: Proceedings, vol. 62, p. 4821–4827, 2022.

M. Narra, M. Umer, S. Sadiq, H. Karamti, A. Mohamed and I. Ashraf, “Selective feature sets based fake news detection for COVID-19 to manage infodemic,” IEEE Access, vol. 10, p. 98724–98736, 2022.

C.-M. Lai, M.-H. Chen, E. Kristiani, V. K. Verma and C.-T. Yang, “Fake news classification based on content level features,” Applied Sciences, vol. 12, p. 1116, 2022.

S. Kumar and T. D. Singh, “Fake news detection on Hindi news dataset,” Global Transitions Proceedings, vol. 3, p. 289–297, 2022.

L. Mishchenkо, I. Klymenkо and V. Tkachenko, “The fake news recognition method based on Naïve Bayes with improved TF-IDF algorithm,” 2023.

M. A. Wani, M. ELAffendi, K. A. Shakil, I. M. Abuhaimed, A. Nayyar, A. Hussain and A. A. Abd El-Latif, “Toxic Fake News Detection and Classification for Combating COVID-19 Misinformation,” IEEE Transactions on Computational Social Systems, 2023.

M. S. Farooq, A. Naseem, F. Rustam and I. Ashraf, “Fake news detection in Urdu language using machine learning,” PeerJ Computer Science, vol. 9, p. e1353, 2023.

M. Akhter, S. M. M. Hossain, R. S. Nigar, S. Paul, K. M. A. Kamal, A. Sen and I. H. Sarker, “COVID-19 Fake News Detection using Deep Learning Model,” Annals of Data Science, p. 1–32, 2024.

H. Lughbi, M. Mars and K. Almotairi, “CybAttT: A Dataset of Cyberattack News Tweets for Enhanced Threat Intelligence,” Data, vol. 9, p. 39, 2024.

M. Dong, J. Lu, G. Wang, X. Zheng and D. Kiritsis, “Model-based systems engineering papers analysis based on word cloud visualization,” 2022.

L. Hickman, S. Thapa, L. Tay, M. Cao and P. Srinivasan, “Text preprocessing for text mining in organizational research: Review and recommendations,” Organizational Research Methods, vol. 25, p. 114–146, 2022.

S. Sarica and J. Luo, “Stopwords in technical language processing,” Plos one, vol. 16, p. e0254937, 2021.

N. A. Razmi, M. Z. Zamri, S. S. S. Ghazalli and N. Seman, “Visualizing stemming techniques on online news articles text analytics,” Bulletin of Electrical Engineering and Informatics, vol. 10, p. 365–373, 2021.

S. Kundu, “31 An overview of Stemming and Lemmatization Techniques,” 2024.

E. Naresh, B. J. Ananda, K. S. Keerthi and M. R. Tejonidhi, “Predicting the stock price using natural language processing and random forest regressor,” 2022.

A. Shete, H. Soni, Z. Sajnani and A. Shete, “Fake news detection using natural language processing and logistic regression,” 2021.

M. T. H. K. Tusar and M. T. Islam, “A comparative study of sentiment analysis using NLP and different machine learning techniques on US airline Twitter data,” 2021.

L. S. Riza, Y. Firdaus, R. A. Sukamto, Wahyudin and K. A. F. Abu Samah, “Automatic generation of short-answer questions in reading comprehension using NLP and KNN,” Multimedia Tools and Applications, vol. 82, p. 41913–41940, 2023.

S. S. I. Ismail, R. F. Mansour, R. M. Abd El-Aziz and A. I. Taloba, “Efficient E‐Mail Spam Detection Strategy Using Genetic Decision Tree Processing with NLP Features,” Computational Intelligence and Neuroscience, vol. 2022, p. 7710005, 2022.

F.-J. Yang, “An implementation of naive bayes classifier,” 2018.

A. Aizawa, “An information-theoretic perspective of tf-idf measures,” Information Processing & Management, vol. 39, p. 45–65, 2003.

A. Wendland, M. Zenere and J. Niemann, “Introduction to text classification: impact of stemming and comparing TF-IDF and count vectorization as feature extraction technique,” 2021.

M. U. Salur and İ. Aydın, “A soft voting ensemble learning-based approach for multimodal sentiment analysis,” Neural Computing and Applications, vol. 34, p. 18391–18406, 2022.

J. Dessain, “Machine learning models predicting returns: Why most popular performance metrics are misleading and proposal for an efficient metric,” Expert Systems with Applications, vol. 199, p. 116970, 2022.

D. Jeet, V. Sharma, S. Mishra, C. Iwendi and J. Osamor, “Twitter Sentiment Analysis and Emotion Detection Using NLTK and TextBlob,” 2023.

Soft Voting Ensemble with N-GRAM Vectorization for Accurate News Classification on Twitter Data

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

ijisae

Information

Indexed By

Soft Voting Ensemble with N-GRAM Vectorization for Accurate News Classification on Twitter Data

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By