Classification of Spanish Fake News about Covid-19 using Text Augmentation and Transformers
Keywords:
Covid-19 fake news, transformers, spanish language models, text augmentationAbstract
This paper presents the results of five models based on transformers such as DistilBERT, ALBERT, BETO, DistilBETO, and ALBETO for the classification of fake news about covid-19 in the Spanish language. Two text augmentation processes based on GPT-3 are compared, the first TA1 consists of the most common way of increasing the records of the training data, that is, increase all training data; and the second TA2, is more selective in the sense that it only increases the records that could not be learned by the models in the training phase, thus optimizing the training time of the models with respect TA1. The results show that both text augmentation techniques allow improvement, however, TA2 has a better performance in the models based on the Spanish language such as BETO, DistilBETO, and ALBETO, improving on average 1.15%, 11.12%, 2.44%, and 7.50% in terms of accuracy, recall, precision and f1-score respectively.
Downloads
References
X. Zhang and A. A. Ghorbani, “An overview of online fake news: Characterization, detection, and discussion,” Inf. Process. Manag., vol. 57, no. 2, 2020, doi: 10.1016/j.ipm.2019.03.004.
I. Y. Agarwal, D. P. Rana, M. Shaikh, and S. Poudel, “Spatio-temporal approach for classification of COVID-19 pandemic fake news,” Soc. Netw. Anal. Min., vol. 12, no. 1, 2022, doi: 10.1007/s13278-022-00887-8.
J. Li and M. Lei, “A Brief Survey for Fake News Detection via Deep Learning Models,” in Procedia Computer Science, 2022, vol. 214, no. C, doi: 10.1016/j.procs.2022.11.314.
WHO, “Coronavirus.” https://www.who.int/es/health-topics/coronavirus (accessed Aug. 10, 2022).
K. M. Douglas, “COVID-19 conspiracy theories,” Gr. Process. Intergr. Relations, vol. 24, no. 2, 2021, doi: 10.1177/1368430220982068.
M. Mohammed et al., “Assessment of COVID-19 Information Overload Among the General Public,” J. Racial Ethn. Heal. Disparities, vol. 9, no. 1, 2022, doi: 10.1007/s40615-020-00942-0.
L. E. Smith et al., “Side-effect expectations from COVID-19 vaccination: Findings from a nationally representative cross-sectional survey (CoVAccS – wave 2),” J. Psychosom. Res., vol. 152, 2022, doi: 10.1016/j.jpsychores.2021.110679.
J. Ayoub, X. J. Yang, and F. Zhou, “Combat COVID-19 infodemic using explainable natural language processing models,” Inf. Process. Manag., vol. 58, no. 4, 2021, doi: 10.1016/j.ipm.2021.102569.
H. Luo, “Yeti at fakedes 2021: Fake news detection in spanish with albert,” in CEUR Workshop Proceedings, 2021, vol. 2943.
J. Cañete et al., “Spanish Pre-Trained BERT Model and Evaluation Data,” 2020.
J. Cañete, S. Donoso, F. Bravo-Marquez, A. Carvallo, and V. Araujo, “ALBETO and DistilBETO: Lightweight Spanish Language Models,” in 13th Conference on Language Resources and Evaluation (LREC 2022), 2022, pp. 4291–4298.
Z. H. Kilimci, “Prediction of user loyalty in mobile applications using deep contextualized word representations,” J. Inf. Telecommun., vol. 6, no. 1, 2022, doi: 10.1080/24751839.2021.1981684.
J. F. Low, B. C. M. Fung, F. Iqbal, and S. C. Huang, “Distinguishing between fake news and satire with transformers,” Expert Syst. Appl., vol. 187, 2022, doi: 10.1016/j.eswa.2021.115824.
M. Abadeer, “Assessment of DistilBERT performance on Named Entity Recognition task for the detection of Protected Health Information and medical concepts,” 2020, doi: 10.18653/v1/2020.clinicalnlp-1.18.
A. Ul Hussna, I. I. Trisha, M. S. Karim, and M. G. R. Alam, “COVID-19 Fake News Prediction on Social Media Data,” 2021, doi: 10.1109/TENSYMP52854.2021.9550957.
W. H. Bangyal et al., “Detection of Fake News Text Classification on COVID-19 Using Deep Learning Approaches,” Comput. Math. Methods Med., vol. 2021, 2021, doi: 10.1155/2021/5514220.
J. Kapusta, M. Drlik, and M. Munk, “Using of n-grams from morphological tags for fake news classification,” PeerJ Comput. Sci., vol. 7, 2021, doi: 10.7717/PEERJ-CS.624.
T. Olaleye, A. Abayomi-Alli, K. Adesemowo, O. T. Arogundade, S. Misra, and U. Kose, “SCLAVOEM: hyper parameter optimization approach to predictive modelling of COVID-19 infodemic tweets using smote and classifier vote ensemble,” Soft Comput., 2022, doi: 10.1007/s00500-022-06940-0.
B. Al-Ahmad, A. M. Al-Zoubi, R. A. Khurma, and I. Aljarah, “An evolutionary fake news detection method for covid-19 pandemic information,” Symmetry (Basel)., vol. 13, no. 6, 2021, doi: 10.3390/sym13061091.
S. Gonwirat, A. Choompol, and N. Wichapa, “A combined deep learning model based on the ideal distance weighting method for fake news detection,” Int. J. Data Netw. Sci., vol. 6, no. 2, 2022, doi: 10.5267/j.ijdns.2022.1.003.
C. Raj and P. Meel, “ARCNN framework for multimodal infodemic detection,” Neural Networks, vol. 146, 2022, doi: 10.1016/j.neunet.2021.11.006.
D. Guo, Y. Kim, and A. M. Rush, “Sequence-level mixed sample data augmentation,” 2020, doi: 10.18653/v1/2020.emnlp-main.447.
A. V. Mosolova, V. V. Fomin, and I. Y. Bondarenko, “Text augmentation for neural networks,” in CEUR Workshop Proceedings, 2018, vol. 2268.
X. Zhang, J. Zhao, and Y. Lecun, “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Systems, 2015, vol. 2015-January.
X. Wang, Y. Sheng, H. Deng, and Z. Zhao, “Charcnn-svm for chinese text datasets sentiment classification with data augmentation,” Int. J. Innov. Comput. Inf. Control, vol. 15, no. 1, 2019, doi: 10.24507/ijicic.15.01.227.
O. Kashefi and R. Hwa, “Quantifying the Evaluation of Heuristic Methods for Textual Data Augmentation,” 2020, doi: 10.18653/v1/2020.wnut-1.26.
M. Wei, N. S. Harzevili, Y. Huang, J. Wang, and S. Wang, “CLEAR: Contrastive Learning for API Recommendation,” in Proceedings - International Conference on Software Engineering, 2022, vol. 2022-May, doi: 10.1145/3510003.3510159.
S. Y. Feng, V. Gangal, D. Kang, T. Mitamura, and E. Hovy, “GenAug: Data Augmentation for Finetuning Text Generators,” 2020, doi: 10.18653/v1/2020.deelio-1.4.
N. Mrkšić et al., “Counter-fitting word vectors to linguistic constraints,” 2016, doi: 10.18653/v1/n16-1018.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 2019, vol. 1.
C. Olah and S. Carter, “Attention and Augmented Recurrent Neural Networks,” Distill, vol. 1, no. 9, 2017, doi: 10.23915/distill.00001.
S. T. Aroyehun and A. Gelbukh, “Aggression Detection in Social Media: Using Deep Neural Networks, Data Augmentation, and Pseudo Labeling,” 2018.
A. W. Yu et al., “QaNet: Combining local convolution with global self-attention for reading comprehension,” 2018.
G. Rizos, K. Hemker, and B. Schuller, “Augment to prevent: Short-text data augmentation in deep learning for hate-speech classification,” 2019, doi: 10.1145/3357384.3358040.
T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial training methods for semi-supervised text classification,” 2017.
E. Chang, X. Shen, D. Zhu, V. Demberg, and H. Su, “Neural data-to-text generation with LM-based text augmentation,” 2021, doi: 10.18653/v1/2021.eacl-main.64.
M. Jungiewicz and A. Smywiński-Pohl, “Towards textual data augmentation for neural networks: Synonyms and maximum loss,” Comput. Sci., vol. 20, no. 1, 2019, doi: 10.7494/csci.2019.20.1.3023.
L. Sun, C. Xia, W. Yin, T. Liang, P. S. Yu, and L. He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks,” 2020, doi: 10.18653/v1/2020.coling-main.305.
H. Guo, “Nonlinear mixup: Out-of-manifold data augmentation for text classification,” 2020, doi: 10.1609/aaai.v34i04.5822.
K. M. Yoo, D. Park, J. Kang, S. W. Lee, and W. Park, “GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation,” 2021, doi: 10.18653/v1/2021.findings-emnlp.192.
C. Zhou, M. Li, and S. Yu, “Intelligent Grouping Method of Science and Technology Projects Based on Data Augmentation and SMOTE,” Appl. Artif. Intell., vol. 36, no. 1, 2022, doi: 10.1080/08839514.2022.2145637.
Abdurrahman and A. Purwarianti, “Effective use of augmentation degree and language model for synonym-based text augmentation on Indonesian text classification,” 2019, doi: 10.1109/ICACSIS47736.2019.8979733.
J. Min, R. T. McCoy, D. Das, E. Pitler, and T. Linzen, “Syntactic Data Augmentation Increases Robustness to Inference Heuristics,” 2020, doi: 10.18653/v1/2020.acl-main.212.
C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text Data Augmentation for Deep Learning,” J. Big Data, vol. 8, no. 1, 2021, doi: 10.1186/s40537-021-00492-0.
R. Dale, “GPT-3: What’s it good for?,” Natural Language Engineering, vol. 27, no. 1. 2021, doi: 10.1017/S1351324920000601.
H. Choi, J. Kim, S. Joe, and Y. Gwon, “Evaluation of BERT and Albert sentence embedding performance on downstream NLP tasks,” 2020, doi: 10.1109/ICPR48806.2021.9412102.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.