Classification Model Based on Supervised Learning in the Constituent Context of the Ayacucho Region, Peru.


  • Yordan Sullca-Palomino, Yudi Guzmán-Monteza


Supervised learning, Data mining, Constituent process, Annotation Rules, Data mining


Using data mining techniques, data from platforms such as Twitter (now called X) represent a valuable opportunity to analyze preferences, specifically in discussing political and social issues. In this study, a text classification model designed to categorize content related to the constituent process in Ayacucho was developed. Using data collected from Twitter, we sought to classify text as 'constituent' or 'non-constituent'. Supervised learning techniques (SVM, RF, and NB) were applied along with three vectorization methods (BOW, TF-IDF, and W2V). An annotation process was established to label classes, ensuring data reliability with a Kappa coefficient 0.72. The data were divided into training, test, and validation sets. Data Augmentation strategies were explored to address data imbalance. Experimental results on the validation dataset revealed that the SVM classification model obtained the highest F1 score, reaching a value of 0.74, outperforming other evaluated models. The findings of this study offer valuable insights for other researchers facing similar challenges in niche-specific text classification. Both the annotation methodology employed and the effectiveness of the classification techniques, together with an approach focused on continuous improvement, lay a solid foundation for future projects in this field.


Download data is not yet available.


