Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading


  • Sridevi Bonthu Jawaharlal Nehru Technological University, Kakinada, India and Vishnu Institute of Technology, Bhimavaram, India
  • S. Rama Sree Aditya Engineering College, Surampalem, India
  • M. H. M. Krishna Prasad Jawaharlal Nehru Technological University, Kakinada, India


Semantic Similarity, Dataset Comparison, Statistical Analysis, Short Answer Grading


Developing dataset-specific models involves iterative fine-tuning and optimization, incurring significant costs over time. This study investigates the transferability of state-of-the-art (SOTA) models trained on established datasets to an unexplored text dataset. The key question is whether the knowledge embedded within SOTA models from existing datasets can be harnessed to achieve high-performance results on a new domain. In pursuit of this inquiry, two well-established benchmarks, the STSB and Mohler datasets, are selected, while the recently introduced SPRAG dataset serves as the unexplored domain. By employing robust similarity metrics and statistical techniques, a meticulous comparative analysis of these datasets is conducted. The primary goal of this work is to yield comprehensive insights into the potential applicability and adaptability of SOTA models. The outcomes of this research have the potential to reshape the landscape of natural language processing (NLP) by unlocking the ability to leverage existing models for diverse datasets. This may lead to a reduction in the demand for resource-intensive, dataset-specific training, thereby accelerating advancements in NLP and paving the way for more efficient model deployment.


Download data is not yet available.


Jurafsky D. Speech & language processing. Pearson Education India; 2000.

Li H. Learning to rank for information retrieval and natural language processing. Springer Nature; 2022 May 31.

Kang Y, Cai Z, Tan CW, Huang Q, Liu H. Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics. 2020 Apr 2;7(2):139-72.

Belinkov Y, Glass J. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics. 2019 Apr 1;7:49-72.

Zhang F, Luo C, Xu J, Luo Y, Zheng FC. Deep learning based automatic modulation recognition: Models, datasets, and challenges. Digital Signal Processing. 2022 Sep 1;129:103650.

Bonthu S, Rama Sree S, Krishna Prasad MH. Automated short answer grading using deep learning: A survey. InMachine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, August 17–20, 2021, Proceedings 5 2021 (pp. 61-78). Springer International Publishing.

Burrows S, Gurevych I, Stein B. The eras and trends of automatic short answer grading. International journal of artificial intelligence in education. 2015 Mar;25:60-117.

Bonthu S, Sree SR, Prasad MK. Improving the performance of automatic short answer grading using transfer learning and augmentation. Engineering Applications of Artificial Intelligence. 2023 Aug 1;123:106292.

Mohler M, Bunescu R, Mihalcea R. Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies 2011 Jun (pp. 752-762).

Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055. 2017 Jul 31.

Araque O, Zhu G, Iglesias CA. A semantic similarity-based perspective of affect lexicons for sentiment analysis. Knowledge-Based Systems. 2019 Feb 1;165:346-59.

Azarpanah H, Farhadloo M. Measuring Biases of Word Embeddings: What Similarity Measures and Descriptive Statistics to Use?. InProceedings of the First Workshop on Trustworthy Natural Language Processing 2021 Jun (pp. 8-14).

Bag S, Kumar SK, Tiwari MK. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences. 2019 May 1;483:53-64.

Chiny M, Chihab M, Bencharef O, Chihab Y. Netflix Recommendation System based on TF-IDF and Cosine Similarity Algorithms. no. Bml. 2022 May:15-20.

Kusner M, Sun Y, Kolkin N, Weinberger K. From word embeddings to document distances. InInternational conference on machine learning 2015 Jun 1 (pp. 957-966). PMLR.

Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH. Universal sentence encoder. arXiv preprint arXiv:1803.11175. 2018 Mar 29.

Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. 2019 Aug 27.

Thakur N, Reimers N, Daxenberger J, Gurevych I. Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. arXiv preprint arXiv:2010.08240. 2020 Oct 16.

Gao T, Yao X, Chen D. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821. 2021 Apr 18.

Casella G, Berger RL. Statistical inference. Cengage Learning; 2021 Jan 26.

Garcia S, Herrera F. An Extension on" Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of machine learning research. 2008 Dec 1;9(12).

Kratzert F, Klotz D, Shalev G, Klambauer G, Hochreiter S, Nearing G. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences. 2019 Dec 17;23(12):5089-110.

Vangipuram SK, Appusamy R. A survey on similarity measures and machine learning algorithms for classification and prediction. InInternational Conference on Data Science, E-learning and Information Systems 2021 2021 Apr 5 (pp. 198-204).

Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S. Using of Jaccard coefficient for keywords similarity. InProceedings of the international multiconference of engineers and computer scientists 2013 Mar 13 (Vol. 1, No. 6, pp. 380-384).

Tata S, Patel JM. Estimating the selectivity of tf-idf based cosine similarity predicates. ACM Sigmod Record. 2007 Jun 1;36(2):7-12.

Huang G, Guo C, Kusner MJ, Sun Y, Sha F, Weinberger KQ. Supervised word mover's distance. Advances in neural information processing systems. 2016;29.




How to Cite

Bonthu, S. ., Sree, S. R. ., & Prasad, M. H. M. K. . (2024). Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading . International Journal of Intelligent Systems and Applications in Engineering, 12(15s), 530–538. Retrieved from



Research Article