Transformer Models: Key Methodologies, Next Sentence Prediction, GLUE Benchmark, and Transfer Learning

Authors

  • Pratap Singh Barth, Dhanroop Mal Nagar

Keywords:

Transformer Models, GLUE Benchmark, Transfer Learning

Abstract

This research work undertakes a comprehensive examination of the nascent yet rapidly evolving landscape of Transformer-based models in Natural Language Processing. In this research work, the architectural innovations that define this paradigm shift are delved into, particularly highlighting the efficacy of the attention mechanism as a core computational unit, which has allowed for unprecedented parallel processing and contextual understanding in sequence modeling (Vaswani et al., 2017). The central subject of this investigation is the Bidirectional Encoder Representations from Transformers (BERT), a landmark model introduced in 2018, which leverages the Transformer architecture to achieve deep bidirectional representations of language (Devlin & Chang, 2018).
This study critically analyzes BERT's dual pre-training objectives: Masked Language Modeling, designed to foster a rich contextual understanding by predicting occluded tokens, and Next Sentence Prediction, a novel task aimed at equipping the model with the ability to discern relationships between sentence pairs, crucial for discourse-level comprehension. This research work further assesses the instrumental role of the General Language Understanding Evaluation benchmark, established in 2018, as a standardized and challenging suite of tasks that has significantly driven progress and enabled robust comparison across diverse language understanding systems (Wang et al., 2018a, 2018b). Through this lens, the transfer learning paradigm, exemplified by BERT's pre-train and fine-tune approach, has revolutionized NLP by enabling state-of-the-art performance across numerous downstream tasks with minimal task-specific data. This paper illuminates how these interconnected methodological pillars collectively facilitate the generation of highly versatile and robust pre-trained language representations, fundamentally reshaping the trajectory of natural language understanding research and application.

Downloads

Download data is not yet available.

References

Belinkov, Y., & Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7, 49. https://doi.org/10.1162/tacl_a_00254

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. T. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1607.06520

Caliskan, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora contain human-like biases. arXiv. https://doi.org/10.48550/ARXIV.1608.07187

Chaabouni, S. (2017). Study and prediction of visual attention with deep learning net- works in view of assessment of patients with neurodegenerative diseases. HAL (Le Centre Pour La Communication Scientifique Directe). https://tel.archives-ouvertes.fr/tel-02408326

Chang, K.-W., Prabhakaran, V. M., & Ordóñez, V. (2019, November 1). Bias and Fairness in Natural Language Processing. Empirical Methods in Natural Language Processing. https://aclanthology.org/D19-2003/

Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1904.10509

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019a). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. https://doi.org/10.48550/ARXIV.1901.02860

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019b). Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. 2978. https://doi.org/10.18653/v1/p19-1285

Devlin, J., & Chang, M. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Leibniz-Zentrum Für Informatik (Schloss Dagstuhl). https://doi.org/10.48550/arxiv.1810.04805

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). 4171. https://doi.org/10.18653/v1/n19-1423

Doshi‐Velez, F., & Kim, B. (2017). Towards A Rigorous Science of Interpretable Machine Learning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1702.08608

Fisch, A., Talmor, A., Jia, R., Seo, M., Choi, E., Chen, D., Berant, J., Srikumar, V., Chen, P., Linden, A. V., Harding, B., Kembhavi, A., Schwenk, D., Choi, J., Farhadi, A., Kwiatkowski, T., Palomaki, J., Collins, M., Parikh, A. P., … Herledan, F. (2019). Proceedings of the 2nd Workshop on Machine Reading for Question Answering. https://doi.org/10.18653/v1/d19-58

Houlsby, N., Giurgiu, A., Jastrzȩbski, S., Morrone, B., Laroussilhe, Q. de, Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1902.00751

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1412.6980

Papanikolaou, Y., Roberts, I., & Pierleoni, A. (2019). Deep Bidirectional Transformers for Relation Extraction without Supervision. https://doi.org/10.18653/v1/d19-6108

Shi, W., & Demberg, V. (2019). Next Sentence Prediction helps Implicit Discourse Relation Classification within and across Domains. https://doi.org/10.18653/v1/d19-1586

Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Wang, J., Wook, K., Jong, Sarah, K., Miles, M., Alex, N., Jason, B., Kris, M., & Jasmine, W. (2019). Release Strategies and the Social Impacts of Language Models. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1908.09203

Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). How to Fine-Tune BERT for Text Classification? arXiv (Cornell University). https://doi.org/10.48550/arxiv.1905.05583

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/ARXIV.1706.03762

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018a). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. 353. https://doi.org/10.18653/v1/w18-5446

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018b). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1804.07461

Downloads

Published

30.06.2020

How to Cite

Pratap Singh Barth. (2020). Transformer Models: Key Methodologies, Next Sentence Prediction, GLUE Benchmark, and Transfer Learning. International Journal of Intelligent Systems and Applications in Engineering, 8(2), 152–157. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/8035

Issue

Section

Research Article