EffiLLM: A Comprehensive Framework for Benchmarking and Optimizing Large Language Models in Resource-Constrained Environments

Authors

  • Fardeen NB, Sameer NB

Keywords:

large language models, inference optimization, benchmarking, resource-constrained environments, quantization, efficiency analysis

Abstract

The deployment of Large Language Models (LLMs) in resource-constrained environments remains challenging due to their substantial computational and memory requirements. While numerous benchmarking tools exist, they predominantly focus on high-end hardware configurations, leaving a significant gap in understanding LLM performance characteristics under resource limitations. This paper introduces EffiLLM, a comprehensive benchmarking framework specifically designed to evaluate and optimize LLM inference efficiency across varied hardware configurations and quantization techniques. Through extensive experimentation with models ranging from 125M to 13B parameters across diverse computational settings, we quantify the impact of batch sizes, sequence lengths, and quantization methods on throughput, latency, and memory utilization. Our findings reveal that INT8 quantization offers a near-optimal balance, reducing memory requirements by approximately 50% while maintaining 90-95% of baseline performance. Furthermore, we identify non-linear scaling patterns in throughput as batch sizes increase, with diminishing returns beyond certain thresholds dependent on model size and available resources. The framework’s visualization capabilities enable nuanced analysis of efficiency trade-offs, facilitating informed deployment decisions. EffiLLM provides researchers and practitioners with an essential tool for optimizing LLM performance in environments with limited computational resources, potentially broadening the accessibility of these powerful models.

Downloads

Download data is not yet available.

References

30 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2022). PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., ... & Bowman, S. R. (2019). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32.

Liang, P. P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Jurafsky, D. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2021). The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., ... & Zoph, B. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C. J., ... & Zhou, Y. (2020). MLPerf inference benchmark. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446-459.

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.

Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., ... & Keutzer, K. (2022). ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35, 18098-18111.

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM. int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.

Ma, X., Fang, Z., Wu, J., Zhu, Z., Zhu, Q., Li, Z., ... & Zhou, J. (2023). LLM-Pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Pope, R., Douglas, F., Chowdhery, A., Brock, A., Botha, J., Pieterse, J., ... & Li, E. (2022). Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102.

Yu, F., Xu, Y., Chen, Z., & Cheng, Y. (2022). Orca: A distributed serving system for transformer-based generative models. arXiv preprint arXiv:2212.10435.

NVIDIA. (2023). FasterTransformer. GitHub repository. https://github.com/NVIDIA/FasterTransformer

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3505-3506.

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38-45.

Dettmers, T. (2022). 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861.

Liu, X., Han, K., Li, L., Wu, Y., Yan, J., Zhang, Y., & Zhou, J. (2022). Oxone: Accelerating the scanning and quantization in transformer networks. arXiv preprint arXiv:2202.01344.

Dong, X., Mao, Y., Bhosale, S., Mukherjee, P., Chen, R., Li, Z., ... & Anubhai, R. (2022). Parameter-efficient fine-tuning with PEFT. GitHub repository. https://github.com/huggingface/peft

Li, H., Wang, M., Zhuang, B., Sun, P., Zhao, T., Liang, X., & Zhao, D. (2023). LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.

Kim, Y., Kim, H., Jung, S., Kwon, S., & Kim, H. (2023). Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14314.

Downloads

Published

23.02.2024

How to Cite

Fardeen NB. (2024). EffiLLM: A Comprehensive Framework for Benchmarking and Optimizing Large Language Models in Resource-Constrained Environments. International Journal of Intelligent Systems and Applications in Engineering, 12(17s), 982 –. Retrieved from https://ijisae.org/index.php/IJISAE/article/view/7710

Issue

Section

Research Article