EffiLLM: A Comprehensive Framework for Benchmarking and Optimizing Large Language Models in Resource-Constrained Environments
Keywords:
large language models, inference optimization, benchmarking, resource-constrained environments, quantization, efficiency analysisAbstract
The deployment of Large Language Models (LLMs) in resource-constrained environments remains challenging due to their substantial computational and memory requirements. While numerous benchmarking tools exist, they predominantly focus on high-end hardware configurations, leaving a significant gap in understanding LLM performance characteristics under resource limitations. This paper introduces EffiLLM, a comprehensive benchmarking framework specifically designed to evaluate and optimize LLM inference efficiency across varied hardware configurations and quantization techniques. Through extensive experimentation with models ranging from 125M to 13B parameters across diverse computational settings, we quantify the impact of batch sizes, sequence lengths, and quantization methods on throughput, latency, and memory utilization. Our findings reveal that INT8 quantization offers a near-optimal balance, reducing memory requirements by approximately 50% while maintaining 90-95% of baseline performance. Furthermore, we identify non-linear scaling patterns in throughput as batch sizes increase, with diminishing returns beyond certain thresholds dependent on model size and available resources. The framework’s visualization capabilities enable nuanced analysis of efficiency trade-offs, facilitating informed deployment decisions. EffiLLM provides researchers and practitioners with an essential tool for optimizing LLM performance in environments with limited computational resources, potentially broadening the accessibility of these powerful models.
Downloads
References
30 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., ... & Fiedel, N. (2022). PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
OpenAI. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., ... & Bowman, S. R. (2019). SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 32.
Liang, P. P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Jurafsky, D. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2021). The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., ... & Zoph, B. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C. J., ... & Zhou, Y. (2020). MLPerf inference benchmark. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446-459.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
Yao, Z., Dong, Z., Zheng, Z., Gholami, A., Yu, J., Tan, E., ... & Keutzer, K. (2022). ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35, 18098-18111.
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM. int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
Ma, X., Fang, Z., Wu, J., Zhu, Z., Zhu, Q., Li, Z., ... & Zhou, J. (2023). LLM-Pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
Pope, R., Douglas, F., Chowdhery, A., Brock, A., Botha, J., Pieterse, J., ... & Li, E. (2022). Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102.
Yu, F., Xu, Y., Chen, Z., & Cheng, Y. (2022). Orca: A distributed serving system for transformer-based generative models. arXiv preprint arXiv:2212.10435.
NVIDIA. (2023). FasterTransformer. GitHub repository. https://github.com/NVIDIA/FasterTransformer
Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. (2020). DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 3505-3506.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35, 16344-16359.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38-45.
Dettmers, T. (2022). 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861.
Liu, X., Han, K., Li, L., Wu, Y., Yan, J., Zhang, Y., & Zhou, J. (2022). Oxone: Accelerating the scanning and quantization in transformer networks. arXiv preprint arXiv:2202.01344.
Dong, X., Mao, Y., Bhosale, S., Mukherjee, P., Chen, R., Li, Z., ... & Anubhai, R. (2022). Parameter-efficient fine-tuning with PEFT. GitHub repository. https://github.com/huggingface/peft
Li, H., Wang, M., Zhuang, B., Sun, P., Zhao, T., Liang, X., & Zhao, D. (2023). LLM-Adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
Kim, Y., Kim, H., Jung, S., Kwon, S., & Kim, H. (2023). Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14314.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Fardeen NB, Sameer NB

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.