Beyond Bigger Models: Multi-Axis Scaling for Large Language Models
Keywords:
Large Language Models, Scaling Laws, Multi-Axis Scaling, Sparse Mixture-of-Experts, Retrieval-Augmented Generation, Post-Training Alignment, Inference Optimization, AI GovernanceAbstract
For years, the dominant prescription for building capable large language models (LLMs) was deceptively simple: make the model bigger. That logic produced real results with GPT-3's 175 billion parameters, delivering in-context learning across dozens of benchmarks. However, the assumption that parameter count is the primary lever for system quality is no longer tenable with Chinchilla demonstrating that most landmark models were significantly undertrained relative to their size. What followed was not a replacement of the scaling paradigm but a proliferation of it. Modern LLM systems are now shaped by nine distinct scaling axes: pretraining scale, compute-optimality, sparse conditional computation, retrieval and memory augmentation, long-context modeling, post-training alignment, parameter-efficient adaptation (PEFT), reasoning-time compute, and serving infrastructure optimization. This paper evaluates each axis against four operational dimensions (capability, cost, latency, and governance burden) and presents a product decision framework to minimize enterprise failure modes by selecting the most efficient scaling interventions. The central finding is that the AI systems that consistently outperform in production are those built on a coherent portfolio of axis choices, not those with the largest base model.
Downloads
References
A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, vol. 30, 2017. [Online]. Available: https://doi.org/10.48550/arXiv.1706.03762
T. Brown et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2005.14165
J. Kaplan et al., "Scaling laws for neural language models," arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2001.08361
J. Hoffmann et al., "Training compute-optimal large language models," arXiv preprint arXiv:2203.15556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.15556
W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022. [Online]. Available: https://jmlr.org/papers/v23/21-0243.html
A. Q. Jiang et al., "Mixtral of experts," arXiv preprint arXiv:2401.04088, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.04088
P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2005.11401
D. Edge et al., "From local to global: A graph RAG approach to query-focused summarization," arXiv preprint arXiv:2404.16130, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.16130
Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context," arXiv preprint arXiv:2403.05530, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.05530
N. F. Liu et al., "Lost in the middle: How language models use long contexts," Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. [Online]. Available: https://doi.org/10.1162/tacl_a_00638
L. Ouyang et al., "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.02155
E. J. Hu et al., "LoRA: Low-rank adaptation of large language models," in International Conference on Learning Representations, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2106.09685
T. Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs," in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.14314
J. Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," in Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2201.11903
X. Wang et al., "Self-consistency improves chain of thought reasoning in language models," in International Conference on Learning Representations, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2203.11171
W. Kwon et al., "Efficient memory management for large language model serving with PagedAttention," in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626, 2023. [Online]. Available: https://doi.org/10.1145/3600006.3613165
Y. Leviathan, M. Kalman, and Y. Matias, "Fast inference from transformers via speculative decoding," in Proceedings of the 40th International Conference on Machine Learning, pp. 19274–19286, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2211.17192
A. Chowdhery et al., "PaLM: Scaling language modeling with pathways," Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2204.02311
OpenAI, "GPT-4 technical report," arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774
AI@Meta, "The Llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783
H. W. Chung et al., "Scaling instruction-finetuned language models," Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2210.11416
Y. Bai et al., "Constitutional AI: Harmlessness from AI feedback," arXiv preprint arXiv:2212.08073, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2212.08073
R. Rafailov et al., "Direct preference optimization: Your language model is secretly a reward model," in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.18290
OpenAI, "Learning to reason with LLMs," OpenAI Blog, Sep. 2024. [Online]. Available: https://openai.com/index/learning-to-reason-with-llms/
D. Guo et al., "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning," arXiv preprint arXiv:2501.12948, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.12948
NVIDIA Corporation, "NVIDIA Blackwell architecture technical brief," NVIDIA, 2024. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture
MLCommons, "MLPerf inference v5.0 results," MLCommons, 2025. [Online]. Available: https://mlcommons.org/benchmarks/inference-datacenter/
T. Dao et al., "FlashAttention: Fast and memory-efficient exact attention with IO-awareness," in Advances in Neural Information Processing Systems, vol. 35, pp. 16344–16359, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14135
NIST, "Artificial intelligence risk management framework (AI RMF 1.0)," National Institute of Standards and Technology, NIST AI 100-1, 2023. [Online]. Available: https://doi.org/10.6028/NIST.AI.100-1
NIST, "Artificial intelligence risk management framework: Generative artificial intelligence profile," National Institute of Standards and Technology, NIST AI 600-1, 2024. [Online]. Available: https://doi.org/10.6028/NIST.AI.600-1
ISO/IEC, "ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system," International Organization for Standardization, 2023. [Online]. Available: https://www.iso.org/standard/81230.html
OWASP, "OWASP top 10 for large language model applications v1.1," OWASP Foundation, 2023. [Online]. Available: https://owasp.org/www-project-top-10-for-large-language-model-applications/
A. Conneau et al., "Unsupervised cross-lingual representation learning at scale," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020. [Online]. Available: https://doi.org/10.18653/v1/2020.acl-main.747
A. Srivastava et al., "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models," Transactions on Machine Learning Research, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2206.04615
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All papers should be submitted electronically. All submitted manuscripts must be original work that is not under submission at another journal or under consideration for publication in another form, such as a monograph or chapter of a book. Authors of submitted papers are obligated not to submit their paper for publication elsewhere until an editorial decision is rendered on their submission. Further, authors of accepted papers are prohibited from publishing the results in other publications that appear before the paper is published in the Journal unless they receive approval for doing so from the Editor-In-Chief.
IJISAE open access articles are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. This license lets the audience to give appropriate credit, provide a link to the license, and indicate if changes were made and if they remix, transform, or build upon the material, they must distribute contributions under the same license as the original.


