Beyond Bigger Models: Multi-Axis Scaling for Large Language Models

Reeshav Kumar

Authors

Reeshav Kumar

Keywords:

Large Language Models, Scaling Laws, Multi-Axis Scaling, Sparse Mixture-of-Experts, Retrieval-Augmented Generation, Post-Training Alignment, Inference Optimization, AI Governance

Abstract

For years, the dominant prescription for building capable large language models (LLMs) was deceptively simple: make the model bigger. That logic produced real results with GPT-3's 175 billion parameters, delivering in-context learning across dozens of benchmarks. However, the assumption that parameter count is the primary lever for system quality is no longer tenable with Chinchilla demonstrating that most landmark models were significantly undertrained relative to their size. What followed was not a replacement of the scaling paradigm but a proliferation of it. Modern LLM systems are now shaped by nine distinct scaling axes: pretraining scale, compute-optimality, sparse conditional computation, retrieval and memory augmentation, long-context modeling, post-training alignment, parameter-efficient adaptation (PEFT), reasoning-time compute, and serving infrastructure optimization. This paper evaluates each axis against four operational dimensions (capability, cost, latency, and governance burden) and presents a product decision framework to minimize enterprise failure modes by selecting the most efficient scaling interventions. The central finding is that the AI systems that consistently outperform in production are those built on a coherent portfolio of axis choices, not those with the largest base model.

Downloads

Download data is not yet available.

References

A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, vol. 30, 2017. [Online]. Available: https://doi.org/10.48550/arXiv.1706.03762

T. Brown et al., "Language models are few-shot learners," in Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2005.14165

J. Kaplan et al., "Scaling laws for neural language models," arXiv preprint arXiv:2001.08361, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2001.08361

J. Hoffmann et al., "Training compute-optimal large language models," arXiv preprint arXiv:2203.15556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.15556

W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022. [Online]. Available: https://jmlr.org/papers/v23/21-0243.html

A. Q. Jiang et al., "Mixtral of experts," arXiv preprint arXiv:2401.04088, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.04088

P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2005.11401

D. Edge et al., "From local to global: A graph RAG approach to query-focused summarization," arXiv preprint arXiv:2404.16130, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.16130

Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context," arXiv preprint arXiv:2403.05530, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.05530

N. F. Liu et al., "Lost in the middle: How language models use long contexts," Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024. [Online]. Available: https://doi.org/10.1162/tacl_a_00638

L. Ouyang et al., "Training language models to follow instructions with human feedback," in Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.02155

E. J. Hu et al., "LoRA: Low-rank adaptation of large language models," in International Conference on Learning Representations, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2106.09685

T. Dettmers et al., "QLoRA: Efficient finetuning of quantized LLMs," in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.14314

J. Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," in Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2201.11903

X. Wang et al., "Self-consistency improves chain of thought reasoning in language models," in International Conference on Learning Representations, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2203.11171

W. Kwon et al., "Efficient memory management for large language model serving with PagedAttention," in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626, 2023. [Online]. Available: https://doi.org/10.1145/3600006.3613165

Y. Leviathan, M. Kalman, and Y. Matias, "Fast inference from transformers via speculative decoding," in Proceedings of the 40th International Conference on Machine Learning, pp. 19274–19286, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2211.17192

A. Chowdhery et al., "PaLM: Scaling language modeling with pathways," Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2204.02311

OpenAI, "GPT-4 technical report," arXiv preprint arXiv:2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774

AI@Meta, "The Llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

H. W. Chung et al., "Scaling instruction-finetuned language models," Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2210.11416

Y. Bai et al., "Constitutional AI: Harmlessness from AI feedback," arXiv preprint arXiv:2212.08073, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2212.08073

R. Rafailov et al., "Direct preference optimization: Your language model is secretly a reward model," in Advances in Neural Information Processing Systems, vol. 36, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.18290

OpenAI, "Learning to reason with LLMs," OpenAI Blog, Sep. 2024. [Online]. Available: https://openai.com/index/learning-to-reason-with-llms/

D. Guo et al., "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning," arXiv preprint arXiv:2501.12948, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2501.12948

NVIDIA Corporation, "NVIDIA Blackwell architecture technical brief," NVIDIA, 2024. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

MLCommons, "MLPerf inference v5.0 results," MLCommons, 2025. [Online]. Available: https://mlcommons.org/benchmarks/inference-datacenter/

T. Dao et al., "FlashAttention: Fast and memory-efficient exact attention with IO-awareness," in Advances in Neural Information Processing Systems, vol. 35, pp. 16344–16359, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14135

NIST, "Artificial intelligence risk management framework (AI RMF 1.0)," National Institute of Standards and Technology, NIST AI 100-1, 2023. [Online]. Available: https://doi.org/10.6028/NIST.AI.100-1

NIST, "Artificial intelligence risk management framework: Generative artificial intelligence profile," National Institute of Standards and Technology, NIST AI 600-1, 2024. [Online]. Available: https://doi.org/10.6028/NIST.AI.600-1

ISO/IEC, "ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system," International Organization for Standardization, 2023. [Online]. Available: https://www.iso.org/standard/81230.html

OWASP, "OWASP top 10 for large language model applications v1.1," OWASP Foundation, 2023. [Online]. Available: https://owasp.org/www-project-top-10-for-large-language-model-applications/

A. Conneau et al., "Unsupervised cross-lingual representation learning at scale," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, 2020. [Online]. Available: https://doi.org/10.18653/v1/2020.acl-main.747

A. Srivastava et al., "Beyond the imitation game: Quantifying and extrapolating the capabilities of language models," Transactions on Machine Learning Research, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2206.04615

Beyond Bigger Models: Multi-Axis Scaling for Large Language Models

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

ijisae

Information

Indexed By

Beyond Bigger Models: Multi-Axis Scaling for Large Language Models

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Announcements

Information for Authors

Like, Subscribe and Share This Video

ijisae

Information

Indexed By