Cloud-Based Large Language Model Deployment: A Comparative Analysis of Serverless and Bring-Your-Own-Container Architectures

doi:10.18267/j.aip.313

Acta Informatica Pragensia X:X | DOI: 10.18267/j.aip.31319

Cloud-Based Large Language Model Deployment: A Comparative Analysis of Serverless and Bring-Your-Own-Container Architectures

Mateusz Ploskonka ORCID...: Faculty of Informatics and Statistics, Prague University of Economics and Business, Czech Republic

Background: Large Language Models (LLMs) have transformed research and industry applications; however, cloud deployment decisions remain complex and poorly documented, particularly for academic researchers operating under budget constraints. Systematic guidance on infrastructure selection for LLM-based research is limited.

Objective: This study provides a comprehensive empirical evaluation of cloud-based LLM deployment architectures, examining inference efficiency, serverless platform availability, and architectural trade-offs across major cloud providers to deliver actionable guidance for budget-constrained researchers.

Methods: The author evaluated 32 open-source LLMs ranging from 0.6 billion to 1 trillion parameters across serverless and Bring Your Own Container (BYOC) deployment configurations. Using the Belebele benchmark, we analyzed cost–efficiency relationships, serverless platform availability, and metrics exposure across Amazon SageMaker, Amazon Bedrock, Azure Serverless, and Hugging Face–compatible providers.

Results: Model performance follows a logarithmic scaling relationship with parameter count (R²=0.727) and deployment cost (R²=0.639). Models in the 30–50B parameter range achieve 85–90% of maximum accuracy at a fraction of the cost of frontier models. However, serverless availability remains fragmented: only 34.4% of examined models are accessible via serverless endpoints, with minimal cross-platform redundancy (6.2%). Deployment architecture introduces a fundamental trade-off: serverless platforms expose 71% fewer metrics than BYOC approaches while eliminating infrastructure management overhead and idle costs.

Conclusion: These findings provide practical guidance for researchers selecting cloud infrastructure under budget constraints. Models in the 7–14B range offer optimal cost efficiency, while the 30–50B range maximizes accuracy per dollar for demanding tasks. The results also challenge the prevailing emphasis on ever-larger models, as diminishing returns become substantial beyond 30B parameters. Persistent gaps in serverless availability and observability highlight the need for greater standardization in cloud platforms.

Keywords: LLMs; Cloud computing; Serverless architecture; Cost optimization; Performance evaluation; Model deployment; Infrastructure selection.

Received: January 29, 2026; Revised: January 29, 2026; Accepted: March 16, 2026; Prepublished online: April 26, 2026

Download citation

References

Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., Tumanov, A., & Ramjee, R. (2024a). Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (pp. 117-134). USENIX Association.
Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., & Ramjee, R. (2024b). Sarathi-Serve: Efficient LLM inference by piggybacking decodes with chunked prefills. arXiv. https://doi.org/10.48550/arXiv.2403.02310 Go to original source...
Appenzeller, M. (2024). Welcome to LLMflation - LLM inference cost is going down fast. Andreessen Horowitz. https://a16z.com/llmflation-llm-inference-cost
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2024). The Belebele benchmark: A parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (pp. 749-775). ACL. https://doi.org/10.18653/v1/2024.acl-long.44 Go to original source...
Cheng, Y., Liu, Y., Yao, J., An, Y., Chen, X., Feng, S., Huang, Y., Shen, S., Zhang, R., Du, K., & Jiang, J. (2025). LMCache: An efficient KV cache layer for enterprise-scale LLM inference. arXiv. https://doi.org/10.48550/arXiv.2510.09665 Go to original source...
Collabnix. (2024). Kubernetes autoscaling for LLM inference: Complete guide (2024). https://collabnix.com/kubernetes-autoscaling-for-llm-inference-complete-guide-2024/
Dao, T. (2024). FlashAttention-2: Faster attention with better parallelism and work partitioning. In 12th International Conference on Learning Representations. ICLR.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (pp. 16344-16359). NeurIPS. Go to original source...
Xiang, Y., Li, X., Qian, K., Yang, Y., Zhu, D., Yu, W., Zhai, E., Liu, X., Jin, X., & Zhou, J. (2025). Aegaeon: Effective GPU pooling for concurrent LLM serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (pp. 1030-1045). ACM. https://doi.org/10.1145/3731569.3764815 Go to original source...
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. In 11th International Conference on Learning Representations. ICLR.
FriendliAI. (2024). Which quantization to use to reduce the size of LLMs? https://friendli.ai/blog/quantization-reduce-llm-size
Fu, Y., Xue, L., Huang, Y., Brabete, A.-O., Ustiugov, D., Patel, Y., & Mai, L. (2024). ServerlessLLM: Low-latency serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (pp. 135-153). USENIX Association.
GMI Cloud. (2025). How much do GPU cloud platforms cost for AI startups in 2025. https://www.gmicloud.ai/blog/how-much-do-gpu-cloud-platforms-cost-for-ai-startups-in-2025
Gun.io. (2025). Scaling AI infrastructure for LLMs: Best practices for mid-sized companies. https://gun.io/news/2025/04/scaling-ai-infrastructure-for-llms/
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv. https://doi.org/10.48550/arXiv.2001.08361 Go to original source...
Cheng, X., Zhang, Z., Zhou, Y., Ji, J., Jiang, J., Zhao, Z., Xiao, Z., Ye, Z., Huang, Y., Lai, R., Jin, H., Hou, B., Wu, M., Dong, Y., Yip, A., Wang, S., Yang, W., Miao, X., Chen, T., & Jia, Z. (2025). Mirage persistent kernel: A compiler and runtime for mega-kernelizing tensor programs. arXiv. https://doi.org/10.48550/arXiv.2512.22219 Go to original source...
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th ACM Symposium on Operating Systems Principles, (pp. 611-626). ACM. https://doi.org/10.1145/3600006.3613165 Go to original source...
Lai, R., Liu, H., Lu, C., Liu, Z., Cao, S., Shao, S., Zhang, Y., Mai, L., & Ustiugov, D. (2025). TokenScale: Timely and accurate autoscaling for disaggregated LLM serving with token velocity. arXiv. https://doi.org/10.48550/arXiv.2512.03416 Go to original source...
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning (pp. 19274-19286). PMLR.
Li, L., Dinh, L., Hu, S., & Hemphill, L. (2024a). Academic collaboration on large language model studies increases overall but varies across disciplines. arXiv. https://doi.org/10.48550/arXiv.2408.04163 Go to original source...
Li, B., Jiang, Y., Gadepally, V., & Tiwari, D. (2024b). LLM inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference. IEEE. https://doi.org/10.1109/HPEC62836.2024.10938426 Go to original source...
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2025). AWQ: Activation-aware weight quantization for LLM compression and acceleration. GetMobile: Mobile Computing and Communications, 28(4), 12-17. https://doi.org/10.1145/3714983.371498 Go to original source...
Lin, Y., Peng, S., Lu, C., Xu, C., & Ye, K. (2026). FlexPipe: Adapting dynamic LLM serving through inflight pipeline refactoring in fragmented serverless clusters. In Proceedings of the 21st European Conference on Computer Systems. ACM. https://doi.org/10.1145/3767295.3769316 Go to original source...
Meta AI. (2024). Llama 3.3 70B Instruct. Hugging Face. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
Microsoft Azure. (2023). Cost-effective private large language model inference on Azure Kubernetes Service. https://msazure.club/cost-effective-private-large-language-model-inference-on-azure-kubernetes-service/
NVIDIA. (2023). NVIDIA TensorRT-LLM supercharges large language model inference on NVIDIA H100 GPUs. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus
NVIDIA. (2025). NVIDIA TensorRT-LLM now supports recurrent drafting for optimizing LLM inference. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-supports-recurrent-drafting-for-optimizing-llm-inference/
Oracle. (2024). Achieve cost-efficient LLM serving with production-ready quantization solution. Oracle Cloud Infrastructure Blog. https://blogs.oracle.com/cloud-infrastructure/cost-efficient-llm-serving-with-quantization
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys, 55(6), Article 124. https://doi.org/10.1145/3533378 Go to original source...
Romero, F., Souza, M., Watcharapichat, P., Zhang, Q., Zhao, N. J., & Li, A. (2022). INFless: A native serverless system for low-latency, high-throughput inference. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 768-781). ACM. Go to original source...
Saleh, Y., Abu Talib, M., Nasir, Q., & Dakalbab, F. (2025). Evaluating large language models: A systematic review of efficiency, applications, and future directions. Frontiers in Computer Science, 7, Article 1523699. https://doi.org/10.3389/fcomp.2025.1523699 Go to original source...
Schmid, L., Hey, T., Armbruster, M., Corallo, S., Fuchß, D., Keim, J., Liu, H., & Koziolek, A. (2025). Software architecture meets LLMs: A systematic literature review. arXiv. https://doi.org/10.48550/arXiv.2505.16697 Go to original source...
Semerikov, S. O., Vakaliuk, T. A., Kanevska, O. B., Ostroushko, O. A., & Kolhatin, A. O. (2025). Edge intelligence unleashed: A survey on deploying large language models in resource-constrained environments. Journal of Edge Computing, 4(2), 179-233. https://doi.org/10.55056/jec.1000 Go to original source...
Xiao, C., Cai, J., Zhao, W., Zeng, G., Lin, B., Zhou, J., Zheng, Z., Han, X., Liu, Z., & Sun, M. (2025). Densing law of LLMs. Nature Machine Intelligence, 7, 1823-1833. https://doi.org/10.1038/s42256-025-01137-0 Go to original source...
Xu, M., Liao, J., Wu, J., He, Y., Ye, K., & Xu, C. (2025). Cloud native system for LLM inference serving. arXiv. https://doi.org/10.48550/arXiv.2507.18007 Go to original source...
Yadav, R. (2025). Deploying scalable serverless LLM workloads on Kubernetes + Knative + GPU support + validating admission webhook. Medium. https://medium.com/@daydreamingguy941/deploying-scalable-serverless-llm-workloads-on-kubernetes-knative-gpu-support-validating-1e7cab5b0cf8
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., & Chun, B.-G. (2022). Orca: A distributed serving system for transformer-based generative models. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (pp. 521-538). USENIX Association.
Zhang, Y., & Matta, I. (2025). SERFLOW: Serverless LLM serving with cost-efficient autoscaling. In Proceedings of the 8th International Workshop on Edge Systems, Analytics and Networking (pp. 25-30). ACM. https://doi.org/10.1145/3721323.3724838 Go to original source...
Zheng, L., Yin, L., Xie, Z., Huang, J., Sun, C., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2024). SGLang: Efficient execution of structured language model programs. In Proceedings of the 38th International Conference on Neural Information Processing Systems, (pp. 62557-62583). ACM. Go to original source...
Zhou, Z., Ning, X., Hong, K., Fu, T., Xu, J., Li, S., Lou, Y., Wang, L., Yuan, Z., Li, X., Yan, S., Dai, G., Zhang, X.-P., Dong, Y., & Wang, Y. (2024). A survey on efficient inference for large language models. arXiv. https://doi.org/10.48550/arXiv.2404.14294 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

Return