Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM InferencePol G. RecasensFerran Agulloet al.2025CLOUD 2025
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM InferenceYue ZhuHao Yuet al.2025CLOUD 2025
How Low Can LoRA Go: System-Level Throughput, Energy, and Model Quality Tradeoffs when Fine-Tuning AdaptersConnor EspenshadeUmesh Deshpandeet al.2025ISCA 2025
Optimizing GPU Multiplexing for Efficient and Cost-Effective Access to Diverse Large Language Models in GPU ClustersYue ZhuChen Wanget al.2024MASCOTS 2024
GPU OPTIMIZATIONS FOR EFFICIENT AND COST-EFFECTIVE ACCESS TO DIVERSE LARGE LANGUAGE MODELS IN RESEARCH CLUSTERChen WangYue Zhuet al.2024MLSys 2024
Towards Pareto Optimal Throughput in Small Language Model ServingPol G. RecasensYue Zhuet al.2024EuroSys 2024