Power-aware Deep Learning Model Serving with µ-Serve
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
Function-as-a-Service (FaaS) is becoming an increasingly popular cloud-deployment paradigm for serverless computing that frees application developers from managing the infrastructure. At the same time, it allows cloud providers to assert control in workload consolidation, i.e., co-locating multiple containers on the same server, thereby achieving higher server utilization, often at the cost of higher end-to-end function request latency. Interestingly, a key aspect of serverless latency management has not been well studied: the trade-off between application developers' latency goals and the FaaS providers' utilization goals. This paper presents a multi-faceted, measurement-driven study of latency variation in serverless platforms that elucidates this trade-off space. We obtained production measurements by executing FaaS benchmarks on IBM Cloud and a private cloud to study the impact of workload consolidation, queuing delay, and cold starts on the end-to-end function request latency. We draw several conclusions from the characterization results. For example, increasing a container's allocated memory limit from 128 MB to 256 MB reduces the tail latency by 2× but has 1.75× higher power consumption and 59% lower CPU utilization.
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
Jesus Rios, Saurabh Jha, et al.
CLOUD 2022
Haoran Qiu, Weichao Mao, et al.
SoCC 2022
Weichao Mao, Haoran Qiu, et al.
NeurIPS 2022