Donating llm-d to the Cloud Native Computing Foundation

IBM and partners are contributing llm-d, which offers a replicable blueprint for developers and researchers to deploy inference stacks for any model, on any accelerator, in any cloud.

Operationalizing AI inference is hard, especially with cutting-edge models and the infrastructure they require. New workloads are variable, and APIs don’t always make it possible to orchestrate inference. The cloud‑native world is racing to keep up with the demands of modern AI, and large language model (LLM) inference is one place where that pressure is felt most intensely.

As organizations push models into production, they’re discovering that serving LLMs at scale presents a new class of distributed systems challenges. That’s exactly the gap llm‑d was created to fill. llm-d addresses the limitations of traditional routing and autoscaling by offering a Kubernetes‑native distributed inference framework.

Today at KubeCon Europe, IBM Research, Red Hat, and Google Cloud announced the contribution of llm-d to the CNCF as a sandbox project. Launched as a collaborative effort with founding contributors NVIDIA and CoreWeave, and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, alongside university supporters at the University of California, Berkeley, and the University of Chicago, the project has rapidly evolved into state-of-the-art AI infrastructure. This move marks a major milestone in IBM’s mission to make high‑performance, vendor‑neutral, Kubernetes‑native LLM inference accessible to everyone. By aligning with the CNCF, we’re doubling down on open governance, community‑driven development, and the belief that scalable generative AI should be a core feature of the cloud‑native ecosystem.

“llm-d bridges the gap between traditional distributed systems and the emerging AI inference stack, making large-scale model serving a first-class, cloud-native workload,” said Carlos Costa, a Distinguished Engineer at IBM Research who specializes in hybrid cloud platform for AI. “This donation helps establish the CNCF as a home for AI inference infrastructure, catalyzing a broader ecosystem of composable systems and projects.”

Any model, any accelerator, any cloud

The mission of llm-d from the outset was to build a vendor-neutral inference serving stack that can be used with any combination of hardware and software. llm‑d, which was launched in 2025, is a Kubernetes‑native, high‑performance distributed inference framework designed to make serving LLMs at scale both predictable and efficient. To support modern generative AI workloads, it provides a modular architecture that turns inference engines like vLLM into production-ready, distributed, cloud-native inference systems capable of sustaining low latency and high throughput under real-world traffic.

“In the most fundamental sense, we’re taking inference from just standing up something and playing with models, to running them in production at scale with multiple users and models,” said Priya Nagpurkar, vice president of AI platform at IBM Research. “You need the scale, distribution, and reliability of what Kubernetes provided for the previous era, while also recognizing that this is a very different workload,” she added.

At its core, llm‑d addresses the most challenging aspects of LLM inference, including KV‑cache locality management, balancing prefill and decode phases, coordinating multi‑node deployments, maintaining low latency, and efficiently utilizing heterogeneous accelerator hardware. “To deliver efficient inference, llm‑d introduces intelligent inference scheduling and prefix‑cache‑aware routing,” said Vita Bortnikov, IBM Fellow specializing in distributed AI inferencing at IBM Research. “This ensures that each request is routed to the most optimal replica based on cache state, traffic patterns, and hardware topology.”

“Another key capability of llm‑d is hierarchical KV‑cache offloading across GPU, CPU, and storage tiers,” she added. “This significantly improves performance, particularly for long‑context workloads and high levels of concurrency.”

Through prefill/decode disaggregation, llm‑d allows these two fundamentally different phases of inference to scale independently, dramatically improving efficiency for variable workloads. Autoscaling is traffic‑ and hardware‑aware, adapting to real‑time workload characteristics rather than relying on generic CPU and GPU metrics. llm‑d integrates deeply with emerging Kubernetes standards, including the Kubernetes Gateway API Inference Extension (GAIE) and LeaderWorkerSet (LWS), making distributed inference a first‑class Kubernetes workload.

A central promise of llm-d is that it will turn AI infrastructure from a black box into a replicable blueprint for manageable, cloud-native microservices. “This is a well-lit path,” Costa said. “We tested this for you. We benchmarked it. We went through the pain, and this is a path that we provide the community a clear path from experimentation to production.” With reproducible benchmarks, validated deployment patterns, and vendor‑neutral design, llm‑d provides a well‑lit path for organizations seeking production‑grade generative AI infrastructure across NVIDIA, AMD, Intel, and Google TPU accelerators.

A well-lit path

The CNCF is a natural venue for approaching this varied landscape. llm‑d was contributed to the CNCF as a sandbox project to accelerate the standardization, openness, and interoperability of distributed LLM inference across the cloud‑native ecosystem. As organizations race to operationalize generative AI, they’re discovering that LLM inference introduces challenges — stateful scheduling, KV cache locality, multi‑phase execution, heterogeneous accelerators — that expose limitations in the original workload model Kubernetes was designed around. These challenges are too fundamental and too shared to be solved inside a single company’s product roadmap. They require a neutral, community‑driven approach.

By contributing llm‑d to the CNCF, the project’s maintainers aim to establish a vendor‑agnostic, Kubernetes‑native blueprint for high‑performance inference that any organization can adopt. CNCF provides the governance model, IP clarity, and community trust needed for llm‑d to evolve from a promising framework into a widely accepted standard. While IBM, Red Hat, and Google are driving core contributions and early adoption, a growing ecosystem of collaborators is actively exploring integrations with the stack. CNCF stewardship ensures that no single vendor controls the project’s direction and that it remains aligned with upstream Kubernetes APIs such as the GAIE and LWS.

Joining the CNCF also strengthens llm‑d’s mission to create well‑lit paths for production‑grade AI infrastructure. The foundation’s ecosystem provides the ideal environment for building interoperable, standards‑driven components. Ultimately, contributing llm‑d to the CNCF is about ensuring that scalable, efficient, and portable LLM inference becomes a core capability of the cloud‑native stack, not a proprietary feature locked behind closed platforms.

What’s next?

Following the announcement of llm-d’s contribution to the CNCF, the project’s next phase will focus on deepening adoption, expanding technical capabilities, and strengthening its position as the neutral, open-governance inference stack for the AI ecosystem. The donation formalizes llm-d as a community project that will grow as more collaborators join, Costa said.

A key next step is collaborating to support next-generation AI architectures. For instance, Mistral AI is currently contributing features to the llm-d ecosystem to help advance open standards around disaggregated serving. "Creating a common foundation stack has already proven its value," said Costa. "It allows the entire ecosystem to focus on pushing the boundaries of the AI platform rather than rebuilding the basic building blocks."

At the same time, IBM Research will continue driving innovation, especially in areas where the industry lacks proven solutions. This includes work at the intersection of inference and training — reinforcement learning — as well as advancing self‑managing, AI‑guided optimization across caching, scaling, and configuration. The border between scale inference and model adaptation jobs is becoming blurrier, and an inference platform needs to be adapted to suit this reality.

As the project matures, the broader community is actively tackling the next generation of AI infrastructure challenges. The technical roadmap introduces standardized support for multi-modal workloads, expands integration to additional inference engines, and optimizes scheduling for multi-LoRA environments alongside advanced multi-tier KV cache offloading, ensuring llm-d meets the ecosystem’s evolving table-stakes expectations while pushing into new frontiers. Together, these steps position llm-d to evolve rapidly under CNCF governance and accelerate its role as the operating layer for distributed inference.

Replacing the ‘bones’ of transformer-based models
Research
Peter Hess
09 Jul 2026
- AI
- Generative AI
How the wrong training environment can teach AI models to misbehave
Research
Peter Hess
09 Jul 2026
What is a nanostack? Building a chip like a city
Explainer
Peter Hess
25 Jun 2026
How to run AI workloads on mixed GPUs quickly and affordably
News
Kim Martineau
23 Jun 2026

Any model, any accelerator, any cloud

A well-lit path

What’s next?

Related posts

Replacing the ‘bones’ of transformer-based models

How the wrong training environment can teach AI models to misbehave

What is a nanostack? Building a chip like a city

How to run AI workloads on mixed GPUs quickly and affordably