AI Hardware Forum 2025
- Yorktown Heights, NY, USA
IBM is proud to sponsor KubeCon + CloudNativeCon North America 2025.
The Cloud Native Computing Foundation’s flagship conference brings together adopters and technologists from leading open source and cloud native communities.
Be a part of the conversation as CNCF Graduated, Incubating, and Sandbox Projects unite for four days of collaboration, learning, and innovation to drive the future of cloud native computing.
Steve Bek, IBM
As cloud-native architectures grow more complex, developers are under pressure to resolve incidents faster, yet root cause analysis remains a time-consuming, expertise-heavy task. In this keynote, we’ll unveil how Instana’s new Intelligent Incident Investigation, powered by agentic AI, empowers developers to ask natural language questions and instantly surface the insights they need. This breakthrough capability accelerates incident resolution by up to 80%, helping teams cut through operational noise and reduce costly downtime. Join us to explore how AI-driven observability is reshaping the developer experience and redefining what’s possible in modern incident response.
James Mellinger, IBM
Cloud-native development continues to evolve, bringing unprecedented power, but also increasing complexity. In this keynote, we’ll explore emerging patterns and best practices for building resilient, secure applications in hybrid and regulated environments. You’ll learn how AI-driven automation is transforming operational workflows, with a spotlight on IBM Concert’s latest innovations: the CISO Agent and Resilience Agent. These tools leverage architecture diagrams, run books, and app artifacts to generate resilience profiles and monitoring strategies. Illustrating how intelligent agents can enhance reliability and compliance. Whether you're using IBM tools or other platforms, this session offers actionable insights for navigating today’s cloud-native landscape with confidence.
Alessandro Pomponio, IBM
When you let researchers loose on your Kubernetes clusters, it doesn’t take long before the whole place turns into the Wild West: interactive GPU pods left running for days, large CPU-only jobs stampeding onto GPU nodes, and resources vanishing like water in the desert. So we did what any good admin team would: we brought in the sheriffs - Kyverno, Kueue, and Argo CD - to lay down the law and bring order to the frontier.
In this talk, we’ll share how we used these tools to enforce fine-grained policies, implement fair-share GPU scheduling, and automate governance across our Accelerated Discovery bare-metal clusters. No custom code, no cowboy hacks - just smart policy design and GitOps discipline.
Whether you’re managing research workloads or just trying to keep your clusters from descending into chaos, this session will show you how policy-as-code can save you a thousand headaches - and maybe a few GPUs too.
Sunyanan Choochotkaew & Tatsuhiro Chiba, IBM Research
llm-d is a community-driven effort to modernize large language models serving at scale—natively within Kubernetes. The core is modular architecture that decouples prefill and decode operations. This disaggregated design unlocks precise tuning of computing and network resources, tailored to the unique demands of each phase.
But here’s the twist: how precise can it be defined? A GPU unit? A MIG slice? Maybe even something finer? With a new capability proposed for Dynamic Resource Allocation (DRA) framework, resource capacities for compute and network devices can now be dynamically requested and adjusted on the fly. At the same time, the core DRA capability enables device selection based on fine-grained attributes—including topology awareness—eliminating the need for clunky hacks or rigid resource pools.
In this talk, we will demonstrate how a new capability of DRA makes the llm-d framework more feasible and cost-effective, explore the remaining challenges, share practical insights.
Carlos Sanchez, Adobe & Kevin Dubois, IBM
Your software rollouts to production are probably always flawless, right? For the rest of us, once in a while we do run into issues when releasing code to production. Argo Rollouts is a great tool to help mitigate these issues by progressively delivering software to production, and automatically rolling back new features if anything doesn’t go right.
Wouldn’t it be nice if we can take this functionality to the next level? We can take advantage of the advances made in Agentic AI and instruct a model to analyze the logs when a rollout fails. Then thanks to the use of agents, it can take action on our behalf, such as fixing the code or the deployment manifests on the fly, creating new PRs and sending notifications. The sky is really the limit.
Come to this session to learn how to combine Argo Rollouts with Agentic AI to achieve the most seamless release experience yet.
Martin Hickey, IBM & Junchen Jiang, University of Chicago
LLMs are powering copilots, search engines, document understanding, and chatbots. Most real-world AI apps route their workloads through GPU clusters running high-throughput inference engines. For enterprises however, the key concerns are still cost and return on investment (ROI). Welcome to LMCache, an open source LLM serving engine extension which reduces Time to First Token (TTFT) and increases throughput. In this talk, we’ll demonstrate how you can reduce GPU costs and token latency using LMCache. We'll demonstrate LMCache's high-performance KV cache management layer and its integration with well known production inference engines like vLLM and KServe, deployed on a Kubernetes cluster. We'll use real world examples like document analysis and high-speed RAG support. Get a glimpse into the growing community which is the OSS KV caching layer impacting ROI for companies like RedHat, IBM, Google, Nvidia, and CoreWeave.
Mariusz Sabath & Maia Iyer, IBM Research
Agentic workflows in cloud-native environments demand robust identity and authorization. This session explores how to move beyond hard-coded credentials by assigning trusted, granular identities to agents acting on behalf of users. We'll dive into strategies for establishing traceability, enforcing least privilege, and enabling auditable decision-making within a zero-trust architecture.
Focusing on shared agents and tool-calling patterns, we'll demonstrate how SPIRE’s workload identity integrates with user identity to support secure delegation and dynamic, context-aware authorization. You’ll learn how to safeguard agent interactions with external tools and data sources through identity propagation and policy enforcement.
Through a real-world case study using Llama Stack and the extended Model Context Protocol (MCP), attendees will gain actionable insights to build secure, identity-aware agentic platforms ready for production use.
Alex Scammon, G-Research; Abhishek Malvankar, IBM Research; Marlow Warnicke, SchedMD; Dan Desjardins, Distributive
Data transfer is slow -- so in AI and HPC, data locality matters. As workloads scale, optimizing where and how to run data-heavy workloads in Kubernetes becomes critical. Yet this area remains underexplored. The CNCF Batch Subproject shares findings from our work on data-locality-aware scheduling across clusters. Should we move compute to the data or the data to compute? What are the trade-offs in latency, cost, and efficiency?
We present methods to test potential policies: splitting jobs, exposing location-aware metadata from compute/storage, and basing scheduling on historical data and pricing. We share early discoveries from real-world tests across regions with limited bandwidth, storage, and power.
If your workloads are bottlenecked by data gravity -- or you’re chasing GPU efficiency across sites -- join us to explore emerging patterns for intelligent, cost-aware data placement in Kubernetes.
Boaz Michaely, Red Hat & Adi Sosnovich, IBM Research
Kubernetes networking by default is Malicious Actors’ heaven.
Why? Because by default, any pod can send and receive traffic to and from any other pod, ignoring namespace and privilege boundaries. External traffic in both directions is allowed as well, as far as Kubernetes is concerned.
Indeed, best practices rightfully dictate that this default be modified, using "Kubernetes Network Policies" .
Yet most teams find this too difficult to implement. Authoring NetworkPolicy YAML is very challenging. Baseline/AdminNetworkPolicy fills a gap for cluster administrators, but authoring these policies and understanding their impact is a new, additional challenge. Furthermore, policy authors may not know what the application’s communication needs are.
What if there was a way to automatically produce tight network policy rules, in YAML, and see the impact of applied B/ANP network policies?
Join this session to see the magic yourself, and learn how you can leverage this technology today!
Moderated by Katie Norton; Alex Zenla, Edera; Jason Hall, Chainguard; Jon Ceanfaglione, IBM
After a decade of "microservices all the things," the industry is experiencing a fascinating recalibration. Organizations that rushed to decompose monoliths are now grappling with distributed system complexity, operational overhead, and the cognitive load on development teams. This panel explores how modern organizations are making more intentional architectural choices and evolving their approach to software consumption and deployment.
This panel will cover:
Maia Iyer, Alan Cha & Mariusz Sabath, IBM Research; Anjali Telang & Andrew Block, Red Hat
Agentic platforms are redefining how cloud-native applications interact—but behind every action lies a critical question: who is allowed to do what, and why? Emerging standards such as MCP allow AI agents to easily connect with tools, but organizations looking to support agents must maintain security and transparency. They can do so by combining the power of OAuth 2.0 with strongly attested workload identity from SPIFFE.
In this hands-on workshop, we’ll dive into the mechanics of secure workload identity for agents and tools—no prior experience required. Attendees will work hands-on with a working agentic stack, including MCP for agentic tool-calling, and integrate with cloud-native tools such as SPIRE for workload identity, and Keycloak for user management. These existing technologies are key for enabling granular access control and rich audit trails across the full agentic flow. This workshop lays the foundations to building identity-first, zero-trust agentic platforms.
Paolo Patierno, IBM & Michael Morris, Ericsson Software Technology
Strimzi is best known for its operators, but its ecosystem includes a rich set of components that make Apache Kafka on Kubernetes truly production-ready. This talk dives into the broader Strimzi landscape: the HTTP Bridge for RESTful Kafka access, the Drain Cleaner for safe node maintenance, the OAuth library for secure authentication, the Access Operator for declarative user and ACL management, and the Metrics Reporter for enhanced observability. We’ll also touch on other complementary tools like the Kubernetes Config Provider for dynamic configuration and the MQTT Bridge for IoT integration. Whether you're running Kafka at scale or exploring cloud-native streaming for the first time, this session will offer a practical look at how the full Strimzi ecosystem works together to simplify and strengthen your deployment.
Ryan Jarvinen, Red Hat & Daniel Oh, IBM
Running Java applications in Kubernetes brings a set of performance expectations: fast startup, low memory usage, and efficient container images. This session is a hands-on walkthrough of tools and techniques to help meet those goals. You'll learn how to use Jib to build lean container images, accelerate cold starts with GraalVM native image compilation, and improve runtime responsiveness with Class Data Sharing (CDS) and Coordinated Restore at Checkpoint (CRaC). We'll dive into real-world configuration examples, discuss trade-offs, and demonstrate how to combine these tools to boost performance in Kubernetes-native Java workloads.
Blaine Gardner, IBM
The Rook project will be introduced to attendees of all levels and experience. Rook is an open source cloud-native storage operator for Kubernetes, providing the platform, framework, and support for Ceph to natively integrate with Kubernetes. The panel will discuss various scenarios to show how Rook configures Ceph to provide stable block, shared file system, and object storage for your production data. Rook was accepted as a graduated project by the Cloud Native Computing Foundation in October 2020.
Sunyanan Choochotkaew, IBM Research & John Belamaric, Google
Divvying up a network card using Kubernetes is really hard to do. If you need to spin up virtual interfaces on top of a NIC, limit their bandwidth, and hand them out to different Pods, you will have a rough time.
Come find out how the Kubernetes project will make sharing network hardware just as easy as sharing node CPU and memory! And networking is just the initial use case - this functionality can work with any device. Being able to sub-divide devices will really improve utilization of your pricey hardware.
In this talk, we detail a new way to request resources from attached devices like NICs, GPUs, and DPUs. Building on the recently released Device Resource Allocation (DRA), this feature performs on-demand provisioning based on resource requests, allowing a physical device to be independently shared among Pods multiple times. It extends K8s multi-tenancy to the sub-device level. We’ll dive deep and explore real-world use cases, under the hood details, and future extensions.
Jing Chen, IBM Research; Junchen Jiang, University of Chicago; Ganesh Kudleppanavar, NVIDIA; Samuel Monson, Red Hat; Jason Kramberger, Google
As organizations deploy LLMs as distributed stacks in production Kubernetes environments, optimizing inference performance has been critical. This collaborative tutorial brings together experts from Google, NVIDIA, RedHat, IBM, and University of Chicago (LMCache) to provide practical benchmarking techniques for impactful LLM optimization strategies.
Using identified use cases as examples, we'll show how to benchmark key optimization strategies: KV Cache offloading, autoscaling, prefix/session-aware routing, KVCache-aware routing, and xPyD for prefill decode disaggregation. Attendees will learn a unified benchmarking approach integrating tools including vLLM, LMBenchmark, GuideLLM, GenAIperf, inference-perf, and fmperf. Through live demonstrations, participants gain hands-on experience with production-tested methodologies reflecting real-world scenarios. Attendees will be equipped to implement these approaches for data-driven LLM serving optimizations on Kubernetes.
Maroon Ayoub, IBM & Michey Mehta, Red Hat
Kubernetes excels at stateless service routing - but modern AI workloads are not stateless. Generative workloads demand context-aware routing that maximizes performance while reducing costs.
This talk explores layered routing strategies for stateful LLM workloads on Kubernetes - from round-robin to full KV-Cache-aware load balancing. We’ll explain when each level applies, and its effects on performance.
Based on our experience developing llm-d - a framework using the K8s Gateway API Inference Extension, a collaboration between Google, IBM Research, and RedHat - we’ll cover: - Why traditional Kubernetes routing falls short for generative AI - Routing patterns for long-context, sessionful traffic - Global cache indices and local offloading for smart routing - Benchmarks showing latency, cache hit rates, and GPU utilization - Practical ways to adopt cache-aware routing without major infra changes
If you’re scaling multi-turn, agentic, or LLM-powered workloads, this session is for you.
Daniel Oh & Kevin Dubois, IBM
This session delves into the critical aspects of developing production-ready Large Language Model (LLM) applications using Java. We'll explore how to leverage Java's strengths to build scalable and efficient LLM systems, addressing key challenges such as performance optimization, resource management, and seamless integration with existing infrastructures.
Attendees will gain practical knowledge on handling massive datasets, optimizing model inference, and fine-tuning LLMs for optimal performance. We'll discuss strategies for ensuring the reliability and scalability of your LLM deployments, empowering you to create robust and high-performing AI applications. Whether you're a seasoned Java developer or new to the AI domain, this session will provide valuable insights and guidance for your LLM development journey, equipping you with the tools and knowledge to navigate the complexities of building production-grade LLM systems.
Prajakta Kashalkar-Joshi & Socheat Sou, IBM
When working on a cloud-native product that is an aggregate of separate products, each with their own cadence, clear communication is critical to smooth integration. Ideally, each product team wants to know when the other related products have published a new release and if it's ready to be integrated with their product. It becomes an exponential, logistical nightmare to have each team subscribe to notifications of each other team. Using an "integration repository" helps solve both the business and technical needs of product integration and currency. In this talk, learn how the Fusion DevOps team turned the pull-request into a mechanism for streamlining team handoffs, notifying the appropriate focals, and clearly defining boundaries of responsibilities between teams.
Chen Wang, IBM Research & Huamin Chen, Red Hat
This research-driven talk introduces a novel architecture paradigm that complements recent advances in timely intelligent inference routing for large language models. By integrating proxy-based classification and reranking techniques, we've developed a system that efficiently routes incoming prompts to domain-specialized LLMs based on rapid content analysis. Our approach creates a meta-layer of intelligence above traditional model serving infrastructures, enabling specialized models to handle queries they're optimized for while maintaining a unified API interface. We'll present performance research comparing this distributed approach against monolithic inference-time scaling, demonstrating how intelligent routing can achieve superior results for complex, multi-domain workloads while reducing computational overhead. The session includes a Kubernetes-based reference implementation and quantitative analysis of throughput, latency, and accuracy across diverse prompt categories.
Martin Bartoš & Ryan Emerson, IBM
In order to mitigate the impact of CVEs and allow continuous delivery of features, it is crucial that upgrades can be rolled out seamlessly. For stateless applications zero downtime upgrades is a solved problem, but for stateful applications, upgrades can present a significant challenge.
As the leading open-source identity and access management solution, Keycloak is a critical component in many organizations' infrastructure. Achieving maximum uptime is vital in order for dependent services to function.
Join us to discover how Keycloak has evolved to support zero-downtime rollouts of configuration changes and patch upgrades. In this talk we explain the technical and project management challenges we faced, the measures taken to overcome them and what best practices you can leverage in your projects to enable zero-downtime upgrades. Key focus areas will be the Keycloak Operator, how we ensure clustering compatibility, testing strategies and our plans for the future.
Jitendra Singh, IBM India Pvt. Ltd.
Kubernetes observability tools, like Fluent Bit, OpenTelemetry, and Loki, provide deep visibility, but they also handle sensitive data: user identifiers, tokens, and internal service metadata. Even with encryption at rest and in transit, telemetry data is often exposed during collection and processing.
In this lightning talk, we’ll show how to secure observability pipelines on Kubernetes using confidential computing-enabled nodes. We demonstrate how observability components (e.g., Fluent Bit, OpenTelemetry Collector, Loki) can run inside hardware-isolated Kubernetes nodes, ensuring that telemetry data is encrypted at the source and only processed by trusted, attested workloads. Attendees will learn a practical, zero-intrusion design that combines Kubernetes-native observability tools with confidential compute infrastructure to deliver end-to-end encrypted, trusted observability, ideal for regulated workloads in finance, healthcare, and government.
Ezra Silvera, IBM & Michael Hrivnak, Red Hat
Running AI/ML workloads in Pods on bare-metal is common for maximizing GPU performance but lacks strong isolation and flexibility.
In this talk, we share how we use KubeVirt to run high-performance AI workloads inside VMs with NVIDIA GPUs and NVLink, achieving near bare-metal speeds. This enables multi-tenancy, improved security, and resource partitioning—critical for service providers and cost-efficient for customers. We’ll show how VM-based worker nodes enable virtual Kubernetes clusters on shared infrastructure, supporting both full BM nodes and partitioned node use cases. We'll also dive into challenges like integrating NVIDIA Fabric Manager with the Kubebernets/KubeVirt workflow , optimizing NUMA and PCI topology, and aligning Kubernetes scheduling with VM-based GPU layouts. Finally, we’ll share customer use cases demonstrating the need for isolated, high-performance AI environments using Kubernetes-native tooling.