Technical note
3 minute read

Building PyTorch-native support for the IBM Spyre Accelerator

We recently published our 1H 2026 roadmap for enabling the IBM Spyre Accelerator in the PyTorch ecosystem. In a companion technical note, we described the hardware — its 32 active AI cores, mixed-precision SIMD-systolic arrays, and programmable dataflow design.

Here, we wanted to walk through the key themes of the roadmap and how we are building first-class PyTorch support for a dataflow accelerator in an ecosystem largely shaped by GPUs. Our philosophy is ecosystem-first — we’re leveraging upstream mechanisms, minimizing custom code, and contributing back the pieces that make it easier for the next accelerator to follow the same path.  

Integrating with torch.inductor

We are extending inductor out-of-tree to handle the abstractions that dataflow accelerators require. Three extensions matter most: First, we are introducing tile-based tensor layouts so the compiler can reason about the block-structured data movement that Spyre's ring-connected cores expect. Second, we are adding multi-core work-division passes that partition tiles across Spyre's 32 cores during compilation rather than at runtime. Third, we are adding scratchpad optimization — Spyre cores use explicitly managed on-chip memory rather than hardware caches, and inductor needs to account for this when scheduling data. Together, these extensions will let torch.compile produce efficient Spyre code for every priority model in our 1H 2026 scope, from Llama 3.1 8B to Granite 4 Hybrid 30B.  

From inductor to silicon: The compiler stack

We are taking a two-staged approach to the backend compiler intermediate representation (IR) that sits between inductor's high-level graph and Spyre machine code.

In the first stage, SuperDSC (SDSC) will serve as the backend compiler IR — the single entry point for all operation lowering and code generation. Every torch op required by our priority models will be expressible in SDSC, providing a clean separation between the PyTorch integration layer and hardware-specific optimization.

In the second stage, we will transition to KernelTile IR (KTIR), a community-aligned specification more in line with emerging initiatives like TileIR. KTIR will generalize the tile-level representation so that other dataflow accelerators — not just Spyre — can use it for lower-level scheduling and code generation. We plan to publish the complete KTIR spec in the first half of the year and are designing the open-source scheduling algorithms that sit on top of it to be adaptable beyond our own hardware.  

Runtime and distributed inference

Spyre will register as a PyTorch device entirely through out-of-tree extensions: device lifecycle, memory management, data transfer, and dispatch. Our target is 100% of registration handled this way, with less than 5% overhead compared to direct device access. We plan to contribute the generic primitives we build back into PyTorch core's OpenReg testing infrastructure.

For multi-card inference, we are compiling functional collective operations (all-reduce, all-gather) through torch.inductor, which will give us distributed inference across all priority models in 1H 2026. Longer-term, we plan to migrate to torch.distributed and eventually torch.comms as the community communication layer stabilizes.  

Serving models with vLLM

Production inference will run through vLLM. We are enabling Spyre as a vLLM platform plugin, adopting upstream model implementations rather than maintaining our own forks. Our priority models will serve end-to-end through vLLM on Spyre.

Two optimizations will drive practical usability. A new Spyre attention backend will remove the homogeneous sequence-length constraint, directly reducing inter-token latency. And improved torch.compile artifact caching in upstream vLLM will bring startup time down to a few seconds. We are collaborating with the vLLM community to stabilize the platform plugin interface.  

Testing at every layer

We are building a layered test pyramid that will validate the full stack: op-level correctness, inductor compilation and lowering, module-level tests (including attention, normalization, and activations), top-level model quality and performance, and end-to-end vLLM inference. All tests will be scoped to the priority models and will run nightly, with regression failures flagged within hours.

We are building this CI infrastructure as an out-of-tree CI contribution to the PyTorch ecosystem, establishing patterns that other accelerator teams can adopt. Our target is above 95% pass rate on nightly runs, with the full pipeline completing in under three hours.

Contributing back

Being ecosystem-first means giving back, not just building on top. Three contributions stand out this half: We plan to upstream OpenReg primitives so that out-of-tree device testing becomes a first-class PyTorch capability. We are working to generalize KTIR as a community specification so that dataflow accelerators share a common tile-level IR rather than each inventing their own. And we will be documenting out-of-tree CI patterns so that the next accelerator team to come along does not have to solve infrastructure from scratch.

Design documents and RFCs live in our public repository. We welcome engagement — whether reviewing the KTIR spec, trying the Spyre trace analyzer, or contributing to the conversation on what PyTorch-native accelerator support should look like.

Related posts