Triton in Action: Real-World Optimizations for Mamba2 and vLLM

Jamie Yang; Sara Kokkila Schumacher; Adnan Hoque; Fabian Yu Chin Lim; Tuan Hoang Trong; Burkhard Ringlein; Tyler Smith; Rishi Astra

Triton Developer Conference 2025

Poster

21 Oct 2025

Triton in Action: Real-World Optimizations for Mamba2 and vLLM

Abstract

Deploying Mamba2 in production requires more than algorithmic innovation—it demands kernel-level and system-level optimizations. In this poster, we present our work on accelerating Mamba2 inference in vLLM using Triton. We extended Triton kernels to support continuous batching and chunked prefill, enabling dynamic request admission and incremental computation for long contexts. Beyond these features, we introduced a single fused Mamba SSD kernel, replacing the original five-kernel solution for improved efficiency. We also implemented a Triton-based Conv1D kernel, resolving memory layout inconsistencies that limited performance. These changes, combined with system-level optimizations such as reduced CPU-GPU synchronization, delivered up to 6× throughput gains. By replacing CUDA kernels with Triton, we achieved cross-platform compatibility and unlocked new opportunities for fine-grained performance tuning. We share insights from these optimizations to guide developers in building high-performance, portable inference kernels for next-generation architectures like Mamba2.

Paper