Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Deploying Mamba2 in production requires more than algorithmic innovation—it demands kernel-level and system-level optimizations. In this poster, we present our work on accelerating Mamba2 inference in vLLM using Triton. We extended Triton kernels to support continuous batching and chunked prefill, enabling dynamic request admission and incremental computation for long contexts. Beyond these features, we introduced a single fused Mamba SSD kernel, replacing the original five-kernel solution for improved efficiency. We also implemented a Triton-based Conv1D kernel, resolving memory layout inconsistencies that limited performance. These changes, combined with system-level optimizations such as reduced CPU-GPU synchronization, delivered up to 6× throughput gains. By replacing CUDA kernels with Triton, we achieved cross-platform compatibility and unlocked new opportunities for fine-grained performance tuning. We share insights from these optimizations to guide developers in building high-performance, portable inference kernels for next-generation architectures like Mamba2.
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025
Miao Guo, Yong Tao Pei, et al.
WCITS 2011