Porting Mamba2 to ROCm

Companion to the Zonos Hebrew TTS case study — the kernel-level write-up of why a Hebrew voice now runs on a consumer AMD GPU. Read the case study for the product story; this post is the ROCm internals.

Mamba2 is a state-space model that replaces attention with a selective scan — a recurrent operation with a structured state matrix that can be parallelized across the sequence dimension using a parallel prefix scan algorithm. It is fast on CUDA. On ROCm it mostly didn’t work.

This post is a record of what broke, what I patched, and what the throughput looks like now on a consumer AMD card (RX 6800 XT under Windows, ROCm 7.12 nightly).

What Mamba2 actually is

Standard SSMs compute a sequence-to-sequence mapping:

h_t = A * h_{t-1} + B * x_t
y_t = C * h_t

The matrices A, B, C are input-dependent in Mamba — that is the “selective” part. The recurrence is then parallelized using an associative scan (the same structure as a parallel prefix sum), which is what makes it fast on GPU.

Mamba2 restricts the state matrix A to a scalar times the identity, which simplifies the math and allows a chunked-scan algorithm that is more memory-efficient than the full Mamba1 scan. The reference implementation uses a custom CUDA kernel for the chunked scan — selective_scan_cuda in the original repo, and mamba_chunk_scan_combined in the Mamba2 reference.

Those kernels are the problem.

The CUDA-to-HIP translation gap

HIP is AMD’s CUDA compatibility layer. For most CUDA code, hipify (the automated translation tool) works reasonably well — you get a compilable HIP file that runs on an AMD GPU. For custom kernels that use CUDA-specific intrinsics or rely on warp-level primitives, hipify gets you maybe 80% of the way.

The Mamba2 selective-scan kernel uses:

__shfl_down_sync for warp reductions
atomicAdd on half-precision (__half) inputs
Custom PTX for the associative scan inner loop in the CUDA 12 build path

HIP supports __shfl_down_sync — it maps to __shfl_down with an implicit full-warp mask. Atomic adds on __half are supported in ROCm 5.5+ via __hip_atomic_fetch_add. The PTX path is not supported and has to be replaced entirely.

Hipifying the kernel

hipify-clang selective_scan_fwd_kernel.cuh \
  --cuda-path=/usr/local/cuda \
  -o selective_scan_fwd_kernel_hip.cuh

The output compiles. It does not produce correct results. The associative scan inner loop uses __ldg (texture cache load) in a way that hipify translates literally but that on RDNA2 hardware does not hit the L2 cache the same way. The effect is a correctness bug on sequences longer than the chunk size, where state accumulation across chunk boundaries is off by a small floating-point error that compounds.

The fix: replace the PTX path and the __ldg calls with explicit __builtin_nontemporal_load hints, which on HIP map to the GPU’s read-only cache path rather than relying on __ldg’s implicit texture-cache semantics.

// CUDA original
float val = __ldg(&ptr[idx]);

// HIP replacement
float val = __builtin_nontemporal_load(&ptr[idx]);

This is not a general solution — it was tested on RDNA2 (RX 6800 XT, gfx1030). Behavior on RDNA3 (RX 7000 series, gfx1100) and on the datacenter CDNA targets (MI200 / MI300) may differ; test before trusting.

The half-precision atomic

// Original CUDA (requires sm_70+)
atomicAdd(reinterpret_cast<__half*>(dout_ptr), val);

// HIP equivalent (ROCm 5.5+)
__hip_atomic_fetch_add(
    reinterpret_cast<_Float16*>(dout_ptr),
    (_Float16)val,
    __ATOMIC_RELAXED,
    __HIP_MEMORY_SCOPE_AGENT);

On ROCm < 5.5 you have to promote to float32, which costs a read-modify-write cycle. The work here was done against ROCm 7.12 nightly, so the _Float16 path applies directly.

MIOpen vs cuDNN gaps

The upstream Mamba2 PyTorch wrapper uses torch.nn.functional.conv1d for the 1-D depthwise convolution in the SSM block. On CUDA, PyTorch routes this through cuDNN’s implicit GEMM path, which is fast for the small kernel sizes Mamba2 uses (kernel size 4). On ROCm, the same call goes through MIOpen.

MIOpen’s depthwise conv1d for small kernels is significantly slower than cuDNN’s — on kernel size 4, the ratio is wide enough that an explicit implementation is worth writing. The workaround is to do the depthwise conv directly with an einsum or a manual unfold for kernel size 4 or smaller, avoiding the MIOpen path entirely and running through the standard PyTorch compute graph:

def depthwise_conv1d_fallback(x: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
    # x: (B, D, L), weight: (D, 1, K)
    B, D, L = x.shape
    K = weight.shape[-1]
    # pad left with K-1 zeros (causal)
    x_pad = torch.nn.functional.pad(x, (K - 1, 0))
    # unfold and contract
    x_unfold = x_pad.unfold(-1, K, 1)  # (B, D, L, K)
    return (x_unfold * weight.squeeze(1)).sum(-1)

For small D (around 128) and L=2048 this is faster than MIOpen on the RX 6800 XT. For larger D it stops paying off — there’s a crossover where MIOpen’s GEMM dispatch wins; the threshold depends on driver and hardware, so worth measuring on the target box rather than trusting a single number.

Loading the patched kernel

The cleanest approach is to keep the patched HIP kernel in a separate file and load it through torch.utils.cpp_extension:

import torch.utils.cpp_extension as ext
import os

_lib = ext.load(
    name="mamba2_hip",
    sources=[
        os.path.join(os.path.dirname(__file__), "csrc/selective_scan_hip.cpp"),
        os.path.join(os.path.dirname(__file__), "csrc/selective_scan_fwd_kernel_hip.cuh"),
    ],
    extra_cflags=["-O3"],
    extra_cuda_cflags=["--offload-arch=gfx1030"],  # RX 6800 XT (RDNA2)
    verbose=False,
)

The --offload-arch flag must match your hardware. gfx1030 is the RDNA2 target for the RX 6800 / 6900 series; gfx1100 is the RDNA3 target for the RX 7900 series. Consumer Radeon cards are RDNA — the CDNA architectures (gfx90a for MI200, gfx942 for MI300) are datacenter-only and not what this build is targeting.

Throughput

The patched build runs the Zonos Transformer TTS pipeline (which uses this kernel inside its SSM block) at roughly 8.3 tokens/sec on a single RX 6800 XT — real-time factor ~0.14, 3.3 GB VRAM of 16 — with correctness checked against the CUDA reference on the same prompts. The absolute number is less interesting than the fact that any consumer-AMD Mamba2 throughput exists; the naive hipified path produces incorrect output before you can measure it.

The patched build passes the Mamba2 reference test suite (tests/ops/test_selective_scan.py) with tolerance atol=1e-3, rtol=1e-3 on float32. Half-precision tests require atol=1e-2.

What I’d do differently

The main mistake was trying to hipify the complete kernel file first and then debugging the result. The better approach is to hipify the simpler pieces — the reduction utilities, the data-loading helpers — verify each one on a microbenchmark, then assemble them into the full kernel. The associative scan core should be rewritten from scratch in HIP rather than translated; the PTX assumptions don’t carry over and the rewrite is not that long.

The MIOpen depthwise-conv issue was the bigger surprise. The gap between MIOpen and cuDNN on small-kernel depthwise convolutions is wide enough that it is worth routing around it at the Python level rather than waiting for MIOpen to close it.