Uri Maayan

Project

Zonos Hebrew TTS

A Hebrew voice running on an AMD GPU, across three TTS backends and two tool-chains of yak-shaving.

Role
ML engineer
Stack
Python 3, PyTorch, Zonos Transformer, VoxCPM.cpp, F5-TTS, ROCm 7.12 nightly, Vulkan SDK, phonikud-onnx, espeak-ng, Descript Audio Codec
Metrics
Zonos Transformer ~8.3 tok/s · RTF ~0.14 · 3.3 GB VRAM of 16 · three backends evaluated end-to-end on RX 6800 XT + Windows

Hebrew TTS has a small public ecosystem, and most of it assumes CUDA. The goal here was a useful Hebrew voice running on an AMD RX 6800 XT under Windows — no NVIDIA card in the loop — with an orchestration layer above it that could swap model backends without rewriting the pipeline. ”No NVIDIA in the loop” is harder than it sounds. The Hebrew-finetuned Zonos Hybrid variant needs CUDA-only Mamba kernels; VoxCPM.cpp’s upstream build assumed GCC flag conventions that MSVC doesn’t accept; MIOpen has its own gap with cuDNN on small depthwise convs. Each backend took a different detour.

What it does

Three backends were evaluated end-to-end:

  • Zonos Transformer (1.6 B). Ships. The Hebrew-finetuned Zonos Hybrid variant needs CUDA-only Mamba kernels, so the generic transformer path was used instead. Throughput on ROCm 7.12 nightly lands at ~8.3 tokens/sec with an RTF of ~0.14 (144 s of compute for 13 s of audio), 3.3 GB VRAM of 16 available.
  • VoxCPM2 via VoxCPM.cpp (Vulkan). Model weights load cleanly — F16 at 4.73 GB, Q8_0 at 2.5 GB — but the graph fails at prefill with a ggml_concat dtype mismatch. Upstream bug, unresolved. Left the rig patched (MSVC build flags, 64-bit file offsets, C++20 migration) so the path isn’t lost when the upstream fix lands.
  • F5-TTS-Hebrew. Integrated, not primary. Usable as a fallback.

The Hebrew-specific preprocessing matters as much as the model choice. Text goes through phonikud-onnx for diacritic restoration (niqqud), then espeak-ng for phoneme conversion. Audio I/O uses the Descript Audio Codec at 44 kHz. The orchestration layer around these is thin but important: one entry point, swap the backend with a config flag, same Hebrew input, same audio output format.

The yak-shaving

Two memorable detours.

First, safetensors segfaulted when loading weights on Windows across every PyTorch build tried. The workaround was a numpy-backed BF16 → FP16 converter that bypasses safetensors entirely for the one format conversion that needed it. Not elegant; shipped.

Result

The decision that “CUDA-only” is not a final answer on hardware that costs real money.

Some of this work is useful to others — the Mamba2 ROCm port is written up separately. Most of it was personal: a Hebrew voice running on the GPU I already owned, in a pipeline thin enough to outlive its current backends.

What I’d do differently

Stop treating 'CUDA-only' as a final answer earlier. I spent too long trying to coerce CUDA-only Mamba kernels onto ROCm before switching to the generic-transformer Zonos variant. The productive path was picking the backend that actually compiles for your hardware and accepting the quality delta.