Zonos Hebrew TTS — Uri Maayan

Hebrew TTS has a small public ecosystem, and most of it assumes CUDA. The goal here was a useful Hebrew voice running on an AMD RX 6800 XT under Windows — no NVIDIA card in the loop — with an orchestration layer above it that could swap model backends without rewriting the pipeline. Toggle sidenote ”No NVIDIA in the loop” is harder than it sounds. The Hebrew-finetuned Zonos Hybrid variant needs CUDA-only Mamba kernels; VoxCPM.cpp’s upstream build assumed GCC flag conventions that MSVC doesn’t accept; MIOpen has its own gap with cuDNN on small depthwise convs. Each backend took a different detour.

What it does

Three backends were evaluated end-to-end:

Zonos Transformer (1.6 B). Ships. The Hebrew-finetuned Zonos Hybrid variant needs CUDA-only Mamba kernels, so the generic transformer path was used instead. Throughput on ROCm 7.12 nightly lands at ~8.3 tokens/sec with an RTF of ~0.14 (144 s of compute for 13 s of audio), 3.3 GB VRAM of 16 available.
VoxCPM2 via VoxCPM.cpp (Vulkan). Model weights load cleanly — F16 at 4.73 GB, Q8_0 at 2.5 GB — but the graph fails at prefill with a ggml_concat dtype mismatch. Upstream bug, unresolved. Left the rig patched (MSVC build flags, 64-bit file offsets, C++20 migration) so the path isn’t lost when the upstream fix lands.
F5-TTS-Hebrew. Integrated, not primary. Usable as a fallback.

The Hebrew-specific preprocessing matters as much as the model choice. Text goes through phonikud-onnx for diacritic restoration (niqqud), then espeak-ng for phoneme conversion. Audio I/O uses the Descript Audio Codec at 44 kHz. The orchestration layer around these is thin but important: one entry point, swap the backend with a config flag, same Hebrew input, same audio output format.

The yak-shaving

Two memorable detours.

First, safetensors segfaulted when loading weights on Windows across every PyTorch build tried. The workaround was a numpy-backed BF16 → FP16 converter that bypasses safetensors entirely for the one format conversion that needed it. Not elegant; shipped.

MSVC + ROCm + Vulkan

Building VoxCPM.cpp against MSVC required patching the upstream CMake for flag compatibility, migrating one file to C++20, defining M_PI where the build assumed it, and fixing a 64-bit file-offset truncation bug in the weight store. The rig now compiles cleanly under MSVC + ROCm + Vulkan. The upstream ggml_concat bug blocks the prefill path on current main, so the patched binary is shelved until that resolves.

Result

The decision that “CUDA-only” is not a final answer on hardware that costs real money.

Some of this work is useful to others — the Mamba2 ROCm port is written up separately. Most of it was personal: a Hebrew voice running on the GPU I already owned, in a pipeline thin enough to outlive its current backends.