Five independent systems — each in production, running live
during events, or shipped to a single user (me) and still useful.
Short version below; full case studies linked.
Live-transcription RAG quoting platform · Python · FastAPI · Vue 3 · ChromaDB · Whisper
Listens to live client meetings in Hebrew or English, extracts
technical requirements with an LLM, matches them to an equipment
inventory via RAG, emits a ready-to-send .docx /
.pdf quote in seconds. 16 REST modules, 650+ pytest
tests covering 1,360+ assertions, 235-key i18n with full Hebrew
RTL, multi-tenant JWT auth, WhatsApp Business bot, QuickBooks /
Hashavshevet / Priority ERP integrations.
The interesting part: streaming partial-transcript chunks into the
RAG pipeline without re-embedding on every token, and keeping live
suggestions under perceptible latency while the speaker is still
talking.
Multi-camera detection grid · Python · OpenCV · ffmpeg HW decode · MOG2 · Hungarian tracker
Nine public camera feeds in parallel with a five-stage detector
(ROI crop → MOG2 background subtraction → streak / flash
filters → Hungarian multi-frame tracking → false-positive
rejection) plus cross-camera temporal correlation. ~10 FPS per
stream on one machine with hardware-decoded ffmpeg
(AMF / D3D11VA / VideoToolbox). Built and run during active alerts.
Operational detail is under NDA; the part I can say is that the
real constraints were decoder backpressure and the false-positive
budget, not detector accuracy in isolation.
Multi-LLM delegation middleware · Python 3.12 · MCP · asyncio · 6 provider APIs
An MCP server that lets Claude Code delegate structured subtasks to
a pool of free-tier LLM providers. DAG-based plan execution with
dynamic task claiming under asyncio.Semaphore(4),
context injection across dependencies, specialist routing by task
type, self-healing fallbacks on failure, and a final senior-model
integration-review pass. Routing gets better with use: per-provider,
per-task-type outcomes feed a scoring formula that biases future
dispatches. The point isn’t cost — it’s keeping
the senior model’s tokens for design and review instead of
boilerplate.
Autonomous Hebrew broadcast system · FastAPI · OBS WebSocket · Gemini · Edge TTS · Redis
Pikud HaOref alert → social-clip fetch → keyframe + audio
analysis (Gemini + Whisper in parallel) → dual-host Hebrew
script → Edge TTS synthesis → OBS scene switching.
Sub-8-second siren-to-air. Two personality-distinct AI co-hosts
(Dana / Omer) with a 20-segment Redis short-term memory and per-host
ChromaDB long-term memory. Circuit breakers on every external call;
a producer dashboard where “kill this segment”
is one click. The design axiom: “the stream must not
die” is the hardest requirement, and every external call has
to be written to survive that.
Pure-PyTorch Mamba2 + Attention backbone · ROCm 7.12 · RX 6800 XT (gfx1030)
Ported a state-space Hebrew TTS model to AMD GPUs by rewriting the
hybrid Mamba2 + Attention backbone in pure PyTorch — removing
CUDA-only dependencies (mamba_ssm,
causal_conv1d, flash_attn). Implemented
SSD chunked scan, RMSNormGated, and single-step decode from scratch
with weight layouts matching the pretrained checkpoint exactly.
War-story catalogue: a bf16 SSM state-drift bug that collapsed
audio past ~2.8 s (fixed by forcing the recurrence to fp32
while keeping I/O in bf16); silent corruption in ROCm’s cuDNN
SDPA path, bisected down to a math-SDPA fallback; matching
flash_attn’s half-split rotary convention exactly;
ROCm SDPA memory-access faults under enable_gqa=True.
Full write-up →