Channel Shesh — Uri Maayan

Channel Shesh is an autonomous broadcast system: an alert event arrives, a pipeline of vision, transcription, LLM script generation, and dual-voice TTS produces a bilingual segment, and OBS automation puts two AI hosts on air — sub-eight seconds end to end. Toggle sidenote The information the public needs — what region, what to do, what’s known — exists in structured form in the alert API seconds before anyone can read it aloud. The latency is not a model problem; it is a pipeline-composition problem. Human presenters need minutes to produce a script that matches that latency budget, which is the engineering problem the system was built to close.

What it does

Channel Shesh listens to the Pikud HaOref siren API, pulls context from social-media clips, and puts two AI hosts on air with a bilingual script in under eight seconds. A FastAPI orchestrator runs the pipeline; OBS WebSocket v5 handles scene switching, overlays, and clip playback; two distinct Hebrew voices — Dana (urgent, energetic) and Omer (analytical, slower) — carry the dialogue.

The two-host pattern is not cosmetic. Scripted dialogue between disagreeing personalities reads very differently from single-voice narration, and it gives the system room to handle uncertainty (“we don’t have confirmation yet”) without breaking pace.

Hard parts

Sub-8 s, end to end. FastAPI receives the alert event and runs a fixed pipeline: fetch social clip → extract keyframes → Gemini Vision describes frames → Whisper transcribes audio → Gemini Flash generates a dual-host script (JSON-structured output) → Edge TTS renders both voices → OBS switches scene and plays. Vision and transcription run in parallel over the same clip; both finish in ~2 s, then feed the script generator together.

Memory that stays coherent across a broadcast

Redis holds a 20-segment short-term conversation buffer that gets injected into every script-generation prompt, so the hosts stay coherent across a broadcast. Per-host ChromaDB collections store “memorable facts” extracted from prior episodes — that’s what lets Dana call back to something Omer said a week ago. Two-tier memory, short and long, one fingerprint-cached voice-line layer underneath.

The gap between segments. Transition phrases — “moving on,” “back to you,” handoffs, de-escalations — are fingerprinted by MD5 and cached, 200-file LRU. Cache hit rate on transitions sits around 60%. Segment N+1 is pre-generated while segment N is playing; the gap between segments is effectively invisible.

OBS as a remote scene graph. Four scenes — Idle, AlertMain, Breaking, FullClip — switched programmatically over WebSocket v5. Dynamic text overlays update mid-scene; no manual operator touching the mixer. Getting WebSocket v5 semantics right was harder than the TTS or the LLM work; OBS’s wire protocol has sharp edges around scene-item identity across reload.

Failure modes for a broadcast system. Circuit breakers between every external service, graceful fallbacks when Gemini is slow or a clip fails to download, a producer dashboard that lets a human trigger or kill a segment from the browser. Ugly-mode matters more than happy-path when the topic is a live siren.

Result

Development / demo state — not live on broadcast yet. End-to-end runs cleanly from mock alerts.

Eight seconds is the whole constraint. It rules out a human in the loop and most of the architectures you’d reach for first — which is why the interesting work is in the pre-generation and the circuit breakers, not the models. 114 pytest tests, and a producer dashboard that looks a lot like an emergency-room monitor.