Project
Channel Shesh
Two AI hosts, one OBS scene graph, sub-8-second event-to-air.
- Role
- Engineer
- Stack
- Python 3, FastAPI, OBS WebSocket v5, Gemini 2.5 Flash, Edge TTS (Dana / Omer), Redis, ChromaDB, Docker
- Metrics
- 114 pytest tests · sub-8 s alert-to-air pipeline · 2 parallel Hebrew voices · 20-segment Redis short-term memory · circuit breakers on every external call
Channel Shesh is an autonomous broadcast system: an alert event arrives, a pipeline of vision, transcription, LLM script generation, and dual-voice TTS produces a bilingual segment, and OBS automation puts two AI hosts on air — sub-eight seconds end to end. The information the public needs — what region, what to do, what’s known — exists in structured form in the alert API seconds before anyone can read it aloud. The latency is not a model problem; it is a pipeline-composition problem. Human presenters need minutes to produce a script that matches that latency budget, which is the engineering problem the system was built to close.
What it does
Channel Shesh listens to the Pikud HaOref siren API, pulls context from social-media clips, and puts two AI hosts on air with a bilingual script in under eight seconds. A FastAPI orchestrator runs the pipeline; OBS WebSocket v5 handles scene switching, overlays, and clip playback; two distinct Hebrew voices — Dana (urgent, energetic) and Omer (analytical, slower) — carry the dialogue.
The two-host pattern is not cosmetic. Scripted dialogue between disagreeing personalities reads very differently from single-voice narration, and it gives the system room to handle uncertainty (“we don’t have confirmation yet”) without breaking pace.
Hard parts
Sub-8 s, end to end. FastAPI receives the alert event and runs a fixed pipeline: fetch social clip → extract keyframes → Gemini Vision describes frames → Whisper transcribes audio → Gemini Flash generates a dual-host script (JSON-structured output) → Edge TTS renders both voices → OBS switches scene and plays. Vision and transcription run in parallel over the same clip; both finish in ~2 s, then feed the script generator together.
Memory that stays coherent across a broadcast
The gap between segments. Transition phrases — “moving on,” “back to you,” handoffs, de-escalations — are fingerprinted by MD5 and cached, 200-file LRU. Cache hit rate on transitions sits around 60%. Segment N+1 is pre-generated while segment N is playing; the gap between segments is effectively invisible.
OBS as a remote scene graph. Four scenes — Idle, AlertMain, Breaking, FullClip — switched programmatically over WebSocket v5. Dynamic text overlays update mid-scene; no manual operator touching the mixer. Getting WebSocket v5 semantics right was harder than the TTS or the LLM work; OBS’s wire protocol has sharp edges around scene-item identity across reload.
Failure modes for a broadcast system. Circuit breakers between every external service, graceful fallbacks when Gemini is slow or a clip fails to download, a producer dashboard that lets a human trigger or kill a segment from the browser. Ugly-mode matters more than happy-path when the topic is a live siren.
Result
Development / demo state — not live on broadcast yet. End-to-end runs cleanly from mock alerts.
Eight seconds is the whole constraint. It rules out a human in the loop and most of the architectures you’d reach for first — which is why the interesting work is in the pre-generation and the circuit breakers, not the models. 114 pytest tests, and a producer dashboard that looks a lot like an emergency-room monitor.
What I’d do differently
I underestimated how much of the latency budget OBS's WebSocket handshake would consume. The script generation and TTS were fast; the scene-switch roundtrips were not. I'd have profiled the OBS integration first and sized the upstream pipeline around what was left, not the other way around.