Uri Maayan

Project

Ozer-AI

An MCP server that fans coding subtasks out to a pool of free-tier LLMs, with a senior model owning design and integration review.

Role
Engineer
Stack
Python 3, asyncio, MCP server, DeepSeek / Groq / Gemini / Alibaba / OpenRouter / Puter, pydantic, pytest
Metrics
193 pytest tests · 6 free-tier providers · asyncio.Semaphore(4) concurrency · MCP-native integration with Claude Code

Claude-tier tokens are expensive. Most of what a coding agent actually does with them is cheap work — implement this function to spec, write tests for that module, refactor these three files. That’s exactly the work that smaller, free-tier LLMs are now good enough at. The expensive reasoning — architecture decisions, breaking work into tasks, catching integration bugs across modules — is where Opus and Sonnet actually earn their keep. The ratio of “cheap work” to “hard work” inside a coding session is lopsided: most tokens go to things a free-tier Llama could handle. There was no clean way to split the two: Claude Code can’t natively delegate subtasks to another provider, asking one free LLM to do everything hits quality walls, and running a fleet of them by hand is orchestration work that defeats the point.

Ozer-AI's status view in a terminal: a task DAG titled 'Session 7: Self-healing fallbacks + Swappable orchestrator', each subtask showing its type, difficulty, status (approved or failed), and the free-tier provider it ran on (groq, gemini, nvidia).
A real dispatch: subtasks fanned across free-tier providers (groq, gemini, nvidia), each tracked to approved or failed.

What it does

Ozer-AI is an MCP server that Claude Code calls as a structured tool. You hand it a DAG of tasks — “implement fn_a, then fn_b which depends on fn_a, then a test suite for both” — and it dispatches each task to a free-tier provider (DeepSeek, Groq, Gemini, Alibaba, OpenRouter, Puter), one at a time as dependencies resolve, up to four in parallel. Completed outputs are injected as context into downstream tasks so a test-writer sees the actual code it’s testing, not just the spec.

When tasks fail, the system tries to fix them rather than giving up: it reformulates the prompt with error context, rotates to an untried provider, or injects a new fix task into the DAG. At the end of the run, a senior LLM (Claude) does an integration review — conflicting signatures, missing imports, duplicate code — before anything gets applied to the filesystem.

The senior model owns design decisions and the integration review; everything in between runs on free-tier providers.

Hard parts

Dynamic task claiming. Tasks don’t wait for “wave N” to finish. An asyncio.Semaphore(4) gates concurrency; any task whose DAG dependencies are green claims the next open slot. This is simple and it beats wave scheduling on heterogeneous task durations — a 3-second provider doesn’t sit idle behind a 30-second one.

Context injection that doesn’t blow the context window. When a task completes, its output is stored and passed into dependent tasks as a ContextFile entry in the prompt. Downstream workers get the actual produced code, not just the abstract interface description. The packing is deliberate — only strictly-needed upstream outputs are inlined; everything else stays referenceable by path.

Specialist routing as a learning problem

Providers are tagged with role affinities — Groq for test-writing (fast, Llama), DeepSeek for implementation (strong coder), Gemini for frontend. Per-provider, per-task-type outcomes are persisted with a scoring formula: success·0.6 + quality·0.3 + efficiency·0.1. The router biases future routing toward high-scoring pairs. The router gets better at its job with use, without any supervised labelling.

Self-healing fallbacks. Failures run through three strategies in order: prompt reformulation with error context, provider rotation to one that hasn’t tried yet, and auto-generation of a new fix task inserted into the DAG. The pipeline keeps moving; one bad provider doesn’t halt a 40-task run.

MCP as the integration surface. Ozer runs as a real MCP server exposing dispatch_task, dispatch_plan, get_status, apply_task. Claude Code talks to it as structured tools — no shell commands, no prompt stuffing. That integration shape is what made the whole thing usable in practice rather than a curiosity.

Result

The senior model designs and reviews. Everything in between, a free-tier LLM can do.

Treating routing as a learning problem — which provider is best at which kind of task, conditional on what’s been tried — outperforms static pipelines. 193 tests, six providers, one user (me). It runs locally and it saves real tokens.

Source private — a personal tool (MIT-licensed), not yet published.

What I’d do differently

Build the agent-memory scoring before the router, not after. The routing decisions made the most sense once the scoring formula existed; designing routing first meant rewriting it once the feedback signal became available. Scoring is the signal; routing is the policy.