Project
Ozer-AI
An MCP server that fans coding subtasks out to a pool of free-tier LLMs, with a senior model owning design and integration review.
- Role
- Engineer
- Stack
- Python 3, asyncio, MCP server, DeepSeek / Groq / Gemini / Alibaba / OpenRouter / Puter, pydantic, pytest
- Metrics
- 193 pytest tests · 6 free-tier providers · asyncio.Semaphore(4) concurrency · MCP-native integration with Claude Code
Claude-tier tokens are expensive. Most of what a coding agent actually does with them is cheap work — implement this function to spec, write tests for that module, refactor these three files. That’s exactly the work that smaller, free-tier LLMs are now good enough at. The expensive reasoning — architecture decisions, breaking work into tasks, catching integration bugs across modules — is where Opus and Sonnet actually earn their keep. The ratio of “cheap work” to “hard work” inside a coding session is lopsided: most tokens go to things a free-tier Llama could handle. There was no clean way to split the two: Claude Code can’t natively delegate subtasks to another provider, asking one free LLM to do everything hits quality walls, and running a fleet of them by hand is orchestration work that defeats the point.
What it does
Ozer-AI is an MCP server that Claude Code calls as a structured tool. You hand it a DAG of tasks — “implement fn_a, then fn_b which depends on fn_a, then a test suite for both” — and it dispatches each task to a free-tier provider (DeepSeek, Groq, Gemini, Alibaba, OpenRouter, Puter), one at a time as dependencies resolve, up to four in parallel. Completed outputs are injected as context into downstream tasks so a test-writer sees the actual code it’s testing, not just the spec.
When tasks fail, the system tries to fix them rather than giving up: it reformulates the prompt with error context, rotates to an untried provider, or injects a new fix task into the DAG. At the end of the run, a senior LLM (Claude) does an integration review — conflicting signatures, missing imports, duplicate code — before anything gets applied to the filesystem.
The senior model owns design decisions and the integration review; everything in between runs on free-tier providers.
Hard parts
Dynamic task claiming. Tasks don’t wait for “wave N” to finish. An asyncio.Semaphore(4) gates concurrency; any task whose DAG dependencies are green claims the next open slot. This is simple and it beats wave scheduling on heterogeneous task durations — a 3-second provider doesn’t sit idle behind a 30-second one.
Context injection that doesn’t blow the context window. When a task completes, its output is stored and passed into dependent tasks as a ContextFile entry in the prompt. Downstream workers get the actual produced code, not just the abstract interface description. The packing is deliberate — only strictly-needed upstream outputs are inlined; everything else stays referenceable by path.
Specialist routing as a learning problem
success·0.6 + quality·0.3 + efficiency·0.1. The router biases future routing toward high-scoring pairs. The router gets better at its job with use, without any supervised labelling. Self-healing fallbacks. Failures run through three strategies in order: prompt reformulation with error context, provider rotation to one that hasn’t tried yet, and auto-generation of a new fix task inserted into the DAG. The pipeline keeps moving; one bad provider doesn’t halt a 40-task run.
MCP as the integration surface. Ozer runs as a real MCP server exposing dispatch_task, dispatch_plan, get_status, apply_task. Claude Code talks to it as structured tools — no shell commands, no prompt stuffing. That integration shape is what made the whole thing usable in practice rather than a curiosity.
Result
The senior model designs and reviews. Everything in between, a free-tier LLM can do.
Treating routing as a learning problem — which provider is best at which kind of task, conditional on what’s been tried — outperforms static pipelines. 193 tests, six providers, one user (me). It runs locally and it saves real tokens.
Source private — a personal tool (MIT-licensed), not yet published.
What I’d do differently
Build the agent-memory scoring before the router, not after. The routing decisions made the most sense once the scoring formula existed; designing routing first meant rewriting it once the feedback signal became available. Scoring is the signal; routing is the policy.