Cost-Aware AI Coding Orchestrator

This project is a local Python workflow engine for bounded AI-assisted coding tasks. The orchestrator keeps policy, command execution, validation, budgeting, and rollback under deterministic Python control while model providers are used only to produce bounded patch proposals or diagnosis output.

Current Status

Operational MVP with Phase 3 OpenAI multimodel refactor cleanup in progress.

The runtime backend is role_provider. OpenAI is opt-in through configuration, and the fake provider supports deterministic tests and no-network dry runs. Phase 1, Phase 2, and Phase 2.5 are complete; Phase 3 is focused on OpenAI multimodel cleanup, OpenCode runtime removal from the active path, backend normalization, naming cleanup, docs consistency, and final validation.

Implemented proof points

Local Python workflow engine for bounded AI-assisted coding tasks
role_provider backend with opt-in OpenAI and fake-provider test paths
Strict patch validation, rollback, checkpointing, and deterministic validators
Local logs, JSON state, and SQLite budget-ledger tracking

Architecture

Task Classify Choose Role Build Context Provider Call Parse Patch Safety Checks Apply Patch Validators Retry / Escalate / Rollback Report

The operator defines bounded task files with allowed files, budgets, timeouts, and validation commands.
The orchestrator selects a capability role from the configured policy and routes through role_provider.
OpenAI is explicitly gated by configuration, while the fake provider covers tests and dry runs without network access.
Models produce only unified diff proposals, structured JSON patch output, or diagnosis text when patching is not appropriate.
Patch safety checks enforce workspace boundaries, forbidden path rules, patch-size limits, and git apply --check before any patch lands.
Deterministic validators such as ruff, mypy, and pytest run after patch application.
Shell commands run only through deterministic wrappers with allowlists and timeouts, never from model output.
Failures roll back to a git checkpoint, retries stay bounded, and runtime, context, output, patch, stdout/stderr, and cost limits are enforced.

Implemented Components

Workflow Engine

Controls task loading, bounded routing, scoped context building, provider execution, validation, retry logic, logging, and final status.

Provider Runtime

Uses a typed role_provider contract with opt-in OpenAI execution and a fake provider for deterministic tests and dry runs.

Patch Safety

Accepts only unified diff or structured JSON patch output, validates touched files, rejects forbidden paths or deletes, and runs git apply --check.

Validators

Runs ruff, mypy, pytest, and patch-safety checks through deterministic subprocess wrappers.

Git Manager

Creates checkpoints, applies safe patches, rolls back tracked changes on failure, removes unsafe new files, and preserves bounded failure recovery.

Budget Guard

Estimates tokens and cost before calls, tracks spend in SQLite, and blocks work that would exceed task, role, runtime, context, output, patch, or project budgets.

Routing Policy

Combines capability-role intent, configuration-driven pricing, reasoning effort, and service-tier policy to choose the correct builder or reviewer role.

Observability

Writes JSON state, JSONL logs, a local SQLite budget ledger, convergence reports, escalation packets, and operator-facing run history.

Model Roles

Capability roles route work by task shape, not by a cheap-versus-expensive split.

The active capability set includes mechanical_worker, classification_worker, simple_builder, general_builder, coding_optimized_builder, frontier_reviewer, and emergency_pro_reviewer. Routing is based on bounded task intent, validation needs, escalation depth, and manual approval policy where required.

Builder roles

mechanical_worker, simple_builder, general_builder, and coding_optimized_builder cover deterministic edits, small local changes, guided repair, and coding-heavy integration work.

Reviewer roles

classification_worker supports classification and summarization, while frontier_reviewer and emergency_pro_reviewer stay diagnosis-first by default rather than patching freely.

Cost Controls

Pre-call estimates

Input tokens are estimated from built context, and projected output cost is evaluated before a provider call is allowed.

Role budgets

Budgets are tracked by project, task, and capability role with explicit runtime and output caps.

Persistent ledger

SQLite records provider, model, service tier, role, estimated tokens, projected cost, status, task, and timestamp.

Hard blocks

Calls are blocked before execution if projected spend would exceed task, role, runtime, context, output, or project limits.

budget:
  project_total_budget_usd: 5.0
  role_budgets_usd:
    mechanical_worker: 2.0
    general_builder: 2.0
    frontier_reviewer: 1.0

limits:
  max_input_tokens_per_call: 32000
  max_output_tokens_per_call: 4000
  max_context_chars_per_call: 128000
  max_patch_bytes: 256000

Service Tier Policy

Service tiers are policy-controlled, not a proxy for capability split.

Non-urgent builder work prefers flex, falls back to default, and keeps priority disabled by default. Reasoning effort is a request-level control, and pricing is configuration-driven rather than hardcoded into runtime logic.

policy:
  builder_service_tier_preference:
    preferred: flex
    fallback: default
    allow_priority: false

reasoning_effort:
  default: medium

Convergence Logic

Converging

Fewer validator failures, narrower diffs, or clearer progress toward the requested outcome.

Stagnating

Repeated identical failures or minimal improvement across attempts despite valid bounded retries.

Diverging

Growing diffs, new failure categories, unrelated files, or regressions that indicate the route should stop.

Diagnosis needed

Non-converging work creates structured escalation packets for diagnosis-first reviewer analysis.

Operational Safety

The model never gets shell execution authority.

The orchestrator allows only deterministic subprocess tools through explicit wrappers with allowlists and timeouts. Model output is treated as data, never as executable shell. Patches must pass contract parsing, boundary checks, validators, and rollback rules instead of relying on model confidence.

Safety invariants

No model-generated shell execution
Strict patch contract and forbidden-path checks
git apply --check before patch application
Mandatory validators after every patch
Diagnosis-first frontier reviewer path by default
Rollback and checkpoint preservation on failure

Operator Modes

The same orchestration logic supports local testing and real provider runs.

The fake provider keeps the system testable without network calls, while the OpenAI provider enables real runtime execution only when configuration opts it in and the operator supplies OPENAI_API_KEY. This preserves a deterministic local test path without weakening the actual provider-backed runtime.

Practical usage

Fake provider for no-network tests and deterministic validation
OpenAI provider for smoke, integration, and normal real runs
Same validator, rollback, and patch-safety layer in both modes

Phase 3 Cleanup

The MVP is operational, but the Phase 3 cleanup is still in progress. The current work is focused on final OpenAI multimodel refactor cleanup, OpenCode runtime removal from the active path, backend normalization, naming cleanup, docs consistency, validation, and clean commit preparation.

Quickstart Summary

Use the fake provider path for local no-network validation and safety testing.

Enable the OpenAI provider explicitly for real smoke, integration, or normal runs with OPENAI_API_KEY configured.

Normal runs stay bounded by allowed files, budgets, timeouts, patch limits, and validator commands.

Validation and rollback remain the final authority over whether a patch is accepted.

Project Scope

A local CLI orchestration layer for controlled AI-assisted coding, not a web app or unrestricted autonomous shell agent.

OpenAI support is real but explicitly gated, while the fake provider handles deterministic dry runs and tests.

Provider usage accounting is policy-based and recorded in a SQLite ledger rather than reconciled to external billing APIs.

classification_worker is supported canonically, but router auto-selection remains intentionally conservative.

Human review remains the final trust boundary before shipping generated patches.

OpenCode was part of the earlier prototype, but the active runtime path now uses direct provider abstractions.

Technologies

Python Pydantic YAML SQLite Git Ruff mypy pytest Docker OpenAI provider abstraction