Evaluate coding agents on your own repositories.
You evaluate your employees, why not your coding agents? RepoGauge turns your repositories into a reproducible evaluation suite, runs real coding agents against the same task set, and shows the numbers you need to compare them: pass rate, cost, latency, regressions, and the points where a premium model actually pays for itself.
RepoGauge mines actual bugfix commits, preserving the problem, the tests, and the gold patch.
Your repositories stay on your machine unless you explicitly point a solver at a remote provider.
Solver patches go through the same validation flow, which keeps the score tied to real task behavior.
Token usage, cache hits, wall-clock time, and spend sit beside pass rate so the tradeoffs stay visible.
Choosing an agent gets expensive fast.
Public leaderboards rarely look like your codebase, provider behavior changes quietly, and teams end up paying for overlapping assistants because nobody can prove which one actually ships better code. RepoGauge gives that discussion a stable measurement loop.
How teams usually choose
A new model lands, someone tries it for an afternoon, and the decision gets made on taste: feels faster, seems sharper, maybe the patch looked cleaner.
- 1No stable baseline between providers or months.
- 2No proof the benchmark resembles your repo.
- 3No cost-per-solved-bug metric to keep quality honest.
- 4No reproducible canary for silent regressions.
What the workflow gives you
A repeatable pipeline that mines real fixes from your history, validates each task before scoring, runs the same task slice across every solver, and leaves behind analysis artifacts you can diff over time.
- ASame dataset, same scoring rules, same codebase slice. The comparison stays clean.
- BPass rate tied directly to the human fix and failing tests.
- CCost, tokens, and latency show up beside quality.
- DTrain a router so premium models only handle premium tasks.
From raw commit history to benchmark-grade evidence.
The workflow is explicit: mine, review, export, validate, run, analyze, then optionally train a router. Every stage emits a defined artifact contract, so the system stays inspectable and concrete.
Start from the fixes your team already shipped.
RepoGauge scans the default branch for bugfix-shaped changes and emits a candidate set that can be reviewed, filtered, and later exported into a benchmark built from your repositories. That keeps the benchmark grounded in actual work your team already shipped.
- Artifact:
candidates.jsonlbecomes the raw source pool for the whole benchmark. - Bias control: deterministic heuristics reduce the temptation to overfit selection to a preferred solver.
- Low setup cost: pure rules mode works without any model calls.
Turn a noisy history into a benchmark-worthy task list.
The review stage applies accept and reject heuristics, generates a browsable summary, and draws a clean boundary between strong bugfixes and changes that do not belong in the benchmark.
- HTML output: an immediately reviewable report helps humans spot weak candidates quickly.
- Rules first: LLMs can advise, but they do not get to define ground truth.
- Faster curation: this is the quality gate before expensive evaluation work starts.
Materialize benchmark datasets for your repositories.
Export turns accepted candidates into dataset instances, writes the gold predictions, and generates a repo-specific evaluation adapter so the scoring flow can understand repositories it has never seen before.
- Compatibility boundary: the dataset plus adapter pair is what keeps scoring reliable.
- Gold patch preservation: every task keeps the human fix for validation and comparison.
- Structured outputs: downstream stages can resume cleanly without redoing export work.
Make sure the benchmark is actually solvable before you score anyone on it.
Gold validation runs the scoring flow against the human patches inside the container image, so broken tasks get caught before they contaminate solver comparisons.
- Sanity check: confirms each benchmark instance can pass under the intended validation environment.
- Resolved slice: emits the clean dataset subset worth spending solver time on.
- Infrastructure clarity: separates bad tasks from bad model behavior.
Test every solver on the same task set and record the economics of each attempt.
RepoGauge executes the matrix configuration against the resolved dataset and records one row per attempt: patch, tokens, cost, duration, exit reason, and workspace artifacts.
- Head-to-head fairness: same tasks, same scoring rules, same codebase baseline.
- Workspace-backed CLIs: local agent tools can run inside the containerized benchmark flow.
- Rich telemetry: every attempt keeps cost, speed, and solver behavior attached to it.
Join patch quality to the business reality of cost, latency, and failure modes.
Analysis turns raw attempts into actual decisions: pass rate, cost per solved task, expensive tails, timeout rate, and the spread between uniform routing and mixed strategies.
- One report, full picture: solver outcomes and economics are reported together.
- Regression detection: rerun next month and diff like any other artifact.
- Actionable summaries: it becomes obvious where each solver breaks.
Use benchmark results to control model spend.
Router training fits a small decision tree on the analysis data so future task routing can trade off success probability against cost in a concrete, repo-specific way.
- Practical outcome: premium models only get the work that justifies premium pricing.
- Repo-specific: the boundary comes from your task distribution and repo history.
- Future use: the benchmark can feed policy and routing decisions.
Questions you can answer with evidence.
These are the questions teams keep debating in Slack, planning docs, and budget reviews. RepoGauge gives you a stable artifact trail for answering them.
Which agent actually solves the most bugs here?
On your repo, against your own historical bugfixes and test suites.
- See per-solver pass rate and resolved-instance count.
- Compare attempts on the same task slice so conversations stay anchored.
- Spot quality leaders without losing sight of cost.
Is the premium model worth the price?
Pass rate alone hides the economic story. RepoGauge shows the real tradeoff at the margin.
- Track mean spend per solved task and keep raw token totals in context.
- Inspect the expensive tail where “best” often stops being rational.
- Quantify where the cheap model is already sufficient.
Did a provider quietly regress last week?
Rerun the same matrix on the same dataset and diff the outputs against a stable baseline.
- Regression checks become reproducible artifacts.
- Judge output and cost shifts can be diffed together.
- Silent model changes turn into explicit evidence.
For teams making real platform and buying decisions.
The strongest fit is any group making high-leverage decisions about coding agents, spend, or platform defaults and wanting an answer they can defend after the meeting ends.
Pick a default agent with something sturdier than demo impressions.
Engineering leaders usually need to answer a practical question: which coding assistant should the team use, and is it actually earning its seat? RepoGauge gives a repo-specific answer with enough depth to survive scrutiny.
- Choose a default assistant using pass rate and cost per solved bug, then explain the choice in concrete terms.
- See where a premium model wins decisively and where the cheap tier is already good enough.
- Build a repeatable canary that catches provider regressions before the team feels them.
Make access and budget decisions with a real failure map.
Platform teams are typically mediating between user demand, cost controls, and infrastructure risk. RepoGauge helps them decide which agents to expose, which images to run, and what the guardrails should be.
- Compare CLIs and providers in the exact containerized workflow your org will rely on.
- Understand timeout rates, infrastructure flakes, and bad-patch patterns separately.
- Feed routing or policy decisions with repo-specific training data.
Catch quiet regressions before they become organizational folklore.
If your job is to keep an eye on provider quality, a reproducible benchmark on your own repositories is a much sharper instrument than waiting for scattered developer complaints to accumulate.
- Keep a stable dataset and rerun it across model or provider updates.
- Diff quality and cost changes together so one metric does not hide the other.
- Build a canary suite that reflects your codebase and day-to-day work.
Five commands from repositories to comparison.
RepoGauge keeps the workflow linear on purpose. You can stop after any stage, inspect the artifacts, or keep going to a full solver comparison and router-training dataset.
1. Prepare the environment
Use the project’s `uv` workflow and keep everything in the same dependency context the project expects.
uv sync --group dev
2. Build the benchmark
Mine your repositories, review candidates, and export a benchmark dataset.
uv run repogauge mine /path/to/repo
uv run repogauge review candidates.jsonl
uv run repogauge export reviewed.jsonl
3. Validate, run, analyze
Confirm gold solvability, run the matrix, and produce a report that can survive an architecture review.
uv run repogauge eval dataset.jsonl --gold
uv run repogauge run examples/matrix.yaml
uv run repogauge analyze ./out/run/<run_id>
Interested in a hosted platform?
If a managed version would save your team time across multiple repositories, leave your details and what you would need from it. That helps shape the hosted roadmap around real usage.
Tell us what a managed version would need to do.
Share how you want to run evaluations across your repositories, where spend needs tighter control, and what would make the results easier to act on across the organization.
- 1Share the repositories, team shape, and agent stack you care about.
- 2Tell us whether you care most about lower spend, regression alerts, or clearer reporting.
- 3We can prioritize the hosted roadmap around the jobs teams actually need done.
Leave your details or book a short intro.
If you already know you would use a hosted version, open the form. If you want a quick conversation first, use the calendar link.
A benchmark built from your repositories is a cleaner way to choose coding agents.
It gives you a durable benchmark for your own repositories, keeps comparisons honest, and surfaces the cost and quality evidence needed to choose coding agents with confidence.