Changelog

How inspect evolved, including the mistakes.

6b2e4fdfeatMCP server, review verdict, markdown formatter

New inspect-mcp crate: 6 MCP tools so any coding agent can use entity-level review. inspect_triage is the primary entry point, returning entities sorted by risk with a verdict. inspect_entity lets agents drill into one entity with before/after content, dependents, and dependencies. Plus inspect_group, inspect_file, inspect_stats, and inspect_risk_map. Added ReviewVerdict (4 levels: LikelyApprovable, StandardReview, RequiresReview, RequiresCarefulReview) as a quick signal for agents. New --format markdown for all commands. Extended EntityReview with before/after content and dependency names for drill-down.

+820 lines • 12 files • 3 crates (up from 2) • 12 tests passing

Lesson: cache the analysis, not the tools. MCP tools like inspect_entity and inspect_group are drill-downs into the same analysis that inspect_triage already computed. Re-running the full 4-phase pipeline for each tool call would multiply latency by 6x. We cache the ReviewResult keyed by (repo_path, target) so sequential drill-downs are instant. The expensive work happens once in triage; everything else is just a filtered view.

33c94fbbenchGreptile benchmark: 83.5% HC recall

Ran inspect against the full Greptile benchmark: 50 PRs, 5 repos, 97 golden comments from human reviewers. 83.5% High/Critical recall. 100% recall at the Medium threshold, meaning every golden comment fell within a flagged entity. Per-repo: Keycloak (Java) 100%, Cal.com (TypeScript) 91%, Grafana (Go) 81%, Discourse (Ruby) 79%, Sentry (Python) 67%. Beats Augment (55%), Greptile (45%), CodeRabbit (43%), Cursor (41%), and Copilot (34%). No LLM, no API key, runs locally in milliseconds.

50 PRs • 5 repos • 97 golden comments • 83.5% HC recall • zero cost

Lesson: the graph is the moat. LLM-based tools look at code content. inspect looks at the dependency graph. A function that 12 other entities depend on is risky regardless of how the diff looks. This is why a zero-cost static tool beats tools backed by frontier models: the signal comes from structure, not language understanding. Content tells you what changed; the graph tells you what matters.

9731d47rewriteGraph-centric risk scoring

Rewrote the risk scoring formula. Previously, classification was the primary signal: functional changes scored high, text changes scored low, regardless of context. Now dependents and blast radius are the primary discriminators. A functional change to a leaf function with zero dependents scores Medium. A syntax change to a hub function with 50 dependents scores High. Cosmetic discount increased from 0.7x to 0.2x.

Risk formula rewrite • AACR-Bench: 48.3% HC recall (was ~30% before) • 78.2% HC+M recall

Mistake: classification-first scoring doesn't work. The first risk formula weighted classification at 55% and graph signals at 30%. This meant every functional change scored High, flooding the output with noise. A one-line logic fix in a helper function with no dependents isn't risky, but classification-first scoring can't tell the difference. Flipping the formula to graph-first (dependents + blast radius as primary) cut false positives in half and jumped AACR-Bench HC recall from ~30% to 48%.

f403560perfParallel graph building, large codebase support

inspect was built on small repos (sem, weave). Then we ran it on Sentry: 16,000 files, 100,000+ entities. It took 40 seconds. HashSet for O(1) lookups in classify and analyze. Replaced full BFS collection with impact_count() that counts without allocating. Added per-phase timing (diff, graph build, scoring) so bottlenecks are visible. Sentry dropped from 40s to ~4s. Small repos stayed under 50ms.

40s → 4s (Sentry, 16k files) • per-phase timing • parallel graph build

Mistake: building for small repos only. We tested on sem (25 files) and weave (80 files). Everything was fast, so we assumed it was fine. The first time we pointed inspect at a real open-source project (Sentry), the graph build alone took 35 seconds because we were doing O(n²) lookups with Vec::contains instead of HashSet::contains. Always test on repos 100x bigger than your own.

80fb20efeatFull-repo entity graph + bench command

Entity graph now covers all source files (via git ls-files) instead of only changed files. Before this, blast radius was always zero because dependents from unchanged files weren't in the graph. New inspect bench command iterates commits and collects aggregate metrics. 96.8% of commits in sem contained tangled logical changes.

bench command • full-repo graph • tangled commit detection

Mistake: graphing only changed files. The initial implementation built the entity graph from just the files that appeared in the diff. This made blast radius and dependent count always zero, because the callers were in unchanged files that weren't in the graph. It took an embarrassingly long time to realize why every entity scored Low. The fix was obvious: build the graph from the full repo via git ls-files, then score the changed entities against it.

bdfa473featInitial release

First working version. Cargo workspace: inspect-core (library) + inspect-cli (binary). ConGra change classification (7 variants), risk scoring, Union-Find untangling into logical groups. Three commands: inspect diff, inspect pr, inspect file. Terminal (colored) + JSON output. 10 tests.

Cargo workspace • 2 crates • 10 tests • 3 commands

Lesson: Union-Find untangling is underrated. We almost shipped without it. The initial version just sorted entities by risk. But a commit that touches auth, caching, and logging has three independent changes that shouldn't be reviewed together. Union-Find on dependency edges between changed entities separates them into groups. 96.8% of commits in sem were tangled. Without untangling, reviewers have to mentally separate the changes themselves.