How inspect evolved, including the mistakes.
New inspect-mcp crate: 6 MCP tools so any coding agent can use entity-level review. inspect_triage is the primary entry point, returning entities sorted by risk with a verdict. inspect_entity lets agents drill into one entity with before/after content, dependents, and dependencies. Plus inspect_group, inspect_file, inspect_stats, and inspect_risk_map. Added ReviewVerdict (4 levels: LikelyApprovable, StandardReview, RequiresReview, RequiresCarefulReview) as a quick signal for agents. New --format markdown for all commands. Extended EntityReview with before/after content and dependency names for drill-down.
+820 lines • 12 files • 3 crates (up from 2) • 12 tests passing
Ran inspect against the full Greptile benchmark: 50 PRs, 5 repos, 97 golden comments from human reviewers. 83.5% High/Critical recall. 100% recall at the Medium threshold, meaning every golden comment fell within a flagged entity. Per-repo: Keycloak (Java) 100%, Cal.com (TypeScript) 91%, Grafana (Go) 81%, Discourse (Ruby) 79%, Sentry (Python) 67%. Beats Augment (55%), Greptile (45%), CodeRabbit (43%), Cursor (41%), and Copilot (34%). No LLM, no API key, runs locally in milliseconds.
50 PRs • 5 repos • 97 golden comments • 83.5% HC recall • zero cost
Rewrote the risk scoring formula. Previously, classification was the primary signal: functional changes scored high, text changes scored low, regardless of context. Now dependents and blast radius are the primary discriminators. A functional change to a leaf function with zero dependents scores Medium. A syntax change to a hub function with 50 dependents scores High. Cosmetic discount increased from 0.7x to 0.2x.
Risk formula rewrite • AACR-Bench: 48.3% HC recall (was ~30% before) • 78.2% HC+M recall
inspect was built on small repos (sem, weave). Then we ran it on Sentry: 16,000 files, 100,000+ entities. It took 40 seconds. HashSet for O(1) lookups in classify and analyze. Replaced full BFS collection with impact_count() that counts without allocating. Added per-phase timing (diff, graph build, scoring) so bottlenecks are visible. Sentry dropped from 40s to ~4s. Small repos stayed under 50ms.
40s → 4s (Sentry, 16k files) • per-phase timing • parallel graph build
Entity graph now covers all source files (via git ls-files) instead of only changed files. Before this, blast radius was always zero because dependents from unchanged files weren't in the graph. New inspect bench command iterates commits and collects aggregate metrics. 96.8% of commits in sem contained tangled logical changes.
bench command • full-repo graph • tangled commit detection
First working version. Cargo workspace: inspect-core (library) + inspect-cli (binary). ConGra change classification (7 variants), risk scoring, Union-Find untangling into logical groups. Three commands: inspect diff, inspect pr, inspect file. Terminal (colored) + JSON output. 10 tests.
Cargo workspace • 2 crates • 10 tests • 3 commands