inspect + LLM evaluated on the same benchmarks used to measure frontier code review tools. Entity-level triage focuses the LLM on the code that matters.
Dataset on HuggingFace. 141 bugs planted across Sentry, Cal.com, Grafana, Keycloak, and Discourse by the Greptile team. Same heuristic keyword-matching judge applied identically to all three tools.
| Tool | Recall | Precision | F1 | HC Recall | Findings |
|---|---|---|---|---|---|
| inspect + GPT-5.2 | 95.0% | 33.3% | 49.4% | 100% | 402 |
| Greptile (API) | 91.5% | 21.9% | 35.3% | 94.1% | 590 |
| CodeRabbit (CLI) | 56.0% | 48.2% | 51.8% | 60.8% | 164 |
| Severity | inspect | Greptile | CodeRabbit |
|---|---|---|---|
| Critical (n=9) | 100% | 100% | 55.6% |
| High (n=42) | 100% | 92.9% | 61.9% |
| Medium (n=49) | 91.8% | 87.8% | 61.2% |
| Low (n=41) | 92.7% | 92.7% | 43.9% |
| Repo | inspect | Greptile | CodeRabbit |
|---|---|---|---|
| Cal.com (n=31) | 96.8% | 100% | 77.4% |
| Discourse (n=28) | 100% | 100% | 67.9% |
| Grafana (n=22) | 90.9% | 90.9% | 36.4% |
| Keycloak (n=26) | 100% | 96.2% | 65.4% |
| Sentry (n=34) | 88.2% | 73.5% | 32.4% |
inspect + LLM reviews top 60 entities per PR (round-robin by file, sorted by risk score) + 5-file gap review for uncovered diff. Top 15 findings per PR by confidence. 10 concurrent LLM calls. Entity-level dedup (20-line window + identifier overlap). All tools judged with the same keyword-matching heuristic. Greptile via their production API. CodeRabbit via their free-tier CLI (rate limited, ~1 review per 8 min).
AACR-Bench is the benchmark used to evaluate GPT-5.2, Claude 4.5 Sonnet, and other frontier LLMs for automated code review. We ran all three tools on 20 diverse PRs.
| Tool | Recall | Precision | F1 | Findings |
|---|---|---|---|---|
| inspect + GPT-5.2 | 30.1% | 22.7% | 25.9% | 220 |
| Greptile (API) | 23.5% | 21.7% | 22.5% | 180 |
| CodeRabbit (CLI) | 13.3% | 115.8%* | 23.8% | 19 |
*CodeRabbit's precision exceeds 100% because multiple golden comments matched the same finding (19 findings caught 22 issues).
20 diverse PRs from AACR-Bench (round-robin across repos). Same keyword-matching judge for all tools. Top 15 findings per PR by confidence.
inspect runs entity-level triage locally (free, <1s), then sends the top entities to an LLM for focused review. The LLM sees entity-scoped code with before/after context, not a raw diff.
The pipeline. Run inspect first (free, <1s) to get entity risk rankings. Send top 30 entities to an LLM with before/after code + file diff. Gap-review uncovered files. Dedup and filter by confidence. Top 15 findings per PR. The triage step means the LLM sees focused code, not an entire diff.
Entity extraction, dependency graph, change classification, risk scoring, commit untangling. All local, no API calls.
Time to run inspect diff HEAD~1 on a real commit. 30 runs via hyperfine, warm cache.
| Repo | Size | Time |
|---|---|---|
| sem | 25 files, 65 entities changed | 6ms |
| weave | 80 files, 89 entities changed | 6ms |
| inspect | 50 files, large commit | 67ms |
Time to analyze every commit in a repo's history: extract entities, build graph, classify changes, score risk, untangle.
| Repo | Commits | Entities | Wall time |
|---|---|---|---|
| sem | 38 | 5,216 | 0.57s |
| weave | 45 | 2,854 | 1.33s |
Powered by sem-core v0.3.0: xxHash64 structural hashing, parallel tree-sitter parsing via rayon, cached git tree resolution, LTO-optimized release builds.