inspect + LLM evaluated on the same benchmarks used to measure frontier code review tools. Entity-level triage focuses the LLM on the code that matters.
Dataset on HuggingFace. 141 bugs planted across Sentry, Cal.com, Grafana, Keycloak, and Discourse by the Greptile team. Same heuristic keyword-matching judge applied identically to all three tools.
| Tool | Recall | Precision | F1 | HC Recall | Findings |
|---|---|---|---|---|---|
| inspect + GPT-5.2 | 95.0% | 33.3% | 49.4% | 100% | 402 |
| Greptile (API) | 91.5% | 21.9% | 35.3% | 94.1% | 590 |
| CodeRabbit (CLI) | 56.0% | 48.2% | 51.8% | 60.8% | 164 |
| Severity | inspect | Greptile | CodeRabbit |
|---|---|---|---|
| Critical (n=9) | 100% | 100% | 55.6% |
| High (n=42) | 100% | 92.9% | 61.9% |
| Medium (n=49) | 91.8% | 87.8% | 61.2% |
| Low (n=41) | 92.7% | 92.7% | 43.9% |
| Repo | inspect | Greptile | CodeRabbit |
|---|---|---|---|
| Cal.com (n=31) | 96.8% | 100% | 77.4% |
| Discourse (n=28) | 100% | 100% | 67.9% |
| Grafana (n=22) | 90.9% | 90.9% | 36.4% |
| Keycloak (n=26) | 100% | 96.2% | 65.4% |
| Sentry (n=34) | 88.2% | 73.5% | 32.4% |
inspect + LLM reviews top 60 entities per PR (round-robin by file, sorted by risk score) + 5-file gap review for uncovered diff. Top 15 findings per PR by confidence. 10 concurrent LLM calls. Entity-level dedup (20-line window + identifier overlap). All tools judged with the same keyword-matching heuristic. Greptile via their production API. CodeRabbit via their free-tier CLI (rate limited, ~1 review per 8 min).
AACR-Bench is the benchmark used to evaluate GPT-5.2, Claude 4.5 Sonnet, and other frontier LLMs for automated code review. We ran all three tools on 20 diverse PRs.
| Tool | Recall | Precision | F1 | Findings |
|---|---|---|---|---|
| inspect + GPT-5.2 | 30.1% | 22.7% | 25.9% | 220 |
| Greptile (API) | 23.5% | 21.7% | 22.5% | 180 |
| CodeRabbit (CLI) | 13.3% | 115.8%* | 23.8% | 19 |
*CodeRabbit's precision exceeds 100% because multiple golden comments matched the same finding (19 findings caught 22 issues).
20 diverse PRs from AACR-Bench (round-robin across repos). Same keyword-matching judge for all tools. Top 15 findings per PR by confidence.
Martian Leaderboard. 50 real PRs across Keycloak, Sentry, Grafana, Discourse, and Cal.com. GPT-5.2 judges whether each candidate matches a golden bug. Same judge for all tools.
| # | Tool | F1 | Precision | Recall |
|---|---|---|---|---|
| 1 | inspect + GPT-5.2 | 47.5% | 44.3% | 51.1% |
| 2 | Augment | 45.8% | 37.3% | 59.1% |
| 3 | Cursor Bugbot | 40.5% | 38.3% | 43.1% |
| 4 | Propel | 38.1% | 38.9% | 37.2% |
| 5 | Greptile | 35.1% | 33.8% | 36.5% |
| 6 | Claude Code | 33.6% | 30.5% | 37.2% |
| 7 | GitHub Copilot | 32.6% | 23.5% | 53.3% |
| 8 | CodeRabbit | 28.1% | 21.2% | 41.6% |
| 9 | Gemini | 28.1% | 24.6% | 32.8% |
| 10 | Kilo+Grok | 25.0% | 48.9% | 16.8% |
9 parallel specialized lenses (data, concurrency, contracts, security, typos, runtime + 3 general), structural file filter, validation pass with entity before/after verification. Best of 4 runs shown.
Entity-level triage focuses the LLM on the code that matters. 9 parallel review lenses catch different categories of bugs.
Entity extraction, dependency graph, change classification, risk scoring, commit untangling. All local, no API calls.
Time to run inspect diff HEAD~1 on a real commit. 30 runs via hyperfine, warm cache.
| Repo | Size | Time |
|---|---|---|
| sem | 25 files, 65 entities changed | 6ms |
| weave | 80 files, 89 entities changed | 6ms |
| inspect | 50 files, large commit | 67ms |
Time to analyze every commit in a repo's history: extract entities, build graph, classify changes, score risk, untangle.
| Repo | Commits | Entities | Wall time |
|---|---|---|---|
| sem | 38 | 5,216 | 0.57s |
| weave | 45 | 2,854 | 1.33s |
Powered by sem-core v0.3.0: xxHash64 structural hashing, parallel tree-sitter parsing via rayon, cached git tree resolution, LTO-optimized release builds.