Benchmarks

inspect + LLM evaluated on the same benchmarks used to measure frontier code review tools. Entity-level triage focuses the LLM on the code that matters.

Greptile Benchmark (141 planted bugs, 52 PRs, 5 repos)

Dataset on HuggingFace. 141 bugs planted across Sentry, Cal.com, Grafana, Keycloak, and Discourse by the Greptile team. Same heuristic keyword-matching judge applied identically to all three tools.

95.0%
recall (inspect + GPT-5.2)
134 of 141 bugs found
100%
HC recall
every high/critical bug caught
49.4%
F1 score
+14pp over Greptile (35.3%)
ToolRecallPrecisionF1HC RecallFindings
inspect + GPT-5.295.0%33.3%49.4%100%402
Greptile (API)91.5%21.9%35.3%94.1%590
CodeRabbit (CLI)56.0%48.2%51.8%60.8%164

Recall (all severities)

inspect + GPT-5.295.0%
95.0%
Greptile (API)91.5%
91.5%
CodeRabbit (CLI)56.0%
56.0%
141 golden comments, 52 PRs, same judge

HC Recall (High + Critical only)

inspect + GPT-5.2100%
100%
Greptile (API)94.1%
94.1%
CodeRabbit (CLI)60.8%
60.8%
51 high/critical bugs. CodeRabbit misses 39% of them.

Per-severity recall

SeverityinspectGreptileCodeRabbit
Critical (n=9)100%100%55.6%
High (n=42)100%92.9%61.9%
Medium (n=49)91.8%87.8%61.2%
Low (n=41)92.7%92.7%43.9%

Per-repo recall

RepoinspectGreptileCodeRabbit
Cal.com (n=31)96.8%100%77.4%
Discourse (n=28)100%100%67.9%
Grafana (n=22)90.9%90.9%36.4%
Keycloak (n=26)100%96.2%65.4%
Sentry (n=34)88.2%73.5%32.4%

inspect + LLM reviews top 60 entities per PR (round-robin by file, sorted by risk score) + 5-file gap review for uncovered diff. Top 15 findings per PR by confidence. 10 concurrent LLM calls. Entity-level dedup (20-line window + identifier overlap). All tools judged with the same keyword-matching heuristic. Greptile via their production API. CodeRabbit via their free-tier CLI (rate limited, ~1 review per 8 min).

AACR-Bench (166 golden comments, 20 PRs, 9 languages)

AACR-Bench is the benchmark used to evaluate GPT-5.2, Claude 4.5 Sonnet, and other frontier LLMs for automated code review. We ran all three tools on 20 diverse PRs.

30.1%
recall (inspect + GPT-5.2)
1.3x Greptile, 2.3x CodeRabbit
22.7%
precision
highest of all three tools
25.9%
F1 score
beats Greptile (22.5%) and CodeRabbit (23.8%)
ToolRecallPrecisionF1Findings
inspect + GPT-5.230.1%22.7%25.9%220
Greptile (API)23.5%21.7%22.5%180
CodeRabbit (CLI)13.3%115.8%*23.8%19

*CodeRabbit's precision exceeds 100% because multiple golden comments matched the same finding (19 findings caught 22 issues).

Recall comparison

inspect + GPT-5.230.1%
30.1%
Greptile (API)23.5%
23.5%
CodeRabbit (CLI)13.3%
13.3%
20 PRs, 166 golden comments, same judge

20 diverse PRs from AACR-Bench (round-robin across repos). Same keyword-matching judge for all tools. Top 15 findings per PR by confidence.

How it works

inspect runs entity-level triage locally (free, <1s), then sends the top entities to an LLM for focused review. The LLM sees entity-scoped code with before/after context, not a raw diff.

The pipeline. Run inspect first (free, <1s) to get entity risk rankings. Send top 30 entities to an LLM with before/after code + file diff. Gap-review uncovered files. Dedup and filter by confidence. Top 15 findings per PR. The triage step means the LLM sees focused code, not an entire diff.

Speed

Entity extraction, dependency graph, change classification, risk scoring, commit untangling. All local, no API calls.

Single commit review

Time to run inspect diff HEAD~1 on a real commit. 30 runs via hyperfine, warm cache.

RepoSizeTime
sem25 files, 65 entities changed6ms
weave80 files, 89 entities changed6ms
inspect50 files, large commit67ms

Full repo history (inspect bench)

Time to analyze every commit in a repo's history: extract entities, build graph, classify changes, score risk, untangle.

RepoCommitsEntitiesWall time
sem385,2160.57s
weave452,8541.33s

Single commit review time (visual)

sem (25 files)6ms
6ms
weave (80 files)6ms
6ms
inspect (50 files)67ms
67ms
30 runs via hyperfine -N, warm cache

Powered by sem-core v0.3.0: xxHash64 structural hashing, parallel tree-sitter parsing via rayon, cached git tree resolution, LTO-optimized release builds.