Benchmarks

inspect + LLM evaluated on the same benchmarks used to measure frontier code review tools. Entity-level triage focuses the LLM on the code that matters.

Greptile Benchmark (141 planted bugs, 52 PRs, 5 repos)

Dataset on HuggingFace. 141 bugs planted across Sentry, Cal.com, Grafana, Keycloak, and Discourse by the Greptile team. Same heuristic keyword-matching judge applied identically to all three tools.

95.0%

recall (inspect + GPT-5.2)

134 of 141 bugs found

100%

HC recall

every high/critical bug caught

49.4%

F1 score

+14pp over Greptile (35.3%)

Tool	Recall	Precision	F1	HC Recall	Findings
inspect + GPT-5.2	95.0%	33.3%	49.4%	100%	402
Greptile (API)	91.5%	21.9%	35.3%	94.1%	590
CodeRabbit (CLI)	56.0%	48.2%	51.8%	60.8%	164

Recall (all severities)

inspect + GPT-5.295.0%

95.0%

Greptile (API)91.5%

91.5%

CodeRabbit (CLI)56.0%

56.0%

141 golden comments, 52 PRs, same judge

HC Recall (High + Critical only)

inspect + GPT-5.2100%

100%

Greptile (API)94.1%

94.1%

CodeRabbit (CLI)60.8%

60.8%

51 high/critical bugs. CodeRabbit misses 39% of them.

Per-severity recall

Severity	inspect	Greptile	CodeRabbit
Critical (n=9)	100%	100%	55.6%
High (n=42)	100%	92.9%	61.9%
Medium (n=49)	91.8%	87.8%	61.2%
Low (n=41)	92.7%	92.7%	43.9%

Per-repo recall

Repo	inspect	Greptile	CodeRabbit
Cal.com (n=31)	96.8%	100%	77.4%
Discourse (n=28)	100%	100%	67.9%
Grafana (n=22)	90.9%	90.9%	36.4%
Keycloak (n=26)	100%	96.2%	65.4%
Sentry (n=34)	88.2%	73.5%	32.4%

inspect + LLM reviews top 60 entities per PR (round-robin by file, sorted by risk score) + 5-file gap review for uncovered diff. Top 15 findings per PR by confidence. 10 concurrent LLM calls. Entity-level dedup (20-line window + identifier overlap). All tools judged with the same keyword-matching heuristic. Greptile via their production API. CodeRabbit via their free-tier CLI (rate limited, ~1 review per 8 min).

AACR-Bench (166 golden comments, 20 PRs, 9 languages)

AACR-Bench is the benchmark used to evaluate GPT-5.2, Claude 4.5 Sonnet, and other frontier LLMs for automated code review. We ran all three tools on 20 diverse PRs.

30.1%

recall (inspect + GPT-5.2)

1.3x Greptile, 2.3x CodeRabbit

22.7%

precision

highest of all three tools

25.9%

F1 score

beats Greptile (22.5%) and CodeRabbit (23.8%)

Tool	Recall	Precision	F1	Findings
inspect + GPT-5.2	30.1%	22.7%	25.9%	220
Greptile (API)	23.5%	21.7%	22.5%	180
CodeRabbit (CLI)	13.3%	115.8%*	23.8%	19

*CodeRabbit's precision exceeds 100% because multiple golden comments matched the same finding (19 findings caught 22 issues).

Recall comparison

inspect + GPT-5.230.1%

30.1%

Greptile (API)23.5%

23.5%

CodeRabbit (CLI)13.3%

13.3%

20 PRs, 166 golden comments, same judge

20 diverse PRs from AACR-Bench (round-robin across repos). Same keyword-matching judge for all tools. Top 15 findings per PR by confidence.

Martian Bench (137 golden bugs, 50 PRs, GPT-5.2 judge)

Martian Leaderboard. 50 real PRs across Keycloak, Sentry, Grafana, Discourse, and Cal.com. GPT-5.2 judges whether each candidate matches a golden bug. Same judge for all tools.

47.5%

F1 (best run)

avg 44.9% across 4 runs

leaderboard rank

beating Augment, Cursor, Copilot

44.3%

precision

2nd highest after Kilo+Grok

#	Tool	F1	Precision	Recall
1	inspect + GPT-5.2	47.5%	44.3%	51.1%
2	Augment	45.8%	37.3%	59.1%
3	Cursor Bugbot	40.5%	38.3%	43.1%
4	Propel	38.1%	38.9%	37.2%
5	Greptile	35.1%	33.8%	36.5%
6	Claude Code	33.6%	30.5%	37.2%
7	GitHub Copilot	32.6%	23.5%	53.3%
8	CodeRabbit	28.1%	21.2%	41.6%
9	Gemini	28.1%	24.6%	32.8%
10	Kilo+Grok	25.0%	48.9%	16.8%

F1 Score

inspect + GPT-5.247.5%

47.5%

Augment45.8%

45.8%

Cursor Bugbot40.5%

40.5%

Greptile35.1%

35.1%

GitHub Copilot32.6%

32.6%

CodeRabbit28.1%

28.1%

137 golden bugs, 50 PRs, GPT-5.2 judge, same judge for all tools

9 parallel specialized lenses (data, concurrency, contracts, security, typos, runtime + 3 general), structural file filter, validation pass with entity before/after verification. Best of 4 runs shown.

How the review works

Entity-level triage focuses the LLM on the code that matters. 9 parallel review lenses catch different categories of bugs.

Entity triage (local, <1s)

Rank all changed entities by risk score: blast radius, dependents, public API, entity type.

BEFORE/AFTER extraction

Top 10 entities get full source code from both sides of the diff, not just the changed lines. 15K token budget.

9 parallel review lenses

6 specialized (data correctness, concurrency, contracts, security, typos, runtime) + 3 general at different temperatures. Run concurrently.

Merge + dedup

Combine results from all lenses. Deduplicate by first 80 characters of each finding.

Structural file filter

Drop findings that reference files not in the diff. Eliminates hallucinated file paths.

Validation + top 7

Validation pass confirms each finding against the actual code and entity before/after snapshots. Top 7 findings returned by confidence.

Speed

Entity extraction, dependency graph, change classification, risk scoring, commit untangling. All local, no API calls.

Single commit review

Time to run inspect diff HEAD~1 on a real commit. 30 runs via hyperfine, warm cache.

Repo	Size	Time
sem	25 files, 65 entities changed	6ms
weave	80 files, 89 entities changed	6ms
inspect	50 files, large commit	67ms

Full repo history (inspect bench)

Time to analyze every commit in a repo's history: extract entities, build graph, classify changes, score risk, untangle.

Repo	Commits	Entities	Wall time
sem	38	5,216	0.57s
weave	45	2,854	1.33s

Single commit review time (visual)

sem (25 files)6ms

6ms

weave (80 files)6ms

6ms

inspect (50 files)67ms

67ms

30 runs via hyperfine -N, warm cache

Powered by sem-core v0.3.0: xxHash64 structural hashing, parallel tree-sitter parsing via rayon, cached git tree resolution, LTO-optimized release builds.