A real-world benchmark for AI code review

(qodo.ai)

27 points | by benocodes 2 hours ago

9 comments

falloutx 41 minutes ago
Company creates a benchmark. Same company is best in that benchmark.
Story as old as time.
esafak 30 minutes ago
I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?
Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...
> We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]
> Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.
> To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.
> Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.
mattvv 13 minutes ago
Some feedback for the team, looked at pricing page and saw it more expensive ($30/dev/mo) and highly limiting (20prs per month per user). We have devs putting up that many prs in a single day. With this kind of plan pretty much no way we would even try this product
[-]
- esafak 12 minutes ago
  It's true, those are some pre-AI quotas.
logicx24 24 minutes ago
Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.
CuriouslyC 1 hour ago
I don't think LLMs are the right tool for pattern enforcement in general, better to get them to create custom lint rules.
Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.
mdeeks 52 minutes ago
I feel like pricing needs to be included here. I kind of don't care about 10 percentage points if the cost is dramatically higher. Cursor Bugbot is about the same cost but gives 10x the monthly quota of Qodo.
I know this is focused solely on performance, but cost is a major factor here.
mbesto 1 hour ago
Cmd+F - "Overfitting"...nothing.
Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.
kachapopopow 35 minutes ago
coderabbit being the worst while (presumeably) advertising the most seems to be check out at least, wouldn't believe the recall % seems bogus.
aetherspawn 1 hour ago
Your pricing page has a bug on it, the annual price is higher than the monthly price.
[-]
- zamadatix 1 hour ago
  I'm seeing $30/m at annual and $38/m at monthly? (maybe already fixed, hard to tell)