A real-world benchmark for AI code review

(qodo.ai)

27 points | by benocodes 2 hours ago

9 comments

  • falloutx 41 minutes ago
    Company creates a benchmark. Same company is best in that benchmark.

    Story as old as time.

  • esafak 30 minutes ago
    I'm not as cynical as the others here; if there are no popular code review benchmarks why should they not design one?

    Apparently this is in support of their 2.0 release: https://www.qodo.ai/blog/introducing-qodo-2-0-agentic-code-r...

    > We believe that code review is not a narrow task; it encompasses many distinct responsibilities that happen at once. [...]

    > Qodo 2.0 addresses this with a multi-agent expert review architecture. Instead of treating code review as a single, broad task, Qodo breaks it into focused responsibilities handled by specialized agents. Each agent is optimized for a specific type of analysis and operates with its own dedicated context, rather than competing for attention in a single pass. This allows Qodo to go deeper in each area without slowing reviews down.

    > To keep feedback focused, Qodo includes a judge agent that evaluates findings across agents. The judge agent resolves conflicts, removes duplicates, and filters out low-signal results. Only issues that meet a high confidence and relevance threshold make it into the final review.

    > Qodo’s agentic PR review extends context beyond the codebase by incorporating pull request history as a first-class signal.

  • mattvv 13 minutes ago
    Some feedback for the team, looked at pricing page and saw it more expensive ($30/dev/mo) and highly limiting (20prs per month per user). We have devs putting up that many prs in a single day. With this kind of plan pretty much no way we would even try this product
    • esafak 12 minutes ago
      It's true, those are some pre-AI quotas.
  • logicx24 24 minutes ago
    Where's the code for this? I'd love to run our tool, https://tachyon.so/, against it.
  • CuriouslyC 1 hour ago
    I don't think LLMs are the right tool for pattern enforcement in general, better to get them to create custom lint rules.

    Agents are pretty good at suggesting ways to improve a piece of code though, if you get a bunch of agents to wear different hats and debate improvements to a piece of software it can produce some very useful insights.

  • mdeeks 52 minutes ago
    I feel like pricing needs to be included here. I kind of don't care about 10 percentage points if the cost is dramatically higher. Cursor Bugbot is about the same cost but gives 10x the monthly quota of Qodo.

    I know this is focused solely on performance, but cost is a major factor here.

  • mbesto 1 hour ago
    Cmd+F - "Overfitting"...nothing.

    Nope, no mention of how they do anything to alleviate overfitting. These benchmarks are getting tiresome.

  • kachapopopow 35 minutes ago
    coderabbit being the worst while (presumeably) advertising the most seems to be check out at least, wouldn't believe the recall % seems bogus.
  • aetherspawn 1 hour ago
    Your pricing page has a bug on it, the annual price is higher than the monthly price.
    • zamadatix 1 hour ago
      I'm seeing $30/m at annual and $38/m at monthly? (maybe already fixed, hard to tell)