The short answer: No single model wins. Claude Opus 4.7 leads SWE-bench Pro at 64.3% and is the right pick for shipping production code. Gemini 3.1 Pro is the cheapest at the frontier (~$900 to run the full Artificial Analysis Intelligence Index) and the only one that holds long-context together after Opus 4.7's retrieval regression. GPT-5.5 wins Terminal-Bench 2.0 at 82.7% and ARC-AGI-2 at 85.0% — but its 86% AA-Omniscience hallucination rate disqualifies it from legal, medical, and compliance work.
- Claude Opus 4.7 launched April 16, 2026 — API ID claude-opus-4-7, pricing unchanged at $5 input / $25 output (per Anthropic).
- GPT-5.5 (codename 'Spud') launched April 23, 2026 at $5 input / $30 output for the standard model and $30/$180 for Pro (per OpenAI / ALM Corp).
- Gemini 3.1 Pro (preview) launched February 19, 2026 at $2 input / $12 output under 200K tokens — and DOUBLES to $4/$18 above 200K (per OpenRouter).
- GPT-5.5 leads Terminal-Bench 2.0 (82.7%), ARC-AGI-2 (85.0%), and FrontierMath Tier 4 (35.4%) per Mindwired AI's benchmark roundup.
- Opus 4.7 leads SWE-bench Verified (87.6%), SWE-bench Pro (64.3%), MCP-Atlas (77.3%), and Finance Agent v1.1 (64.4%) per Vellum.
- Gemini 3.1 Pro leads BrowseComp (85.9%) and MMMLU multilingual (92.6%) per Mindwired AI's compiled scores.
- AA-Omniscience hallucination rates: GPT-5.5 86%, Gemini 3.1 Pro 50%, Opus 4.7 36% — Opus is the safest of the three.
- Cost to run AA's full Intelligence Index: ~$900 (Gemini 3.1 Pro), ~$1,200 (GPT-5.5 medium), ~$4,800 (Opus 4.7 max).
- Opus 4.7's own model card lists 59.2% on long-context retrieval vs 91.9% on Opus 4.6 — a documented regression flagged on Hacker News.
Three new flagship models shipped within ten weeks of each other. Every existing comparison post will tell you that "the right choice depends on your use case." That's not a comparison — that's a refusal to do the work. This post picks winners.
I pulled the verified numbers from Anthropic's Opus 4.7 announcement, OpenAI's GPT-5.5 page, Google's Gemini 3.1 Pro post, Vellum's deep dives on each model, Artificial Analysis's cross-vendor Intelligence Index, and the Hacker News threads where developers stress-tested the launches. Then I cross-checked pricing on OpenRouter. Where vendors disagree, I'll say so.
The Decision Tree
Skip the marketing. Pick the branch that matches your workload.
- Shipping production code → Claude Opus 4.7. 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 77.3% MCP-Atlas. Cursor measured a 12-point IDE-workflow jump from Opus 4.6 to 4.7 on its internal CursorBench (58% → 70%).
- Long-context research over >200K tokens → Gemini 3.1 Pro. 1,048,576-token window, 85.9% on BrowseComp, and Opus 4.7 just regressed on long-context retrieval to 59.2% from 91.9%.
- Agentic terminals, math, ARC-AGI, cost-per-intelligence → GPT-5.5. 82.7% Terminal-Bench 2.0 (+13.3 over Opus), 85.0% ARC-AGI-2, 60 on AA Intelligence Index, ~$1,200 to run the full suite.
- Anything where confidence-without-knowing is unacceptable (legal, medical, compliance) → Opus 4.7. Its 36% AA-Omniscience hallucination rate is less than half of GPT-5.5's 86%.
- Daily writing → Honestly? None of these three. Opus 4.7 has a documented writing-quality regression (more on that below). Gemini 3.1 Pro is awkward in tool-driven UIs. GPT-5.5 hallucinates. If writing is the job, GPT-5.4 or Claude Sonnet 4.5 may still be the right call.
Benchmarks Side By Side
Each lab leads on different axes. None of them lead on all of them.
Published benchmarks (April 2026)
| Benchmark | GPT-5.5 | Opus 4.7 | Gemini 3.1 Pro | Leader |
|---|---|---|---|---|
| AA Intelligence Index | 60 | 57 | 57 | GPT-5.5 |
| Terminal-Bench 2.0 | 82.7% | 69.4% | 68.5% | GPT-5.5 (+13.3) |
| SWE-bench Verified | — | 87.6% | 80.6% | Opus 4.7 |
| SWE-bench Pro | 58.6% | 64.3% | 54.2% | Opus 4.7 (+5.7) |
| OSWorld-Verified | 78.7% | 78.0% | — | GPT-5.5 (narrow) |
| BrowseComp | 84.4% | 79.3% | 85.9% | Gemini 3.1 Pro |
| MCP-Atlas | 75.3% | 77.3% | 73.9% | Opus 4.7 |
| Finance Agent v1.1 | 61.5% | 64.4% | 59.7% | Opus 4.7 |
| GPQA Diamond | 93.6% | 94.2% | 94.3% | Statistical tie |
| MMMLU multilingual | 83.2% | 91.5% | 92.6% | Gemini 3.1 Pro |
| FrontierMath Tier 4 | 35.4% | 22.9% | — | GPT-5.5 (+12.5) |
| MRCR v2 (512K–1M) | 74.0% | 32.2% | — | GPT-5.5 |
| ARC-AGI-2 | 85.0% | 75.8% | 77.1% | GPT-5.5 |
| AA-Omniscience hallucinations | 86% | 36% | 50% | Opus 4.7 (low = good) |
Three takeaways jump off this table. First, GPT-5.5 owns terminals, agents-in-shells, abstract reasoning, and frontier math — the workloads where step-by-step planning matters more than calibrated knowledge. Second, Opus 4.7 owns the IDE coding stack. Third, the GPQA Diamond scores (93.6% / 94.2% / 94.3%) are statistically indistinguishable — graduate-level Q&A is no longer a useful differentiator at the frontier.
Verdict: benchmark layer
GPT-5.5 wins on points (more first-place finishes), but Opus 4.7 wins where shipping software matters and Gemini 3.1 Pro wins on long, multilingual, web-grounded reading.
Pricing and Cost-to-Run
List price is a lie. Here is what each model actually costs at the workload level.
Sticker pricing per million tokens (April 2026)
| Model | Input | Cached input | Output | Notes |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | — | $25.00 | Unchanged from Opus 4.6; 1M context tier available |
| GPT-5.5 standard | $5.00 | $0.50 | $30.00 | Batch/Flex 50% off; Priority is 2.5× rate |
| GPT-5.5 Pro | $30.00 | — | $180.00 | Pro/Business/Enterprise tiers only |
| Gemini 3.1 Pro (≤200K) | $2.00 | — | $12.00 | Output cap 64K tokens |
| Gemini 3.1 Pro (>200K) | $4.00 | — | $18.00 | Pricing doubles past 200K input |
At 1M input + 1M output tokens per day under 200K-token prompts, the napkin math is brutal: Opus 4.7 = $30/day = $900/month. GPT-5.5 standard = $35/day = $1,050/month. Gemini 3.1 Pro = $14/day = $420/month. Over 30 days, going Gemini saves you $480 vs Opus. Going Opus over GPT-5.5 saves you $150 — and gets you the lower hallucination rate as a bonus.
Then there's Artificial Analysis's like-for-like Intelligence Index run: ~$1,200 on GPT-5.5 medium, ~$4,800 on Opus 4.7 max, ~$900 on Gemini 3.1 Pro Preview. Opus 4.7 max costs 5.3× as much as Gemini for a statistically equivalent Intelligence Index score (57 vs 57). That's the cost-per-intelligence story Anthropic doesn't put on its pricing page.
Not sure which AI model to use?
12 models · Personalized picks · 60 seconds
Hidden cost — GPT-5.5
OpenAI argues the doubled list price ($15 → $30 output) is offset by ~40% fewer output tokens per task. ALM Corp estimates the effective cost increase caps at ~20%. That math holds for Codex-style coding tasks; verify it on your actual workload before you migrate.
Coding: Opus 4.7 Wins, With Real Caveats
Vendor partner data is real. So is the Hacker News pushback.
The official case for Opus 4.7 on coding is strong. SWE-bench Verified at 87.6%, SWE-bench Pro at 64.3% (+5.7 over GPT-5.5), MCP-Atlas at 77.3%, and Finance Agent v1.1 at 64.4%. Cursor's internal CursorBench jumped from 58% on Opus 4.6 to 70% on Opus 4.7. Notion reported a 14% improvement on its evals and called Opus 4.7 "the first model to pass implicit-need tests." Anthropic's own Rakuten-SWE-Bench number claims 3× more production tasks resolved versus 4.6.
And then you read the Hacker News thread. User agentseal ran three days of side-by-side production coding and posted: "4.7 gets things right on the first try less often than 4.6. One-shot rate sits around 74.5% vs 83.8%. Roughly double the retries per edit (0.46 vs 0.22). Produces a lot more output per call, about 800 tokens vs 372 on 4.6. Cost per call is $0.185 vs $0.112." Multiple users confirm the verbosity regression — one notes 4.7 "reach[es] ChatGPT levels of verbosity in code and loves to overcomplicate the most simple things."
Anthropic shipped a new tokenizer with 4.7 that makes the same input map to 1.0–1.35× more tokens. The list price didn't move; the bill can. Combine that with longer outputs at the new default xhigh effort level — the recommended mode for Claude Code — and you can quietly burn 2–3× more of a Claude Pro 5-hour quota per task. User hgoel: "Each exchange takes ~5% of the 5-hour limit now, when it used to be maybe ~1-2%."
Verdict: coding
Opus 4.7 wins shipping production code. But you must (a) retune prompts for its more literal instruction-following, (b) consider switching adaptive thinking off and pinning effort manually where the API allows, and (c) re-baseline cost per task before assuming the unchanged sticker price means an unchanged bill. If you can't tune, Sonnet 4.5 or Opus 4.6 may still be the rational choice — see our Claude Code vs Cursor breakdown and AI coding tools pricing comparison.
Long-Context Research: Gemini 3.1 Pro
Opus 4.7 just gave up its long-context crown. Gemini picked it up.
This branch flipped at the model-card level. Anthropic's own Opus 4.7 model card lists 59.2% on long-context retrieval, down from 91.9% on Opus 4.6 — a 32.7-point regression. Hacker News user bachittle surfaced the number directly from the card. User Someone1234 confirmed: "Opus 4.7 is also worse at 256K context. Across the board regression, not just 1M context." If your workflow lives in the 200K–1M token band — repository-scale code review, contract corpora, multi-PDF research synthesis — Opus 4.7 is now the wrong tool.
Gemini 3.1 Pro inherits the slot by default. The window is 1,048,576 tokens with a 65,536-token output cap. BrowseComp leads at 85.9%. MMMLU multilingual at 92.6% beats both rivals — useful if your corpus isn't English. Artificial Analysis benchmarks output speed at 115.5 tokens/sec.
The honest negatives matter. Time-to-first-token sits at a slow 20.87 seconds — bad for interactive chat, fine for batch research. Pricing doubles above 200K tokens to $4 input / $18 output per million, which makes a single 800K-token research run materially more expensive than it looks at the headline rate. And independent evaluator Dan Cleary called Gemini 3.1 Pro "the smartest dumb model I know" after watching it choke on a ChatGPT-clone build that Claude Sonnet 4.6 "handled effortlessly" — meaning the long-context win does NOT translate to UI-driven coding.
Verdict: long-context research
Gemini 3.1 Pro wins anything over 200K tokens of input. Budget for the >200K pricing tier in advance, accept the slow first-token latency, and don't conflate this win with a coding-tool win — those are different jobs.
Agentic Workflows and Cost-Per-Intelligence: GPT-5.5
Best on terminals and ARC-AGI. Worst on hallucinations. Pick your poison.
GPT-5.5 has the highest score on the Artificial Analysis Intelligence Index (60), Terminal-Bench 2.0 (82.7%, +13.3 over Opus), OSWorld-Verified (78.7%), ARC-AGI-2 (85.0%), FrontierMath Tier 4 (35.4%, +12.5 over Opus), MRCR v2 across 512K–1M (74.0% vs Opus's 32.2%), CyberGym (81.8%), and GDPval knowledge-work tasks (84.9%). It's also natively omnimodal — text, image, audio, video — where the other two require separate handoffs for some media types. At ~$1,200 to run the full Intelligence Index suite, it's mid-priced between Gemini and Opus while delivering the highest measured intelligence score.
If you're building agentic pipelines that drive a shell, hit OS-level UI elements, or chain 50+ tool calls, GPT-5.5 is the right default. The dossier's vendor-partner data backs this up: OpenAI's claim that GPT-5.5 "matches GPT-5.4's per-token latency in real-world serving" while using ~40% fewer output tokens turns the doubled output price into roughly a 20% effective increase on Codex-style work.
Now the asterisk. GPT-5.5's AA-Omniscience hallucination rate is 86%. Opus 4.7 is 36%. Gemini 3.1 Pro is 50%. Jake Handy of Handy AI didn't mince words: "the hallucination gap is insane — at 86%, the model will confidently answer more questions it doesn't know the answer to, making it unsuitable for medical, legal, or regulatory work where accuracy is essential." Stephen Smith framed the same finding for legal practitioners: "GPT-5.5 may know more, reason better, and still be more willing to make something up when it should say 'I don't know.' For legal research, that is not a footnote. That is the caveat."
OpenAI argues hallucinations are 60% lower than GPT-5.4 (per Mindwired AI's roundup of OpenAI's own claims). Third-party verification is still in progress, and the AA-Omniscience absolute number is what it is.
Verdict: agentic + cost-per-intelligence
GPT-5.5 wins terminal-driven agents, math-heavy reasoning, and the cost-per-AA-Index-point race against Opus 4.7. It loses any workflow where confident-but-wrong outputs incur real-world cost. If your agent is producing artifacts a human reviews before action, ship GPT-5.5. If your agent's output goes straight to a customer or a regulator, don't.
Hidden Wildcards
The five things that didn't make any of the launch headlines but will hit your bill.
- Opus 4.7's tokenizer change. Same prompt, 1.0–1.35× more tokens. Sticker price unchanged means a stealth 0–35% bill increase depending on input mix. Combined with longer default outputs at the new
xhigheffort level, real-world cost-per-task is up. - Gemini 3.1 Pro's 200K cliff. $2/$12 per million is the marketing rate. Cross 200,000 input tokens and you're at $4/$18 — a 100% jump. A single repository-scale research prompt can quietly land in the doubled tier.
- Opus 4.7's adaptive thinking can't be disabled via the standard API parameter. Per Hacker News user
rkuska, the manualthinkingparameter is rejected onclaude-opus-4-7with a 400 error. UserJamesSwift: "Disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance." If you scripted reasoning controls before, they need a rewrite. - Opus 4.7 ships with sharper instruction-following and a writing regression. Per botmonster's reception summary, "Opus 4.7 follows instructions much more literally than 4.6, which means prompts that worked before can now produce unexpected results." Hacker News user
limalabs: "Man is it bad at writing. It's such a stark contrast, sloppy, unprecise, very empty sentences." One reviewer at BoringBot scored Opus 4.7 at 35/50 on PRD writing where 4.6 hit 45/50, with the model "literally stopping mid-sentence" against token caps. - GPT-5.5's effective cost depends on your workload mix. The "40% fewer output tokens" efficiency claim is real on Codex-style coding. It's not validated for long-form writing or research. Run your own A/B before assuming the headline efficiency.
The wildcard worth most attention
Opus 4.7's long-context retrieval drop from 91.9% to 59.2% is documented in Anthropic's own model card. If you migrated automatically expecting an upgrade, audit any pipeline that processes >128K-token inputs.
Who Actually Wins Where
The decision table. Bookmark this; the rest of the post is the receipts.
Winner by workload (April 2026)
| Workload | Winner | Why |
|---|---|---|
| Shipping production code | Claude Opus 4.7 | 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, +12 CursorBench gain |
| Tool-heavy IDE workflows | Claude Opus 4.7 | 77.3% MCP-Atlas; Notion + Cursor partner data corroborates |
| Long-context research (>200K) | Gemini 3.1 Pro | 1M window holds; Opus 4.7 dropped to 59.2% retrieval |
| Multilingual research | Gemini 3.1 Pro | 92.6% MMMLU vs 91.5% (Opus) and 83.2% (GPT-5.5) |
| Web research / browsing agents | Gemini 3.1 Pro | 85.9% BrowseComp vs 84.4% / 79.3% |
| Cost-per-intelligence at frontier | Gemini 3.1 Pro | ~$900 to run AA Intelligence Index vs $4,800 Opus max |
| Terminal-driven agents | GPT-5.5 | 82.7% Terminal-Bench 2.0, +13.3 over Opus |
| Frontier math / abstract reasoning | GPT-5.5 | 35.4% FrontierMath Tier 4, 85.0% ARC-AGI-2 |
| Computer-use / OS automation | GPT-5.5 | 78.7% OSWorld-Verified (Opus 78.0%, narrow win) |
| Legal / medical / compliance | Claude Opus 4.7 | 36% AA-Omniscience hallucination vs 86% / 50% |
| Finance agent workflows | Claude Opus 4.7 | 64.4% Finance Agent v1.1 |
| Daily writing / narrative drafts | None of the three | Opus 4.7 regressed on writing; consider Sonnet 4.5 or GPT-5.4 |
The honest summary: if you can only license one of these three for your team this quarter, license Claude Opus 4.7 — it has the lowest hallucination rate, leads the coding benchmarks that map to actual product work, and its weaknesses (long-context retrieval, writing, verbosity) are well-mapped enough that you can route around them. If you do enterprise-scale long-document analysis, add Gemini 3.1 Pro as the second seat — the <200K pricing makes it cheaper than the alternatives and the long-context numbers aren't close. Add GPT-5.5 only if you're building autonomous agents that live in shells, or doing math-heavy research where calibration matters less than reasoning ceiling.
For the broader landscape — including Chinese open-source models that beat all three on price — see our task-by-task model picker guide and the prior-generation comparison in GPT-5.4 vs Opus 4.7 vs Gemini 3.1 Pro. The frontier moves fast; the decision framework above is what's true on April 27, 2026.
Keep Reading
Stay ahead of the AI curve
We test new AI tools every week and share honest results. Join our newsletter.