Three flagships shipped in eight days. Claude Opus 4.7 (Apr 16), GPT-5.5 (Apr 23), and DeepSeek V4 (Apr 24) all landed inside the same news cycle, with Gemini 3.1 Pro the holdover from February. Opus 4.7 wins SWE-Bench Pro at 64.3%. GPT-5.5 wins Terminal-Bench 2.0 at 82.7% — and doubled its API price. Gemini 3.1 Pro is the cheapest Western flagship at $2/$12 per M tokens. DeepSeek V4-Pro undercuts everyone at $0.145/$1.74 per M, but only until May 5.
- Claude Opus 4.7 launched April 16, 2026 at $5/$25 per M tokens with a 1M-token context (Anthropic).
- GPT-5.5 launched April 23, 2026 at $5/$30 per M tokens — exactly 2x the GPT-5.4 list price (Vellum, Simon Willison).
- DeepSeek V4-Pro launched April 24, 2026 at $0.145/$1.74 per M tokens on a 75% promo through May 5 (Hugging Face, VentureBeat).
- Gemini 3.1 Pro remains in Preview since February 19, 2026, at $2/$12 per M tokens up to 200K context (DeepMind model card).
- Opus 4.7 leads SWE-Bench Pro at 64.3%; GPT-5.5 leads Terminal-Bench 2.0 at 82.7%; Gemini 3.1 Pro leads GPQA Diamond at 94.3%.
- GPT-5.5 hits 74.0% on long-context (512K–1M) accuracy versus 36.6% for GPT-5.4 (Vellum).
- Opus 4.7 ships a new tokenizer that maps the same input to ~1.46x as many tokens as Opus 4.6 (Simon Willison).
- GPT-5.5 becomes more expensive than Opus 4.7 beyond ~272K input context (HN tester Topfi).
- DeepSeek V4-Pro: 1.6T total / 49B active params; KV cache is ~10% of V3.2's; 1M context window with 0.59 MRCR at 1M tokens.
Every comparison post still indexing for these models was written before April 23. None of them mention GPT-5.5's price doubling, the DeepSeek V4 promo cliff on May 5, or that Opus 4.7's new tokenizer quietly burns 1.46x more tokens than 4.6 on the same input. This one does.
I pulled verified numbers from Anthropic's Opus 4.7 announcement, the DeepSeek V4 Hugging Face blog, the Gemini 3.1 Pro model card, and Vellum's GPT-5.5 breakdown (OpenAI's announcement page kept returning 403). For the practitioner reality check, I leaned on Simon Willison's hands-on writeup, the Latent Space coverage, and three live Reddit threads — r/LocalLLaMA on V4-Pro, r/ClaudeAI on Opus 4.7, and r/Bard on Gemini regressions. Every benchmark below is lab-claimed unless flagged otherwise.
The April 2026 Wave: What Actually Happened
Three releases in eight days, and a price war that wasn't.
Wall Street spent Q1 predicting the AI price war would force Western labs down. The opposite happened. OpenAI doubled GPT-5.5's list price ($5/$30 vs GPT-5.4's $2.50/$15). Anthropic held Opus 4.7 flat at $5/$25 — but shipped a new tokenizer that Simon Willison clocked at ~1.46x more tokens per input than Opus 4.6, a stealth price hike on the same sticker. Gemini 3.1 Pro added a >200K-context surcharge ($4/$18 above the threshold). Only DeepSeek undercut, and that's a 75% promo expiring May 5.
The cleanest framing came from HN user redsaber on the GPT-5.5 thread: "the era of subsidized AI is over." GPT-5.5 was simultaneously removed from GitHub Copilot Pro (Pro+/Business/Enterprise only), reinforcing the read that flagship models are no longer entry-tier offerings. If you've been waiting for a model to be cheaper than last year's, you're going to be waiting longer.
What this means for buyers
Three of the four labs raised effective prices this cycle. Plan compute budgets accordingly — and don't assume next quarter's flagship will be cheaper than this quarter's.
The Benchmark Grid (Lab-Claimed)
Side-by-side, with the asterisks.
Headline benchmarks across all four
| Benchmark | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro | DeepSeek V4-Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 64.3% | 58.6% | 54.2% | 55.4% |
| SWE-Bench Verified | 87.6% | — | 80.6% | 80.6% |
| Terminal-Bench 2.0 | 69.4% | 82.7% | 68.5% | 67.9% |
| ARC-AGI-2 | 75.83% | 85.0% | 77.1% | — |
| FrontierMath T1–3 | 43.8% | 51.7% | — | — |
| GPQA Diamond | — | — | 94.3% | — |
| MMLU-Pro | — | — | — | 73.5% |
| Long context (512K–1M) | 32.2% | 74.0% | 84.9% @ 128K | 0.59 @ 1M |
| Input $/M | $5.00 | $5.00 | $2.00 | $0.145 |
| Output $/M | $25 | $30 | $12 | $1.74 |
| Context window | 1M | 1M | 1M | 1M |
The honest reading: no model wins everything, and the spread between #1 and #4 on any single benchmark is rarely more than 10–15 points. The decisive factor isn't "who's smartest" — it's which axis (cost, context, agentic harness, prose) matters most for your workload. Vellum's full comparison table is the cleanest single citation if you want to dig further.
Best for Coding: Opus 4.7 — With a Real Asterisk
The benchmark winner isn't always the daily driver.
On paper, this is Opus 4.7's race. 64.3% on SWE-Bench Pro is +5.7 over GPT-5.5's 58.6%, +10.1 over Gemini 3.1 Pro's 54.2%, and +8.9 over DeepSeek V4-Pro's 55.4%. SWE-Bench Verified at 87.6% is +7 vs Opus 4.6. Latent Space called it "literally one step better than 4.6 in every dimension."
Then I read u/Campfire_Steve's r/ClaudeAI post titled "Reverted from Opus 4.7 to 4.6 — went from endless loops to shipping 10 features in one session." Their setup: a Docker-deployed eBay scraper, ~5,800 LOC across 18 files, sustained collaborative coding sessions. On 4.7, they spent "three painful sessions of chasing our tails" before concluding the architecture had to be scrapped. On 4.6, in a single session: a new query planner, a one-time GCD API script with smart sampling (160 calls instead of 1,072), and 27 collection grid pages.
This is the workflow Anthropic explicitly markets 4.7 for. u/CricktyDickty captured the trade-off in one line: "4.6 is better at inferring what you mean. 4.7 is much more literal and doesn't do well with ambiguous requests." MindStudio's analyst review reaches the same conclusion: 4.7 "follows instructions much more literally than 4.6, which means prompts that worked before can now produce unexpected results."
Honest negative
If your coding loop relies on the model inferring intent from vague prompts, Opus 4.7 is a regression. The 64.3% SWE-Bench Pro lead doesn't translate when the bottleneck is your prompt clarity, not the model's reasoning. Stay on Opus 4.6 (or Sonnet 4.5) until you've A/B tested your specific workflow.
Verdict: Opus 4.7 wins for shipping production code on well-specified tasks in large codebases. GPT-5.5 is the alternative for anyone whose flow is closer to the Codex / terminal workflow. If you're cost-sensitive and willing to accept verbosity, DeepSeek V4-Pro is genuinely usable as a worker subagent.
Best for Agentic / Terminal Work: GPT-5.5
Where the new tokenizer math actually pays off.
Not sure which AI model to use?
12 models · Personalized picks · 60 seconds
GPT-5.5's 82.7% on Terminal-Bench 2.0 is the largest single-benchmark margin in this cycle: +13.3 over Opus 4.7 (69.4%), +14.2 over Gemini 3.1 Pro (68.5%), +14.8 over DeepSeek V4-Pro (67.9%). Long-context accuracy at 512K–1M jumps from 36.6% (GPT-5.4) to 74.0% — that's the kind of step change that genuinely shifts what's buildable.
HN user Sembiance's hands-on test on a reverse-engineering coding task: "Achieved 90% perfect on first try in about a fourth of the time" versus Opus 4.6. u/refulgentis cited two specific wins: a streaming JSON decoder where sync time dropped from 75 seconds to 0.8 seconds, and a Flutter Web WASM debug session — both completed inside an hour.
The catch is what HN user wincy flagged: GPT-5.5 has a "lazy" failure mode. Asked to write a SQL transaction with rollback, it returned a template instead of completing the query, and the user had to "prod the model to do what they asked." Behavioral regressions like this are real and not picked up by Terminal-Bench scores.
Hidden wildcard: Gemini 3.1 Pro looks competitive on agentic benchmarks (MCP Atlas 69.2%) but practitioner reports tell a different story. Dan Cleary's CodeX writeup: "Google's models need the most additional layers and helpers compared to Anthropic and OpenAI, and they consistently try to break out of the harness so often that Converge hasn't rolled out any Gemini models to its users." If you're building a tool-calling agent, that's disqualifying.
Verdict
GPT-5.5 wins agentic and terminal-heavy work. The 13-point Terminal-Bench 2.0 lead is real and corroborated by hands-on HN testing. Skip Gemini 3.1 Pro for harness-based agents until Google fixes the break-out issues.
Best for Research and Writing: It Splits
Opus 4.7 for thinking. Sonnet 4.6 for prose. Avoid Opus 4.7 for writing.
For deep research and "hidden premise" exploration, Opus 4.7 is the pick. r/ClaudeAI user chipmux switched from GPT after a side-by-side: shared a stock portfolio with both models. "GPT simply adjusted the distribution. I gave the exact same prompt to Claude Opus 4.7. Claude not only optimized the portfolio but also suggested additional stocks and ETFs that were genuinely useful. GPT did not think in that direction."
Gemini 3.1 Pro's 94.3% on GPQA Diamond and 44.4% on Humanity's Last Exam are the strongest pure-reasoning numbers on paper, and the model also leads BrowseComp at 85.9%. But r/Bard is full of post-launch regression complaints since the Pro daily limit jumped to 50: "It now regularly gets confused… It's also a lot more fawning and sycophantic than before" (u/Oneirathon1). u/Powerful_Ad_8915's theory: "They are training a new model, that is why. Compute starved."
For prose specifically, do not use Opus 4.7. The HN front-page thread literally titled "Opus 4.7 is horrible at writing" (47801971) has multiple senior commenters reaching the same verdict. u/SyntaxErrorist: "It feels like they tuned it so hard for logic and coding that it lost its soul for actual writing." u/chmod775: "Opus 4.7 seems to reach ChatGPT levels of verbosity in code and loves to overcomplicate."
Honest split
For research that benefits from exploration of hidden premises: Opus 4.7. For pure reasoning benchmarks on paper: Gemini 3.1 Pro (with the late-April quality regression caveat). For prose and marketing copy: stay on Sonnet 4.6 or use Opus 4.6 — both flagships regressed for non-coding work.
The Real Cost Math (Not the Sticker Price)
Where the headlines lie.
The vendor pricing pages don't tell you what you'll actually spend. Three things to know:
1. GPT-5.5 doubled the list price but uses ~40% fewer output tokens than GPT-5.4 on equivalent Codex tasks (per Vellum). The effective cost increase is ~20%, not 100%. Willison framed the repositioning bluntly: "the pricing relationship mirrors Claude Sonnet is to Claude Opus." GPT-5.4 is the new mid-tier; GPT-5.5 is the premium tier.
2. Opus 4.7's tokenizer changed. Same input → ~1.46x as many tokens as Opus 4.6. Anthropic claims this is offset by a ~50% reduction in reasoning verbosity, so the net per-task cost may still drop. Decrypt headline: "Token Eating Machine." The reconciliation is plausible but you should benchmark on your actual prompts before assuming the per-task cost matches Opus 4.6.
3. The 272K crossover. HN tester Topfi flagged that GPT-5.5 becomes more expensive than Opus 4.7 beyond ~272K input context. Below 272K, GPT-5.5's $5/$30 plus token efficiency wins. Above it, Anthropic's flat $5/$25 wins. If you do agentic work with big context windows, this single threshold tells you which API to default to.
Concrete daily-spend math at 1M input + 1M output tokens per day:
| Model | Daily cost | Monthly cost (30d) | Notes |
|---|---|---|---|
| DeepSeek V4-Pro (promo) | $1.89 | $56.55 | Promo ends May 5; verbosity inflates real spend |
| Gemini 3.1 Pro (≤200K) | $14.00 | $420.00 | $4/$18 surcharge above 200K context |
| Claude Opus 4.7 | $30.00 | $900.00 | Tokenizer ~1.46x; per-task offset claimed |
| GPT-5.5 | $35.00 | $1,050.00 | ~40% fewer output tokens partially offsets |
| GPT-5.5 Pro | $210.00 | $6,300.00 | $30/$180 — premium reasoning tier |
That's a ~18.5x spread between DeepSeek's promo rate and GPT-5.5 Pro for the same nominal token volume. See our full AI coding tool pricing breakdown for IDE-bundled subscription comparisons.
Hidden Wildcards You Won't See on the Spec Sheets
The footnotes that change which model you should pick.
Wildcard 1 — DeepSeek V4 promo cliff. u/BriefImplement9843 on r/LocalLLaMA: "until the 5th. nobody was using it at the old price." The $0.145/$1.74 pricing reverts to roughly 4x that on May 6. If you're planning a workload around V4-Pro economics, build in a switch to DeepSeek V3.2 or Qwen3-Max as the post-promo fallback.
Wildcard 2 — V4-Pro is verbose to a fault. u/AnomalyNexus: "It writes these huge walls of text about what it's doing. Like just solid blocks of just essays." u/look: "both per token price and its reasoning token use is off the charts." The headline cost saving is gross of model self-talk. Same caveat applies as the Opus 4.7 tokenizer footnote — benchmark your real prompts.
Wildcard 3 — V4-Pro hallucinations. u/look again: "it has an extremely high hallucination rate. It knows a lot of things, but when it doesn't know the answer, it makes something up." The pattern u/2Norn settled on after a week of testing: V4-Pro and V4-Flash as "worker subagents" for implementation only — no design, no planning, no research. Claude/GPT still handle the thinking layer.
Wildcard 4 — Gemini 3.1 Pro time-to-first-token. Artificial Analysis clocks Gemini 3.1 Pro at 21.14 seconds TTFT versus 111.9 output tok/s throughput. The high TTFT is a reasoning-overhead artifact. For chat UX it's noticeable; for batch jobs it doesn't matter.
Wildcard 5 — Opus 4.7's BrowseComp regression. Opus 4.7 dropped on web-research benchmarks vs Opus 4.6. Gemini 3.1 Pro's 85.9% BrowseComp is the right tool for browse-heavy research, not Opus.
Who Wins Where: The Decision Table
Pick the row that matches your workload.
Winner by workload — April 2026
| Workload | Winner | Why |
|---|---|---|
| Shipping production code (well-specified) | Claude Opus 4.7 | 64.3% SWE-Bench Pro, 87.6% Verified — +5.7 over GPT-5.5 |
| Sustained vague-prompt coding sessions | Claude Opus 4.6 or Sonnet 4.5 | 4.7 regressed on intent-inference per Reddit + MindStudio |
| Terminal / CLI / agentic harnesses | GPT-5.5 | 82.7% Terminal-Bench 2.0 — +13.3 lead, real on hands-on testing |
| Long context >272K input | Claude Opus 4.7 | Cheaper than GPT-5.5 above the crossover; flat $5/$25 |
| Long context <272K input | GPT-5.5 | 74.0% long-context accuracy + token-efficiency offset |
| Pure reasoning / GPQA / hard exams | Gemini 3.1 Pro | 94.3% GPQA, 44.4% HLE — paper leader despite UX issues |
| Web research / BrowseComp | Gemini 3.1 Pro | 85.9% BrowseComp; Opus 4.7 regressed here |
| Cost-sensitive worker subagents | DeepSeek V4-Pro/Flash | $0.145/$1.74 promo; 80.6% SWE-Verified — until May 5 |
| Tool-heavy MCP agents | Claude Opus 4.7 or GPT-5.5 | Avoid Gemini — harness break-out per CodeX/Converge |
| Marketing prose / creative writing | Sonnet 4.6 or Opus 4.6 | Both new flagships regressed on prose per HN consensus |
| Self-hostable / open weights | DeepSeek V4-Pro | 1.6T/49B-active MoE; OpenAI + Anthropic API compat |
If you're picking exactly one model for a small team, the honest answer is two models: Claude Opus 4.7 for ambitious code on large codebases, GPT-5.5 for terminal/agentic work and long-context jobs under 272K input. Use Gemini 3.1 Pro for research-style queries that benefit from BrowseComp and GPQA strength, and DeepSeek V4-Pro as a cheap implementation-layer subagent — but only until May 5, and only behind a thinking model that catches its hallucinations.
For a workflow-by-workflow walkthrough beyond just these four, see our task-by-task model picker guide. For deeper coverage of the open-weights side specifically, see the DeepSeek V4 deep-dive.
The bottom line
No single model wins in April 2026. The closest thing to a universal pick is Claude Opus 4.7 for code + research and GPT-5.5 for agentic and long-context work. The vendors raised effective prices this cycle — plan around it. The DeepSeek discount is a temporary arbitrage, not a long-term floor.
Keep Reading
Stay ahead of the AI curve
We test new AI tools every week and share honest results. Join our newsletter.