← Tokenomics

The research notes

2026 / 06 / 11 · Appendix to Tokenomics

Tokenomics opens with a scene: I asked for the research, answered a few multiple-choice questions, and agents worked through my browser and the open web while I slept on it. This page is what they brought back. Fleets of cheaper models ran in parallel workflows reading the Anthropic engineering corpus, sweeping X and the press, pulling OpenRouter rankings live, and mining the public record on document work, and a frontier model orchestrated, deduplicated, and cut what did not hold up.

These are working notes, lightly cleaned for publication: internal annotations and anything without a public source were removed, and the findings were curated down from the raw sweep. Every bullet links to where it came from. Read them the way I did, as raw material. The post is the argument; this is the evidence pile. The sixth corpus, on the price of a finished task, was gathered the same way the day after publication and verified claim by claim.

6 research corpora
339 findings, each with a public source

The spend spectrum

What people actually pay, from $20 subscriptions to nine-figure enterprise commitments. Web corpus: press, pricing guides, engineering blogs, cost audits.

  • Claude Pro plan is $20/month. Max 5x is $100/month. Most casual users stay at Pro or below.
  • Ramp April 2026 benchmarks: median monthly AI spend across Ramp customers is $2,246; average is $140,842, a 63x gap revealing extreme power-user concentration. Middle 50% of companies: $3 to $352 per employee per month.
  • Anthropic enterprise data: average Claude Code cost is approximately $13/developer/active day, $150-250/developer/month; 90% of users stay below $30/active day.
  • Kyle Redelinghuys (ksred.com): tracked approximately 10 billion tokens over 8 months (June 2025 to February 2026) using Claude Code daily. API-equivalent cost: $15,000+. Actual cost: approximately $800 on Max 5x plan ($100/month). Peak month (July 2025): $5,623 API-equivalent, 201 sessions across 45+ projects. Over 90% of all tokens were cache reads.
  • Simon Willison: pays $200/month for Claude Max. Also separately reported approximately $1,000/month against each of Anthropic and OpenAI at API rates.
  • George Xing: 1-person product team spending $100/month on Claude Max, handling PM, full-stack engineering, UX/UI, code review, QA, and test automation.
  • HN/CodeBurn: developer was spending approximately $1,400/week on Claude Code at API-equivalent rates while on the $200/month Max plan. One commenter breakdown: 98.2% cache hit rate; 56% of spending went to conversation turns with no tool usage; actual coding accounted for only 21% of expenditure.
  • Pragmatic Engineer survey: a seed-stage AI startup went from approximately $200/month to approximately $3,000/month per developer in 6 months. A fintech noted “some developers are now spending $500 a day on Claude Code.” One healthcare engineer spent $1,400 in a single session.
  • LeanOps audit of 30 engineering teams: median monthly spend $480/developer, 90th percentile $1,650, 99th percentile $4,200+. A 20x spread between lightest and heaviest users.
  • Anthropic enterprise data: a 6-person team (Branch8) spent $2,400 in month 1 (approximately 240M tokens), dropping to $680 in month 3 after prompt caching and session focus, a 72% reduction.
  • SemiAnalysis: reached a $10.95 million annual spend rate on Anthropic Claude tokens.
  • Salesforce: allocated $300M for Anthropic tokens and froze engineering hires.
  • Ramp top 1%: $831,338/month.
  • Microsoft Experiences + Devices: per-engineer Claude Code costs reportedly reached approximately $2,000/month before licenses were canceled June 30, 2026. Engineers redirected to GitHub Copilot CLI at $39/seat.
  • Uber: burned through entire 2026 AI budget in 4 months after deploying Claude Code to approximately 5,000 engineers. 95% of engineers using AI tools monthly, 70% of committed code from AI. Heavy users: $500 to $2,000/month.
  • Unnamed enterprise: accidentally spent $500 million on Claude AI in a single month after failing to set usage caps.

What tokens buy: cost-per-task and labor comparison math

  • morphllm.com, June 2026 Sonnet 4.6 pricing: bug fix $0.54 (25 API calls, 400K cumulative input at 75% cache rate, 10K output); feature implementation $2.28 (100 calls, 2M input, 40K output); 20 features/month approximately $45.60. Claude Max 5x ($100/month) breaks even at approximately 44 bug-fix tasks/month.
  • SWE-agent original budget: $4 per-instance on hard benchmarks; modern range $0.04 (GPT-5-Mini) to $3.11 (Claude 3.5 Sonnet).
  • Retool framing: mid-level analyst runs $50-80/hour fully loaded. AI agents on simple high-volume tasks run well under $1/hour.
  • SemiAnalysis value framing: agent work costs approximately $6-7 for tasks that would cost a knowledge worker $350-500/day fully loaded, a 10-30x ROI at the task level.
  • Anthropic productivity research (100K real conversations): Claude reduces task completion time by approximately 80%; median time savings 84%. Typical task takes 1.4 hours without AI, estimated at $55 in human labor per conversation. Curriculum development: 4.5 hours to 11 minutes.
  • Customer support benchmark: $0.30 to $0.99 per resolved ticket vs. human agent $4 to $8. Mid-market SaaS (8,000 monthly tickets): cost per resolution drops from $18.40 to $6.20 with AI.
  • Code-review agents: complete routine PRs for $0.72 vs. $48 senior-engineer time (66x savings).
  • Maor Shlomo / Base44: built solo with $10K to $20K LLM costs, reached $1.5M revenue within one month of launch, $189K profit in May 2025 after LLM costs, sold to Wix for $80M cash in June 2025.
  • Pieter Levels, Photo AI: $132-138K MRR with approximately $13K/month total infrastructure costs, 87%+ profit margins, zero employees.
  • Anthropic C compiler project: 100,000-line C compiler built with parallel Claude agents for just under $20,000. Consumed 2 billion input tokens and 140 million output tokens across approximately 2,000 Claude Code sessions over two weeks.
  • Anthropic multi-agent research system: “agents typically use about 4x more tokens than chat interactions, and multi-agent systems use about 15x more tokens than chats.” Token usage explains 80% of performance variance.
  • Anthropic harness engineering: solo agent run $9 for 20 minutes; full harness (Opus 4.5) $200 for 6 hours (22x cost multiplier); updated harness (Opus 4.6) $124.70 for 3h50m.
  • Stanford DEL / arXiv 2604.22750: agentic tasks consume 1,000x more tokens than code reasoning and code chat. Costs for identical agents on the same task vary up to 30x. Models cannot reliably predict their own token costs (Pearson r < 0.15).
  • EA Forum, frontier model cost on long tasks: at plateau performance for 2-hour tasks: Grok 4 approximately $13/hour, GPT-5 approximately $120/hour, o3 climbs to $350/hour, exceeding the approximately $120/hour human software engineer baseline.

Waste patterns

  • LeanOps: naive agent loops accumulate context quadratically because entire conversation history is re-sent on every API call. A 10-turn workflow costs approximately 55x a single query. A 5-step loop: 45,700 input tokens at $0.158 vs. single chatbot call at 9,700 tokens for $0.049, a 3.2x multiplier that grows to 30x at 50 steps and 100x+ at 200 steps. “Re-sent context is 62% of the bill.”
  • pedantic_geek, OpenAI community forum: assistant stuck in a function-calling loop burned $100 in a couple of hours; after redeployment another $160 consumed overnight. Total initial loss: $260. Two further unexplained charges of $76 each appeared weeks later.
  • Cursor agent loop case study (dredyson.com): developer burned monthly token budget in days. After fixing context management: 15,000 tokens/day down to 1,200/day (92% reduction), success rate from 23% to 89%.
  • vexp.dev: unstructured sessions burn 3-5x more tokens than structured development. A typical 2-hour vibe session: 200,000-400,000 tokens; structured equivalent: 60,000-100,000. Solo developer cost: $300-450/month vibe vs $90-135/month structured.
  • Cursor, June-July 2025: shifted $20/month Pro plan from 500 fast responses to $20 worth of usage billed at API rates. One HN commenter: $350 in overage in one week. One team of 5 spent $4,600 in 6 weeks. A $7,000 annual subscription reportedly depleted in a single day. Cursor issued public apology and refunds.
  • SaaStr analysis: with 1,000 users at $20/seat ($20,000 revenue), infrastructure costs of $24,288 produce a -21.4% margin when 5-10% of power users consume disproportionate tokens.
  • Meta tokenmaxxing: employees used 60.2 trillion AI tokens in 30 days (estimated $100M+ cost) after internal leaderboards ranked employees by token consumption. Employees ran agents on tasks unrelated to their work to rank higher. Meta abolished the leaderboard.
  • LLM routing research: LLM API calls account for 70-85% of total AI agent operating costs. Most teams default to the same frontier model for every task, overpaying by 40-85%. A routing policy reserving frontier models for complex tasks reduces cost by 70-80% for routed volume.
  • Anthropic Advanced Tool Use: a 5-server MCP setup consumed approximately 55K tokens before conversation started; one setup consumed 134K tokens before optimization.

Pricing mechanics

  • Current Claude API pricing (June 2026), source: platform.claude.com/docs/en/about-claude/pricing: Claude Fable 5 $10/$50 per MTok input/output; Opus 4.8 $5/$25; Sonnet 4.6 $3/$15; Haiku 4.5 $1/$5.
  • Batch API: 50% discount on all input and output across all models.
  • Prompt caching (5-min TTL): 1.25x write, 0.1x cache read. Breaks even after approximately 1 read. Batch + caching combined: a cached batch read costs approximately 5% of standard input price.
  • Max plan details: Pro = $20/month; Max 5x = $100/month; Max 20x = $200/month, with rolling 5-hour window limits.
  • Madrona VC analysis: Claude Max ($200/month) created an 18x underpricing gap. One developer consumed $15,000+ in API-equivalent tokens for $800. Anthropic’s April 4, 2026 policy blocked third-party agent frameworks from subscriptions, forcing metered API billing.
  • Ramp effective rates (April 2026, reflecting cache benefits): Claude Haiku 4.5 $0.40/MTok effective, Sonnet 4.6 $0.62/MTok effective, Opus 4.6 $1.00/MTok effective. Source

Token-efficiency techniques

  • Anthropic Prompt Caching Docs: 90% cost reduction on cached input tokens. Multi-turn conversation with Opus 4.8 over 13 requests: $63.75 with caching vs $375 uncached, 83% savings. First request pays $31.25 (cache write); subsequent 12 requests cost $2.50 each (cache read).
  • YouTube analytics bot: reduced monthly costs from $720 to $72 by caching an 81,262-token payload.
  • Advanced Tool Use: Tool Search reduces token overhead from approximately 77K to approximately 8.7K tokens (85% reduction) by loading 3-5 relevant tools on-demand. Programmatic tool calling reduces token usage 37% on complex tasks. Tool efficiency improvements also improved accuracy: Opus 4 accuracy from 49% to 74%.
  • Code Execution with MCP: converting a Google Drive-to-Salesforce workflow reduced token usage from 150,000 to 2,000 tokens, a 98.7% cost reduction.
  • Effective Context Engineering: sub-agent architectures with condensed handoffs return only 1,000-2,000 token summaries; just-in-time retrieval over pre-loaded context; persistent external memory outside context window.
  • RouteLLM (ICLR 2025, Berkeley/Anyscale/Canva): trained router achieves 95% of GPT-4 performance while routing only 14% of queries to the expensive model, a 75-85% cost reduction.
  • Anthropic multi-agent research system: upgrading to Claude Sonnet 4 is a larger performance gain than doubling the token budget on Claude Sonnet 3.7. Tiered approach: Opus 4 for lead agent, Sonnet 4 for subagents.
  • Zartis architecture framing: token cost is typically 1-5% of actual agent operating cost. Architecture A: $0.01/call at 70% success = $50,100 daily total; Architecture B: $0.05/call at 95% success = $8,835 daily total. 5x higher token spend, 6x lower operating cost.

Macro tokenomics

  • Stanford HAI / Epoch AI: GPT-3.5-level query dropped from $20.00 per million tokens (November 2022) to $0.07 by October 2024, a 280x reduction in 18 months. LLM inference prices fell at median 50x per year; after January 2024 accelerated to 200x per year for some tasks.
  • The Next Web: token prices fell 98%; enterprise AI bills tripled. Average enterprise AI budget grew from $1.2M/year in 2024 to $7M in 2026.
  • Ramp benchmark: token usage grew 1,001% January 2025 to April 2026; dollar spend grew 497%, lagging due to falling per-token prices.
  • priscasolutionsai.com: inference costs fell 1,000-fold but demand rose 10,000-fold. Enterprise generative AI spend grew from approximately $1.7B in 2023 to $37B in 2025.
  • Goldman Sachs May 2026: agentic AI will drive a 24x increase in global token consumption by 2030 (to 120 quadrillion tokens/month).
  • Gartner analyst Will Sommer (via Fortune): by 2030, inference costs will drop approximately 90% from 2025 levels for commodity models, but enterprise expenses expected to rise overall because agentic demand consumes far more tokens per task. Gartner also predicts more than 40% of agentic AI projects will be cancelled by end of 2027 due to escalating costs.
  • Google token processing (Tomasz Tunguz): 480T tokens/month (May 2025), 980T (July 2025), 1,300T (October 2025). Industry estimate: 50 trillion tokens consumed per day as of November 2025.
  • Anthropic revenue trajectory (Madrona): $9B ARR end 2025, $19B March 2026, $30B April 6, 2026. 1,000+ enterprise customers spending $1M+ annually on API. Claude Code reached $2.5B annualized revenue by early 2026.

Counterpoints and failure cases

  • Uber COO Andrew Macdonald: “It’s very hard to draw a line between one of those stats and, ‘Okay, now we’re actually producing 25% more useful consumer features.’” Only approximately 10% of committed code came from autonomous agents.
  • Team-level productivity paradox: individual developers report 20% speed improvement and merge approximately 60% more PRs, but teams deliver 19% slower overall. PR review time increased approximately 91% in high-AI-adoption teams. AI-coauthored PRs have approximately 1.7x more issues.
  • Fortune May 2026: for every $1 spent on AI tokens, $0.44 goes toward fixing AI-generated bugs, $0.27 toward rewriting AI-produced code, $0.11 toward review delays, nearly 80% in hidden overhead.
  • Microsoft Jellyfish data: engineers using the most tokens were about 2x as productive as low-usage peers but spent 10x more tokens. Nicholas Arcolano (Jellyfish): “The best ROI comes from moving the broad middle from low to moderate usage, not pushing heavy users higher.”
  • S&P Global / 451 Research survey: companies abandoning most AI initiatives jumped from 17% to 42%. MIT research: 95% of generative AI pilots delivered no measurable P&L impact. IBM: 25% of initiatives delivering expected ROI.
  • Stanford DEL study: accuracy peaks at intermediate cost and saturates at higher costs. More spend does not mean better output.
  • Nvidia VP Bryan Catanzaro: “For my team, the cost of compute is far beyond the costs of the employees.”
  • Goldman Sachs equity research (Jim Covello, via 404 Media): replacing low-wage jobs with “tremendously costly technology is basically the polar opposite of prior technology transitions.”
  • Sequoia Capital (David Cahn): AI industry needs to generate $600 billion annually just to break even on infrastructure spending.
  • Anthropic BrowseComp: one eval problem consumed 40.5 million tokens, 38x the median. Multi-agent architectures amplified unintended solutions by 3.7x (0.24% single-agent vs 0.87% multi-agent).

Field reports

What operators say in public: the $1.3M OpenClaw bill itemized, Uber's cap, Pylon's $1.4M run rate, and the economics of subscription buffets. X, LinkedIn, and press.

Uber, on the record

  • Uber rolled Claude Code out to engineering in December 2025; usage doubled by February. An internal leaderboard ranking AI usage drove adoption from 32% to 84% of engineers. Per-engineer API costs ran $500-$2,000/month. Uber burned its entire 2026 AI tools budget in four months. Fortune, 2026-05-26
  • COO Andrew Macdonald: hard to draw the connection between rising Claude Code use and consumer-facing innovation. “That link is not there yet.”
  • Response: $1,500/month per-employee cap on AI coding tools (Claude Code, Cursor). Bloomberg, 2026-06-02
  • 95% of Uber engineers use AI tools monthly; roughly 70% of committed code originates from AI (same coverage).
  • George Pu @TheGeorgePu, May 25: “Uber stopped hiring to save money. By April they’d blown the entire AI budget anyway. 5,000 engineers on Claude Code. Up to $2,000 a month. Each. Now the COO can’t tell if any of it worked. A salary is a number you control. Tokens are a number that never stops.”

First-person spend reports

  • Eric Siu @ericosiu, May 13-14, 2026: was spending $7,500/month on AI tokens, then “$12k/mo and it’s basically $0 now because of this stack: OpenAI Oauth 1 - $200/mo, OpenAI Oauth 2 - $200/mo, Claude CLI - $200/mo. If I run out, Qwen on my NVIDIA DGX Spark. Last resort is API.” Also: “one simple change that started pushing the graph down toward zero: model hierarchy.”
  • Ian Nuttall @iannuttall, Jul 7, 2025: “$1300 is how much the tokens would have cost for last month if I were using the api and not my max plan” (via npx ccusage@latest).
  • @nikshepsvn, Feb 26, 2026: “$67*30days = $2,010 worth of AI usage a month… this is roughly how much I spend via Claude Max if I had to pay metered from the API (per ccusage).”
  • Kai @hqmank, Apr 27, 2026: “Claude Code is getting expensive fast. I had no idea how much until I ran one command. npx ccusage@latest. It reads local logs and breaks usage down by day, month, session, and model.”
  • Melvyn @melvynx, Jan 12, 2026: “if i recruit someone i just check his ccusage, less than 1000$ last month = no.” Token spend as a hiring signal.
  • Ed Zitron @edzitron, Feb 14, 2026: crowdsourcing ccusage screenshots and plan tier ($20/$100/$200) from Claude Code users, surfacing the gap between what subscribers pay and the metered value they consume.

Cost-per-hour framing

  • DANNY @dannytook, Jun 10, 2026, on the Fable 5 launch: “running it costs ~$40/h. same price as a junior SWE. except it’s not junior. not even close. problem is most people are burning that budget on noise. one npm test = 5,000 tokens of passing tests re-read every single turn. the biggest leak isn’t your prompt. it’s what your tools spit back. 4 fixes to cut the bill 90% and get senior-level output at 10% of junior cost.”
  • Same author, “tool output is feastin’ off your tokens” (X article, Jun 9): a team of 6 senior SWEs at a large-tech medical company used Claude for 100% of tasks for 6 months with zero token management (“I’ve been just using the strongest [model]”). Fixing tool output and GitHub repo noise cut per-session tokens roughly 89%, from about 210K to 23K.

Enterprise horror stories

  • Chamath Palihapitiya on All-In EP275 (around May 30, 2026, relayed by @PodcastAlphaX): claims a client “accidentally spent $500M in one month on Claude. $16.6M per day. No usage controls.” A Fortune-level company failed to set token limits; the bill arrived at end of month. A separate Fortune 20 company is reportedly six months into a $1B AI-driven OPEX reduction program. Anecdote circulated widely; treat as “Chamath claims,” not verified independently.

One person, many agents

  • @browomo, May 5, 2026: account of a solo operator running 13 Claude Code agents serving 200 Shopify dropshippers at $800/month each (roughly $160K/mo revenue), working from two wall-mounted 3x2 grids of Claude windows. Viral (495K views); not independently verifiable.
  • @eng_khairallah1, May 6, 2026: describes an operator running 7 agents on Sonnet 4.6 analyzing Google Maps data in small towns, serving 47 small businesses at $400/month each for landing pages. Same caveat.
  • Greg Isenberg @gregisenberg, May 12, 2026: publishing a full course on running a “managed AI agent business” solo (Hermes Agent, Orgo, Obsidian, Codex, Claude Code). The “agency of agents” as a business model is now commodity content.
  • Eric Buess @EricBuess, Jul 25, 2025: personal agent portfolio covering to-dos, Obsidian second brain, computer use, browser use, calendar, kids’ school events, and a “medical diagnostic system for my oncologist wife using a mixture of agents.”

Tokens buying knowledge work, not just code

  • Andrej Karpathy @karpathy, Apr 2, 2026, “LLM Knowledge Bases” (21M views): “a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge (stored as markdown and images).” Describes an LLM-compiled personal wiki (roughly 100 articles, roughly 400K words) in Obsidian: ingest, compile, Q&A, render outputs, file back into the wiki, LLM lint passes for data integrity. “You rarely ever write or edit the wiki manually, it’s the domain of the LLM.”
  • Karpathy at Sequoia AI Ascent 2026 (Apr 29-30, via @stephzhan): “vibe coding raised the floor. Agentic engineering raises the ceiling.” He’s “never felt more behind as a programmer.”

Loop engineering: Steinberger, OpenClaw, and the top of the spend curve

  • Peter Steinberger (OpenClaw creator, now at OpenAI) posted a usage screenshot: $1,305,088.81 in OpenAI API spend in 30 days. 603 billion tokens across 7.6 million requests, roughly 100 Codex instances run by a team of about 3. OpenAI covers the bill. He frames it as “exploring how software would be built if token costs didn’t matter.” Claims roughly 70% of the cost could be cut by disabling Fast Mode. ROI, asked directly: “I’d say pretty high.” The Decoder, Tom’s Hardware, TNW
  • What the 100 agents do: review PRs, find security vulnerabilities in commits, dedupe issues, write fixes, generate PRs from a project vision file (VISION.md), monitor benchmarks for regressions, listen to team meetings and open feature PRs. OpenClaw itself: fastest-growing open-source project in GitHub history at the time (302K stars by April 2026).
  • Original bill post: May 15, 2026, 2.7M views. CodexBar screenshot: today $19,985.84, 7-day $249,661.09, 30-day $1,305,088.81, 603B tokens, 7.6M requests, top model gpt-5.5. His follow-up explainer the same day (2M views), “People freaking out over my AI spend. What nobody sees,” itemizes where the tokens go: ~100 Codex instances reviewing every PR and every issue; Codex on every commit for security review; agents deduplicating issues and sending reports; agents that spin up ephemeral machines, log into Telegram, record before/after video and post to the PR; Codex watching new issues and automatically creating PRs when they match the vision; another Codex reviewing those PRs; Codex scanning comments for spam; Codex verifying performance benchmarks and reporting regressions to Discord; agents listening to team meetings and opening PRs for features while they are still being discussed; clawpatch.ai splitting every project into functional units for bug and security review. Punchline: “All that automation allows us to run this project extremely lean.”
  • Origin of the PR sweep, Feb 15, 2026, 832K views: “PRs on OpenClaw are growing at an impossible rate. Worked all day yesterday and got like 600 commits in. It was 2700; now it’s over 3100. I need AI that scans every PR and Issue and de-dupes.” And: “There’s about 1 Million things people want me to do, I don’t have a magical team that verifies user generated content.” The automation was survival, not a flex.
  • Thread pushback worth noting (replies to the bill post): @jonathanbylos: “you better show something that $1MM worth of engineers couldn’t do… That is also subsidized pricing, holy. If it was the actual cost, it would be much higher.” @iulianlita: “$1.3m/month. Anything useful created yet?” Steinberger: “Other than millions of people enjoying OpenClaw? Yeah.” Alex Lieberman: “Would be fascinating to back into how many engineers worth of production code you’ve shipped… vs the $1.3 million in tokens.”
  • Loop engineering thesis, @steipete, Jun 8, 2026, 6.5M views: “Here’s your monthly reminder that you shouldn’t be prompting coding agents anymore. You should be designing loops that prompt your agents.”
  • His working definition in practice: May 14: “Wrote a skill that runs codex /review in a loop until there’s no booboos anymore. Caveat: It won’t fix system architecture for ya, so you still need BRAIN as master model.” May 30: “Ask it to review code for bugs and it will tell you all good, tell it there is a bug and it will LOOP AND LOOP and will find issues.” Jan 24: “need a ralph-loop so codex keeps running /review until it’s done.” May 3: “10 codex and ensuring you are a good manager and close the loop.”
  • Model-economics aside (May 29, reply): “Opus burns at least twice the token for similar tasks AND is more expensive, far easier to spend a million there.” Even the $1.3M/month spender reasons about tokens-per-task across models.
  • Addy Osmani, “Loop Engineering”: five components of a good loop: automations, worktrees, skills, plugins/connectors, and sub-agents as separate verifiers, plus persistent state between runs. Explicit token-economics warning: “you absolutely have to be careful about token costs”; spend sub-agent tokens strategically on verification. Quotables: “Design the loop. Stay the engineer.” and Boris Cherny (Claude Code creator): “I don’t prompt Claude anymore. I have loops running that prompt Claude.”
  • The control loop itself, published Jun 11, 2026: “Here’s a simple loop: Tell codex to maintain your repos, wake up every 5 minutes and direct work to threads. That makes it easy to parallelize+steer work as needed.” An orchestrator skill combined with triage, auto-review, and computer-use skills, “so some work can land autonomously.”
  • The maintainer-orchestrator skill is the published blueprint: a root orchestrator polls worker threads every five minutes, one Codex worker per repository, and workers cannot spawn subworkers. Live proof against real systems is “a pre-land requirement, not optional polish.” Releases require zero effective open issues and PRs, green CI, and explicit owner authorization.
  • The skill compresses the human’s job to four moves: “The normal owner interaction should be one of: land the prepared PR, delete/close it, provide one exact access step, or choose between clearly documented alternatives.”
  • The companion github-project-triage skill sorts every issue and PR into three buckets (autonomous candidates, needs the owner, defer/close) with an explicit go-vs-ask line: performance fixes, reproduced bugfixes, and docs proceed autonomously; new features, product direction, and security-sensitive changes wait for a human. It also runs a trust assessment on each author (account age, 12-month activity lookback) that “changes review depth, not correctness.”

The OpenClaw repo by the numbers (GitHub API, June 10, 2026)

Source: github.com/openclaw/openclaw

  • 377,972 stars, 79,046 forks, 7,997 open issues, created 2025-11-24 (roughly 6.5 months old at time of check).
  • Velocity, 7 days ending 2026-06-10: 2,076 commits on the main repo; 1,192 PRs created (GitHub search API, repo:openclaw/openclaw is:pr created:>2026-06-03). PR/issue numbers referenced in the changelog run to #91,551.
  • Releases near-daily: 15 tags between May 31 and Jun 10, changelog bodies 7K-21K chars.
  • Contributors: steipete at 32,382 commits (agent fleet lands under his identity), vincentkoc 7,552, shakkernerd 3,406, then a cliff; clawsweeper[bot] at 198.
  • Org ecosystem: 20+ repos (clawhub 8.9K stars, gogcli 7.7K, Peekaboo 4.7K, mcporter 4.6K, wacli 2.5K, and others).
  • v2026.6.5 changelog: real engineering across channels (WhatsApp, Matrix, Feishu, QQBot, Google Chat, iMessage, Telegram), providers (Anthropic extended-thinking recovery, Vertex ADC, OpenRouter cost reconciliation), SQLite state migrations, security (transcript image redaction, owner-only HTTP tools). Roughly half the items are verification machinery: QA Lab parity checks, fail-closed test policies, bounded proof, release-evidence repo at github.com/openclaw/releases. About half the spend buys verification, not generation.

Fable 5: the new top of the market (launched 2026-06-09)

Sources: Anthropic announcement, Fable 5 system card PDF, Ethan Mollick, “What it feels like to work with Mythos”

  • First Mythos-class model, a tier above Opus: $10/MTok input, $50/MTok output, exactly 2x Opus 4.8 ($5/$25). Free on Pro/Max/Team/seat-based Enterprise June 9-22; from June 23 it requires usage credits.
  • Headline capability claims from launch partners: Stripe compressed “months of engineering into days,” including a 50M-line Ruby codebase migration “in a day that would otherwise have taken a whole team over two months by hand.” A physics-research partner got results in 36 hours using one-third the reasoning tokens of competing models. A partner: Fable 5 “delivers more capable engineering in fewer turns than prior models.”
  • Token efficiency as a selling point: highest score on Cognition’s FrontierCode “even at medium effort.” “At the highest effort, Claude Fable 5 reflects on and validates its own work… the extra thinking pays for itself.”
  • Agentic coding benchmarks from the system card: Fable 5 SWE-bench Verified 95.0, SWE-bench Pro 80.0, Terminal-Bench 2.1 84.3 (vs GPT-5.5 at 58.6 SWE-Pro, Gemini 3.1 Pro at 54.2). Mythos 5: 95.5 / 80.3 / 88.0. SWE-bench Verified is effectively saturated; cost-per-solved-task replaces can-it-solve-it as the relevant question.
  • Autonomy evals from the system card: Mythos 5 reaches a 430.93x speedup over baseline on a kernel-optimization task. Risk thresholds are denominated in hours-of-human-effort-equivalent (200x = 8h equivalent, 300x = 40h equivalent). METR’s external testing was consistent with Anthropic’s conclusions.
  • Relevant to unattended loops: Mythos 5 shows a slight regression on overeager behavior and reward hacking in GUI computer-use tasks (“more likely to take destructive or overeager actions in other modalities as well”), mitigable by prompt steering.
  • Ethan Mollick on the relationship change: “The spell has gotten powerful enough that I am no longer sure I am the wizard.” “I no longer steer; I commission.” “Fable is closer to a whole studio, where I am the client who signs off on the final work.”
  • What Mollick built with it: a 9.5-hour autonomous run building a data-calibration system (Concord); an isochronic travel map where Fable launched dozens of sub-agents (mostly cheaper Sonnet instances) to research 2,200+ flights, rail schedules, and road speeds while coding in parallel. Note: even at the frontier, the model itself does model tiering, delegating grunt work to cheaper Sonnets.
  • On cost: Fable “burns through tokens at a rate that suggests the answer to how much it costs in production is ‘a lot.’”
  • The new problem is legibility, not capability: “the conjuring happens somewhere I cannot watch”; the model makes “hundreds of small choices I never get a vote on”; “it turns AI into the ultimate black box.” At the top rung you are no longer paying for answers or even processes, you are paying for outcomes you can only audit.

SemiAnalysis subscription stress test (X thread, 2026-06-10)

Thread: bought every Anthropic/OpenAI subscription and ran long-horizon coding tasks to the weekly limits. Measured max possible spend (API-equivalent):

  • claude-pro $20/mo: roughly $400/mo equivalent (20x)
  • claude-max-5x $100/mo: roughly $2,000/mo equivalent (20x)
  • claude-max-20x $200/mo: roughly $8,000/mo equivalent (40x; the believed ceiling was roughly $2,000)
  • chatgpt-plus $20: roughly $700; pro-5x $100: roughly $3,500; pro-20x $200: roughly $14,000 (70x)
  • Margin chart: at assumed 75% API gross margins, high-utilization subscribers are deeply margin-negative (max-20x subscribers at 50-100% utilization are negative 400% to negative 900%).
  • Their take: nerfing subscriptions causes backlash; falling costs mean “you’ll be able to profitably serve Opus 4.8 level models for $20/month in the near future.”
  • Best reply (@FateOfMuffins): “subscriptions… It’s like buffets where heavy eaters are subsidized by 99% of the population not eating their fair share.”

The Pylon post: the pricing cliff in real time

Marty Kausas (Pylon CEO) on LinkedIn: Anthropic bill jumping $400K to $1.4M/yr, “not because usage exploded,” but because crossing 150 seats forces the Enterprise tier: seats stop including usage, every token bills at API rates, 3.5x overnight at constant usage.

Policy independently confirmed: The Register, 2026-04-16 (bundled-token Enterprise plans ended; renewals $20/seat + metered API; March 8, 2026 hard cutoff).

  • CEO accidentally spent $4,000 in three days in Claude Code without realizing it.
  • Support-team top spenders: $800/month, “consistently across the company.” First non-engineering rung figure.
  • His arc: visibility, role judgment, spend limits (“more tokens now requires explicit approval”). “The era of token-maxxing is coming to an end.”
  • His judgment of waste: “apps that never get used, skills someone else already built; no actual ROI.”
  • Counter-shopping claim: for engineering, “paying for the best model probably saves more over the long run than shopping for cheaper pricing.”

The pricing asymmetry: personal plans are capped, enterprise is full consumption

Confirmed against claude.com/pricing, 2026-06-10:

  • Personal plans are flat-fee with usage limits: Pro $20/mo, Max $100/mo at “5x or 20x more usage than Pro,” rolling 5-hour window limits. Anthropic caps the downside; the heavy user captures the surplus (Ian Nuttall’s $1,300 of API-equivalent tokens for $100 is the example).
  • Enterprise is seat plus consumption: “$20/seat. Usage cost scales with model and task,” with spend controls and audit logs. The company eats the tail risk.
  • This asymmetry explains the shape of the discourse: the bragging ($8K-for-$200 flexes, ccusage screenshots) is all personal-plan arbitrage, and the horror stories (Uber’s blown budget, the $500M month) are all enterprise consumption. Same tokens, different contract, opposite emotional register.

The Five Levels / Dark Factory (Dan Shapiro)

The Five Levels: from spicy autocomplete to the software factory (Dan Shapiro, January 2026):

  • Level 0 Manual: no AI.
  • Level 1 Discrete task automation: AI does bounded tasks (tests, docstrings); job unchanged. “If you are just using ChatGPT to write your regex, you aren’t really getting the benefits of deflation.”
  • Level 2 Collaborative pairing: AI-native flow state; where most “AI-native” developers are. “Level 2… feels like you are done. But you are not done.”
  • Level 3 Human-in-the-loop management: you become a reviewer of large agent diffs. “You are… a manager. You are the human in the loop.” Feels worse, produces more. “Almost everyone tops out” here.
  • Level 4 Autonomous with specification: you write specs, skills, schedules; agents implement while you’re away. Shapiro places himself here.
  • Level 5 Dark Factory: spec in, software out, no humans. Named for Fanuc’s lights-out plant where robots build robots: “It’s dark, because it’s a place where humans are neither needed nor welcome.” Only “small teams, less than five people” operate here today.
  • Economics frame: “technical deflation,” the cost of code is collapsing. “Smart teams are deferring payment on human hours today to pay them back with cheaper AI hours tomorrow.”

Counter-currents

  • @VK_ROXy, Jun 10, 2026: “Stop paying $200/month for Claude Code. Run Claude Code on a Mac Mini for $3/month” (local models via ANTHROPIC_BASE_URL). Cites a Reddit thread: a dev posted a $170-in-10-days Claude Code bill; top reply: “I bought a Mac Mini M4. Haven’t paid Anthropic since.”
  • PublicAI Foundation @PublicAIData, Jun 10, 2026: “The Free Ride Is Over: Why AI Coding Tools Are Getting More Expensive to Use.”

Token-efficiency as a product category

  • mem0 @mem0ai, Apr 16, 2026: “Introducing the token-efficient memory algorithm,” benchmarking roughly 7,000 average tokens per query across memory benchmarks. Memory/context efficiency is now a product pitch, not just a practitioner trick.
  • Tanay @TanayVasishtha, Mar 20, 2026: “100% free coding agent just dropped, completely zero cost… runs 300 tokens per second, up to 10x faster than claude code… 9 built-in subagents” (Freebuff). Race-to-zero counter-current: free/local agents pitched on tokens-per-second and $0 cost.

What the humans are doing

Thirteen Anthropic Economic Index reports, Feb 2025 through Mar 2026: which occupations use Claude, for what tasks, and how usage changes with tenure.

What people actually do

  • Computer and Mathematical occupations account for 37.2% of Claude.ai conversations vs. 3.4% of the U.S. workforce, an approximately 11x overrepresentation. (arXiv:2503.04761, Feb 2025; Anthropic Economic Index)
  • Software development and writing together approach 50% of all Claude.ai usage. (arXiv:2503.04761)
  • Coding’s Claude.ai share peaked at 40% in March 2025 (Claude 3.7 Sonnet launch), settled to 34% by November 2025, while API coding share held at 46-50%. (Economic Index Jan 2026)
  • Single most common task: “modifying software to correct errors,” accounting for 6% of Claude.ai conversations and 10% of first-party API traffic. (Economic Index Jan 2026)
  • Top occupational clusters on Claude.ai (Feb 2025): Computer and Mathematical 37.2%, Arts/Design/Entertainment/Media 10.3% (vs. 1.4% of workforce), Education and Library 9.3% (grew to 15% by Nov 2025), Office and Administrative Support 7.9%, Life/Physical/Social Science 6.4%, Business and Financial 5.9% (fell to 3% by Sep 2025), Transportation 0.3% (vs. 9.1% of workforce, starkest under-representation), Farming/Fishing/Forestry 0.1%. (Anthropic Economic Index)
  • Within coding (Apr 2025, 500K interactions): JavaScript/TypeScript 31%, HTML/CSS 28%, Python 14%, SQL 6%. Top tasks: UI/UX component development 12%, web and mobile app development 8%. (Impact on Software Development)
  • Learning and educational instruction was 9.3% of usage in Feb 2025 and grew to 15% by Nov 2025, the fastest-growing non-coding category. (Economic Index Mar 2025; Economic Index Jan 2026)
  • Translation is a distinct validation-mode cluster: Validation conversations are nearly all translation tasks. (arXiv:2503.04761)
  • 630 granular task clusters were identified in the Mar 2025 taxonomy, including “water management systems” and “battery technologies,” showing breadth well past coding. (Economic Index Mar 2025)
  • Usage purpose split on Claude.ai (Nov 2025): Work 46%, Personal 35%, Coursework 19%. (Economic Index Jan 2026)
  • Only about 4% of occupations use Claude across 75% or more of their O*NET tasks; about 11% of occupations show usage in 50% or more of tasks; 36% of occupations show usage in 25% or more of tasks (Jan 2025 baseline), rising to 49% by pooled data across reports. (arXiv:2503.04761; Economic Index Jan 2026)
  • Task distribution follows a power law: the bottom 80% of tasks account for only 12.7% of Claude.ai usage (Gini 0.84) and 10.5% of API usage (Gini 0.86). (arXiv:2503.04761)
  • Top 10 tasks accounted for 24% of Claude.ai usage in Nov 2025, falling to 19% by Feb 2026 as usage diversified into the long tail. (Economic Index Jan 2026; Economic Index Mar 2026)
  • Heaviest users are mid-to-high wage: computer programmers ($75-100k), data scientists, copywriters. Minimal adoption at both extremes: lowest-wage occupations (e.g., Shampooers ~$25k, waitstaff) and highest-wage occupations (e.g., Obstetricians ~$200k, anesthesiologists). (arXiv:2503.04761; Economic Index Jan 2026)
  • Peak usage is at Job Zone 4 (four-year bachelor’s degree), with a sharp drop at Job Zone 5 (physicians, lawyers). Mean education for Claude-covered tasks: 14.4 years (Associate’s degree) vs. 13.2 years economy-wide. (arXiv:2503.04761)
  • Skills most exhibited in Claude conversations: Critical Thinking, Active Listening, Reading Comprehension, Writing, Systems Analysis, Programming, Complex Problem Solving, Instructing, Troubleshooting. Least present: Installation, Equipment Maintenance, Repairing, Operation and Control. (arXiv:2503.04761)
  • Median labor-cost equivalent per Claude.ai conversation: $54. Management conversations average $133, legal $119, business/financial $69, food preparation $8. (Estimating Productivity Gains)
  • Median time savings across all tasks: 84%, with 80-90% the most common range. Average baseline task without AI: 1.4 hours. (Estimating Productivity Gains)
  • Speedup by education level: college-level tasks achieve a 12x speedup, high-school-level tasks a 9x speedup. (Estimating Productivity Gains)
  • Curriculum development was cut from 4.5 hours to 11 minutes, a 97% time savings. Tasks with judgment bottlenecks, such as diagnostic imaging, resist speedup: only 20%. (Estimating Productivity Gains)
  • Software developers drive 19% of total U.S. productivity gains from Claude. (Estimating Productivity Gains)

Automation vs. augmentation

  • Baseline split on Claude.ai (Feb 2025): Augmentation 57% (Task Iteration 31.3%, Learning 23.3%, Validation 2.8%), Automation 43% (Directive 27.8%, Feedback Loop 14.8%). (arXiv:2503.04761; Anthropic Economic Index)
  • After Claude 3.7 Sonnet (Mar 2025): augmentation/automation remained stable at 57/43, but Learning interactions grew from about 23% to about 28%, the largest single behavioral shift observed. (Economic Index Mar 2025)
  • Sep 2025: automation exceeded augmentation for the first time. Directive jumped from 27% to 39%. Learning 24%, Task Iteration 19%, Feedback Loop 11%. Code creation doubled (+4.5pp to 8.6%), debugging fell (-2.8pp to 13.3%). (Economic Index Sep 2025)
  • Nov 2025: augmentation rebounded to 52%, automation 45%, directive fell back to 32%. (Economic Index Jan 2026)
  • Feb 2026: augmentation increased further on Claude.ai; API directive and automated interactions also decreased and collaborative patterns increased. (Economic Index Mar 2026)
  • By occupation (Sep 2025): Community and Social Services shows about 75% augmentation (highest); copywriters/editors about 58% task iteration; librarians about 56% learning; Production and Computer/Mathematical occupations roughly 50-50. (Economic Index Mar 2025)
  • Directive mode concentrates in: writing/content generation, email drafting, business documents, schoolwork/math problem-solving. Feedback Loop mode is almost entirely coding work, peaking in error correction and front-end debugging. (arXiv:2503.04761; Economic Index Mar 2025)
  • Claude Code vs. Claude.ai for coding (Apr 2025, 500K interactions): Claude Code is 79% automation (Directive 43.8%, Feedback Loop 35.8%); Claude.ai coding sessions are 49% automation (Directive 27.5%, Feedback Loop 21.3%). (Impact on Software Development)
  • Claude Code startup work: 32.9% of sessions vs. about 13% on Claude.ai, skewing toward startups building rather than enterprises maintaining. (Impact on Software Development)

Consumer vs. API/enterprise differences

  • All Economic Index data through Report 3 (Sep 2025) used Claude.ai Free and Pro only. First-party API data was first included starting with the Sep 2025 and Jan 2026 reports; Team and Enterprise are still excluded. (Economic Index Sep 2025; Economic Index Jan 2026)
  • API vs. Claude.ai (Nov 2025): Work-related share is 74% on API vs. 46% on Claude.ai. Automation-dominant share is 75% on API vs. 52% on Claude.ai. Directive mode is 64% on API vs. 32% on Claude.ai. Average session time is 5 minutes on API vs. 15 minutes on Claude.ai. Task success rate is 49% on API vs. 67% on Claude.ai. (Economic Index Jan 2026; Economic Index Sep 2025)
  • 97% of API task categories show automation-dominant patterns vs. 47% on Claude.ai. (Economic Index Sep 2025)
  • API task concentration: Gini 0.86 vs. 0.84 for Claude.ai. Top 10 tasks account for 32% of API traffic vs. 24% on Claude.ai, rising to 33% for API by Feb 2026 while Claude.ai fell to 19%. (Economic Index Jan 2026; Economic Index Mar 2026)
  • Top enterprise API use cases by Sep 2025: debugging web applications about 6%, resolving technical issues about 6%, developing/evaluating AI systems about 5%, marketing materials creation 4.7%, business/recruitment data processing 1.9%. (Economic Index Sep 2025)
  • Top enterprise API use cases by Jan 2026: generate personalized B2B cold sales emails 0.47%, analyze emails and draft business replies 0.28%, invoice processing systems 0.24%, email classification 0.23%, calendar/meeting coordination 0.16%. (Economic Index Jan 2026)
  • Model selection self-sorts by task value: each $10/hr increase in task wage corresponds to +1.5pp Opus selection on Claude.ai and +2.8pp on API. Software developers select Opus 34% of the time; tutors select it 12%. Opus was selected for 51% of all Claude.ai conversations overall by Feb 2026. (Economic Index Mar 2026)
  • U.S. firm AI adoption: 3.7% of U.S. firms used AI in Fall 2023, rising to 9.7% by August 2025 (a 2.6x increase in about 20 months). Information sector: about 25% adoption. Accommodation/food services: about 2.5%. (Economic Index Sep 2025)

Geography and adoption unevenness

  • Per-capita adoption leaders (Sep 2025): Israel 7.0x expected, Singapore 4.57x, Australia 4.10x, New Zealand 4.05x, South Korea 3.73x, US 3.62x, UK 2.67x. (Economic Index Sep 2025)
  • Low adopters per capita: Nigeria 0.2x, India 0.27x, Indonesia 0.36x. (Economic Index Sep 2025)
  • Absolute volume: US 21.6%, India 7.2%, Brazil 3.7% of global Claude usage. (Economic Index Sep 2025)
  • U.S. state leaders per capita: DC 3.82x, Utah 3.78x, California 2.13x, New York 1.58x, Virginia 1.57x. (Economic Index Sep 2025)
  • Income elasticity: a 1% increase in GDP per capita corresponds to 0.7% higher usage globally and 1.8% within U.S. states. Richer regions use AI more per person and for more diverse tasks. (Economic Index Jan 2026; Economic Index Mar 2026)
  • Geographic task specialization: DC concentrates on document editing (1.84x national average) and job applications; California on IT and digital marketing; India near-exclusively on software development (coding over 50% of usage vs. about 33% globally); Brazil on translation and legal services. (Economic Index Mar 2026)
  • U.S. state-level Gini fell from 0.37 to 0.32 between August and November 2025. Top 5 states’ per-capita share fell from 30% to 24% by Feb 2026, but convergence slowed: estimated equalization revised from 2-5 years to 5-9 years. (Economic Index Jan 2026; Economic Index Mar 2026)
  • Internationally, the top 20 countries’ per-capita adjusted share increased from 45% to 48%, indicating international diffusion is lagging U.S. domestic broadening. (Economic Index Mar 2026)
  • In countries with low adoption (e.g., India), coding dominates (over 50% of usage) and delegation/automation patterns prevail even controlling for task mix. High-adoption countries show more augmentation, more diverse task spread, and more personal and educational use. (Economic Index Jan 2026; Economic Index Mar 2026)

Tenure effects

  • Users with 6 or more months of tenure show a 4pp higher task success rate, are 7pp more likely to use Claude for work, show 6% more education signal in inputs, have more diverse task spread, and perform tasks requiring approximately one additional year of education per year on the platform. (Economic Index Mar 2026)
  • Education required for tasks performed via Claude increases nearly one year per additional year of Claude usage. (Economic Index Mar 2026)

The falling floor

Chinese open-weight models as of early June 2026: who they are, what they cost, where they win and lose, and how much of OpenRouter they carry.

OpenRouter market share

  • As of early June 2026, six of the top ten models by weekly token volume on OpenRouter are Chinese open-weight models: DeepSeek V4 Flash (4.23T tokens, +55% week-over-week), Tencent Hy3 Preview (3.53T, +25%), MiniMax M3 (3.22T), Xiaomi MiMo-V2.5 (2.73T), DeepSeek V4 Pro (2.03T), and DeepSeek V3.2 (1.18T).
  • By model author on OpenRouter the same week: DeepSeek 18.0%, Anthropic 16.6%, Tencent 9.9%, Google 9.9%, Xiaomi 9.4%, MiniMax 9.0%, OpenAI 6.9%. Named Chinese authors sum to roughly 49.7% of all tokens. Source: openrouter.ai/rankings, read June 10, 2026.
  • Chinese-origin providers crossed 51.2% of all OpenRouter token volume by April 2026, up from roughly 1.2% in October 2024. Source: digitalapplied.com
  • In February 2026, Chinese models held roughly 85.7% of the top-five OpenRouter volume. Source: aicost.org
  • Chinese models hold 7 of the top 10 coding-collection slots on OpenRouter as of June 2026. Source: digitalapplied.com
  • Over the November 2024 to November 2025 period: DeepSeek ranked first at 14.37T total tokens, Qwen second at 5.59T, Meta LLaMA third at 3.96T. Source: arxiv.org/abs/2601.10088
  • Programming workloads on OpenRouter grew from roughly 11% to over 50% of total token volume through 2025. Source: techtimes.com
  • Chinese developers accounted for 17.1% of all HuggingFace downloads versus US developers at 15.8% (August 2024 to August 2025). 63% of all new fine-tuned models on HuggingFace in September 2025 were based on Chinese base models. Source: the-decoder.com

DeepSeek V4 Pro and V4 Flash

  • V4-Pro is 1.6T parameters, 49B active (MoE), MIT license, 1M-token context. V4-Flash is 284B parameters, 13B active. Both released April 24, 2026. Hybrid Compressed Sparse Attention reduces inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M-token context. Three reasoning modes: non-think, think high, think max. Source: artificialanalysis.ai
  • V4-Pro SWE-bench Verified 80.6% (tied with Gemini 3.1 Pro, independently confirmed). Codeforces ELO 3,206. LiveCodeBench 93.5. MCPAtlas agentic score 73.6 (versus Claude Opus 4.6 at 73.8). Source: codersera.com
  • V4-Flash supports 5x higher concurrency than V4-Pro (2,500 versus 500 req/s) and is the volume leader by mid-2026, reaching 4.48T weekly tokens after its April 24 launch. Source: openrouter.ai/deepseek/deepseek-v4-flash
  • Pricing (permanent as of May 22, 2026): V4-Pro $0.435/M input (cache miss), $0.003625/M (cache hit), $0.87/M output. V4-Flash $0.14/M input, $0.028/M (cache hit), $0.28/M output. Source: api-docs.deepseek.com
  • Weaknesses: text-only at launch (no vision or audio). Roughly 24% timeout rate on hardest reasoning problems. SimpleQA 57.9 versus Gemini 3.1 Pro at 75.6, an 18-point factuality deficit. Hallucination rate of 94% on AA-Omniscience (V4-Flash 96%), among the worst recorded on that benchmark. SWE-bench Pro claim of 55.4% is unverified by third parties and trails Claude Opus 4.7 at 64.3%. Schema drift (field omissions, type errors) in 15-30 step tool-calling chains. Self-hosting requires 8x H100 minimum (862GB weights). Source: artificialanalysis.ai, milvus.io
  • Political censorship: roughly 85% refusal rate on Tiananmen Square, Taiwan, Uyghurs, Xi Jinping criticism, and the Cultural Revolution, embedded in weights and persisting when self-hosted. Cisco testing found 100% jailbreak success rate. Injects roughly 50% more security bugs when prompts include Chinese political trigger terms (CrowdStrike research). Formally quantified in arXiv:2506.12349. Sources: venturebeat.com, arxiv.org

Qwen family (Alibaba)

  • Qwen3-Coder-Next is 80B total parameters, 3B active (MoE), Apache 2.0 license, 256K native context (extendable to 1M via YaRN), released April 2026. Activates only 3B of 80B parameters per token, making local inference viable. Source: huggingface.co/Qwen/Qwen3-Coder-Next
  • Qwen3-Coder-Next: 71.3% SWE-bench Verified, 92.7% average tool-call format accuracy across five IDE/CLI environments, 62.8% SWE-bench Multilingual. Qwen3-235B-A22B: 95.6 ArenaHard, 2056 Codeforces ELO, leads on BFCL and LiveCodeBench v5. Source: logrocket.com
  • Qwen family surpassed 1 billion cumulative HuggingFace downloads as of early 2026. In February 2026: 153.6M downloads, more than double the next eight providers combined. Overtook Meta Llama as the most-downloaded family in September 2025. 113,000+ derivative models. Source: scmp.com
  • Pricing (OpenRouter): Qwen3-Coder-Next $0.11/M input, $0.80/M output. Qwen3-Coder-480B $0.22/$1.80. Alibaba Cloud direct: Qwen-Turbo $0.05/$0.20, Qwen-Flash $0.10/$0.40, Qwen-Plus $0.40/$1.20, Qwen-Max $1.20/$6.00. Mainland China pricing runs 50-75% cheaper. Sources: alibabacloud.com, openrouter.ai
  • Weaknesses: hallucination rates are model-size-dependent and often high: Qwen3-235B-A22B 52%, Qwen3-next-instruct 32%, Qwen3.6-Plus 26.5% (BridgeBench). Small models consume 230-390M output tokens to run standard benchmarks, inflating effective cost. Trails Gemini 2.5 Pro on ArenaHard, AIME, and Aider Pass@2. Source: artificialanalysis.ai
  • Political alignment: China Media Project (February 2026) used thought token forcing to expose hidden system instructions. Questions about China trigger “Keep the answer positive and constructive” directives; identical questions about the US or Belgium trigger “neutral and objective” instructions. Mechanistic interpretability on Qwen3.5-9B found censorship is a small, identifiable circuit in the weights. Source: chinamediaproject.org, interconnects.ai
  • Runs locally on 32GB+ Apple Silicon (M2 Pro+) at 4-bit quantization. Qwen3-8B runs on 16GB M1/M2/M3/M4 (5.2GB download via Ollama). Qwen3-72B needs 64GB (M3/M4 Max). Source: macgpu.com
  • Qwen Code CLI: free/open, runs local Ollama backends with file editing, terminal commands, and MCP tool use. LogRocket hands-on: “not outperforming Claude Code yet, but close. The biggest draw is far more cost-effective.” Source: logrocket.com

Kimi K2.6 (Moonshot AI)

  • 1T parameters, 32B active (MoE), Modified MIT license, 262K context, released April 20, 2026. Agent swarm architecture: 300 sub-agents, 4,000 coordinated steps. Source: artificialanalysis.ai
  • SWE-bench Pro 58.6% (ties GPT-5.5). SWE-bench Verified 80.2%. BrowseComp Agent Swarm 86.3%. Toolathlon 50.0 versus Claude 47.2. Humanity’s Last Exam with tools 54.0% versus Claude Opus 4.6 at 53.0%. Artificial Analysis Intelligence Index 54, tied first among open-weight models with MiMo-V2.5-Pro. Source: artificialanalysis.ai
  • First week on OpenRouter: 1.88T tokens (7,683% growth rate), briefly overtaking Claude Sonnet 4.6 and DeepSeek V3.2. Demonstrated 13-hour autonomous coding session: 4,000+ tool calls, 4,000+ lines of code modified, improving a Zig inference engine from 15 to 193 tokens/second. Source: phemex.com
  • Pricing: $0.68/M input, $3.41/M output on OpenRouter. Kimi K2 Thinking variant: $0.60/$2.50, 262K context. Source: medium.com/@tentenco
  • Weaknesses: pathologically verbose, generating 170M output tokens during AA Intelligence Index evaluation versus platform average of 43M (4x more verbose, ranked 68 of 89 for verbosity). This compresses headline cost savings by 60-70% on reasoning-heavy tasks. Output speed 58.3 t/s versus median 60.4. 262K context versus DeepSeek V4’s 1M. vLLM out-of-the-box tool-call success rate: 18% (improved to roughly 76% after Moonshot-provided patches). K2.5 documented retrying the same wrong tool selection 20+ consecutive times without self-correction. No independent third-party verification of K2.6 benchmark claims as of May 2026. Source: vllm.ai, artificialanalysis.ai
  • Safety evaluation (arXiv:2604.03121): K2.5 shows political censorship aligned with CCP, more compliance on disinformation and copyright infringement than GPT-5.2/Claude Opus 4.5, fewer refusals on CBRNE requests, and more permissive behavior in English/Spanish/Arabic than in Chinese. Source: arxiv.org/abs/2604.03121

Xiaomi MiMo-V2.5-Pro

  • 1.02T parameter MoE, 1M context, open-weight, released April 22, 2026. Source: digitalapplied.com
  • Number one by OpenRouter weekly volume in April 2026 at 4.65T tokens/week (22.3% platform share). Holds 25.5% of all coding traffic on OpenRouter. SWE-bench Pro 57.2%, ahead of Claude Opus 4.6 at 53.4% and competitive with GPT-5.4 at 57.7%. Artificial Analysis Intelligence Index 54, tied first open-weights with Kimi K2.6. Source: artificialanalysis.ai
  • Pricing: $0.435/M input, $0.87/M output (same as DeepSeek V4-Pro).
  • Weaknesses: Xiaomi is primarily a consumer hardware company; model provenance and fine-tuning data are less documented than DeepSeek or Alibaba. Limited third-party evaluation depth as of June 2026.

GLM-5.1 and GLM-5 (Zhipu AI / Z.ai)

  • 744-745B parameters, 40-44B active (MoE), MIT license, 200-203K context, text-only. Trained entirely on Huawei Ascend chips with no Western silicon dependency. GLM-5 released February 12, 2026; GLM-5.1 released April 7, 2026. Source: automatio.ai
  • GLM-5.1 SWE-bench Pro 58.4%, the top open-source position on that benchmark as of April 2026, edging GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%). SWE-bench Verified 77.8%. GPQA Diamond 86.0%, AIME 2026 92.7%. GLM-5 held 5.6% of OpenRouter weekly token share (1.12T tokens/week) as of April 2026.
  • Pricing: OpenRouter GLM-5 roughly $0.60/$1.92/M. GLM-5.1 $0.98/$3.08/M. Z.ai Coding Plan: $10/month (Lite, 120 requests), $30/month (Pro, 600 requests). Source: openrouter.ai/z-ai/glm-5.1
  • The Z.ai Coding Plan explicitly markets GLM-5.1 as a drop-in substitute for Claude Code via API key swap, compatible with Cline, Cursor, Kilo Code, and 20+ other tools. Source: help.apiyi.com
  • Weaknesses: slowest in the competitive tier at 44.3 t/s. Chinese-language openness score 39.6% versus 95.1% in Portuguese; a fake Claude system prompt raises Chinese score by 34 points, indicating censorship is prompt-sensitive rather than fully weight-locked. Identity confusion documented: model intermittently identified itself as Claude under indirect queries. Zhipu AI has been on the US Commerce Department Entity List since January 2025. Source: blog.return.moe

MiniMax M3 and M2.7

  • M3: multimodal MoE, native image and video input, 1M context, released June 1, 2026. M2.7: text-only, 205K context, released March 18, 2026. Source: artificialanalysis.ai
  • M3 vendor-reported benchmarks: SWE-bench Pro 59.0% (ahead of GPT-5.5 at 58.6%), Terminal-Bench 2.1 66.0%, BrowseComp 83.5%, MCP Atlas tool-use 74.2%, AA Intelligence Index 55. Hallucination rate 16.1% on AA-Omniscience, best in class among Chinese models if confirmed. All M3 benchmark numbers are vendor-run; independent third-party verification pending as of June 10, 2026. Source: artificialanalysis.ai
  • M2.7 held roughly 4.55T tokens/month in February 2026, the top OpenRouter slot at the time. Pricing: M2.7 and M3 promo at $0.30/$1.20/M. Source: artificialanalysis.ai
  • License risk: M2.7 was released under an MIT-style label, then quietly updated to “Modified-MIT” requiring written authorization for commercial use. Community reaction on HN and HuggingFace was strongly negative. The pattern is expected to continue with M3. Source: decrypt.co
  • Weaknesses: M2.7 output speed 45.5 t/s (below median 60.4). M2.7 generates non-functional API call signatures in production (independent May 2026 Ruby on Rails coding benchmark).

Pricing versus frontier (June 2026)

ModelInput $/MTokOutput $/MTok
DeepSeek V4 Flash$0.098$0.197
Qwen3-Coder-Next$0.11$0.80
Step 3.7 Flash$0.20n/a
MiniMax M3 (promo)$0.30$1.20
MiMo-V2.5-Pro$0.435$0.87
DeepSeek V4-Pro$0.435$0.87
Kimi K2.6$0.68$3.41
GLM-5.1$0.98$3.08
Claude Opus 4.8$5.00$25.00

Sources: digitalapplied.com, api-docs.deepseek.com, morphllm.com

  • DeepSeek V4-Pro cache-hit input at $0.003625/M versus Anthropic’s $0.50/M is a 138x difference. Source: morphllm.com
  • The broader market dropped from roughly $60/M tokens (early 2024) to $1-2/M by 2026, a roughly 96% average cost reduction. Source: morphllm.com

Local and Mac-runnable

  • Qwen3-8B: 5.2GB download via Ollama, runs on 16GB M1/M2/M3/M4. Zero marginal API cost.
  • Qwen3-Coder-Next (80B MoE, 3B active): runs on 32GB+ M2 Pro+ at 4-bit quantization with MLX acceleration.
  • Qwen3-72B: needs 64GB RAM (M3/M4 Max).
  • DeepSeek V4-Flash (284B/13B active): requires roughly 170-175GB VRAM, needs 2x H200. Not consumer hardware.
  • DeepSeek V4-Pro (1.6T/49B active): 862GB weights, minimum 8x H100. Data center only.
  • Self-hosted Chinese open-weight models are “often 70-90% cheaper than equivalent closed models after self-hosting costs.” Source: digitimes.com

Macro context

  • Stanford HAI 2026: Chinese open-weight models “almost reached the level of leading closed models by end of 2025,” trailing US proprietary systems by approximately seven months. Source: hai.stanford.edu
  • Meta exited competitive open-weight AI: in April 2026, Meta Superintelligence Labs released Muse Spark as a closed-weight model, citing competitive pressure from Chinese open models. Llama 4 Behemoth was never shipped. Source: zapier.com
  • DeepSeek’s $5.576M training claim covers only the final GPU pre-training run. Stanford FSI notes DeepSeek purchased roughly 10,000 A100s (roughly $80M) and roughly 50,000 H800s (roughly $50M). Critics estimate the true all-in cost at $1.3-1.6 billion. Source: cyber.fsi.stanford.edu
  • House Homeland Security Committee and House China Select Committee opened a joint formal investigation into national security and cybersecurity risks of Chinese AI models on April 29, 2026. Letters were sent to Airbnb and Anysphere (Cursor). Sources: warontherocks.com, semafor.com
  • China’s National Intelligence Law requires all Chinese companies to “support, assist, and cooperate” with government intelligence activities, regardless of server location or privacy policies. Source: techtimes.com

Documents and deals

The publicly documented record of AI in commercial real estate and professional services: JLL GPT, Falcon, lease abstraction, and contract analysis at industry scale.

  • JLL launched JLL GPT in August 2023, publicly describing its core use cases as enabling professionals to “quickly draft documents, summarize documents, and brainstorm ideas.” More than 11,000 employees used it within the first 48 hours. Commercial Observer, Aug 2023

  • By October 2024, 47,000+ JLL professionals had used JLL GPT; 25,000+ use it monthly, and the platform processes 200,000+ prompts per week with 27,000 weekly active users. JLL IR, Oct 2024

  • JLL Falcon launched in October 2024 with 60+ AI-enabled features and 25x more working memory than the original JLL GPT. Publicly described use cases include: compile client research, pull key data from massive reports, draft emails, provide workplace planning advice, improve building efficiency, and generate 3D leasing visualizations. JLL PR Newswire, Oct 2024

  • Falcon includes document-extraction chatbots that answer specific questions from documents stretching 30+ pages and link responses back to source paragraphs, replacing what CTO Yao Morin described as the need to “sit for hours to scroll through a document to find one piece of information.” REJournals, Oct 2024

  • Broker prospecting email drafting time dropped from roughly 1 hour to 15 minutes using JLL GPT, a time-savings figure attributed to CTO Yao Morin at launch. Commercial Observer, Aug 2023

  • 1 in 5 JLL Capital Markets opportunities globally was AI-enabled in Q1 2023, cited in the JLL GPT launch press release. JLL IR, Aug 2023

  • JLL Capital Markets employees use APIs built on Falcon to extract numeric values from lease clause invoices, turning hours of manual contract review into a near-instant query. JLL guide, 2024

  • JLL’s AI lease invoice review program cross-checks rent invoices against lease terms to surface overpayments and missed clauses, generating nearly $70 million in client cost savings and avoidance in a single year. JLL guide

  • JLL Lease Navigator is a multi-agent AI solution covering lease administration, accounting and compliance, data analytics, and portfolio optimization. Specialized agents analyze both unstructured documents and structured databases. JLL insights

  • JLL publicly states it uses “AI-powered abstraction tools to enhance first-pass accuracy and reduce cycle times,” naming standardized lease abstraction, automated document management, compliance monitoring, and automated auditing as the four core AI applications within lease administration. JLL guide

  • JLL lease administration services have delivered $57 million in efficiencies to clients, and in one engagement projected $165 million in future cost avoidance for a single global financial institution managing roughly 7,000 data points and 700 transactions per year with 100% data accuracy across four consecutive audit quarters. JLL services page

  • JLL signed a global agreement with Leverton in December 2016 to deploy AI-powered lease abstraction and contract analysis across JLL clients in North America, Continental Europe, and Asia Pacific. Leverton’s deep-learning system extracts key terms from corporate leases and contracts in 20+ languages, outputting structured data compatible with Yardi and MRI. Artificial Lawyer, Dec 2016

  • JLL Leasing replaced manual lease and LOI abstraction with Cadastral’s AI platform. Abstract generation time dropped from multiple days to seconds, saving hundreds of thousands of dollars annually. Brokers can compare draft LOIs instantly and answer complex lease questions via AI chat. Cadastral case study

  • JLL UK Managed Services deployed DealSumm for AI lease abstraction, with JLL COO Jeffrey Jordan noting the linked abstracted data “reduced errors and given us greater confidence in the accuracy of our lease information.” DealSumm claims 70% improvement in delivery speed and efficiency. DealSumm

  • JLL Asset Beacon, a November 2024 joint venture with Slate Asset Management, is a SaaS platform integrating Falcon capabilities including lease abstraction and entity resolution for CRE investors. JLL newsroom, Nov 2024

  • JLL Azara adds natural language query to corporate real estate, facilities, and IoT data, converting questions into SQL queries. Corporate occupier clients average 300 natural-language questions answered per month. Previously 70% of JLL’s portfolio data was inaccessible to clients; a Microsoft case study states Azara delivered “a year’s worth of insights in one week.” JLL IR, 2024 Microsoft case study

  • JLL Property Assistant (2025) uses natural language chat to auto-generate tenancy reports, stacking plans, finance reports, and expense-trend analyses by integrating data from Yardi and MRI. JLL newsroom

  • JLL’s own published research estimates AI taking over tasks like lease abstraction could free roughly 20% of an asset or portfolio manager’s time, projecting approximately 0.5% revenue uplift from redeployment. JLL insights

  • Nearly 100% of JLL’s 110,000 employees have used the platform at least once; roughly 25% use internal AI platforms daily; 100% of JLL’s 800 software developers use AI coding agents. Runtime.news

  • 88% of real estate investors were already piloting AI as of a 2025 JLL investor survey, up from 5% in 2023. 92% of corporate real estate teams had initiated AI pilots by mid-2025. JLL identified 56 AI use cases across the full CRE value chain in its 2025 technology survey. JLL insights, 2025 JLL tech survey JLL newsroom

  • Yao Morin, CTO of JLL, on proprietary data as competitive advantage: “For us in JLL and commercial real estate in general, we really need to be the expert. There are a lot of data and information that is not available to the general public that we wanted to capture and leverage through the GPT technology.” CIO Dive, 2023

  • Yao Morin, CTO of JLL, on AI as action-taker: “AI is not just answering questions or processing data, but actually taking actions for you to get to that last mile.” Facilities Dive

  • Phoebe Holtzman, Global Director of Data Science Innovation, JLL: “We can now take data from documents, we can generate data from Street View or satellite images to create structured datasets previously difficult to produce.” JLL insights

  • Richard Bloxam, CEO Capital Markets, JLL, at the JLL GPT launch: “Connecting buyers, sellers and lenders at the right time, with the right data in hand, within seconds, is going to determine success in this new generative AI era.” PR Newswire, Aug 2023

  • JLL cites industry research showing AI adopters achieve 18% more accurate valuations and 23% faster transaction times versus conventional approaches. JLL insights

The price of a finished task

The efficiency turn: both vendors now sell cost per task, not cost per token. Launch claims, dollars-per-solved-task benchmarks, Cursor's live scoreboard, field reports, and the counter-evidence. Gathered June 11, 2026.

  • GPT-5 launch, Aug 7, 2025: “GPT-5 (with thinking) performs better than OpenAI o3 with 50-80% less output tokens across capabilities.” The developer post adds: vs o3-high on SWE-bench Verified, GPT-5 uses 22% fewer output tokens and 45% fewer tool calls.
  • GPT-5.1, Nov 2025: “adaptive reasoning.” “On straightforward tasks, GPT-5.1 spends fewer tokens thinking, enabling snappier product experiences and lower token bills.”
  • GPT-5.1-Codex-Max, Nov 19, 2025: first model natively trained for compaction across context windows; beats GPT-5.1-Codex at the same effort “while using 30% fewer thinking tokens.”
  • GPT-5.2, Dec 2025: the explicit argument, in the launch post: “despite GPT-5.2’s greater cost per token, the cost of attaining a given level of quality ended up less expensive due to GPT-5.2’s greater token efficiency.” Price rose to $1.75/$14 per million tokens.
  • GPT-5.4, Mar 5, 2026: “the most token efficient reasoning model yet, using significantly fewer tokens to solve problems when compared to GPT-5.2.” Price $2.50/$15.
  • GPT-5.5, Apr 23, 2026: roughly 40% fewer output tokens than GPT-5.4 on the same Codex tasks, at double the price ($5/$30). OpenAI’s math: 2x price times 0.6x tokens is roughly a 20% effective increase, not 100%.
  • The independent check: The Register, May 8, 2026 measured real GPT-5.5 cost increases of 49-92% vs GPT-5.4, offset by only 19-34% fewer completion tokens, and the savings show up mainly on prompts over 10K tokens. Corroborated by The Decoder and OpenRouter.
  • Claude Opus 4.5, Nov 24, 2025: at medium effort, “matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens”; at high effort, exceeds Sonnet 4.5 by 4.3 points using 48% fewer tokens. Introduced the effort parameter. Price cut from Opus 4.1’s $15/$75 to $5/$25.
  • Claude Opus 4.6, Feb 2026: Terminal-Bench 2.0 by effort level: 65.4% at max; 61.1% at medium with 23% fewer output tokens; 55.1% at low with 40% fewer.
  • Claude Opus 4.8, May 28, 2026: fast mode cut 3x to $10/$50. Cursor, as launch customer: “Tool calling is meaningfully more efficient, using fewer steps for the same intelligence.” Databricks: its Genie agent runs “at 61% cheaper token cost than Opus 4.7,” a token-efficiency gain at identical list pricing.
  • Anthropic’s model-choice docs: “Tuning effort is often a better lever than switching models.” The effort docs: effort affects all tokens in the response, including tool calls. The docs never claim Opus is cheaper per task than Sonnet; that framing lives only in the Opus 4.5 launch post.
  • The same vendor’s small-model counterclaim: Haiku 4.5, Oct 15, 2025 offers “similar levels of coding performance to Claude Sonnet 4 but at one-third the cost and more than twice the speed.”
  • Claude Fable 5, Jun 9, 2026: $10/$50, the most expensive frontier model; docs claim lower effort settings “often exceed xhigh performance on prior models.”

The price ladder under the claims

  • OpenAI’s per-token price quadrupled across the 5.x line in nine months ($1.25/$10 in Aug 2025 to $5/$30 in Apr 2026) while every release claimed better token efficiency. Anthropic held Opus at $5/$25 across four releases, then priced Fable 5 at 2x. Current tier ratios: Opus is 1.67x Sonnet per token and 5x Haiku; GPT-5.5 is 2x GPT-5.4. Sources: Anthropic pricing, the-decoder.
  • Epoch AI, “The Price of Progress” (arXiv 2511.23455): the price of a given benchmark performance level falls 5-10x per year, fastest at the top (the highest GPQA-Diamond bin fell 31x/year vs 1.7x for the lowest). Same paper: “While per-token prices have generally declined, the cost of running frontier-level models has nonetheless risen approximately exponentially.”

Dollars per solved task

  • Artificial Analysis on Opus 4.5: cost to run their full Intelligence Index on Anthropic’s flagship fell from about $3.1K (Opus 4.1) to about $1.5K (Opus 4.5). The mechanism was the price cut, not frugality: Opus 4.5 used 60% more tokens than Opus 4.1 (48M vs 30M).
  • Aider polyglot leaderboard (raw data in the repo): GPT-5 (high) scored 88.0% for $29.08 total, vs o3-pro (high) at 84.9% for $146.32 and Claude Opus 4 (32K thinking) at 72.0% for $65.75. The best frontier model beat pricier reasoning models on score and total cost at once.
  • Same data, cost per solved task: GPT-5 high about $0.15/solved at 88%; GPT-4.1-mini about $0.027/solved but only 32.4% solved; DeepSeek V3.2 Reasoner about $0.008/solved at 74.2%. Cheap models are cheaper per task they can solve, and cannot solve most of the set at any price.
  • Verbosity at the cheap end (Artificial Analysis, 2026): DeepSeek V4 Pro generated 190M tokens, DeepSeek V4 Flash 240M, Kimi K2.6 170M to run the Index, against an average near 43M. Low per-token prices are partly eaten by 4-5x token volume.
  • Artificial Analysis on GPT-5.5: price doubled, token use fell about 40%, net cost to run the Index up roughly 20%. Third-party confirmation of OpenAI’s effective-cost math at benchmark scale.
  • ARC Prize GPT-5 tier data, Aug 2025: on ARC-AGI-1, GPT-5 Mini solved at about $0.22 per solved task vs GPT-5’s $0.78; the mini wins per solve when it can solve. On ARC-AGI-2 the cheap tiers barely solve anything (Mini 4.4%, Nano 2.5%).
  • The frontier cost trajectory: o3-preview’s high-compute ARC-AGI-1 run was re-estimated from about $3,000 to about $30,000 per task (TechCrunch, Apr 2025); a year later GPT-5.2 reportedly hit 90.5% at $11.64/task (via ARC Prize data). Roughly a 390x per-task efficiency gain in one year.
  • Princeton HAL (arXiv 2510.11977), 21,730 agent rollouts: on SWE-bench Verified Mini, configurations hitting the same accuracy differed up to about 6x in cost (o4-mini low $259 vs Claude Opus 4.1 high $1,600 at identical 54%), and higher reasoning effort reduced accuracy in a majority of runs.
  • “The Danger of Overthinking” (arXiv 2502.08235), 4,018 trajectories: o1-high resolves 29.1% of SWE-bench issues for about $1,400; o1-low gets 21.0% at about $400. Selecting low-overthinking solutions improves performance almost 30% while cutting cost 43%.
  • Artificial Analysis, Apr 2025: reasoning models use up to 20x more tokens than non-reasoning models; GPT-5 at high effort uses 23x more tokens than at minimal effort.
  • The waste is removable, which proves it was waste: steering methods cut reasoning output “up to 71% while maintaining and even improving accuracy” (arXiv 2505.22411); batch prompting cuts 76% (arXiv 2511.04108).
  • Cost-per-resolved-task ranking, June 2026: DeepSeek V3.2 at about $0.028 per resolved SWE-bench Verified task, about 24x cheaper than Opus 4.5’s $0.68. The same analysis: “a pricier model with a higher solve rate can finish a feature for less total spend than a cheap model that needs three attempts.”

Cursor’s scoreboard

  • cursor.com/evals is a live leaderboard that prices coding models per finished task: correctness score, average dollar cost per task, tokens per task, and steps per task, for every model at every effort setting. Cost is computed by applying each model’s published per-million-token prices (input, cache read, cache write, output) to the tokens it actually used.
  • Why it exists (Cursor’s CursorBench post, Mar 2026): public benchmarks saturated and contaminated. “SWE-bench Verified, Pro, and Multilingual all draw tasks from public repositories that end up in model training data, inflating scores. OpenAI recently stopped reporting SWE-bench Verified results entirely after finding that frontier models could reproduce gold patches from memory, and that nearly 60% of unsolved problems had flawed tests.”
  • Tasks come from real Cursor sessions via Cursor Blame, which traces committed code back to the agent request that produced it. Median task edits 181 lines vs 7-10 for SWE-bench variants; prompts are intentionally short and underspecified, like real requests (analysis). The harness is held constant across models, though it is not open-source, so that is asserted rather than verifiable (digitalapplied).
  • The board as of June 11, 2026: Claude Fable 5 holds the top four rows at every effort setting, from 72.9% at $18.02/task (63,842 tokens, 76 steps) at max down to 69.8% at $8.27 at medium. The best non-Fable score, Opus 4.7 Max at 64.8%, costs $11.02: five points worse than Fable 5 Medium at a third more money.
  • The same-score, different-machine row set: Opus 4.7 Max scores 64.8% at $11.02 in 96 steps; GPT-5.5 Extra High scores 64.3% at $4.37 in 46 steps; Fable 5 Low scores 64.2% at $5.70 in 36 steps. Three configurations within 0.6 points; the newer models finish on a third of the tokens and a third of the steps.
  • The mid-tier version: GPT-5.5 Low matches Sonnet 4.6 Max (48.8% vs 49.0%) at $1.19 vs $3.09, on 4,923 tokens vs 40,280. An 8x token gap between a new model loafing and an older model straining.
  • The counter-argument on the same board: Composer 2.5, Cursor’s own small fine-tuned model, scores 63.2% at $0.55/task, within 1.6 points of Opus 4.8 Max at one-fourteenth the cost. Its base model (Kimi K2.5) scores 31.9%; the fine-tune roughly doubles it (benchlm.ai).
  • Effort settings span more of the board than model families: Opus 4.8 alone runs 54.3% to 63.8% ($2.93 to $7.59) across its effort levels. Steps per task is the closest published proxy for iteration rounds; no benchmark on the page measures human re-prompts or one-shot rates.
  • Independent partial replication (Artificial Analysis Coding Agent Index, own harness): Composer 2.5 third overall at $0.07-0.44/task vs Opus 4.7 max at $4.10 and GPT-5.5 xhigh at $4.82. The cost ratios replicate; the absolute dollars do not travel between harnesses (Cursor’s own page shows different absolutes for the same models).
  • The harness is worth as much as a model generation: one 100-feature benchmark found Cursor’s harness lifted Claude Opus from 77% to 93% and averaged +11% across models (buildmvpfast); Theo Browne: “Opus scored 20% higher in Cursor than in Claude Code”.
  • The conflict-of-interest critique (digitalapplied): “Cursor built the evaluation methodology, ran the v3.1 evaluation, published the scores, and simultaneously launched the model being scored.” The test set is not downloadable and the instance count is undisclosed; BenchLM displays CursorBench “for reference but excluded from the scoring formula.” The mitigating fact critics concede: Cursor’s own leaderboard puts a rival on top, not Composer.
  • The default-tier catch (topreviewed.ai): Composer 2.5’s headline economics use the Standard tier ($0.50/$2.50 per MTok); the interactive default is the Fast tier ($3/$15). The cheap rung requires configuration to reach.

Field reports: fewer rounds, fewer retries

  • Simon Willison on Opus 4.5 launch day, doing the arithmetic in public: “Opus 4.5 is ~60% more expensive than Sonnet ($25/million output compared to $15/million) but if it can use 76% fewer output reasoning tokens for the same complex task it may end up cheaper!”
  • Simon Willison on GPT-5.5, Apr 23, 2026: default-effort output lagged GPT-5.4 and improved only “at the cost of far more tokens.” Vendor efficiency claims are effort-setting-dependent.
  • Zvi Mowshowitz on GPT-5.5: “first time since Claude Opus 4.5 came out… that I’ve considered a non-Anthropic model a competitive choice,” relaying OpenAI’s framing that “token use is more efficient now, so the headline price went up but real costs went down.”
  • Peter Steinberger, the human-time version: even when a model takes about 4x longer per task, he is net faster “because I don’t have to go back and fix issues.” Iteration rounds and rework, not per-token price, set his real cost.
  • Vantage, April 2026: “if the more capable model resolves tasks in fewer turns with fewer retry loops, the cost-per-outcome can end up lower despite the higher per-token price.” Their data: the highest-spending developer had the second-lowest cost per merged PR and nearly 2x the throughput of the next developer.
  • Vantage on retry mechanics: each retry is a full round-trip at inflated context size; three failed attempts at turn 40 cost 3x a turn already carrying 30,000+ input tokens.
  • Cline’s model-selection guidance: measure cost per task, not cost per prompt; “a cheaper prompt is not useful if it takes more retries to finish the work.”
  • The wheel-spinning anecdote: an agent spent 47 iterations retrying variations of the same ALTER TABLE statement, “a $30 learning experience about a $0.50 problem.”
  • Grizzly Peak Software: “the cheapest model per token is often the most expensive model per result.” One team’s cheaper-on-paper model cost 3x the budget on retries when it failed 40% of the time.
  • Unblocked on routing: the “almost right” trap. Cheap models produce output that compiles but misses edge cases, triggering retries that re-pay input tokens and exceed one frontier call.
  • The attention ceiling (Luke Alvoeiro via BigGo): an experienced engineer can supervise roughly 2-3 parallel agents before quality slips. The scarce resource is review capacity, which makes a model that needs fewer interventions cheaper regardless of token price.
  • The subscription inversion (Boris Cherny): “Since Opus consumes rate limits faster than Sonnet, you’ll hit limits more quickly on the Pro plan.” Quota burn is price-weighted, so subscription users experience the opposite of the API math.

Where cheap tokens still win

  • RouteLLM (ICLR 2025): a trained router cut costs 85% on MT-Bench while keeping 95% of GPT-4 quality; with augmented data, 95% quality with only 14% of calls going to the strong model. If a router can shed three-quarters of frontier calls with no measured quality loss, frontier-for-everything was overspend.
  • OpenRouter aggregate data, 2026: the fastest-growing models are all free or under $1/M tokens; Chinese sub-dollar models exceed 45% of token volume; Anthropic holds about 12.3% of token share but an outsized dollar share. Token volume reveals a preference for cheap models; vendor revenue rests on the premium tier.
  • High-volume production routing (fine-tuning case study): a support-ticket pipeline sent 68% of traffic to a fine-tuned 7B model at $0.0004/call and only 6% to the frontier model.
  • For classification, extraction, and moderation at scale, sub-dollar models are the structural winner: “Budget models handle 70-80% of production AI workloads… none of these tasks need a $5/M output model” (TokenMix, 2026). The cost-per-task thesis fails entirely here.
  • Reasoning models are anti-economical on simple queries: “For simple questions like ‘What is the capital of France?’, DeepSeek-R1 might generate hundreds of internal ‘thought’ tokens to verify the answer before outputting ‘Paris,’ and you pay for those thought tokens” (PassHulk, 2026).
  • Anthropic itself capped the expensive tier (TechCrunch, Jul 2025): Opus-specific weekly limits after users burned “tens of thousands in model usage on a $200 plan.”
  • The vendor’s numbers are thinner than the headline (Implicator.ai): the 76%/48% claims cover output tokens only. At the high-effort figure the per-task saving nearly vanishes (0.52 x 1.67 is about 0.87x Sonnet’s output cost), and agentic bills are mostly input tokens, which the claims do not cover and which bill at 1.67x regardless.
  • Reported Claude Code telemetry, May 2026 (knightli.com): developers preferred Sonnet 4.6 over flagship Opus 59% of the time, citing better instruction following and less overengineering. Single source; unconfirmed against any primary data.

Measurement traps

  • The Opus 4.7 tokenizer episode, April 2026, the cleanest proof that the rate card is not the bill: Anthropic’s pricing page disclosed the new tokenizer “may use up to 35% more tokens for the same fixed text.” OpenRouter analyzed 1M+ real requests: 32-45% more native tokens, actual costs up 12-27%, at an unchanged $5/$25 price.
  • Bill Chambers’ community measurement tool, named, fittingly, Tokenomics (tokens.billchambers.me), collected 483 request submissions showing about 37.4% more cost per request on Opus 4.7 vs 4.6; the HN thread ran to 453 points. Simon Willison measured about 1.46x the tokens for the same text and up to 3x for one high-res image: “it’s priced the same as Opus 4.6 on a per-token basis so this is actually a pretty big price bump.”
  • Verbosity comparisons are partly eval artifacts: Artificial Analysis’s methodology caps non-reasoning models at 16,384 output tokens while letting reasoning models run to their maximum, inflating the apparent gap by construction. OckBench (arXiv 2511.05722): standard evaluations “report only final accuracy, obscuring where tokens are spent or wasted.”
  • Effort settings can dominate model choice, in both directions: on hierarchical legal reasoning, raising GPT-5.x effort from medium to high dropped performance from 15.34 to 12.63 (arXiv 2510.08710); a single xhigh call on a long prompt can run 20K reasoning tokens (effort guide).
  • The harness confound: on SWE-bench Pro, four agents all running Claude Opus 4.5 diverged by up to 17 solved problems out of 731 (analysis). Per-task cost comparisons that hold the model constant still vary by scaffold.

← Back to Tokenomics