Tokenomics: what do thousands of dollars of tokens buy?

A staircase from $10 to $100,000 a month. Each rung names what the spend buys: an assistant, an analyst, a team in parallel, self-running loops, a dark factory. — The spend ladder. The rungs do not differ by quantity.

Last night I asked for the research for this post. Then I answered a few multiple-choice questions while agents worked through my browser and the open web. They read the Anthropic engineering corpus, swept X and Reddit, pulled the OpenRouter rankings live, and wrote research files with a few hundred cited findings. I still read all of it and cut what did not hold up. But a sweep that used to be a week of analyst time took one sitting and a few dollars of tokens.

Token bills are suddenly everywhere and nobody quite knows how to read them. Uber put 5,000 engineers on Claude Code, blew its entire 2026 AI budget in four months, and set a $1,500 monthly cap. Peter Steinberger posted a $1.3 million monthly bill for the hundred-odd agents behind OpenClaw. He itemized exactly what they do; the replies still asked what the million bought. He works at OpenAI now, and his employer covers the bill. My own bill runs into the thousands, mostly from building things at work: a lot running in parallel and a growing number of self-running loops, working toward the dark factory over time. Everyone is staring at the same line item and trying to figure out what it is. If the line item is yours, the end of this piece is written for you.

The meter

Part of the confusion is mechanical. Personal plans hide the meter: a $100 or $200 Max subscription is flat and capped, and quietly absorbs what would be thousands of dollars at API rates. SemiAnalysis measured the ceiling: run a $200 plan to its weekly limits and it yields about $8,000 a month of tokens at API prices. Subscriptions are buffets, priced on most people not eating much.

It also helps to know what the meter counts. Five facts make any token bill legible. You pay per million tokens, and a model’s output costs about five times its input. The meter reads documents by length: a 300-page lease costs thirty times more to take in than a 10-page amendment. Agents re-send their history with every step, so long sessions compound: the tenth step pays for the first nine again. Cached context bills at about a tenth of fresh context, which is why two identical-looking workloads can differ ten to one. Stepping down a model tier cuts the price of the same tokens tenfold. And a subscription flattens all five numbers into one flat fee, until you cross into metered enterprise and they reappear, itemized, as your bill. Enterprise plans run the other way, seats plus consumption, every token a line item. Individuals never see what they actually use. Companies see all of it at once. So the bragging about token spend comes from subscribers, and the alarm comes from CFOs, and they are describing the same usage.

The ladder

The other part is what the bills actually represent. They sit on a ladder, by the order of magnitude: $10, $100, $1,000, $10,000, $100,000 a month. The rungs do not differ by quantity. Every rung is a different machine.

At $10 you own a chat window, an assistant. Occasional chats, a summary, some data processing. It answers; you do the work, and what comes back is a draft or a summary in minutes, not hours. Anthropic prices the median conversation at about $54 of professional labor; a lesson plan that took a teacher four and a half hours takes eleven minutes. It’s a small machine, and the work is real.

At $100 you own a daily driver, an analyst at the next desk. The app is open all day and you work through it, one task at a time, watching every step. What comes back is finished work: a research brief that used to take a week, in a day. Not much runs in parallel and nothing runs without you. Developers live here, and so do support reps, analysts, and writers, and the longer people stay, the harder the work they hand over: the education a user’s tasks require rises about a year for every year on the platform. Most people settle here and think they’ve arrived. There’s a rung above it.

At $1,000 you run a team in parallel: agents working autonomously on their own tracks while you move between them. Product development, strategy, analytics, and writing, each running at once. It did not replace anyone. It changed what we would have had to hire for: work that used to wait on a req starts the same week, and prototypes that would have taken a quarter take days.

The bill stops reading like a tool and starts reading like capacity.

This rung is also Uber’s story. They put Claude Code in front of five thousand engineers and did not expect how many would end up here: usage doubled in two months, most committed code is now AI-assisted, and the bill outran every projection. But the harder problem was not the bill. A business can only digest so much new work. More code per engineer does not automatically become more things customers can touch; review, release, and the organization itself become the bottleneck. When tokens produce more than the company can absorb, the spend stops mapping to value, and the cap follows.

At $10,000 you have self-running loops, always-on and owned, increasingly doing the prompting that used to be yours: a fleet that reviews every pull request, security scans on every commit, reconciliations that run nightly, benchmarks watched and regressions filed while the team sleeps. Outside engineering the shape is the same: email triage, invoice processing, outreach that drafts itself, lease abstracts that took days arriving in seconds, clause values pulled from a contract in one query. On Anthropic’s API, a handful of workflows like these consume nearly all the volume. At this rung governance is part of the machine: every loop has an owner, a budget, and a gate it cannot pass without proof.

And at $100,000 and up the machine never stops. This is the dark factory. The term comes from manufacturing: a plant so fully automated it runs with the lights off and nobody on the floor. In software it means code built around the clock, specs in and releases out, with no one at the keyboard. Hardly anyone lives there yet. The closest thing we have is OpenClaw: in the first week of June 2026, the repo took about 1,200 pull requests and cut a release nearly every day. Read the changelogs and half of what ships is verification machinery, gates and proof and releases that fail closed. The factory runs because the gates were built first. Steinberger has since published the loop itself: an orchestrator that wakes every five minutes, directs work to threads, and reduces his job to four moves: land the prepared PR, close it, provide one access step, or choose between documented alternatives. His bill runs to $1.3 million because the surface is enormous, the fastest-growing repo in GitHub history plus the ecosystem around it. Point the same loop at one product and it is a $10,000 machine.

Spend vs. investment

Which is the honest way to read the upper rungs: past a certain point you are not spending, you are investing. The chat window and the daily driver buy work you can price the same day. Loops and factories are capital, tokens spent building machines, and machines to check the machines, so that future work gets cheap. Some of it compounds. Some of it never pays back. And note what it is not: capex. All of this spend is opex that scales with usage. The investment is real, but it never sits on the balance sheet, which is part of why boards have trouble seeing it.

For me the honest accounting is speed and reach. The prototypes that shipped this quarter instead of next year. The analysis that happened at all. Hiring is slow and execution drags behind it; tokens turn on this afternoon. And the same test that catches Uber catches me: speed only counts when the work lands. Capacity you cannot digest is just a bigger bill.

The waste

The waste is real too. At any rung you can spend $500 and ship nothing, and it is rarely the model’s fault. It is loop design: tool output re-read on every turn, whole histories re-sent each step until old context is most of the bill, loops with no tests or gates accepting their own first answer at full price. Waste is a machine problem, and machine problems have fixes. If you run a platform, the audit is three questions per loop: what does it re-read every turn, what gate stops a bad answer, and what does one finished outcome cost.

What a token buys now

Meanwhile the floor of the ladder keeps dropping. As of early June 2026, roughly half the tokens on OpenRouter go to Chinese open-weight models priced 5 to 50 times below the frontier, and a decent coding model runs on a Mac. Commodity intelligence is nearly free. That sharpens the ladder rather than flattening it. The thousands buy what the cheap tokens can’t: frontier judgment, reliability across long unattended loops, and trust.

Routing is the same lever inside a single bill. The expensive model is for judgment; the cheap ones are for volume. People already price this by instinct: every extra $10 an hour a task is worth adds 1.5 points to the share run on the most expensive model. The research for this post ran that way, a frontier model directing fleets of cheaper ones. Most bills shrink the day someone checks which model is doing the grunt work.

The tier rule itself is getting more complicated, because the newest models changed what a token buys. Since late 2025 every frontier launch has led with the same claim: fewer tokens for the same work. Opus 4.5 matched the mid-tier’s best benchmark score using 76% fewer output tokens. GPT-5.5 doubled its price per token and OpenAI argued the real increase was closer to 20%, because it finishes the same tasks on 40% fewer tokens. The unit of account is shifting from the token to the task. Cursor now publishes a live scoreboard that prices every model per finished task, in dollars, tokens, and steps, because the public benchmarks stopped telling models apart. Read one row of it and the iteration argument becomes arithmetic: the previous frontier flagship at full effort took 96 steps and $11 to score 64.8% on their suite; the newest model at low effort scored 64.2% in 36 steps for half the money. And the same board shows the limit: a small fine-tuned model sits within two points of last month’s flagship at one-fourteenth the cost. Smarter models answer better and take fewer turns to get there, and the turns are the bill.

Read your own bill

So read your own bill. At $10 you bought a chat window. At $100, a daily driver. If you want the parallel rung, the move is not to spend more on the same machine. It is to change how you work: more tracks, clearer specs, gates you trust.

And if you own a budget, the cap is the understandable first move. The better ones are two questions: what did each token buy, and how much of it actually landed? You do not need a measurement program to start. Pick one task your teams already track: a code review, a support ticket, a research brief. Put the token cost next to what it cost before, and count whether the output landed. That is the whole value function: things you already measure, with a new cost column.

A four-column ledger: task, cost before, tokens, landed. Research sweep for one post: a week of analyst time, a few dollars, landed yes. A security audit: a pentest at $50k and a week, ~$500 and an hour in tokens, landed yes. A support ticket: a human rep at $4 to $8, cents a ticket in tokens, landed the easy ones. — The last column is the one that matters.

Two habits make it stick. Show every team its own meter before you charge them for it; people who can see their usage manage it. And forecast in outcomes, not seats; the bill follows the work, so a seat-based budget will always be surprised. A cap makes the number smaller. A value function makes it make sense.