A few weeks ago I ran the numbers on the token cost panic. I took the scariest figure in legal AI, the finding that agentic workflows burn a thousand times more tokens than a chat query, and followed it all the way down to a dollar amount on a real deal. The panic did not survive the arithmetic. The piece is here if you want the full walk-through.
This is not that piece. The panic has moved on since I wrote it, and the new versions are smarter than the old one. The thousand-times number has quietly retired, because a thousand times almost nothing is still almost nothing. In its place are three fresher anxieties, and they deserve a real answer. The first says the model makers have a monopoly now, the price of a token is climbing, and it will climb forever, so you had better lock in a flat rate or build your own models before it does. The second says forget the price of a token, watch the meter: every time the AI reads your contract it ticks, and a long agentic session reads your contract over and over and over. The third does not bother with an argument at all. It just points at a number. One company spent five hundred million dollars on AI in a single month, and the number is so large it does the panicking for you.
All three are wrong. They are wrong in more interesting ways than the original, which is the only reason I am writing this down instead of linking to the first piece again. But underneath the new costumes it is the same body. Every version of this panic makes the same mistake and reaches the same conclusion. So let us stop swatting the individual numbers and name the thing that keeps generating them.
The Mistake Underneath All of It
Here is the error, stated once, because everything below is a variation on it.
A token is the unit a model uses to bill you. It is not the unit your work is measured in, it is not the unit your client pays for, and it is not the unit anything you care about is denominated in. It is a meter reading. The entire genre of token panic consists of staring at the meter reading as though it were the fare, the destination, and the quality of the ride all at once.
It is not any of those things. It is the meter. And a meter, by itself, tells you nothing about whether you are getting a good deal. A taxi meter reading of forty dollars is a bargain to the airport and a robbery around the block. The number on the meter is the least informative number in the entire transaction, because it means nothing until you put it next to what the ride was worth. Every piece in this genre forgets that, and forgets it in a slightly different way. Let me take them in turn.
“Prices Only Go Up”
Start with the monopoly story, because it has a real fact inside it. Yes, the newest frontier model costs more per token than last year’s newest model. That part is true. What the story does with it is the problem.
It draws a line through two dots and calls it a trend. Frontier prices up, therefore prices up forever, therefore lock in a flat rate before the meter eats you. But you are watching the wrong number. The price of a frontier token is not your cost. Your cost is what it takes to finish a task, and the cost of finishing a given task has been in freefall for two straight years. The same capability that ran on the most expensive model available in 2022 runs today on something on the order of two hundred and eighty times cheaper. Last year’s frontier is this year’s mid-tier is next year’s free default. The token at the very tip of the frontier gets a little pricier each release; everything behind the tip collapses in price behind it. Gartner expects another ninety percent drop in inference cost by 2030.
Watching the frontier price and concluding that AI is getting more expensive is reading the thermometer and announcing a fever, while ignoring that you are holding the thermometer over a candle. The evidence that the baseline is getting cheaper often sits right there in the same articles raising the alarm, quoted from the experts and then left unaddressed. You do not build a cost strategy on the one number in the system that is engineered to always be the highest.
“Watch the Meter Tick”
The second version is more seductive, because it comes with a picture. There is a meter. It is running. It is not visible and nobody is watching it. Every question you ask, every document you paste, every time the model reads back over the contract, the meter advances, and an agentic session is one long ride with the meter buried somewhere you cannot see it. Be afraid of the meter.
It is a good picture. It is also describing a machine that was rebuilt about a year ago.
Here is the mechanism the picture leaves out. When an AI reads a long document, that document is loaded into its context once, at full price. Every subsequent time the model reads back over it, that is a cached read, and every major model provider bills it at a discount, anywhere from fifty to ninety percent off, depending on the provider. You pay full freight to put the contract in the room once. After that, every time the model reads back over it costs a fraction of that first pass.
So the entire horror story, the one image every version of this panic is built on, the machine reading your document over and over while the meter spins, describes a problem the providers fixed before most of these pieces were written. The agent that reads your contract fifty times is not paying fifty times to read it. It pays full price once and a steep discount on the other forty-nine. The panic prices every one of those reads at full freight; the actual bill is a small fraction of that. The meter is real. It barely moves on the part everyone is pointing at.
I want to be precise here, because imprecision is how someone discredits a whole argument over a footnote. Cached reads are discounted, not free, and a long enough session still adds up. The cache expires after a few minutes to an hour, so the discount lives inside a working session rather than forever. None of that rescues the panic. Agentic work is high-frequency, same-session, re-reading-the-same-context work. That is the precise workload the caching discount was built for.
“But Look At This Number”
And then there is the half-billion-dollar bill.
The story made every outlet, because it is built to. One company, unnamed, spent five hundred million dollars on Claude in a single month. It is worth knowing where that number comes from: a single AI consultant, quoted in a single report, describing a client they do not name. No company has confirmed it and no one has verified it. That did not slow it down for a second. The number is enormous and the number is unverified, and it is being passed around as proof that AI costs have slipped the leash and big companies are getting torched.
So let us take it at face value anyway, because even granting every word of it, the story argues the opposite of what it is being used to prove. Read the second sentence and it falls apart in your hands. The company spent five hundred million dollars because it put no usage limit on the licenses. Thousands of employees had unlimited, uncapped access, and for a month nobody looked at the meter. That is the story. That is the whole story.
This is not a company that got beaten by the cost of AI. This is a company that took its foot off the brake, tied the steering wheel down, climbed into the back seat, and then expressed surprise at where the car ended up. The number is not evidence that AI cost is uncontrollable. It is evidence of precisely the opposite, because the controls exist and this company chose not to use a single one of them.
The controls are not theoretical. Claude Enterprise ships four levels of spend control: an organization-wide monthly cap, group caps, caps by seat tier, and individual per-user caps. They are hierarchical, so a user cannot exceed their own cap, their group’s cap, or the organization’s, whichever is lowest. When someone hits the limit, they are blocked. That is finer-grained cost governance than most firms have ever had over their e-discovery spend, and every serious enterprise AI platform ships some version of it. It is a shipping product, not a roadmap promise.
A company spent half a billion dollars by switching all of it off.
A company that gives thousands of people uncapped access to any metered resource and looks away for thirty days does not have a token problem. It has a management problem wearing a token costume.
And there is a second reason the number should not frighten you, specific to the kind of work most professionals are actually doing. That bill came from unattended automation: thousands of processes looping on their own, re-reading and retrying around the clock, with no person waiting on any single step. That is the only kind of AI spending that can run away while you sleep, because it is the only kind with no human in the loop to stop it. Attended work, a person at a desk directing the tool and reading the results, is capped by something the runaway scenario removed: human attention. Even a whole firm only has so many working hours in a month. I have tried to reach half a billion dollars a month with nothing but attended sessions, stacking every worst case, the largest firms, the most expensive model, every lawyer running flat out all day with the caps off, and the ceiling lands in the low single-digit millions, an order of magnitude short, and even that requires assumptions no real firm would survive.
The only way to clear that ceiling is to take the human out of the loop. Here someone will object that firms will simply start running headless processes of their own, looping unattended like the company that ran up the bill. They will not, because there is nothing for such a process to do. The runaway scenario needs a task with no natural end, something that can spin on itself for a thousand hours and always find more to spend on. Legal work is not shaped like that. A contract gets reviewed and marked up and it is finished; there is no ten-thousandth pass, because there is no ten-thousandth version. The deliverable has a floor, and the floor is the cap, whether or not anyone is watching.
And if none of that reassures you, there is a backstop that costs nothing to set. Cap each person at some deliberately absurd number, ten thousand dollars of usage a month, and forget about it. No one doing supervised legal work will ever come close, so it never touches real use. But it is not really a spending limit. It is a smoke alarm. The day someone actually hits ten thousand dollars in a month, the cap has told you something has gone wrong, a broken process, a headless loop someone set running, a mistake worth finding, long before it becomes a number worth fearing. The company in the story did not lack a way to prevent the bill. It declined to use one.
The most-cited number in the entire panic turns out to be the best argument against it.
The One True Thing
Now let me give the panic its due, because there is a real fact in here and I am not going to pretend otherwise.
Your AI bill probably is going up. Not the per-token price, the bill. Even as the price of intelligence collapses, total enterprise spend on it has risen sharply, because the work has moved from a single chat answer to an agent running an entire multi-step task, and that consumes vastly more tokens. That is real. That is the genuine signal buried under all the noise, and the people watching their invoices climb are not imagining it.
But look at what the climbing number represents. The bill went up because the machine stopped answering a question and started doing the job. Two years ago the meter measured a chatbot composing a paragraph. Today it measures an agent reading the deal room, drafting the issues list, checking it against the precedent, and revising its own work. Of course the meter is higher. It is doing thirty times the work, because there is thirty times the work being done, work that used to belong to a person and a timesheet.
So the question was never “why is the meter higher.” The question is the one this entire genre is constructed to avoid: what is the meter now doing that it could not do before, and what did that work cost you the last time a human did it? Put the number as high as you like. Four hundred dollars a deal, four thousand, forty thousand: the question does not change, and neither does the answer, as long as the work it replaced cost you more. A bill that tripled while absorbing the work of a first-year associate is not a cost problem. It is the best trade your firm made all year. You only get to be horrified by the number if you refuse, the entire time, to look at the other side of the ledger.
What Actually Deserves Your Attention
There is a version of cost discipline that is not panic, and it is worth naming so it does not get lost in the noise.
Use the cheapest model that does the job. Route the easy work to the small model and reserve the frontier for the tasks that need it. Do not paste the entire deal room in to summarize one clause. Cap your users. Watch your meter, not because the meter is the enemy, but because watching the meter is just management, and a firm that cannot see its AI spend by matter and by practice group should go build that visibility before it signs anything. None of this is glamorous. None of it sells a product. Nobody is going to write a breathless thought piece urging you to right-size your model selection, because there is no panic in it and no vendor on the other end of it. It is just the unsexy discipline of knowing what a task is worth before you run it, which is the same discipline the profession has always claimed to have and rarely does.
That is the whole legitimate concern. It fits in a paragraph. Everything past it is theater.
Why It Keeps Coming Back
Here is the part I actually want you to take away, because it will outlast the next five versions of this.
The token panic recurs because it is the comfortable debate. It lets a roomful of smart people argue urgently about something that does not threaten anyone. It is easier to compare per-seat pricing against per-token pricing than to ask what happens to associate leverage when the work a first-year used to bill for is absorbed by a machine that costs four hundred dollars a deal. It is easier to fear a meter than to ask who, exactly, captures the value when AI makes a partner ten times more productive: the client, the firm, or the vendor. It is easier to publish a chart of rising token prices than to sit with the fact that the entire economic structure of the firm, the leverage pyramid, the billable hour, the margin built on associate hours, is the thing actually being repriced, and the tokens are a rounding error inside that story.
The panic is a place to hide. Every few weeks it comes back wearing a new number, because the number is never the point. The number is the thing people reach for so they do not have to look at the ledger underneath it.
So do the boring things, the ones that fit in a paragraph, and then put the meter down and go have the uncomfortable conversation. That is the one that decides which firms are still standing in five years.
The tokens were never going to.