// SYSTEM: DIGEST // LIVE
AI WORKFLOW
OPINION
TUTORIALS
ChatGPT
ChatGPT
William Smith
William
CONVERSATIONS WITH CODE

Token maxing: How Meta and Amazon measure AI productivity backward

Meta ranked 85,000 employees by tokens consumed and crowned "Token Legends." Employees gamed it by running agents overnight. It's Goodhart's Law with a $900M price tag.

There's a pattern in how organizations measure work that never really goes away — it just finds a new costume.

For a while it was physical presence, then it was email response time, then it was Slack availability, then it was story points.

Now it's tokens.

Meta's internal leaderboard called "Claudeonomics" ranked 85,000 employees by how many AI tokens they consumed.

Top performers got titles like "Token Legend."

In one 30-day window, those employees burned through 60.2 trillion tokens — which at standard Anthropic pricing would have cost somewhere around $900 million.

Jensen Huang publicly said he'd be "deeply alarmed" if a $500,000 engineer wasn't consuming at least $250,000 worth of tokens annually..

What employees actually did when the leaderboard went up

At Amazon, where management set a target for 80% of developers to use AI weekly, Amazon employees report gaming an internal tool called MeshClaw to run overnight on low-value tasks — monitoring deployments, triaging email, consolidating notes — not because those tasks needed doing that way, but because the agent running meant tokens were accumulating. One engineer apparently used an AI to find ways to mock a project manager.

At Meta, employees started inflating prompts deliberately. Instead of asking for a concise answer, you ask for a verbose explanation with multiple alternatives, detailed reasoning, and rollback options. You dump entire Slack histories into a model for trivial analysis. You feed large, irrelevant documents through summarization tasks first thing in the morning just to rack up input tokens before the real work starts. None of this is irrational behavior given the incentives. All of it is completely useless.

The engineers doing this aren't the problem. The people who designed a system where this is the logical response are.

Goodhart's Law has been waiting for this moment

There's a principle — Goodhart's Law — that says when a measure becomes a target, it stops being a useful measure. It's been around since the 1970s and economists have been watching it play out in every domain imaginable. Whenever you pick a proxy for the thing you actually want and then reward the proxy directly, people optimize for the proxy and the underlying thing you cared about quietly stops mattering.

Token consumption is a near-perfect Goodhart trap. It sounds like it should correlate with useful AI work — if you're using the tools, you're presumably doing something with them. But tokens measure compute consumed, not value created.

They measure the size of the conversation, not whether the conversation produced anything worth having.

A 2026 PwC survey found that 56% of CEOs reported no significant revenue increase or cost reduction from AI investments in the past year. That's not a coincidence. That's what happens when adoption theater gets mistaken for transformation.

The defense isn't crazy, it's just incomplete

Meta's CTO Andrew Bosworth defended the practice by saying that when high token spending results in 5-10x productivity gains, it's "easy money." And honestly, that's not a wrong position — it's just a position that assumes the token spending and the productivity gains are connected, which is exactly what the gaming behavior puts in doubt.

If you can't tell the difference between tokens burned on real work and tokens burned on invented busywork, the leaderboard isn't measuring productivity. It's measuring the willingness to run AI agents. Those are different things.

The steelman version of Bosworth's argument is that in the early days of a new tool, getting people to use it at all has value — even if some of that usage is inefficient, you're building familiarity, discovering use cases, and normalizing the workflow. I actually think that's partially true. I've used AI in pretty dumb ways when I was first figuring out what it could do — just asking it things I could have Googled, basically using it as a search engine with better grammar. That's how most of us got started. There's no shame in the learning curve.

But there's a difference between a learning curve and a leaderboard. One is a phase you pass through. The other is a permanent incentive structure that rewards the wrong behavior indefinitely.

What measuring outcomes actually looks like

A few companies have tried to build something better, though none of them have been doing it long enough to know if it works at scale.

Salesforce created what they're calling "Agentic Work Units" — measuring completed AI tasks rather than tokens consumed. Marc Benioff's framing is worth quoting directly: "A token on its own doesn't know your customers, your pipeline, your org chart, but Salesforce does. And the value isn't in the token. The value is in what our platform does with it, the work." That's the right frame. Whether the AWU metric actually captures "the work" in a way that can't also be gamed is a genuinely open question — the metric is new enough that the reporting on it is still mostly Salesforce's own documentation.

Zapier took a different approach and tracked percentage of active employee usage, number of AI-powered workflows deployed, and number of AI experiments launched. They hit 97% active employee usage. That's still an activity metric rather than an outcome metric, but it's at least measuring whether people are building things rather than whether their agents ran overnight.

Writer tracks "words generated," "recaps transcribed," "words rewritten" — more granular than tokens, closer to actual work product. It's not perfect either, but at least it's pointing at something a person produced rather than something a server processed.

None of these are fully solved. But they're all pointing in the right direction: toward what got done, not toward how much compute got burned doing it.

What this means if you're not at Meta

Most of the people reading this aren't running 85,000-person engineering organizations. But the dynamic isn't exclusive to big companies, and the pressure version of it is already showing up in smaller contexts.

If you're a freelancer or a small-team operator, nobody's handing you a token leaderboard. But there's a softer version of the same trap: the impulse to demonstrate AI usage rather than demonstrate results. Showing a client a 40-page AI-generated research document when they needed a two-paragraph answer. Running a dozen different AI tools on a project to prove you're "using AI" when one tool used well would have been enough. I catch myself doing versions of this — reaching for AI on things where it doesn't actually help because it feels like the right move in 2026.

The question worth asking isn't "am I using enough AI?" It's "did the work get better?" Those are genuinely different questions and the second one is harder to answer, which is probably why companies keep defaulting to the first.

If your company or client starts tracking AI usage as a KPI — token counts, sessions, weekly active usage — you're watching the same trap get set. The metric will look like accountability. It will feel like progress. And the people being measured will respond exactly the way the Meta engineers did: rationally, efficiently, and in ways that produce nothing useful.

The tools are genuinely good. Claude, Gemini, GPT-4o — I use them constantly and they've changed how I work in real ways. But the value isn't in how many tokens you burn. It's in whether the output was worth having. That's a harder thing to put on a leaderboard, which is exactly why nobody's figured out how to do it yet.

Generated Images

Seven variants below — three standard compositions, one documentary (foreground bokeh), and three dynamic-angle "spatial" compositions for parallax video. To request a fix on any one, add a checkbox under ## Image Touch-ups like: - [ ] spatial-square: remove the random hand on the right

landscape — 1920×1080

landscape
← Back to Digest

Token maxing: How Meta and Amazon measure AI productivity backward

Meta ranked 85,000 employees by tokens consumed and crowned "Token Legends." Employees gamed it by running agents overnight. It's Goodhart's Law with a $900M price tag.

Token maxing: How Meta and Amazon measure AI productivity backward
Meta programmers working in a nice meta branded working space typical of a big social technology company.

There's a pattern in how organizations measure work that never really goes away — it just finds a new costume.

For a while it was physical presence, then it was email response time, then it was Slack availability, then it was story points.

Now it's tokens.

Meta's internal leaderboard called "Claudeonomics" ranked 85,000 employees by how many AI tokens they consumed.

Top performers got titles like "Token Legend."

In one 30-day window, those employees burned through 60.2 trillion tokens — which at standard Anthropic pricing would have cost somewhere around $900 million.

Jensen Huang publicly said he'd be "deeply alarmed" if a $500,000 engineer wasn't consuming at least $250,000 worth of tokens annually..

What employees actually did when the leaderboard went up

At Amazon, where management set a target for 80% of developers to use AI weekly, Amazon employees report gaming an internal tool called MeshClaw to run overnight on low-value tasks — monitoring deployments, triaging email, consolidating notes — not because those tasks needed doing that way, but because the agent running meant tokens were accumulating. One engineer apparently used an AI to find ways to mock a project manager.

At Meta, employees started inflating prompts deliberately. Instead of asking for a concise answer, you ask for a verbose explanation with multiple alternatives, detailed reasoning, and rollback options. You dump entire Slack histories into a model for trivial analysis. You feed large, irrelevant documents through summarization tasks first thing in the morning just to rack up input tokens before the real work starts. None of this is irrational behavior given the incentives. All of it is completely useless.

The engineers doing this aren't the problem. The people who designed a system where this is the logical response are.

Goodhart's Law has been waiting for this moment

There's a principle — Goodhart's Law — that says when a measure becomes a target, it stops being a useful measure. It's been around since the 1970s and economists have been watching it play out in every domain imaginable. Whenever you pick a proxy for the thing you actually want and then reward the proxy directly, people optimize for the proxy and the underlying thing you cared about quietly stops mattering.

Token consumption is a near-perfect Goodhart trap. It sounds like it should correlate with useful AI work — if you're using the tools, you're presumably doing something with them. But tokens measure compute consumed, not value created.

They measure the size of the conversation, not whether the conversation produced anything worth having.

A 2026 PwC survey found that 56% of CEOs reported no significant revenue increase or cost reduction from AI investments in the past year. That's not a coincidence. That's what happens when adoption theater gets mistaken for transformation.

The defense isn't crazy, it's just incomplete

Meta's CTO Andrew Bosworth defended the practice by saying that when high token spending results in 5-10x productivity gains, it's "easy money." And honestly, that's not a wrong position — it's just a position that assumes the token spending and the productivity gains are connected, which is exactly what the gaming behavior puts in doubt.

If you can't tell the difference between tokens burned on real work and tokens burned on invented busywork, the leaderboard isn't measuring productivity. It's measuring the willingness to run AI agents. Those are different things.

The steelman version of Bosworth's argument is that in the early days of a new tool, getting people to use it at all has value — even if some of that usage is inefficient, you're building familiarity, discovering use cases, and normalizing the workflow. I actually think that's partially true. I've used AI in pretty dumb ways when I was first figuring out what it could do — just asking it things I could have Googled, basically using it as a search engine with better grammar. That's how most of us got started. There's no shame in the learning curve.

But there's a difference between a learning curve and a leaderboard. One is a phase you pass through. The other is a permanent incentive structure that rewards the wrong behavior indefinitely.

What measuring outcomes actually looks like

A few companies have tried to build something better, though none of them have been doing it long enough to know if it works at scale.

Salesforce created what they're calling "Agentic Work Units" — measuring completed AI tasks rather than tokens consumed. Marc Benioff's framing is worth quoting directly: "A token on its own doesn't know your customers, your pipeline, your org chart, but Salesforce does. And the value isn't in the token. The value is in what our platform does with it, the work." That's the right frame. Whether the AWU metric actually captures "the work" in a way that can't also be gamed is a genuinely open question — the metric is new enough that the reporting on it is still mostly Salesforce's own documentation.

Zapier took a different approach and tracked percentage of active employee usage, number of AI-powered workflows deployed, and number of AI experiments launched. They hit 97% active employee usage. That's still an activity metric rather than an outcome metric, but it's at least measuring whether people are building things rather than whether their agents ran overnight.

Writer tracks "words generated," "recaps transcribed," "words rewritten" — more granular than tokens, closer to actual work product. It's not perfect either, but at least it's pointing at something a person produced rather than something a server processed.

None of these are fully solved. But they're all pointing in the right direction: toward what got done, not toward how much compute got burned doing it.

What this means if you're not at Meta

Most of the people reading this aren't running 85,000-person engineering organizations. But the dynamic isn't exclusive to big companies, and the pressure version of it is already showing up in smaller contexts.

If you're a freelancer or a small-team operator, nobody's handing you a token leaderboard. But there's a softer version of the same trap: the impulse to demonstrate AI usage rather than demonstrate results. Showing a client a 40-page AI-generated research document when they needed a two-paragraph answer. Running a dozen different AI tools on a project to prove you're "using AI" when one tool used well would have been enough. I catch myself doing versions of this — reaching for AI on things where it doesn't actually help because it feels like the right move in 2026.

The question worth asking isn't "am I using enough AI?" It's "did the work get better?" Those are genuinely different questions and the second one is harder to answer, which is probably why companies keep defaulting to the first.

If your company or client starts tracking AI usage as a KPI — token counts, sessions, weekly active usage — you're watching the same trap get set. The metric will look like accountability. It will feel like progress. And the people being measured will respond exactly the way the Meta engineers did: rationally, efficiently, and in ways that produce nothing useful.

The tools are genuinely good. Claude, Gemini, GPT-4o — I use them constantly and they've changed how I work in real ways. But the value isn't in how many tokens you burn. It's in whether the output was worth having. That's a harder thing to put on a leaderboard, which is exactly why nobody's figured out how to do it yet.

Generated Images

Seven variants below — three standard compositions, one documentary (foreground bokeh), and three dynamic-angle "spatial" compositions for parallax video. To request a fix on any one, add a checkbox under ## Image Touch-ups like: - [ ] spatial-square: remove the random hand on the right

landscape — 1920×1080

landscape
// LEXICON_CITY_DISPATCH_REQ
// STATUS: CONNECTION_STABLE
// SOURCE: CENTRAL_DISPATCH_HQ

SHERMAN UPLINK: "I'm at HQ holding down Central Dispatch. Enter your query below to pull relevant data records and I'll see what data cards we've recovered!"