[article]
AI-shaped stuff
13 Jun 2026
We keep waiting for AI to get good enough. That’s backwards. The bottleneck is human: work that isn’t AI-shaped, verification we’re bad at, and confidence we’ve offloaded to the machine itself.
I’m writing a book chapter on ‘using AI well’ with a military colleague of mine. I have rather a lot of thoughts about AI. But mostly, these are thoughts about what AI can’t do (yet). This chapter is supposed to be about what AI can do. And in particular, about what we can do with AI.
In a more recent slew of marginalia, I’ve been working through a couple of questions—again, questions about what AI can’t do. But these questions shed some illuminating light on what AI can do. Or perhaps, why AI can’t do stuff well, and by implication, what we need to do so we can do AI better.
¶What is AI actually changing?
There’s no doubt that AI feels productive, for many people using it. But there is really very little sign of this productivity in productivity measures. There are a few reasons this could be.
Firstly, humans reliably overestimate the effect of a new technology in the short term—a phenomenon known as Amara’s Law. This appears to be the case for AI. Of course, Amara’s Law also points out that we underestimate the effect in the long run. Regardless, some of this effect may simply be us confusing enthusiasm for productivity.
Some of this seems likely to be scope creep (see also this). As AI enables us to do some things faster, we simply try to do more of those things. More, perhaps, than we might have before, and thus whatever direction we were productive in before is now getting dispersed in more directions.
However, much of this is going to be a simple case of jagged adoption. Different sectors are adopting AI at different rates, and different people are differently skilled at using AI. This isn’t just a skill issue though. It’s structural (pdf). People need time to build around AI. New processes, ways of working, business models, and so on, all have to be pieced together to make AI useful.
Finally, skeptical economists reckon that there just aren’t that many tasks AI can do profitably. A case made stronger by the fact that the more difficult tasks AI succeed at are achieved in lab-based settings, which might hide the fact that they’re much less achievable when we introduce the complexity of the real world.
But there is something buried in this skepticism that seems less like a problem of waiting until AI comes good, and more like a problem of bad implementation now.
¶The AI bottleneck
There’s a well-known problem of AI task achievement. The common name is the Jagged AI Frontier. In that paper, on tasks AI is good at, management consultants saw a substantial boost in productivity. On tasks AI was bad at, they were almost 20% more likely to make errors informed by the AI. Certain tasks are just beyond AI. Hence the economic skepticism I linked to earlier.
Of course, this gap is closing. As models improve, the frontier becomes less jagged. Or, perhaps, the jaggedness is pushed so far beyond human capacity that we don’t notice it anymore. It seems likely that there’ll be a cap to this, of course. Yann LeCun, one of the early pioneers of machine learning since the late 80s, famously worries that the Large Language Models (LLMs) we have currently simply don’t have the capacity to create the kind of internal world model that will help them plan or predict consequences to the extent we want them to.
LLMs are essentially unimodal (i.e. they take in mostly text, which is a perceptual environment quite unlike the rich perceptual miasma humans swim in); they also really only do one thing: they talk. They talk by predicting the next token of meaning from the last. This means that they can’t deliberate substantially over alternatives, which requires sophisticated modelling of a multidimensional world. Some of this problem is ameliorated by ‘thinking’ LLMs, which spend some time talking through a problem before responding. But this isn’t quite the same thing as simulating a world: representing abstract concepts of the world in the mind to work out the general principles the world operates on. LLMs produce detail—the text-based representations of the world are (kind of) all it has access to.
So, these models, at a minimum, seem likely to be weak at some tasks for the foreseeable future. However, some non-trivial part of this jaggedness is completely human.
¶The human bottleneck
First of all, I mentioned earlier that humans are hampered by AI, when AI isn’t good at stuff. Equally though, AI is hampered by humans, when humans aren’t good at stuff. For some tasks, when humans team up with AI, their performance improves over human performance alone, but performs worse than AI alone. This seems kind of similar to other examples of human-machine teaming. In Advanced Chess, humans team up with computers to play. These ‘cyborgs’ or ‘centaurs’—human-machine pairs—beat computers alone in the earlier years, to 2005 or so. By the mid-2010s, computers far outstripped the ‘centaurs’—the humans were holding them back.
But again, this isn’t just a skill problem, it’s a structural problem.
See, some jobs seem to cluster into better AI-shapes than others. So if I think about what I do, lecturing has a fabulous cluster of AI-shaped tasks. Research, slide creation, example generation. All this stuff clusters into a ‘preparation block’ that can be almost entirely handed over to AI. Even the lecture itself can be AI-ified. NotebookLM generates podcasts that wouldn’t be much worse than me on stage. I can just come in at the end, and check everything looks good.
Tutorials are a different matter. It’s essentially the same activities, all of which can be AI-ified in the same way. But in a tutorial there’s a lot of live diagnosis and back-and-forth with students. Even though the collection of tasks is similarly AI-shaped, the way they cluster makes one easy to chain, and one harder.
This problem of AI chaining (pdf) is a major roadblock to implementing AI well. To the extent that humans aren’t thinking about (or able to) bundle tasks into AI-shaped chains, the utility of AI won’t be properly realised.
This actually explains part of the economic skepticism of earlier. Papers like the one I linked to earlier, which claim that only a small fraction of tasks are automatable, aren’t really sensitive to how tasks could be clustered into AI-shaped chains (i.e. linear exposure indices like this or this).
But, returning to the jagged edge, some tasks aren’t (yet) deferrable to AI at all. Some tasks are simply just human-shaped tasks. Like my tutorials—the rapid back and forth between student and teacher. Or, since AI seems likely to eat even that kind of latency problem, consider a simple apology or a eulogy or a confession. All things which are meaningful from a human, but much less satisfying from a word-perfect AI.
Indeed any task where the performer is essential to the character of the task is going to be difficult to defer to AI. At Sandhurst, I teach ethical decision-making. In theory, a machine is going to be better at some aspects of ethical judgement. Making the kinds of proportionality judgements that characterise the Law of Armed Conflict (LOAC) for example. A human and an AI can both make a decision about whether striking a target that might hurt civilians is sufficiently proportional. But an AI, conceivably, could make that decision more effectively than a human dealing with fear, and the fog of war, and days of combat fatigue. An AI, however, can’t really bear responsibility for that decision. LOAC’s enforceability relies on things like command responsibility, individual criminal liability, and mens rea—the mental state of the person involved. An AI can produce a war crime, but it can’t be punished for it or deterred from doing it again. It can’t answer for it.
This human-shaped problem opens up an opportunity, but also a specific liability.
The opportunity is in what’s known as the O-ring problem of automation. It refers to the Challenger Disaster, in which the space shuttle came apart because of a single faulty gasket seal. O-ring automation points out that some jobs are multiplicative. One task feeds the next, and the quality of each depends on the quality of the tasks before it. It didn’t matter how good the rest of the Challenger was, one faulty O-ring determined the outcome.
In task clusters like this, automating some of them seems like a good thing. It frees time up for humans to spend more time on the quality of the human-shaped bits. The AI can do all the targeting and calculation, freeing up important decision-time for a human to make the responsibility-bearing judgement about whether to strike the target or not.
The liability here, though, is the problem of verification. Remember the AI bottleneck—the jagged frontier. The solution is human verification. Where AI isn’t particularly good at tasks, or when it’s an O-ring task, or when you simply aren’t confident about the ability of an AI, you want to put a human-in-the-loop to verify the output.
For simple tasks, this is very straightforward and good. For something like mammography, AI can simply flag weird scans for a human to check. It doesn’t need to make any substantive decisions otherwise, and cutting down the expensive human triage is a straightforward value-add. Verifying that the AI is doing a good job is an easy case of quality-control sampling—checking some proportion of ‘not weird’ scans. Easy.
In contrast, if you want to work out how good AI is at solving difficult or intricate coding tasks, you need someone who understands the code as well as if they’d written it themselves in order to verify that. Regression tests don’t capture future-oriented code structure—code designed for new features or version changes. This kind of verification is do-able, but the verification process is really expensive. This is at least part of that infamous study, in which AI use seemed to slow expert coders.
In something more complicated still, like bearing the responsibility for an ethical judgement—to strike a target or not—it almost seems like one would want the human to go through the entire calculation process from scratch.
Madeleine Elish gestures at this in her work on “Moral Crumple Zones”. If the human doesn’t really audit or control the entire chain of AI reasoning before making a decision, to what extent are they truly taking on the responsibility for the decision? LLMs are particularly troubling for this. Their internal reasoning is a notorious black box—it’s near impossible to determine how inputs get translated into outputs. So, like a car’s bonnet absorbs the impact of a crash, humans have often borne the consequences of the decisions of automated systems, where realistically, they had no particular ability to respond at all.
Verification is a human bottleneck, and there’s a kind of implicit taxonomy to it. Tasks which are about sampling, tasks which require expert verification, and tasks which require, essentially, full re-derivation. And I suspect the costlier ones are the ones we’re most at risk of fumbling.
¶Nerfing humans
In video games, to nerf something is to make it worse for players. Usually this happens when something is overpowered. An in-game weapon that kills enemies too easily might often get ‘nerfed’ by developers to make the game more fair.
I think it’s fair to say that a general consensus among educators is that AI ‘nerfs’ humans. Specifically, it seems to really mess with our ability to engage in critical thinking. A recent paper makes the mechanics clear. When we’re confident about an AI’s output, we engage less critically with the material. When we’re more self-confident, we engage more critically with the material. The implication, of course, is that where we are less confident, we might be inclined to defer to the AI.
Worse still, we’re really bad at calibrating our confidence in AI. We, understandably, seem to judge AI competence on the human-shaped aspects of a task. If a task is easy for humans, but hard for AI, then we develop the sense that AI is not really very good. So much so, that some people just dump the AI entirely. If a task is hard for humans, even if it’s very easy for AI, we develop the sense that AI is very good. Given that AI has this irritating jagged edge, and due to the fact that—as LeCun points out—AI operates in a very different world to us, we end up miscalibrating our sense of AI competence in both directions.
Then there’s the problem of skill offloading. Before AI, people were very worried about the fact that we seemed to be googling away our memory: relying on the fact that we could search online for information, rather than memorising that information for ourselves. Something similar appears to be happening with AI. The general trend appears to be cognitive offloading to AI—knowing that AI can handle tasks stops us from learning how to do those tasks at all (see e.g. here and here).
Now, this isn’t always true. I myself notice that I learn patterns from AI on tasks where I’m less skilled. Coding patterns I didn’t know before, or disciplines of thought I’d never read into that parallel my own field. It seems that, for novices, AI can actually scaffold learning, rather than nerf it.
But this too has its drawbacks. AI suffers from the tyranny of the authority: drawing on ideas that are frequently discussed or published, rather than the information that might be best suited for the task at hand.
All in all, while AI acts as its own bottleneck, humans are a specific liability to the quality of AI output. Our value is in verifying AI output on the basis of its competence, but our ability to determine its competence isn’t just bad, it’s offloaded to the AI itself.
¶Outro
There are obvious implications here for improving things now, rather than simply waiting for AI to get better.
The first is to try to organise our jobs, where possible, into AI-shaped chains of AI-shaped tasks. Human-shaped tasks and the verification humans do need thoughtful and strategic placement. Essentially, the more we can get out of the way of the AI, the better the output is going to be.
But this introduces new problems. Verification of complex tasks requires a level of expertise next to that of doing the task in the first place. If the tasks are all automated, then who is learning how to do the tasks so that we can verify them? The same automation that demands our verification is eroding the skills we need to actually do it. Not just by taking away our ability to verify by stopping us doing and thus learning the skill, but dampening the self-confidence that prompts us to verify in the first place.
Relatedly, good verification is a function of confidence in the machine. We’re bad at determining how good AI is at stuff, because we judge it against human-shaped criteria. Not only that, but our confidence in something only calibrates well in an environment that provides lots of good quality feedback. AI-assisted work is usually not very feedback rich—you rarely learn whether the output you see is actually good.
Neither of these have to obtain. As I pointed out earlier, AI does seem to help some people develop their skills. Using AI as a mentor to scaffold learning is part of the solution.
But beyond up-skilling novices, it seems like expertise is a required feature of our future with AI. Building skill-development in next to AI-shaping tasks is going to be a requirement of successful implementation.
To do that, something is going to have to make us do it. Forcing functions need to be built into the automation process to stop us deferring to the LLM. Enforced delays to encourage thought, asking users to make decisions before seeing AI output, disallowing AI to make recommendations about courses of action—all things that force us to critically engage with the task at hand. Sadly, people fucking hate that.
So, is AI changing anything? Whether the gains are small or hidden in this mess of human noise seems like something we can resolve with some of this re-shaping work. I’m not so sure I’m optimistic, in the short term. Oh well.
Anthologies: Betterment, Wealth Architecture, Digital Architecture, Humans Aren't Special, Noetik, On Being Fruitful