Ghost in a Loop

2026-01-14

What is Really Happening When LLMs Think

We've casually started calling LLMs "thinking" machines. Navigate to any AI chatbot and you'll see options like "fast," "thinking," or "deep thinking." We talk about models "understanding" content. But are they? I've spent the past couple of years building with LLMs, and I've never been more intrigued by a technology. But I'm also concerned by how loosely we're using words that describe core human traits.

These words, understanding, reasoning and thinking, aren't just marketing fluff. They're terms we use to describe core human cognitive traits. When we apply them to LLMs without precision, we obscure what the technology actually does. And as this tech becomes as normal as FaceTiming someone across the globe, that obscurity becomes a problem. This post is my attempt to look under the hood to ground myself in what's really happening, and hopefully it's useful to others.

It's important to first understand the core of an LLM is a static, frozen file consisting of statistical weights. The way that file is generated is a much larger and technical conversation which won't be discussed here. The critical part to understand is the weights are used to generate statistical predictions based on a specified input. You see this in action when you tell ChatGPT something like "Jack and Jill went..." and it'll complete the sentence. The completion of the sentence is generated by sampling from statistically probable outputs based on those weights. The weights don't change. The model didn't "learn" anything from your input. In fact, the moment you start a new conversation, it's like Groundhog Day and the model has no idea about the previous conversation.

This is the second really important thing to understand about all LLMs: they are stateless. The core functionality centers around taking an input, running a function, and getting an output. There is no magical reasoning that happens inside the model, it doesn't have an inner monologue, it just executes a set of functions to generate a statistically probable output and provides that output to the user. All the mainstream LLM-powered applications operate the same way. There are other layers which we'll discuss later in this post, but at the core of every LLM application in production right now, the low-level primitive that's powering everything is a stateless process that generates the statistically probable output for a provided input.

Now that we understand what's at the core of every LLM, let's talk about this term "thinking" that we seem to have settled on to describe what an LLM is doing. Daniel Kahneman's "Thinking Fast and Slow" offers a useful frame here. He distinguishes System 1 thinking which is fast, impulsive, pattern-matching, from System 2 thinking which is slower, effortful, logical, and the kind you do when working through a layered engineering problem over days or weeks.

LLM behavior is fast, reactive and focuses on pattern matching. This maps to System 1 thinking. Reasoning models like OpenAI's o3 and others appear to exhibit System 2 thinking because they take longer, appear to produce intermediate steps and give an illusion that the tech has "gone off to give your prompt a good think". But look under the hood and you find the same stateless function running repeatedly. Just clever engineering to provide the appearance of System 2 thinking, achieved through System 1 mechanics. In the end it's matrix multiplication and vector transformations. Very sophisticated math, but math nonetheless. No understanding, no intention, just computation that occasionally produces outputs so coherent we mistake them for thought. An LLM's output is, as Grady Booch put it, "an echo of a whisper" of what human intelligence is. However, if you listen closely, you kind of hear something that faintly resembles it.

I'm sure that after everything we've seen in 2025, most people who have interacted with this tech will say "well this is impossible, I've seen with my own two eyes an LLM vibe code an entire application that got 80% close to exactly what I was looking for...and the damn thing worked! That can't just be statistics and pattern matching!" That brings us to the next really important thing we have to understand about LLMs and the applications they power. For that we have to go up the stack to the application layer.

All of the currently available production-grade LLM applications that offer "reasoning" or "thinking", are pretty much operating the exact same way. There is an input provided by the user, the application takes that input and runs it through an LLM just like we discussed earlier, except that this time we do it in a loop. The nuance of how this is done is less important for our purposes here. However the core primitive that is being used to complete this process is the same as the primitive we talked about earlier, a stateless function that takes an input, and using statistical probability and a static file of weights, generates an output but just repeatedly. In 2025 we saw some really creative ways to use this rather simple concept to implement some really powerful applications. We saw things like MCP servers come about which allowed applications to "intelligently" execute pieces of code that could do things like search the internet, call an API or modify files in your machine's directory, pretty cool stuff. However none of this is actually part of the LLM itself, it's all done using traditional software engineering and deterministic execution of code.

Thus far we have talked about the key primitive that makes LLMs "think" which is a stateless function that takes an input and provides an output. We also went through how the concept of "reasoning" or "thinking" is achieved by running that primitive in a loop. And we can use some creative engineering to make that loop look like the machine is doing things...a Ghost in a loop.

The last thing we have to talk about is how state is handled. Earlier I described LLMs as stateless functions, and that is still correct. However, there is a state being maintained when a user initiates a session with an LLM-powered application, and that state is what most users know as their prompt. This can also be the context that was provided to the application. This state is not a component of the LLM itself, the LLM is still completely stateless. The state is a component of the application. So without an application layer, there is no state, and without a state, an LLM is just a static file of weights which do nothing of real value in the world.

My perspective is that LLMs are static energy. A file of weights does nothing on its own. It needs an application layer to transform that potential into something kinetic: a chatbot, a coding assistant, an agent that can search a company knowledge base and take action for a customer.

But not all applications are equal. The useful ones understand that an LLM is only as good as the context it's given. An LLM with access to your codebase, your docs, your specific constraints will outperform a generic model every time. The weights are the same; the context makes the difference.

The models will keep improving. But the real leverage is in the application layer and in engineering systems that deliver the right context at the right moment. That's where the actual work is. That's where the value gets created.