The last few years of AI have been unusually forgiving for buyers. Teams have moved between ChatGPT, Claude, Gemini, Codex, and a dozen others almost opportunistically, picking whichever model looked best in a given month and switching again when the frontier moved. Switching costs stayed low because the models were generic. Your prompts and documents lived in one system, but the intelligence behind them was never deeply fused with any particular vendor, which is unusual even by the standards of software. Software normally locks customers in through integrations, data formats, workflows, and the slow accumulation of institutional habit. AI has so far been a peculiar exception, in that the thing you are buying is largely interchangeable with the thing the next vendor is selling.

That interchangeability survives only as long as the models stay generic, and it ends once they acquire real memory. Real memory is not a larger context window, a folder of documents the model can search, or a vector database behind an API, all of which already exist and have changed nothing fundamental about the economics. It is the model adapting to your organisation and absorbing how you make decisions, how your processes run, what is in your codebase, and the edge cases you have learned to handle. In its strongest form that means continual learning, or weights that have been reshaped around your business. Once a model reaches that point, the question is no longer which model is best but whether you can afford to give up the one that knows you, which is a far stronger form of lock-in than anything the current AI market has produced.

Why memory has to live in the weights

Shallow forms of memory, such as vector stores, retrieval layers, preference files, and conversation history, are already widely deployed. These are useful, but they are not the same thing as a model that has internalised your organisation, because they sit outside the model, get pulled in at inference time, and can be exported. Real memory probably requires one of two mechanisms instead, with the model either learning continually as you interact with it and correct it, or having its weights adapted and fine-tuned around your organisation, so that in both cases the knowledge ends up inside the network rather than alongside it. That is the distinction that matters, because once knowledge is embedded in weights it stops being portable in the way documents and databases are, and you cannot easily export what the model has become after a year of learning your business.

The batching problem

Frontier serving works at scale because of batching, where a single forward pass through the network processes queries from thousands of unrelated users at once and amortises the cost of moving the weights through GPU memory across all of them. That is what has let API prices keep falling even as capability has risen, and it is exactly what memory in the weights breaks, because weights specific to your organisation cannot share a forward pass with weights specific to someone else’s.

The labs will solve this, because they have to, even if it is hard to say in advance how. The fix might be partial batching, where the first twenty-four layers of a thirty-six layer network stay shared across customers while the last twelve carry each customer’s adaptations, or an adapter architecture that keeps the base model shared and injects per-customer changes through small swappable modules, or a routing system that groups similar customers into the same batch, or something nobody has built yet.

Whatever the solution turns out to be, though, it will involve compromises that today’s pure shared-serving model avoids, and those compromises will limit how deeply a vendor can adapt a hosted model to any single customer. If your memory lives only in the last twelve layers because the first twenty-four have to stay shared, the depth of that memory is set by an architectural decision the lab made for its own serving economics, rather than by what would be best for you. A self-hosted open-source model carries no such constraint, because you can tune the whole network, which means that once memory is the thing that matters, the freedom to adapt a model as far as you want may favour self-hosting in its own right, before cost even enters the picture.

Lock-in is structural either way

Even if the labs engineer their way to memory-bearing inference at competitive cost, the lock-in does not go away. The vendor is now hosting a unique artefact, your organisation’s accumulated operating intelligence, that exists nowhere else and cannot be rebuilt elsewhere, and that gives them pricing power they did not have before, not because their costs went up but because your switching costs did. The Claude rate-limit and pricing scares of early 2026 were a preview of this, at a point when most AI tools were still relatively easy to swap out, and with memory embedded in the weights a customer has far less room to walk away. The cost question is real, but it is the lesser of the two problems, and the bigger one is that the vendor ends up holding something that used to live inside your business.

Upgrades become migrations

Today, when a provider releases a new model, customers simply get it, and the upgrade is clean because the model is generic. Once memory is in the weights, though, an upgrade turns into a state migration. The new model either inherits the old one’s learned state or has to relearn it, performance may dip while the transition settles, and you are left checking what knowledge carried over, what was lost, and whether any of it can be rolled back. None of these are questions that subscribing to an API normally forces on you. Owning the learning loop means you decide when and how these transitions happen, while leaving it with the vendor means you live with whatever they decide, whenever they decide it.

The practical response

None of this means you should abandon closed frontier models, which will still be useful, particularly for work that needs raw capability now and where memory is not yet the binding constraint. The point is to separate short-term performance from long-term memory ownership, and to be deliberate about which of the two you are paying for at any given moment.

At New Gradient we already do a small version of this, for now mostly to save money. Simple queries, like short rewrites, classification, extraction, and routine formatting, go first to open-source models hosted on our own infrastructure, and only harder or more ambiguous work escalates to frontier systems. The frontier is overkill for most of what people actually ask AI to do, and an open-source model six months behind it is already good enough for the bulk of everyday work.

When memory arrives, the same architecture matters a great deal more. The routing layer, the evaluation pipeline, the hosting, and the feedback loops together make up the apparatus by which an organisation teaches machines. If that apparatus sits inside your own infrastructure, the memory it produces stays portable and inspectable and under your control. If it sits inside a vendor, you are effectively renting a shadow of your own company.

Taken far enough, the decision about where AI memory lives stops resembling a software purchase and starts resembling a staffing one. Treat your AI systems as something close to employees and the logic is clear: if they are trained inside the business and embedded in its processes, the institutional knowledge they build up stays with the organisation. Host them with a vendor instead and you have outsourced a growing part of your workforce, on the understanding that the longer those systems work for you, the harder it becomes to bring the work back in-house. At that point the vendor is not really selling you software at all. They are selling you access to a workforce that has learned your business, on terms they can revisit whenever they like.

The companies that get the most out of AI memory will not be the ones that happened to pick the best assistant. They will be the ones that own the process by which their organisation teaches machines. Once that accumulated knowledge sits inside someone else’s weights, changing vendors is no longer a procurement decision but the loss of much of what the business has learned, something closer to institutional amnesia than to swapping out a piece of software. That is the real reason to keep the learning loop in-house, and it is why, once the labs crack memory, the businesses ready to run their own open models will be the ones that still own what they know.

The model that knows your business is the one you can’t leave

Why memory has to live in the weights

The batching problem

Lock-in is structural either way

Upgrades become migrations

The practical response

The advantages of specialised AI systems

The Economics of AGI

The AI Capability Overhang