Beyond the Language Model: Why World Models Are the Next Frontier for Physical AI

Language models do not understand the world. They understand text about the world - a distinction that sounds subtle until you try to build a machine that has to act in physical reality. For most enterprise AI applications, this gap has been manageable. For Physical AI and AIoT, it is not. The systems that sense environments, reason over sensor data, and take consequential action in the physical world require an architecture that language models were never designed to provide - and in my experience advising on Physical AI strategy, that mismatch is the most underappreciated risk in how organisations are currently investing in AI. That architecture exists. It is called a world model - and the transition toward it is already underway.

This article is part of a series exploring the Physical AI and AIoT landscape. If you have not read the earlier pieces - From Code to Concrete: The Rise of Physical AI and the Power of Convergence and Data Has Currency: The Case for the Internal Data Brokerage - I would encourage you to do so. The argument in this article builds on both. It also extends into territory that I think is underappreciated in most strategic AI discussions: the role of sovereign AI and edge models as the infrastructure layer that makes the transition to Physical AI not just architecturally coherent, but operationally real.

1. The LLM Plateau

Large Language Models have been a genuine technological achievement. The ability to generate coherent, contextually relevant text at scale - to reason across domains, to summarise, translate, and synthesise - represents a meaningful leap in what machines can do with human knowledge. I am not dismissing that.

But a clear-eyed assessment requires acknowledging what LLMs are, structurally. They are statistical models trained on text - vast quantities of it. They learn the patterns of how language is used to describe the world. They do not learn the world itself. This distinction sounds philosophical until you try to deploy an LLM in a Physical AI context, at which point it becomes an engineering constraint with significant consequences.

The sustainability case against LLMs is compelling on its own. According to the International Energy Agency's 2025 Energy and AI report, data centre electricity consumption is projected to roughly double from 485 TWh in 2025 to 950 TWh by 2030 - with AI-focused facilities driving the majority of that growth.^[1] Electricity demand from AI-focused data centres alone surged 50% in 2025.^[2] A single large data centre can consume as much electricity as 100,000 households; the largest currently under construction could match the consumption of 2 million.^[1]

Recent empirical benchmarking adds further texture. Research published in 2025 found that the most energy-intensive reasoning models consume over 33 watt-hours per long prompt - more than 70 times the consumption of their leaner counterparts - and that scaling a single model to hundreds of millions of daily queries results in annual electricity use comparable to powering tens of thousands of homes.^[3] These are not edge cases. They are the operational reality of deploying LLMs at scale.

More consequentially: the energy and compute cost is not yielding proportionate returns. The capability gains from scaling LLMs - throwing more parameters, more data, more compute at the problem - are showing diminishing marginal returns. The architecture is approaching a ceiling that cannot be solved by making it bigger.

For Physical AI specifically, this creates a structural problem that compute alone cannot resolve.

2. What LLMs Cannot Do

To understand why world models matter, it helps to be precise about what LLMs cannot do - not as a limitation to be patched, but as a structural property of what they are.

LLMs predict the next token in a sequence based on patterns learned from text. They are, at their core, extraordinarily sophisticated text completion engines. When they appear to reason, they are leveraging statistical regularities in language that correlate with correct reasoning - not an internal model of cause and effect. When they hallucinate, it is not a bug in the conventional sense. It is the inevitable output of a system that has no mechanism to distinguish between a plausible-sounding claim and a true one, because it has no grounded representation of reality against which to check.

This matters profoundly for Physical AI. Consider the Sense - Reason - Act framework that underpins the Synaptec Physical AI and AIoT Framework. Each pillar depends on something that LLMs cannot provide:

Sense requires a system that can ingest and interpret high-fidelity real-world data - not language about the world, but sensor streams, video feeds, LiDAR returns, and environmental telemetry. LLMs were not designed for this.

Reason requires causal inference - the ability to model what will happen if an action is taken, to simulate consequences, to plan across time. Predicting the next token is not causal reasoning.

Act requires a feedback loop between the system's predictions and the physical consequences of its actions. LLMs are stateless between inferences. They cannot learn from what they do in the world.

A system that cannot close the loop between prediction and consequence cannot get better through operation. It can only be retrained - expensively, periodically, centrally - on new text describing what happened. That is not intelligence operating in the physical world. That is a very sophisticated lookup table with a refresh cycle.

3. What World Models Actually Are

The concept of a world model is not new. The term has lineage in reinforcement learning research going back decades, describing systems that build internal representations of their environment and use those representations to plan and predict. What is new is that the field is converging on architectures that make this viable at scale - and that the case for world models as the foundation for advanced AI is being made with increasing rigour.

The most substantive articulation of this position comes from Yann LeCun, whose 2022 position paper A Path Towards Autonomous Machine Intelligence argued that intelligence fundamentally requires predictive models of the world rather than pure pattern matching over tokens.^[4] LeCun's central proposal - the Joint Embedding Predictive Architecture (JEPA) - represents a departure from autoregressive text prediction in a critical respect: rather than predicting the next token in pixel or word space, JEPA predicts in abstract representation space. It learns what is relevant about the world, not a pixel-perfect reconstruction of it.^[5]

The distinction is important. A generative model asked to predict the next frame of a video must reconstruct every pixel - most of which are irrelevant to understanding what is happening. A world model using JEPA predicts the abstract state of the scene: what objects are present, how they are moving, what physical relationships hold between them. It learns to model the causal structure of reality, not its surface texture.

Meta's V-JEPA 2, released in mid-2025, extended this architecture to video understanding and physical reasoning at scale - trained on over one million hours of internet video and adapted with robot trajectory data to enable planning on real robot arms.^[6] In early 2026, LeCun left Meta to found AMI Labs, which raised $1.03 billion at a $3.5 billion valuation - Europe's largest seed round - and subsequently published formal mathematical proofs establishing the conditions under which JEPA architectures reliably recover the hidden causal variables driving observations.^[7]

Importantly, this is not a single-company bet. The convergence on world models as the architecture for physical intelligence is now industry-wide. Google DeepMind released Genie 3 in August 2025 - the first real-time interactive general-purpose world model - generating navigable 3D environments at 24 frames per second without hard-coded physics engines, learning how the world works through training rather than rule specification.^[8] Fei-Fei Li's World Labs shipped Marble, a platform for generating physically coherent 3D spatial environments, and is reported in discussions at a $5 billion valuation.^[8] NVIDIA launched Cosmos 3 on 1 June 2026 - an open physical AI foundation model trained on 20 trillion tokens of multimodal data, including nearly a billion images, 400 million real and synthetic videos, and action data from humans and robots, explicitly designed to model how machines move and act in the world rather than how scenes look.^[9] The approaches differ - JEPA-based prediction and planning, generative interactive environments, and open world foundation models - but they share the same foundational conviction: language alone is not sufficient architecture for physical intelligence.

4. Why This Is an AIoT and Physical AI Imperative

For organisations operating in the Physical AI and AIoT space, the shift toward world models is not a theoretical preference. It is the architecture that maps onto what physical systems actually require.

Consider what sensor-rich AIoT deployments generate: continuous streams of time-series data from environmental sensors, video feeds from cameras and drones, positional data from LiDAR and GPS, operational telemetry from industrial equipment. This data is not language. It is a direct record of physical state - precisely the kind of signal that world models are designed to learn from, and that LLMs have no native capacity to process or reason over.

The edge device - a sensor node, an industrial controller, a smart infrastructure component - is not just a deployment target for AI. It is an epistemic anchor. It is where the model's predictions meet physical consequence. A warehouse robot that misjudges a pick and drops a component gets immediate, unambiguous feedback. A predictive maintenance system that flags a bearing failure correctly demonstrates its model of the machine. These feedback loops are the training signal that world models require and that LLMs, retrained centrally on text, cannot leverage.

This also reframes the data brokerage argument I made in the previous article. The sensor data that AIoT deployments generate - operational telemetry, environmental state, asset performance streams - is not just an input to analytics. It is, in a world model paradigm, the training substrate itself. The organisation that has invested in a governed, well-structured internal data brokerage is not just better positioned for analytics and reporting. It is sitting on the raw material for continuously improving physical intelligence. Every governed dataset is a potential input to a world model that learns from it.

The Synaptec Physical AI and AIoT Framework's three pillars map directly onto what world model architectures require:

The Sense pillar's emphasis on sensor ecosystem quality, data sovereignty, and connectivity becomes the data pipeline that feeds world model training.

The Reason pillar's focus on AI model suitability, digital twins, and adaptive learning is precisely the domain where world models outperform LLMs - causal reasoning, simulation, and consequence prediction.

The Act pillar's feedback loops and continuous improvement mechanisms are the operational manifestation of what world models are designed to do - close the loop between prediction and physical reality.

Organisations that have built their AI strategy around LLMs for physical applications are not necessarily wrong about the ambition. They are likely wrong about the architecture.

5. Sovereignty and the Edge - The Infrastructure Layer That Makes It Real

Architecture is only half the argument. The other half is where that architecture runs, and who controls it.

Sovereign AI - broadly defined as a nation's or organisation's capacity to develop, deploy, and govern AI systems using infrastructure it controls, data it owns, and models it can audit - has moved from policy discussion to strategic investment at a speed that many boards have not yet registered. By 2026, global spending on sovereign AI infrastructure is projected to surpass $100 billion, with countries across Europe, Asia, and Latin America treating compute capacity as critical national infrastructure on a par with energy and communications networks.^[10] South Korea announced plans to deploy more than 260,000 GPUs across sovereign clouds and AI factories. The EU is expanding its network of public AI Factories built on EuroHPC supercomputers. These are not pilot projects. They are national infrastructure programmes.^[11]

The sovereign AI imperative connects directly to the world model argument in a way that is worth making explicit. If world models derive their intelligence from physical, local feedback loops - from the specific sensor data generated by a specific facility, city, or operational environment - then the data those models learn from is inherently sovereign in character. It reflects local conditions, local assets, and local operational realities. Routing that data to a foreign cloud provider for processing, and receiving inference results back, is not just a latency problem. It is a data sovereignty problem, a competitive intelligence problem, and increasingly a regulatory problem.

In my earlier article on Data Residency, Data Sovereignty, and the Rise of Inference Sovereignty, I argued that the most consequential question for organisations is not where their data is stored, but where inference happens - because inference is where data becomes decision. That argument becomes even more pointed in a Physical AI context. If the inference is happening in a cloud jurisdiction outside your control, using a model you did not train and cannot audit, acting on sensor data that describes your physical operations, then you have outsourced both the intelligence and the control of your physical systems. That is not a technology risk. It is a governance risk that belongs on the board agenda.

Edge models are the architectural mechanism that resolves this. The rapid maturation of Small Language Models and purpose-built edge AI frameworks has fundamentally changed what is deployable on-device. Where 7 billion parameters once seemed a minimum for coherent reasoning, sub-billion models now handle many practical inference tasks effectively - with model families including Gemma 4, Qwen 3.5, Phi-4, and Llama 3.2 all targeting efficient on-device deployment across mobile, embedded, and industrial hardware - with the smallest variants running on sub-5GB RAM at 4-bit quantisation.^[12] Quantised models running on neural processing units in ruggedised industrial hardware are delivering sub-10 millisecond inference at power envelopes of 15 to 40 watts - without a cloud dependency, without per-token pricing, and without the latency that makes cloud inference unsuitable for real-time physical control.^[13]

For Physical AI deployments, this matters across three dimensions:

Latency. A drone avoiding a collision, a robotic arm adjusting its grip, an autonomous vehicle responding to an obstacle - none of these can tolerate the round-trip latency of a cloud inference call. The intelligence must be local. Edge models make it local.

Resilience. Physical systems operate in environments where connectivity cannot be guaranteed - remote infrastructure, underground facilities, maritime operations, disaster response. A Physical AI system that requires cloud connectivity to reason is a system that fails precisely when reliable operation matters most. On-device inference removes the dependency.

Sovereignty. Sensor data describing physical operations stays within the operational perimeter. Inference happens locally. The model can be audited, updated, and controlled by the organisation that deploys it. This is what inference sovereignty looks like in practice - and it is the only version of Physical AI deployment that organisations with genuine data governance obligations can responsibly adopt.

The combination of world model architecture and edge deployment is not just technically coherent. It is strategically aligned in a way that centralised LLM deployment cannot be. The intelligence is local because the physical reality it models is local. The sovereignty is preserved because the inference stays on-device. The feedback loops that make world models learn are closed at the edge, where the consequences of predictions are immediately observable.

6. Sustainability - The Full Picture

The sustainability argument against centralised LLMs is typically framed in energy terms, and the numbers are striking enough to stand on their own. But there is a deeper sustainability concern that matters more for strategic decision-makers: architectural sustainability.

An LLM is a static snapshot of knowledge at the point of training. It does not update through use. It cannot incorporate new information without expensive retraining. Deploying it in a physical environment where conditions change - seasonal variation in an agricultural setting, equipment degradation in an industrial one, shifting traffic patterns in a smart city - means the model's understanding of the world becomes stale the moment deployment begins. The world moves on. The model does not.

A world model, by contrast, is designed to update continuously from the feedback loops it operates within. Each interaction with the physical environment refines its predictive representation. The model deployed in a smart building today is a better model of that building tomorrow - not because it has been retrained on text about buildings, but because it has accumulated a direct operational history of this building's behaviour.

Edge deployment compounds this advantage. A model running on-device is not just more efficient in energy terms - though it is, dramatically so relative to cloud inference at scale. It is also more efficient epistemically: it learns from the specific environment it operates in, rather than a generalised representation of all environments. A world model running at the edge of a port facility learns that facility's rhythms, its equipment quirks, its seasonal load patterns. That specificity is not a limitation. It is the source of its value.

This is what genuine architectural sustainability looks like: a system whose accuracy improves with operation rather than degrades with time, whose energy footprint scales with local deployment rather than centralised compute demand, and whose governance is transparent because the inference happens where the decision-maker can see it.

For boards and technology leaders evaluating AI investments, this distinction is material. The total cost of maintaining a centralised LLM-based Physical AI system includes not just inference energy but the ongoing cost of retraining, the risk cost of decisions made on stale models, the latency cost of cloud-dependent real-time control, and the sovereignty cost of inference that happens outside your jurisdiction. World models at the edge do not eliminate cost - but they change the cost structure in ways that compound favourably over time.

7. The Strategic Implications

The shift from language models to world models - and from centralised to edge deployment - has practical implications that I want to make explicit, because the risk is that this reads as an architectural argument rather than a strategic one.

First, AI infrastructure investment decisions need to account for this transition. Organisations building or procuring AI capability for physical applications - smart infrastructure, predictive maintenance, autonomous systems, operational intelligence - should be asking their vendors and technology partners how their architecture handles physical feedback loops, causal reasoning, on-device inference, and continuous learning. If the answer is "we fine-tune an LLM on your data and run it in our cloud," that is a materially different capability from a world model at the edge, and it should be evaluated accordingly.

Second, data strategy, AI strategy, and sovereignty strategy need to be unified. As I argued in the data brokerage article, the quality, governance, and accessibility of an organisation's operational data is the single largest determinant of AI performance. In a world model paradigm, this becomes even more direct: the governed sensor data you generate is the training signal for the intelligence that operates your physical systems. Organisations that treat sensor data as an operational byproduct will find themselves unable to leverage the next generation of physical AI architectures. Those that treat it as a governed, sovereign asset will find themselves holding exactly what those architectures require.

Third, edge capability is strategic infrastructure, not a technical preference. LLM-based AI concentrates intelligence in the cloud and pushes outputs to the edge. World model-based Physical AI concentrates learning at the edge - where the physical consequences of predictions are observable - and aggregates understanding upward. Organisations building edge compute capability today are not just solving a latency or connectivity problem. They are building the epistemic and sovereign infrastructure for the next generation of AI.

Finally, the transition will not be instantaneous. LLMs will remain useful - particularly for language tasks, knowledge retrieval, and interfaces that require natural language interaction. The argument is not that they disappear, but that they are not the foundation for Physical AI. The organisations that understand this distinction now - that invest in the data governance, the edge infrastructure, and the architectural literacy required to work with world models - will hold a compounding advantage as the paradigm shifts.

Conclusion: The Intelligence Is in the Loop

The most important insight from the world model paradigm is deceptively simple: intelligence is not a property of a model sitting in a data centre. It is a property of the loop between a system and its environment - the capacity to predict, act, observe consequence, and refine. That loop is what biological intelligence runs on. It is what world models are designed to replicate. And it is what LLMs, by architectural design, cannot participate in.

Sovereign AI and edge deployment are not separate considerations sitting alongside this argument. They are its operational conclusion. If the intelligence must be grounded in local physical reality, it follows that the inference must be local. If the data that feeds the feedback loop is operationally sensitive, it follows that sovereignty must be preserved from sensor to decision. The architecture, the deployment model, and the governance framework converge on the same answer: intelligence at the edge, under your control, learning from your world.

We are at a moment where the AI field is beginning to acknowledge what the Physical AI community has understood for longer: that the challenge is not generating plausible text about the world. The challenge is building systems that understand the world well enough to act in it reliably, safely, and at scale - and that do so under governance frameworks that organisations can actually stand behind.

The language model era gave us remarkable tools for working with human knowledge. The world model era - running at the edge, under sovereign control, closing feedback loops with physical reality - will give us the architecture for machines that can genuinely participate in the world. For organisations building in the Physical AI and AIoT space, the strategic imperative is clear: invest in the data, the edge infrastructure, and the architectural understanding that positions you for a transition that is already underway.

The intelligence is not in the model. It is in the loop.

References

International Energy Agency. (2025, April). Energy and AI. IEA. https://www.iea.org/reports/energy-and-ai; International Energy Agency. (2026, April). Key questions on energy and AI. IEA. https://www.iea.org/reports/key-questions-on-energy-and-ai
International Energy Agency. (2026, April). Data centre electricity use surged in 2025 [News release]. IEA. https://www.iea.org/news/data-centre-electricity-use-surged-in-2025-even-with-tightening-bottlenecks-driving-a-scramble-for-solutions
Luccioni, A., et al. (2025, May). How hungry is AI? Benchmarking energy, water, and carbon footprint of LLM inference (arXiv:2505.09598). arXiv. https://arxiv.org/abs/2505.09598; Ozcan, E., et al. (2025, July). Quantifying the energy consumption and carbon footprint of LLM inference (arXiv:2507.11417). arXiv. https://arxiv.org/abs/2507.11417
LeCun, Y. (2022, June). A path towards autonomous machine intelligence, version 0.9.2. OpenReview. https://openreview.net/forum?id=BZ5a1r-kVsf
Meta AI. (2023). I-JEPA: The first AI model based on Yann LeCun's vision for more human-like AI [Blog post]. Meta AI. https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/; Balestriero, R., & LeCun, Y. (2025, November). LeJEPA: Provable and scalable self-supervised learning without the heuristics (arXiv:2511.08544). arXiv. https://arxiv.org/abs/2511.08544
Assran, M., et al. (2025, June). V-JEPA 2: Self-supervised video models enable understanding, prediction and planning. Meta AI. Turing Post. (n.d.). What is joint embedding predictive architecture (JEPA)? https://www.turingpost.com/p/jepa
AMI Labs. (2026, May). When does LeJEPA learn a world model? [arXiv preprint]. arXiv. As reported in: Daws, R. (2026, May 31). Yann LeCun's world model earns a formal proof. TechTimes. https://www.techtimes.com/articles/317452/20260531/yann-lecuns-world-model-earns-formal-proof-benchmark-finds-current-models-brittle.htm
Built In. (2026, February). World models are the next big thing in AI. https://builtin.com/articles/ai-world-models-explained; TechTimes. (2026, June). Yann LeCun world models bet: AMI Labs stakes $1.03 billion against large language models. https://www.techtimes.com/articles/317928/20260606/yann-lecun-world-models-bet-ami-labs-stakes-103-billion-against-large-language-models.htm; Introl. (2026, January). World models race 2026 [Blog post]. https://introl.com/blog/world-models-race-agi-2026
NVIDIA. (2026, June 1). NVIDIA launches Cosmos 3, the open frontier foundation model for physical AI [Press release]. NVIDIA Newsroom. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-3-the-open-frontier-foundation-model-for-physical-ai; Axios. (2026, June). Nvidia's Cosmos 3 open AI world model helps robots, autonomous vehicles. https://www.axios.com/2026/06/01/nvidia-ai-push-cosmos-3-world-model; NVIDIA Research. (2025, January). Cosmos world foundation model platform for physical AI (arXiv:2501.03575). arXiv. https://arxiv.org/abs/2501.03575
RAISE Summit. (2026). Sovereign AI: Why nations are treating compute as critical infrastructure. https://www.raisesummit.com/post/sovereign-ai-compute-critical-infrastructure; Lawfare Media. (2024, November). Sovereign AI in a hybrid world: National strategies and policy responses. https://www.lawfaremedia.org/article/sovereign-ai-in-a-hybrid-world--national-strategies-and-policy-responses
Swiss Institute of Artificial Intelligence. (2025, December). Sovereign AI is becoming public infrastructure. SIAI. https://siai.org/memo/2025/12/202512284707
Digital Applied. (2026, April). Small language models business guide: Gemma, Phi, Qwen [Blog post]. https://www.digitalapplied.com/blog/small-language-models-business-guide-gemma-phi-qwen; Edge AI and Vision Alliance. (2026, January). On-device LLMs in 2026: What changed, what matters, what's next. https://www.edge-ai-vision.com/2026/01/on-device-llms-in-2026-what-changed-what-matters-whats-next/
iFactory. (2026, March). Small language models at the edge: Transforming factory-floor AI for real-time applications [Blog post]. https://ifactoryapp.com/blog/small-language-models-slm-edge-factory-ai; ZEDEDA. (2026, January). 2026 predictions: How edge AI is reshaping industrial operations [Blog post]. https://zededa.com/blog/2026-predictions-how-edge-ai-is-reshaping-industrial-operations/

← Back to Articles