Robotics Foundation Models in 2026: From RT-2 to Production

How robotics foundation models have progressed from research to production deployment in 2026, what they can and cannot do, and what they mean for industrial automation.

Robotics Foundation Models in 2026: From RT-2 to Production
By Agentic Convergent
#robotics foundation model#embodied AI#generalist robot policy#robot AI 2026#RT-2 production#OpenVLA industrial

Robotics Foundation Models in 2026: From RT-2 to Production

Three years ago, "foundation model" was a term used almost exclusively for language models. In 2026, it is being used seriously in the context of robot control — and the implications for industrial automation are substantial.

This article explains what robotics foundation models are, how the field has developed from early research to production deployment, and what they mean for manufacturers evaluating AI-based robot programming today.


What Is a Robotics Foundation Model?

In AI, a foundation model is a large model trained on broad data that can be adapted to many downstream tasks. GPT-4 is a foundation model for language tasks. CLIP is a foundation model for image-text correspondence.

A robotics foundation model extends this to physical manipulation: a large model trained on diverse robot trajectory data and perception data that can be adapted to control robots across many different tasks and environments.

The key distinction from traditional robot control: traditional approaches require task-specific programming for every new task. A foundation model, properly trained, can generalise — taking a natural language instruction or a video demonstration and generating robot actions for tasks it was not explicitly programmed to do.


The Research Progression

2022–2023: Proof of Concept

The concept of using large-scale language and vision models for robot control emerged from concurrent work at Google, Stanford, UC Berkeley, and CMU. SayCan (2022) showed that large language models could decompose household tasks into robot-executable steps. RT-1 (2022) demonstrated that transformer architectures could learn to control robots from large datasets of demonstrations.

These were proof-of-concept results — impressive within their training distribution, brittle outside it.

2023: RT-2 and Cross-Modal Transfer

Google DeepMind's RT-2 (2023) was a significant step. By co-training on internet-scale vision-language data alongside robot trajectory data, RT-2 gained the ability to apply concepts from the broader world (learned from text and images) to physical manipulation.

The result: a robot that could interpret instructions like "pick up the object that represents danger" (interpreting "danger" from visual context) without having been trained on that specific instruction. Cross-modal transfer — generalising from language concepts to physical action — was demonstrated for the first time at scale.

2024: Open X-Embodiment and Cross-Platform Generalisation

Open X-Embodiment (2024) expanded the scope dramatically: a single model architecture trained across 22 different robot types and 527 distinct tasks. The key finding was that training on diverse robot data improved performance even on individual robot types — data diversity, not data concentration, was what mattered.

This was the first clear evidence that a single foundation model could transfer across robot brands and form factors.

2024–2025: OpenVLA and Open Source

OpenVLA brought the VLA (vision-language-action) model architecture into open-source, making it available for any company to inspect, use, and adapt. This removed the black-box objection that enterprise buyers raised about proprietary AI systems.

MIT CSAIL's PhysicsGen (2025) demonstrated that 24 human demonstrations could be expanded into thousands of robot training examples using physics simulation — achieving a 60% improvement in task success rate without additional human effort.

2025–2026: Production Deployment

The current state: VLA models are in production in commercial products. They power demo-to-deploy workflows where a worker demonstrates a task on video and the model generates a robot program. They are in industrial test environments at companies across Europe, North America, and Asia.

They are not yet in every factory. But they are past the research-only stage.


What Production-Grade Robotics Foundation Models Can Do in 2026

Generalise to New Objects

A model trained on demonstrating tasks with a set of training objects can typically handle new objects with similar visual and physical properties. A model that has learned "pick up the metal cylinder, place it in the fixture" can generalise to new cylinders of different sizes without being retrained.

The limits: very different object geometries, unusual surface properties, or significantly different weights may require additional demonstrations or fine-tuning.

Understand Natural Language Instructions

Models like RT-2 and its successors can receive instructions in plain English and translate them to robot actions. "Put the red cap on the blue bottle" does not require a programmer to translate it into Cartesian coordinates — the model handles that.

This is the direct enabler of no-code programming: the barrier of needing robot-specific language to communicate task intent is removed.

Handle Variation in Real Environments

Properly trained VLA models are more robust to real-world variation than traditional programmed trajectories. A traditional trajectory specifying exact waypoints fails when a part is 3cm off-position. A VLA model can often adapt to position variation within a reasonable range.

The limits: this robustness has limits. Significant deviation from training distribution conditions (very different lighting, completely different gripper configuration, unfamiliar part geometry) still degrades performance.

Work Across Multiple Robot Brands

Models trained on multi-embodiment data (like Open X-Embodiment) can, with appropriate fine-tuning, transfer across robot brands. A model that has learned to pick-and-place on a UR can be adapted to ABB with significantly less data than training from scratch.

This matters for industrial buyers who run multi-brand environments or plan to expand to additional robot brands.


What Production-Grade Robotics Foundation Models Cannot Do in 2026

Sub-Millimetre Precision Tasks

Foundation model policies in current production systems typically operate at millimetre-level precision. Tasks requiring sub-millimetre positioning (electronics assembly, precision press-fitting) typically still benefit from traditional programmed trajectories with calibrated tooling.

This is a current limitation, not a fundamental one — research systems are already achieving finer precision. Production-grade sub-millimetre VLA control is likely on a 2–4 year horizon.

Novel Gripper Configurations

Models need gripper-specific training data. An unusual custom end-effector that the model has not encountered requires additional data collection or fine-tuning. Standard two-finger parallel grippers and suction cups have the most training data; custom tooling requires more work.

Safety-Critical Decision Making

Foundation models are not safety systems. A model's output trajectory must be validated before deployment and monitored during operation. The model makes probabilistic predictions — it is not deterministic in the sense that a programmed PLC is. Safety validation by qualified engineers is not replaced by foundation model deployment; it is still required.

Fully Autonomous New-Task Learning

In 2026, using a VLA model to handle a genuinely novel task still typically requires some human demonstration data. The vision of a robot that watches a human do a task once and immediately executes it reliably is closer than it was — but still not fully general. Some tasks need multiple demonstrations; some need explicit fine-tuning.


What This Means for Industrial Buyers

The Opportunity

Foundation model-based robot programming reduces the specialist requirement for task setup. When the programming interface is demonstration-plus-language rather than code-plus-pendant, the person who does the task can also teach the robot to do it. This is the central opportunity: removing the bottleneck of specialist dependency from high-mix industrial automation.

The Due-Diligence Question

When a vendor says "powered by AI" or "uses foundation models," ask specifically: which model architecture? Trained on what data? What is the performance benchmark on tasks similar to mine? The term "foundation model" is being applied to a wide range of systems, some of which are genuinely sophisticated and some of which are marketing language attached to rule-based systems.

The Honest Current State

A manufacturing buyer in 2026 should understand: this technology is real, it works for a meaningful subset of industrial tasks (pick-and-place, machine tending, assembly with standard grippers), and it is improving rapidly. It does not yet work for all industrial tasks, and honest vendors will tell you which tasks are in scope and which are not.

The manufacturers who benefit most from investing in this technology now are the ones who have a clear set of target tasks that fall within current capabilities — and who are willing to be early adopters in exchange for the competitive advantage that comes from learning before their competitors.


The Bottom Line

Robotics foundation models in 2026 are past the research-curiosity stage and into production deployment — for a defined and growing set of industrial applications. They are the technical foundation under the current generation of no-code robot programming tools, and they represent a genuine step change in how quickly and cheaply new robot tasks can be deployed.

They are not magic. They have well-defined current limitations. But for the right tasks, they deliver on the core promise: robot programming that a factory worker can do, in hours, without specialist help.


See also: