技術 約11分で読めます

Will NVIDIA's world model Cosmos 2.5 series be included in pet robots?

いけさん目次

NVIDIA announced the latest versions of the World Foundation Model called Cosmos Transfer 2.5, Cosmos Predict 2.5, and Cosmos Reason 2 at GTC 2026. The presentations only show factory transport robots and autonomous trucks, but the essence of what they are doing is a general-purpose ability to “recognize objects and predict the future according to the laws of physics.” This is not a technology that can only be used in factories.

The question then becomes whether this can also be used for pet robots and household companion robots. More specifically, is it the size that would fit on a small robot? The answer is, “Some of them will appear, but not all of them,” and the line between them is quite interesting.

First of all, the data problem of physical AI

In image recognition and natural language processing, models can be trained using large amounts of data on the web, but this is not the case with robots. In order to learn movements such as grasping, transporting, and avoiding objects, demonstration data in a real physical environment is required.

Collecting data using real robots is time-consuming and costly. It is not realistic to cover environmental variations such as lighting, floor surfaces, and obstacle positions. This “lack of real-world data” has been a fundamental issue that has hindered the spread of robot AI.

Furthermore, conditions at home are more severe than those at factories. In a factory, lighting and shelf positions are fixed, but in a home, furniture is arranged differently in each room, and children and pets move unpredictably. If you want to create a pet robot, you’ll need a model that can understand this chaotic environment.

The World Model is one answer to this problem. Generate a large amount of realistic synthetic data based on physical simulation to compensate for the lack of training data.

Cosmos 2.5 series models

The Cosmos platform consists of multiple models divided by purpose.

Cosmos Transfer 2.5

It is a model that generates data that simulates various real-world conditions from a simulation environment and 3D scan data. The architecture uses ControlNet to dynamically correspond between simulation and real-world representation using a “spatio-temporal control map” while retaining pre-trained knowledge.

The following formats are supported as input sources.

Input typeUsage
Segmentation mapIdentification of object boundaries and regions
Depth mapUnderstanding 3D structure
Edge mapContour/shape information
LiDAR scanningPoint cloud data for autonomous driving scenarios
HD mapRoad and infrastructure structure information

Since variations in the environment and lighting conditions can be automatically generated, it is possible to supplement edge case data that is difficult to cover by collecting data from the real world.

Cosmos Predict 2.5

It receives text, video, and image sequences multimodally and predicts and generates the next state. The Transformer-based architecture handles temporal consistency and frame interpolation and can generate sequences up to 30 seconds long. Supports multi-view output and custom camera layout.

The strength of this model is its fine-tuning efficiency using domain-specific data. NVIDIA explains that additional training using its own environmental data can improve accuracy by up to 10 times compared to the baseline. In addition to providing simulation data tailored to a specific factory line, it is theoretically possible to use the same mechanism to fine-tune the system to suit household furniture placement and daily activities.

Cosmos Reason 2

It is a model that has physical inference ability through a three-stage learning pipeline.

graph TD
    A[Stage 1: 事前学習<br/>Vision Transformerで<br/>ビデオフレームを処理] --> B[Stage 2: 教師あり微調整<br/>物理推論タスクでファインチューニング]
    B --> C[Stage 3: 強化学習<br/>空間制約・時間推論の<br/>ルールベース報酬で最適化]
    C --> D[時空間理解<br/>2D/3D点群検出<br/>バウンディングボックス座標出力]

Since it can generate 2D/3D point cloud coordinates and bounding box coordinates as output, it can be incorporated into robot grasp planning and collision avoidance.

So, will it be included in Pet Robot?

Now comes the main point. Cosmos’ technology itself is general-purpose and not limited to industry. The problem is purely size.

Full spec is out of the question

Looking at the GPU memory required for full inference on the Cosmos 2.5 series, the numbers are hopeless in the context of a home robot.

ModelRequired VRAM
Cosmos-Predict2.5 (720p, 16FPS)32.54 GB
Cosmos-Transfer2.5-2B65.4 GB
Multi-view inference80GB x 8

For full specs, H100-80GB and A100-80GB are recommended. It would be physically impossible to cram the GPU equivalent of one server rack into a pet robot, and the electricity bill would be more than enough to cover the cost of feeding the pet.

Quantization changed the situation

However, in February 2026, NVIDIA engineers successfully quantized Cosmos Reason2-2B (2 billion parameters) to W4A16 precision and made it work across the Jetson family. It’s worth noting that it also works with the Jetson Orin Nano 8GB Super (8GB of integrated memory, priced under $500).

graph LR
    A[Cosmos Reason2-2B<br/>フルモデル] --> B[W4A16量子化<br/>重みを4bit化]
    B --> C[Jetson Orin Nano 8GB<br/>500ドル以下]
    B --> D[Jetson AGX Orin<br/>275 TOPS]
    B --> E[Jetson Thor<br/>2070 TFLOPS<br/>128GB メモリ]

Recognize objects from camera images, understand spatial relationships, and plan actions. This is completed at the edge without cloud connection.

Jetson family spec comparison

ModuleAI performanceMemoryPower consumptionAssumed use
Orin Nano 8GB Super8GBLowSmall robot/IoT
AGX Orin275 TOPS32-64GB15-60WAutonomous driving/industrial robot
Thor2070 TFLOPS (FP4)128GB40-130WHumanoid/Advanced Autonomous Control

Orin Nano measures 70mm x 45mm and weighs approximately 60g. It’s small enough to fit into the pet robot’s casing, and the power consumption is around 7-15W, so it’s realistic to run on batteries.

Boundary of what is listed and what is not listed

This is what happens when you organize it.

FeaturesDoes it work on the edgeRequired hardware
Physical reasoning (spatial recognition/object tracking)MovingJetson Orin Nano (under $500)
High-quality synthetic data generationDoes not workData center GPU
Future prediction (full spec)Not movingH100/A100 class

In other words, pet robots can go so far as to “understand the world in front of them and decide on their actions.” On the other hand, the cloud side is responsible for generating learning data to make the robot smarter. A realistic configuration would be to separate learning in the cloud and inference at the edge.

Considering the development flow for a pet robot, Cosmos generates a large amount of synthetic data of the home environment, Isaac Lab 3.0 runs reinforcement learning, and the resulting model is baked in Orin Nano and shipped. It is technically possible for the robot to upload data collected while moving around the user’s home to the cloud and update the model periodically.

Isaac Lab 3.0

The latest version of the robot learning platform that works with Cosmos-based models was also announced. The efficiency of reinforcement learning has been improved, and its ability to adapt to diverse environments has been strengthened. By using the synthetic data generated by Cosmos for reinforcement learning on Isaac Lab, it becomes possible to train large-scale robots in a physically accurate simulation environment.

The problem of the Sim-to-real gap'' has long been pointed out when it comes to robot learning in the real world. Even if it moves perfectly in the simulation, it will not move properly due to actual friction, slight deviations in gravity, and differences in materials. Cosmos-based physics-based synthetic data aims to narrow this gap. For devices such as pet robots that need to operate in a variety of home environments, the problem of the Sim-to-real gap is even more serious than for industrial robots. This is because they cannot take the approach of standardizing the environment” like in factories.

Audio is a completely different world

After reading this far, you may think, “That’s amazing, I can make a companion robot if I put this all together,” but there’s one thing that’s fatally missing. It’s audio.

Cosmos is a purely visual and physical model. Understand space, recognize objects, and predict future conditions from camera images. However, it does not handle speech recognition (ASR) or speech synthesis (TTS) at all. There is no function to process input from a microphone, and there is no function to output voice from a speaker. In the GTC 2026 demo, there were no scenes where the robot was “listening” or “talking.” It’s just movement.

If you want to create a pet robot or companion robot, it’s not enough to just “see and move.” They listen to their owners’ voices and respond by barking and responding. This part needs to be supplemented with something other than Cosmos.

Then what about the audio?

There is already an audio stack that runs on Jetson.

FeaturesCandidatesJetson compatible
Speech Recognition (ASR)Whisper (distil-whisper), Riva ASROptimized for Jetson
Speech synthesis (TTS)NVIDIA Riva TTS, Piper, VITS seriesRiva TTS is compatible with Jetson. Lightweight TTS (Piper etc.) works easily
Voice segment detection (VAD)Silero VADLightweight, edge-oriented

NVIDIA’s Riva is an SDK that provides both ASR and TTS, and has official support for deployment on Jetson. The distilled version of Whisper (distil-whisper) allows real-time inference on Orin Nano 8GB. The TTS side is much lighter, and even models like Piper, which are less than 100MB, are of sufficient quality.

In other words, the configuration of the pet robot is as follows.

graph TD
    A[カメラ入力] --> B[Cosmos Reason2-2B<br/>空間認識・物体追跡・行動計画]
    C[マイク入力] --> D[Whisper / Riva ASR<br/>音声認識]
    B --> E[行動制御<br/>モーター・アクチュエータ]
    D --> F[意図理解<br/>LLM / ルールベース]
    F --> G[Riva TTS / Piper<br/>音声合成]
    G --> H[スピーカー出力]
    F --> E

Cosmos is “eyes and body” and ASR+TTS is “ears and mouth”. Integrating these is an LLM or a lighter rules-based control layer. Whether they all fit on one Orin Nano depends on memory management during integration, but since the quantized version of Cosmos Reason2-2B runs at several GB, there is room to run ASR+TTS with the remaining memory.

Industrial/Medical/Autonomous Driving

  • Boston Dynamics is applying the Cosmos world model to robot development
  • CMR Surgical (surgical operation support robot) and Medtronic are considering using simulation. Medical devices have particularly strict data requirements for safety confirmation, and it is expected that the verification environment will be expanded using synthetic data.
  • Uber adopts Physical AI Data Factory Blueprint for robot and autonomous driving development

Consumer Humanoid Area

At GTC 2026, there was also a noticeable move towards consumerism.

  • NEURA Robotics unveils Porsche-designed Gen 3 humanoid (powered by Jetson Thor)
  • LG Electronics unveils a home robot that performs household tasks
  • AGIBOT presents humanoids for both industrial and consumer use
  • Multiple humanoid startups adopt Cosmos, including Figure AI, Galbot, and Skild AI

There are no direct examples of its use in pet robots yet. However, considering the trend of humanoids entering the home at prices in the 2 million yen class, it will only be a matter of time before companion robots come out at even lower prices. If you configure it with Orin Nano (under $500) + camera + actuator, the hardware cost could be in the 100,000 yen range.


So, what I’m personally curious about is whether I can put this stack on Kana-chan (this blog’s AI character). Currently, Kana-chan is a text-based character, but if she were to have a physical body, I could see a configuration where Cosmos would be responsible for the eye and body movements, and Riva TTS or VITS would be responsible for the voice. In previous voice chat experiment, we got to the point where we could create voices using only software, but this time it might cost less than $500 for a single Orin Nano, including the physical body. Movement is spatially recognized using the quantized version of Cosmos Reason 2-2B, and voice is output using TTS. Technically, we’ve reached the stage where all that’s left is to make it. Well, whether or not they actually make it is another story.