Why human experience is the only scalable path to Physical AGI
February 6, 2026
Disclaimer: The views expressed here are my own and do not represent Nvidia, or the GEAR group.
Every major breakthrough in artificial intelligence comes down to one thing: data. GPT-3 didn't emerge from architectural innovation alone—it emerged from training on the collective written output of humanity. Diffusion models didn't learn to generate photorealistic images through clever loss functions alone—they learned by ingesting billions of images created by humans over decades. The pattern is unmistakable: intelligence, artificial or otherwise, is a function of experience.
This observation leads to an uncomfortable question for the robotics community: if data is the bottleneck for intelligence, and robot data is notoriously expensive to collect, how will we ever achieve Physical AGI?
The answer, I believe, points in only one direction.
The fundamental challenge to achieving Physical AGI is not hardware, not algorithms, not compute—it's data. Consider the asymmetry: LLMs were trained on text representing the accumulated knowledge of billions of people across human history. Vision models learned from images captured by billions of cameras over decades. But robot learning? We're celebrating when a lab collects thousands of hours of teleoperation data.
The gap isn't just about scale—it's about diversity. Text and vision data are "in-the-wild": scraped from the internet, captured incidentally, reflecting the full messiness of human life across cultures, contexts, and edge cases. Robot data is on-demand: collected intentionally in controlled settings, limited to the scenarios researchers choose to demonstrate. This inherently constrains the distribution. You can't teleoperate your way to the long tail.
Teleoperation doesn't scale. It requires expensive hardware, skilled operators, physical space, and—crucially—time that cannot be parallelized beyond the number of robots you own. Even the most ambitious data collection efforts represent a tiny fraction of the experience needed for physical AGI. Approaches like UMI (Sunday Robotics, Generalist) have made impressive strides by enabling portable, in-the-wild data collection without requiring actual robots—but they still require dedicated human demonstrators deliberately performing tasks with specialized hardware, and their tightly integrated vertical stacks mean the collected data must be carefully calibrated for replay on specific target robots, which fundamentally limits scale.
So where does the data come from?
Humans are robots that have already been deployed at scale. There are 8 billion of us, each accumulating roughly 16 waking hours of sensorimotor experience per day—manipulating objects, navigating environments, using tools, interacting with each other. All of it is recordable.
The path forward is obvious: we need to capture this data.
Recent advances in human egocentric video datasets point toward this future. The rapid emergence of human egocentric data vendors suggests the market sees what we see. But let's think bigger. Imagine we had 100 million hours of human egocentric video data. Sounds like a lot? Given a life expectancy of 76 years, that's roughly 150 human lifetimes of recorded experience.
For comparison, LLM training data represents the written output of billions of people over centuries. We're not even close to saturating the human experience data regime—we're barely scratching the surface.
Of course, video alone doesn't capture the full richness of physical interaction—forces, contacts, tactile feedback are all missing. My hypothesis is that video can serve as the high-data-regime modality that enables transfer to these lower-data modalities.
Raw video data is necessary but not sufficient. The question is: what do we do with 100 million hours of human footage?
The answer, in my view, lies in world modeling. The standard approach—mapping observations directly to actions—is, in my opinion, a dead end: it locks learned knowledge to the specific embodiment and viewpoint it was collected on, offers no mechanism for verifying what the model has actually understood, and provides no path to leveraging the ocean of action-free human video that dwarfs any robot dataset. It may produce impressive demos, but I believe it will plateau well short of the generalization needed for real deployment. We should instead train models to predict how the world evolves. Physics. Affordances. Cause and effect. Video is the ideal medium for this: it's dense, captures spatial and temporal relationships, and implicitly encodes the dynamics of physical interaction.
Our recent work on DreamZero demonstrates this principle. By training a 14B parameter autoregressive video diffusion model to predict future frames and actions in a single forward pass, we showed that world modeling enables generalization that pure action prediction cannot achieve. Why generate pixels at all? Because generating pixels is the only way for humans to verify that the model truly understands how to perform a task. If it can accurately predict what happens next—the object moving, the hand grasping, the cup tilting—it must have internalized the underlying dynamics. The pixels are the proof, and that proof translates to action.
The training objective we should aspire to is breathtaking in its simplicity and difficulty: given N hours of history, predict the next N hours of video. It's next-token prediction for the physical world—simple to state, extraordinarily hard to solve. Think about what this requires: remembering the past, understanding the present, and predicting the future. This is what humans do when they plan. It's what we call thinking.
Let's assume we successfully train a massive autoregressive video world model on 100 million hours of human egocentric data—a milestone I expect we'll hit within the next few years. It understands how humans accomplish physical tasks. It can predict how scenes will evolve. It has internalized the "how" of physical intelligence, learned from 150 lifetimes of human experience.
But raw video prediction isn't enough. A true world model must predict the next state conditioned on action—otherwise it's just a physics simulator you can observe but not steer. But what does "action" mean for a world model trained on human video? We don't have joint torques or end-effector velocities—humans don't come with an action API. What we do have is language: a natural, scalable way to express intent that can be paired with any video. The 100 million hours must be densely annotated with captions that segment continuous experience into meaningful subtask intervals—"reaching for the cup," "grasping the handle," "pouring"—teaching the model that when a human intends to do X, the world evolves in a predictable way. Language becomes both the training signal that teaches intentional action and the interface through which humans later command the robot.
Now what? The model exists in silicon. To act in the physical world, it needs a body. Can physical understanding learned from human video transfer to a robot?
We already have evidence that it does. DreamZero generalizes zero-shot to unseen tasks (see the gallery of 100+ zero-shot tasks)—untying shoelaces, ironing, shaking hands—motions no robot in the training data ever performed. The knowledge came from the video diffusion backbone, pretrained on web-scale data that is overwhelmingly human. When DreamZero generates a video of a robot untying a shoelace and executes the corresponding actions, it is transferring physical understanding from human experience to a robot body.
But let's be honest: today's "zero-shot" is what I still consider to be the 'AI Slop' phase. The robot attempts the right motion—correct approach direction, plausible contact points—but doesn't execute accurately or reliably. The model is doing two hard things at once: (1) transferring physical knowledge from pretraining where humans perform these tasks, usually in an exocentric view instead of egocentric, and (2) adapting it to an embodiment with completely different kinematics. The transfer is real but lossy.
Targeted human egocentric data attacks the first problem. The pretrained model learned human physical knowledge incidentally—from internet video that happened to contain manipulation amid talking heads and landscapes. Egocentric data collected deliberately for robotics would sharpen the transfer: first-person viewpoints closer to robot cameras, manipulation-dense footage, task-relevant temporal structure. This doesn't introduce a new mechanism—it amplifies the one already working, so less capacity is spent filtering noise and more goes toward bridging the embodiment gap.
DreamZero's cross-embodiment experiments confirm this logic. Adding just 10–20 minutes of video-only demonstrations—from a different robot or from humans—improved unseen task performance by over 42%. No action labels, just video. More surprisingly, adapting to an entirely new robot required only 30 minutes of play data while transferring zero-shot generalization. Physical dynamics transferred; only the kinematics needed relearning. Critically, both results depended on similar camera viewpoints across embodiments—wrist and head-mounted cameras that approximate a shared visual perspective. This is precisely why egocentric human data is the right bet: a simple wrist or head-mounted camera is all it takes to capture the first-person, hand-centric viewpoint that maximizes visual overlap with robot observations.
This is the paradigm shift. The standard approach—mapping observations directly to actions—locks learned knowledge to a single embodiment. World models break that coupling. They learn how the world works, not just how one robot should move, making their knowledge a shared asset across bodies. Train once on human experience, deploy everywhere with calibration. Humanoids minimize the remaining embodiment gap, but even dramatically different robots can benefit with modest adaptation. How far that adaptation can stretch remains an open question, which naturally transitions to the next topic.
This brings us to the thesis of this post. If:
Then the optimal robot embodiment is one that maximizes similarity to humans.
But "optimal" depends on what's feasible. Here's the tension:
The hardware path. If humanoid hardware matures quickly—reliable actuators, dexterous hands, robust locomanipulation—then the minimal transfer gap makes humanoids the obvious choice. Why collect embodiment-specific data when the embodiment already matches?
The algorithm path. If embodiment adaptation algorithms improve faster than hardware, the transfer gap matters less. A world model that can calibrate to a wheeled manipulator in 30 minutes of play data doesn't need a humanoid body to leverage human experience. Intermediate forms become viable indefinitely.
But adaptation isn't free. Consider picking up a credit card from a flat table. With a dexterous hand, you'd slide a fingertip under the edge and pinch. With a parallel gripper, you'd push the card to the table's edge and clamp it from the side. Same task, completely different motion strategy. A world model trained on human video has seen the first approach a million times—but the second must be learned from embodiment-specific data. The further you stray from human form, the more of these gaps you accumulate.
We don't know which will win. My intuition is that hardware is the slower variable—iteration cycles are brutal, supply chains are complex, and the path from prototype to product is long. But I'd love to be proven wrong.
What's certain is this: the world understanding learned from human video will power robots of all forms. Humanoids minimize the remaining gap. Whether that gap is worth closing depends on a race we're only beginning to run.
I don't want to oversell the simplicity of this path. Significant challenges remain, ones we have to relentlessly tackle for the next few years.
The inverse dynamics problem. Even with perfect world prediction, we still need to map predicted futures to motor commands. Our experiments suggest this implicit inverse dynamics model (IDM) can be learned efficiently, but the data requirements likely will scale with the system's degrees of freedom. As humanoid robots approach human-level dexterity—dozens of actuators per hand, compliant joints, complex dynamics—the IDM learning problem grows harder. How much actual robot play data is truly needed? We don't yet know.
Multimodal sensing. Humans don't just see—we feel pressure, temperature, texture, and pain. We hear objects make contact. We sense our body's position through proprioception. Current world models operate primarily in the visual domain. Our hypothesis, based on DreamZero's success in transferring from video to action, is that video can serve as the high-data regime modality that enables transfer to lower-data modalities, such as tactile and force sensing, the same way it did for the action modality. But this remains to be demonstrated.
Non-intrusive human data capture. The richer the sensory data you want—even just egocentric video from head or wrist-mounted cameras—the harder it is to make capture truly unobtrusive. And if the rig becomes too specialized, you've essentially reinvented UMI: dedicated operators with purpose-built equipment, losing the scalability advantage of passively recording natural behavior. Consumer smart glasses are one potential path—if products like Meta's Ray-Ban glasses gain mass adoption, egocentric video becomes a byproduct of daily life. But we're not there yet, for vision or any other modality.
Long-horizon coherence. Predicting the next few seconds of video is achievable today. Predicting the next few hours—the timescale of meaningful human tasks—is harder, but the path is visible: longer context windows, better architectures, more efficient attention. We're making real progress. But true persistent memory? Coherence over days, weeks, the full arc of a human life? That's not a scaling problem. That's an open research question we don't yet know how to answer.
Close your eyes and imagine the following:
A video world model trained on 100 million hours of human experience. It has watched people cook, clean, build, repair, care for each other, navigate cities, and use every tool humans have invented. It understands not just what happens, but why and how.
This model is deployed in a humanoid robot whose body mirrors human kinematics. Because the embodiment matches the training distribution, the model can immediately leverage its vast experience. With a few hours of play data to 'calibrate' its proprioception, it can begin to act—just as a skilled human teleoperator adapts to a new robot in hours, not weeks.
The robot doesn't need to be teleoperated through every task. It has already watched humans do those tasks a million times. It knows how the world responds to manipulation. It can perceive, predict, plan, and execute.
This is Physical AGI. And the path to get there runs through world models trained on the only data source large enough to matter—the physical experience of 8 billion humans.
This post draws on our previous work DreamGen, which first showed that video world models can serve as synthetic data generators enabling generalization, and our recent work DreamZero, a World Action Model that unifies video prediction and action generation to enable zero-shot generalization.
I would especially like to thank Danfei Xu for providing some of the inspirations during our conversations. I would also like to thank Seonghyeon Ye, Youliang Tan, Chuning Zhu, Yevgen Chebotar, Yilun Du, and Jim Fan for their comments and feedback on this essay.