Fusheng-Ji · 浮生记

On 3D, Video World Models, and the Approaching ImageNet Moment of Perception-Action Learning

A Question About What Scales

I recently read Vincent Sitzmann’s essay, The flavor of the bitter lesson for computer vision. What stayed with me was not only its argument that many familiar vision tasks may be historical decompositions rather than final objectives, but a more structural question: what makes a research direction truly scalable?

For a long time, computer vision was organized around mappings from images to intermediate representations because direct perception-action learning was simply too hard. In that setting, reconstruction, segmentation, tracking, and pose estimation were not arbitrary choices. They were the forms in which the problem became tractable.

That question feels personal to me because much of my own research training has been built around 3D.

Why 3D Felt Central

For a long time, 3D felt like the natural representation for understanding humans and scenes. It offered structure, consistency, and a workable interface between raw observations and explicit reasoning. In human-centered and scene-centric problems, geometry is not just elegant. It is operational. It makes models easier to control, outputs easier to inspect, and failure modes easier to diagnose. That is why 3D never felt like a temporary engineering choice to me. It felt like the right language for the problem.

What has changed for me is not that I now think 3D was mistaken. It is that I have become less convinced that 3D should be treated as the destination.

I still do not think 3D simply disappears. In many of the problems I care about, explicit geometry remains uniquely valuable for structure, control, and consistency. But I do think its role is shifting. More and more, I find myself treating 3D not as the final object around which the whole agenda should be organized, but as an important intermediate layer in a larger chain: from visual structure, to world modeling, and perhaps eventually to action.

What ImageNet Really Changed

One way I have started to understand this shift is through the idea of research regimes.

What made ImageNet historically decisive for 2D vision was not scale alone. It was that scale came bundled with a task definition, shared benchmarks, and a common optimization target. ImageNet did not merely provide more data; it helped define the task around which modern 2D vision could organize itself. That is what allowed representation learning in 2D vision to genuinely scale.

Seen from that perspective, recent large-scale 3D models feel important in a different way from earlier task-specific systems. VGGT, for example, suggests the beginning of a more unified and scalable regime for 3D understanding: one in which cameras, depth, point maps, and tracks can be predicted within a more general framework, rather than through isolated systems built for one task at a time.

That is why VGGT feels significant to me. Not because it solves 3D once and for all, and not because it is literally the ImageNet of 3D, but because it hints that 3D may be entering a new regime: less fragmented, less tied to narrow pipelines, and more shaped by large-scale data and unified training.

The Regime Perception-Action Learning Still Lacks

But if 2D vision had its ImageNet era, and 3D may now be entering something analogous, then perception-action learning still seems to lack an equivalent moment.

For me, the missing piece is not just “more data” in the abstract. It is the absence of a full task regime for perception-action learning: data that scales, a training setup that scales, and a task definition that the field can genuinely organize itself around. That is what 2D vision had. That is what 3D may be beginning to develop. And that is what perception-action learning still seems to be approaching rather than possessing.

This is also why video world models matter to me.

Beyond Visual Plausibility

But what matters here is not visual realism alone. A video model that produces plausible-looking futures is still not necessarily a usable world model. The difference, to me, is increasingly about causality and physical accuracy.

A model can generate a visually convincing continuation and still fail as a world model in the stronger sense. If it does not capture causal structure, then it does not really tell us what would happen under intervention. If it is not physically accurate, then even a compelling-looking future may be useless for control. A trajectory that looks smooth but violates contact, inertia, collision, or object consistency might still be good video generation, but it is a weak substrate for action.

In that sense, causal and physically accurate are not side conditions. They are part of what separates generative plausibility from action-relevant world modeling.

Why 3D Still Matters

This is also why I still care about 3D. One reason explicit structure remains important is that it can contribute exactly where purely visual generation is weakest: consistency across views and time, geometric coherence, and a better path toward physically grounded prediction. I no longer see 3D as the end of the story, but I do see it as one of the ingredients that may help world models become more than impressive predictors of appearance.

The causal requirement matters just as much. A perception-action model does not need only to predict what is likely next. It needs to predict what changes when an action is taken. That is a stronger demand than passive forecasting. It requires models whose internal organization is at least somewhat intervention-sensitive rather than purely correlational. Without causal grounding, a model may summarize the world’s regularities without becoming a model one can truly act through.

Video World Models as a Bridge

This is where video world models become especially interesting. They do not solve the perception-action problem directly, but they may offer the most realistic current bridge toward changing its data regime. They let us learn temporal structure, dynamics, behavior, and partial causal regularities from abundant video before we have enough large-scale action-conditioned data to train fully general perception-action systems directly.

At the same time, that bridge is incomplete. Interactive controllability depends on data in which visual changes and corresponding actions are aligned, and obtaining such aligned data at scale remains a central open problem. That is why I do not see current video generation as the solution. I see it as a transition mechanism.

What This Changes in My Own Research

What interests me now is not simply moving from 3D to action. That would be too abrupt, and also not quite true. What interests me more is a longer transition: from 3D as a central intermediate representation, to video world models as a bridge, and ultimately to perception-action models that may one day have their own scalable regime.

In that picture, 3D still matters, but differently. It becomes part of the path rather than the endpoint of the path.

And within that transition, the properties I now care about most are no longer just representational quality or visual realism. I care more and more about whether a model is learning something causal rather than merely correlational, and whether its predictions are physically accurate enough to support intervention rather than only observation.

Those requirements are exactly what make the problem harder, but they are also what make it worth pursuing. If world models are going to matter for action, then they cannot stop at being good videos. They have to become usable models of what actions do in worlds that obey structure.

That is also why I have become increasingly interested in the possibility of a unified framework for video generation and action generation. If video world models are one of the few plausible ways to push learning toward richer dynamics before large-scale action data arrives, then it seems natural to ask whether video prediction and action-conditioned prediction should remain separate for much longer.

I do not take this as a solved direction. If anything, the open problems are exactly what make it interesting. But it is the direction that now feels most alive to me.

A Longer Transition

So the change in my thinking is not that 3D became irrelevant, and not that current video models have already solved embodied intelligence. It is something more modest and more important to me: I have started to see my earlier work on 3D as part of a larger transition in how visual intelligence may scale.

If 2D vision had its ImageNet moment, and 3D is beginning to form a more unified learning regime of its own, then perhaps video world models are helping prepare the ground for the still-approaching ImageNet moment of perception-action learning. But for that bridge to really hold, it will not be enough for these models to become more realistic. They will have to become more causal, more interactive, and more physically accurate.