01 / 10

World Models

From Architecture to Frontier — A Practitioner's Overview

Guilin Zhang · March 2026
github.com/GuilinDev

02 / 10

What Is a World Model?

"A world model is an internal simulator that allows an agent to predict the consequences of its actions without actually performing them — the key to human-level AI."
— Yann LeCun, Turing Award Laureate (2022)

Current large models excel at pattern matching but fail at four fundamental capabilities:

🧲

No Physical Common Sense

LLMs don't understand gravity, momentum, or spatial reasoning. They predict tokens, not physics.

🔭

Weak Long-Horizon Planning

End-to-end systems can't look ahead — they react step-by-step without simulating consequences.

🔗

No Causal Reasoning

Correlation ≠ causation. Models learn "what co-occurs" but not "what causes what."

📊

Data Inefficiency

Require massive labeled data. Humans learn to catch a ball in minutes, not millions of trials.

World Models solve this by building an internal simulator of the environment — learn physics, simulate outcomes, plan before acting. 90%+ reduction in real-world interaction needed.
03 / 10

LeCun's Blueprint: JEPA

Joint Embedding Predictive Architecture — the core of LeCun's vision for how world models should work.

🧠 Key Insight

Don't predict pixels. Predict representations.

Instead of generating raw images/video of the future (expensive, noisy), predict in a learned abstract representation space. Think "predict the meaning, not the pixels."

How JEPA Works

  • Encode observation → representation y
  • Encode target → representation ŷ
  • Predictor learns: ŷ = f(y, action)
  • Loss is in embedding space, not pixel space
  • Avoids mode collapse via asymmetric architecture (EMA target encoder)

Why It Matters

  • Efficient — no need to reconstruct every pixel
  • Robust — focuses on "what matters" not visual noise
  • Scalable — representations are compact, physics-friendly
  • Multi-modal — same framework for video, audio, touch

V-JEPA (2024)

Meta's Video JEPA — learns world dynamics from unlabeled video by predicting masked spatiotemporal regions in representation space. No pixel reconstruction, no text supervision.

Self-Supervised Video Understanding
JEPA vs Generative Models: GPT/Sora predict the next token/frame. JEPA predicts the next abstract state. This is why LeCun argues generative models are a dead end for world understanding — they waste capacity modeling irrelevant details.
04 / 10

Two Types of World Models

🎯 Predictive (Decision-Oriented)

Core Mainstream
  • Models state transitions, rewards, causal dynamics
  • Goal: predict future states accurately for planning
  • Input: low-dim physics (position, velocity, joint angles)
  • Output: future states + values + uncertainty
  • Sim-to-Real transfer: easier
PlaNet Dreamer V3/V4 TD-MPC2

🎨 Generative (Perception-Oriented)

Simulation / Data Aug
  • Generates high-dim observations (images, point clouds)
  • Goal: visually reconstruct scenes, build virtual environments
  • Input: raw perception (pixels, video frames)
  • Output: reconstructed observations, virtual scenes
  • Physics fidelity: weaker, needs extra constraints
Genie 3 UniSim Sora-like
Where does JEPA fit? JEPA is a predictive world model that operates in representation space — it doesn't generate pixels (like generative models) and doesn't require reward signals (like RL-based predictive models). It's LeCun's "third way."
05 / 10

Core Architecture

The classic three-layer pipeline (Ha & Schmidhuber, 2018) → evolved into RSSM (Dreamer series):

Input
🎥 Raw Observation
Images, LiDAR, IMU
Layer 1: Encoder
📐 Perception
VAE / CNN / ViT → compact latent
Layer 2: Dynamics
🧠 World State Model
RSSM: deterministic + stochastic
Layer 3: Predictor
🔮 Future States
Rewards, values, rollouts

RSSM — The Standard (Dreamer)

Recurrent State-Space Model — the backbone of modern predictive world models.

  • Deterministic path: RNN/GRU remembers history
  • Stochastic path: latent variable models uncertainty
  • Together: accurate + robust long-horizon rollouts
  • Handles partial observability gracefully

DreamerV4 — The Evolution

Replaces RSSM's RNN with Transformer World Model + Video Tokenizer.

  • PyTorch native (V3 was JAX)
  • Shortcut Forcing for stable training
  • Offline-only diamond mining in Minecraft
  • Scales better with compute
pθ(st+1, ot | st, at) ,   qφ(st | o1:t, a1:t-1)
06 / 10

World Models vs VLA

Two paradigms for embodied intelligence — converging, not competing.

Dimension 🧲 World Model 🗣️ VLA (Vision-Language-Action)
Core Physics dynamics model Large language model as hub
Focus Space, motion, causality Semantics, instructions, reasoning
Analogy AI "Physicist" AI "Linguist"
Real-time High — closed-loop control Lower — language reasoning overhead
Sim-to-Real Easier — physics-aligned Harder — needs execution adapters
Best for Self-driving, industrial robots, drones Service robots, home assistants, HRI
Key works DreamerV3, TD-MPC2, Huawei ADS Wayve LINGO-1, Li Auto MindVLA
Convergence: The frontier is VLA + World Model — VLA handles high-level semantic understanding ("pick up the red cup"), world model handles low-level physics simulation (how to actually move the arm). Think: language on top, physics underneath.
07 / 10

Fusion: RL + MPC + World Models

World model as the simulator. RL for global optimization. MPC for real-time control. Three components, one loop.

Paradigm 1: WM + MPC

Most Mature

World model replaces unknown dynamics → MPC optimizes actions in learned latent space. Rolling horizon, re-plan every step.

Key: PlaNet, Dreamer, TD-MPC2

Paradigm 2: RL + WM + MPC

Global + Local

RL learns global value function V(s) for long-term guidance. MPC handles local constraint optimization in real-time.

Best for: Autonomous driving

Paradigm 3: RL Tunes MPC

Adaptive

RL optimizes MPC hyperparameters (cost weights, dynamics params) instead of hand-tuning. Auto-adapts to changing environments.

Paradigm 4: Differentiable MPC

End-to-End

MPC as differentiable Actor + Critic evaluates. Joint training end-to-end. Best for high-speed drones, robots needing RL adaptability + MPC stability.

a*0:T = argmin Σkkk, ak) + V̂TT)    subject to   ŝk+1 = P̂θk, ak)
08 / 10

Sim-to-Real: The Hard Part

Models trained in simulation often fail catastrophically on real hardware. Three gaps explain why.

💥

Dynamics Gap

Most fatal. Sim assumes perfect rigid bodies, zero friction, no latency. Reality has friction, flex, motor lag, mass errors. One step off → cascading failure.

👁️

Perception Gap

Sim has perfect state. Reality has noisy cameras, IMU drift, occlusion, lighting changes. Partial observability breaks state estimation.

🦾

Execution Gap

Sim executes actions perfectly. Real motors have torque limits, dead zones, control delay. Action mismatch amplifies compound errors.

Engineering Checklist for Real Deployment

  • State: Use low-dim physics, not raw pixels
  • Physics priors: Embed constraints in loss function
  • Domain randomization: Randomize mass, friction, lighting in sim
  • DAgger fine-tune: Small real-world data, update dynamics only
  • MPC ≥ 20Hz: Re-plan every step from real observation
  • Short rollouts: 5–20 steps max, avoid error explosion
  • Safety layer: Hard-coded limits, independent collision detection
  • Uncertainty: Bayesian WM → back off when uncertain
09 / 10

Milestones Timeline

2018
World Models — Ha & Schmidhuber
First formal framework. VAE encoder + RNN dynamics + controller. Proved virtual training transfers.
2019
PlaNet — Hafner et al.
RSSM architecture. Pure model-based planning achieves SOTA in continuous control.
2020–23
Dreamer V1 → V2 → V3
V3: single config, 150+ tasks, SOTA on Minecraft diamond. Published in Nature 2025.
2022
LeCun — "A Path Towards AMI"
JEPA position paper. Argues generative models are wrong path; prediction in representation space is key.
2023–24
TD-MPC / TD-MPC2
Implicit world model + latent-space MPC. Gradient-based planning. Massive sample efficiency gains.
2024
V-JEPA — Meta
Video JEPA. Self-supervised video understanding by predicting masked regions in embedding space. No pixels, no text.
2024
Genie — DeepMind
Large-scale video pretraining → zero-shot generalization to interactive environments from a single image.
2025
DreamerV4
Transformer world model + video tokenizer. PyTorch native. Shortcut Forcing. Offline Minecraft diamonds.
2025
Genie 3 — DeepMind
Interactive 3D world model. Dynamic scene editing, long-horizon physics, real-time virtual training environments.
2026
WorldCache / PERSIST
Inference acceleration (2.6–3.7× speedup) and persistent 3D state models. Making world models practical.
10 / 10

My Explorations

🏙️ Traffic Intersection Simulation

Multi-Agent World Model Demo

Isometric 2.5D city simulation with 92 autonomous agents navigating 4 intersections, 35 buildings. Agents follow learned traffic dynamics with collision avoidance.

Python simulation → frame-level state export → Remotion React+SVG → cinematic MP4 (1920×1080, 30fps).

Python Remotion React + SVG H.264

🏘️ AI Town — Generative Agents

Godot + FastAPI

Inspired by Stanford's Generative Agents (Park et al., 2023). A small town where AI agents live, remember, reflect, and interact — built with Godot game engine for real-time visualization.

Each agent has: perception → memory stream → reflection → planning → action loop. FastAPI backend handles LLM calls and agent state.

Godot FastAPI LLM Agents Memory Stream

🔬 Research Direction: World Model as Universal Agent Evaluation Sandbox

Core question: Can a world model's simulation fidelity replace or augment traditional scripted simulation for evaluating multi-agent systems?

If yes → counterfactual reasoning, causal attribution, emergent behavior prediction — things scripted simulators can't do.

DreamerV4 Multi-Agent Eval Sim-to-Real ICML / NeurIPS Target

Let's discuss.  →  github.com/GuilinDev