01 / 10

World Models

From Architecture to Frontier — A Practitioner's Overview

Guilin Zhang · March 2026
github.com/GuilinDev

02 / 10

What Is a World Model?

"A world model is an internal simulator that allows an agent to predict the consequences of its actions without actually performing them — the key to human-level AI."

— Yann LeCun, Turing Award Laureate (2022)

Current large models excel at pattern matching but fail at four fundamental capabilities:

🧲

No Physical Common Sense

LLMs don't understand gravity, momentum, or spatial reasoning. They predict tokens, not physics.

🔭

Weak Long-Horizon Planning

End-to-end systems can't look ahead — they react step-by-step without simulating consequences.

🔗

No Causal Reasoning

Correlation ≠ causation. Models learn "what co-occurs" but not "what causes what."

📊

Data Inefficiency

Require massive labeled data. Humans learn to catch a ball in minutes, not millions of trials.

World Models solve this by building an internal simulator of the environment — learn physics, simulate outcomes, plan before acting. 90%+ reduction in real-world interaction needed.

03 / 10

LeCun's Blueprint: JEPA

Joint Embedding Predictive Architecture — the core of LeCun's vision for how world models should work.

🧠 Key Insight

Don't predict pixels. Predict representations.

Instead of generating raw images/video of the future (expensive, noisy), predict in a learned abstract representation space. Think "predict the meaning, not the pixels."

How JEPA Works

Encode observation → representation y
Encode target → representation ŷ
Predictor learns: ŷ = f(y, action)
Loss is in embedding space, not pixel space
Avoids mode collapse via asymmetric architecture (EMA target encoder)

Why It Matters

Efficient — no need to reconstruct every pixel
Robust — focuses on "what matters" not visual noise
Scalable — representations are compact, physics-friendly
Multi-modal — same framework for video, audio, touch

V-JEPA (2024)

Meta's Video JEPA — learns world dynamics from unlabeled video by predicting masked spatiotemporal regions in representation space. No pixel reconstruction, no text supervision.

Self-Supervised Video Understanding

JEPA vs Generative Models: GPT/Sora predict the next token/frame. JEPA predicts the next abstract state. This is why LeCun argues generative models are a dead end for world understanding — they waste capacity modeling irrelevant details.

04 / 10

Two Types of World Models

🎯 Predictive (Decision-Oriented)

Core Mainstream

Models state transitions, rewards, causal dynamics
Goal: predict future states accurately for planning
Input: low-dim physics (position, velocity, joint angles)
Output: future states + values + uncertainty
Sim-to-Real transfer: easier

PlaNet Dreamer V3/V4 TD-MPC2

🎨 Generative (Perception-Oriented)

Simulation / Data Aug

Generates high-dim observations (images, point clouds)
Goal: visually reconstruct scenes, build virtual environments
Input: raw perception (pixels, video frames)
Output: reconstructed observations, virtual scenes
Physics fidelity: weaker, needs extra constraints

Genie 3 UniSim Sora-like

Where does JEPA fit? JEPA is a predictive world model that operates in representation space — it doesn't generate pixels (like generative models) and doesn't require reward signals (like RL-based predictive models). It's LeCun's "third way."

05 / 10

Core Architecture

The classic three-layer pipeline (Ha & Schmidhuber, 2018) → evolved into RSSM (Dreamer series):

Input

🎥 Raw Observation

Images, LiDAR, IMU

→

Layer 1: Encoder

📐 Perception

VAE / CNN / ViT → compact latent

→

Layer 2: Dynamics
🧠 World State Model
RSSM: deterministic + stochastic

→

Layer 3: Predictor

🔮 Future States

Rewards, values, rollouts

RSSM — The Standard (Dreamer)

Recurrent State-Space Model — the backbone of modern predictive world models.

Deterministic path: RNN/GRU remembers history
Stochastic path: latent variable models uncertainty
Together: accurate + robust long-horizon rollouts
Handles partial observability gracefully

DreamerV4 — The Evolution

Replaces RSSM's RNN with Transformer World Model + Video Tokenizer.

PyTorch native (V3 was JAX)
Shortcut Forcing for stable training
Offline-only diamond mining in Minecraft
Scales better with compute

p_θ(s_t+1, o_t | s_t, a_t) , q_φ(s_t | o_1:t, a_1:t-1)

06 / 10

World Models vs VLA

Two paradigms for embodied intelligence — converging, not competing.

Dimension	🧲 World Model	🗣️ VLA (Vision-Language-Action)
Core	Physics dynamics model	Large language model as hub
Focus	Space, motion, causality	Semantics, instructions, reasoning
Analogy	AI "Physicist"	AI "Linguist"
Real-time	High — closed-loop control	Lower — language reasoning overhead
Sim-to-Real	Easier — physics-aligned	Harder — needs execution adapters
Best for	Self-driving, industrial robots, drones	Service robots, home assistants, HRI
Key works	DreamerV3, TD-MPC2, Huawei ADS	Wayve LINGO-1, Li Auto MindVLA

Convergence: The frontier is VLA + World Model — VLA handles high-level semantic understanding ("pick up the red cup"), world model handles low-level physics simulation (how to actually move the arm). Think: language on top, physics underneath.

07 / 10

Fusion: RL + MPC + World Models

World model as the simulator. RL for global optimization. MPC for real-time control. Three components, one loop.

Paradigm 1: WM + MPC

Most Mature

World model replaces unknown dynamics → MPC optimizes actions in learned latent space. Rolling horizon, re-plan every step.

Key: PlaNet, Dreamer, TD-MPC2

Paradigm 2: RL + WM + MPC

Global + Local

RL learns global value function V(s) for long-term guidance. MPC handles local constraint optimization in real-time.

Best for: Autonomous driving

Paradigm 3: RL Tunes MPC

Adaptive

RL optimizes MPC hyperparameters (cost weights, dynamics params) instead of hand-tuning. Auto-adapts to changing environments.

Paradigm 4: Differentiable MPC

End-to-End

MPC as differentiable Actor + Critic evaluates. Joint training end-to-end. Best for high-speed drones, robots needing RL adaptability + MPC stability.

a*_0:T = argmin Σ_k r̂_k(ŝ_k, a_k) + V̂_T(ŝ_T) subject to ŝ_k+1 = P̂_θ(ŝ_k, a_k)

08 / 10

Sim-to-Real: The Hard Part

Models trained in simulation often fail catastrophically on real hardware. Three gaps explain why.

💥

Dynamics Gap

Most fatal. Sim assumes perfect rigid bodies, zero friction, no latency. Reality has friction, flex, motor lag, mass errors. One step off → cascading failure.

👁️

Perception Gap

Sim has perfect state. Reality has noisy cameras, IMU drift, occlusion, lighting changes. Partial observability breaks state estimation.

🦾

Execution Gap

Sim executes actions perfectly. Real motors have torque limits, dead zones, control delay. Action mismatch amplifies compound errors.

Engineering Checklist for Real Deployment

✅ State: Use low-dim physics, not raw pixels
✅ Physics priors: Embed constraints in loss function
✅ Domain randomization: Randomize mass, friction, lighting in sim
✅ DAgger fine-tune: Small real-world data, update dynamics only

✅ MPC ≥ 20Hz: Re-plan every step from real observation
✅ Short rollouts: 5–20 steps max, avoid error explosion
✅ Safety layer: Hard-coded limits, independent collision detection
✅ Uncertainty: Bayesian WM → back off when uncertain

09 / 10

Milestones Timeline

2018

World Models — Ha & Schmidhuber

First formal framework. VAE encoder + RNN dynamics + controller. Proved virtual training transfers.

2019

PlaNet — Hafner et al.

RSSM architecture. Pure model-based planning achieves SOTA in continuous control.

2020–23

Dreamer V1 → V2 → V3

V3: single config, 150+ tasks, SOTA on Minecraft diamond. Published in Nature 2025.

2022

LeCun — "A Path Towards AMI"

JEPA position paper. Argues generative models are wrong path; prediction in representation space is key.

2023–24

TD-MPC / TD-MPC2

Implicit world model + latent-space MPC. Gradient-based planning. Massive sample efficiency gains.

2024

V-JEPA — Meta

Video JEPA. Self-supervised video understanding by predicting masked regions in embedding space. No pixels, no text.

2024

Genie — DeepMind

Large-scale video pretraining → zero-shot generalization to interactive environments from a single image.

2025

DreamerV4

Transformer world model + video tokenizer. PyTorch native. Shortcut Forcing. Offline Minecraft diamonds.

2025

Genie 3 — DeepMind

Interactive 3D world model. Dynamic scene editing, long-horizon physics, real-time virtual training environments.

2026

WorldCache / PERSIST

Inference acceleration (2.6–3.7× speedup) and persistent 3D state models. Making world models practical.

10 / 10

My Explorations

🏙️ Traffic Intersection Simulation

Multi-Agent World Model Demo

Isometric 2.5D city simulation with 92 autonomous agents navigating 4 intersections, 35 buildings. Agents follow learned traffic dynamics with collision avoidance.

Python simulation → frame-level state export → Remotion React+SVG → cinematic MP4 (1920×1080, 30fps).

Python Remotion React + SVG H.264

🏘️ AI Town — Generative Agents

Godot + FastAPI

Inspired by Stanford's Generative Agents (Park et al., 2023). A small town where AI agents live, remember, reflect, and interact — built with Godot game engine for real-time visualization.

Each agent has: perception → memory stream → reflection → planning → action loop. FastAPI backend handles LLM calls and agent state.

Godot FastAPI LLM Agents Memory Stream

🔬 Research Direction: World Model as Universal Agent Evaluation Sandbox

Core question: Can a world model's simulation fidelity replace or augment traditional scripted simulation for evaluating multi-agent systems?

If yes → counterfactual reasoning, causal attribution, emergent behavior prediction — things scripted simulators can't do.

DreamerV4 Multi-Agent Eval Sim-to-Real ICML / NeurIPS Target

Let's discuss. → github.com/GuilinDev