From Architecture to Frontier — A Practitioner's Overview
Guilin Zhang · March 2026
github.com/GuilinDev
Current large models excel at pattern matching but fail at four fundamental capabilities:
LLMs don't understand gravity, momentum, or spatial reasoning. They predict tokens, not physics.
End-to-end systems can't look ahead — they react step-by-step without simulating consequences.
Correlation ≠ causation. Models learn "what co-occurs" but not "what causes what."
Require massive labeled data. Humans learn to catch a ball in minutes, not millions of trials.
Joint Embedding Predictive Architecture — the core of LeCun's vision for how world models should work.
Don't predict pixels. Predict representations.
Instead of generating raw images/video of the future (expensive, noisy), predict in a learned abstract representation space. Think "predict the meaning, not the pixels."
Meta's Video JEPA — learns world dynamics from unlabeled video by predicting masked spatiotemporal regions in representation space. No pixel reconstruction, no text supervision.
The classic three-layer pipeline (Ha & Schmidhuber, 2018) → evolved into RSSM (Dreamer series):
Recurrent State-Space Model — the backbone of modern predictive world models.
Replaces RSSM's RNN with Transformer World Model + Video Tokenizer.
Two paradigms for embodied intelligence — converging, not competing.
| Dimension | 🧲 World Model | 🗣️ VLA (Vision-Language-Action) |
|---|---|---|
| Core | Physics dynamics model | Large language model as hub |
| Focus | Space, motion, causality | Semantics, instructions, reasoning |
| Analogy | AI "Physicist" | AI "Linguist" |
| Real-time | High — closed-loop control | Lower — language reasoning overhead |
| Sim-to-Real | Easier — physics-aligned | Harder — needs execution adapters |
| Best for | Self-driving, industrial robots, drones | Service robots, home assistants, HRI |
| Key works | DreamerV3, TD-MPC2, Huawei ADS | Wayve LINGO-1, Li Auto MindVLA |
World model as the simulator. RL for global optimization. MPC for real-time control. Three components, one loop.
World model replaces unknown dynamics → MPC optimizes actions in learned latent space. Rolling horizon, re-plan every step.
Key: PlaNet, Dreamer, TD-MPC2
RL learns global value function V(s) for long-term guidance. MPC handles local constraint optimization in real-time.
Best for: Autonomous driving
RL optimizes MPC hyperparameters (cost weights, dynamics params) instead of hand-tuning. Auto-adapts to changing environments.
MPC as differentiable Actor + Critic evaluates. Joint training end-to-end. Best for high-speed drones, robots needing RL adaptability + MPC stability.
Models trained in simulation often fail catastrophically on real hardware. Three gaps explain why.
Most fatal. Sim assumes perfect rigid bodies, zero friction, no latency. Reality has friction, flex, motor lag, mass errors. One step off → cascading failure.
Sim has perfect state. Reality has noisy cameras, IMU drift, occlusion, lighting changes. Partial observability breaks state estimation.
Sim executes actions perfectly. Real motors have torque limits, dead zones, control delay. Action mismatch amplifies compound errors.
Isometric 2.5D city simulation with 92 autonomous agents navigating 4 intersections, 35 buildings. Agents follow learned traffic dynamics with collision avoidance.
Python simulation → frame-level state export → Remotion React+SVG → cinematic MP4 (1920×1080, 30fps).
Inspired by Stanford's Generative Agents (Park et al., 2023). A small town where AI agents live, remember, reflect, and interact — built with Godot game engine for real-time visualization.
Each agent has: perception → memory stream → reflection → planning → action loop. FastAPI backend handles LLM calls and agent state.
Core question: Can a world model's simulation fidelity replace or augment traditional scripted simulation for evaluating multi-agent systems?
If yes → counterfactual reasoning, causal attribution, emergent behavior prediction — things scripted simulators can't do.
Let's discuss. → github.com/GuilinDev