MBRL World Models Planning

An end-to-end Model-Based Reinforcement Learning pipeline that learns environment dynamics and plans via imagined trajectories — achieving sample efficiency beyond model-free baselines.

MBRL World Models Planning

The Problem

Model-free reinforcement learning agents learn by interacting with an environment thousands — often millions — of times before converging on a useful policy. This sample inefficiency makes real-world deployment impractical: physical robots break, simulators are expensive, and data collection is slow.

The core question this project answers: can an agent learn a compressed mental model of its environment, and use that model to plan — without needing to live through every experience?

Approach

This project builds a full Model-Based RL pipeline from scratch, inspired by the Dreamer / PlaNet family of world models.

World Model Architecture

The world model has three tightly coupled components:

1. VAE — Visual Encoder / Decoder
Raw observations (image frames) are compressed into a low-dimensional latent space using a Variational Autoencoder. This gives the agent a compact, continuous representation of "what it sees."

2. RSSM — Recurrent State Space Model
The Recurrent State Space Model maintains a joint latent state that combines:

  • A deterministic hidden state (GRU/LSTM) — carries temporal memory across steps
  • A stochastic component — captures environment uncertainty

At each timestep, the RSSM predicts the next latent state given the current state and action. This is the agent's "imagination engine."

3. Reward Model
A lightweight MLP trained to predict scalar rewards from latent states — enabling the agent to evaluate imagined trajectories without touching the real environment.

Planning in Latent Space

Given a trained world model, the agent plans entirely inside the learned latent space:

  • MCTS (Monte Carlo Tree Search): Builds a search tree of imagined futures, selecting actions that maximize expected cumulative reward.
  • CEM (Cross-Entropy Method): Iteratively refines action sequences by sampling, evaluating in imagination, and fitting a new distribution to the best candidates.

Both planners operate without any further interaction with the real environment during planning.

Training Pipeline

Real Env → Collect trajectories → Encode with VAE
                                        ↓
                              Train RSSM on sequences
                                        ↓
                              Train Reward Model
                                        ↓
                    Plan with MCTS / CEM in latent space
                                        ↓
                        Execute best action → Repeat

Results

| Metric | Model-Free (PPO/SAC) | MBRL (This Work) | |--------|----------------------|-------------------| | Sample efficiency | Baseline | Significantly fewer interactions | | Policy stability | Moderate variance | Stable convergence | | Planning horizon | N/A | Multi-step imagined rollouts |

The pipeline demonstrates stable policy learning in continuous control tasks (locomotion, manipulation) with substantially fewer real-environment interactions than PPO and SAC baselines.

Key Learnings

  • Latent dynamics are powerful but fragile. The RSSM must be trained carefully — compounding prediction errors over long rollout horizons can destabilize planning.
  • CEM is deceptively effective. Despite its simplicity, the Cross-Entropy Method produces competitive plans when the world model is accurate.
  • TensorBoard is essential. Tracking latent reconstructions, KL divergence, and reward prediction loss separately was critical for diagnosing training issues.

What's Next

  • Integrating a learned value function (actor-critic) alongside the planner
  • Extending to pixel-based 3D environments (IsaacGym / Mujoco)
  • Experimenting with transformer-based world models (TWIRL / Genie-style)