Castalia Institute

Dungeon

A benchmark and implementation for evaluating AI world models using structured adversarial environments—inspired by D&D and latent inference.

Research

Why DWMB

World-model learning and planning are widely seen as prerequisites for robust intelligence. Yet most RL benchmarks (Atari, Procgen, MuJoCo) let reactive policies succeed without truly understanding hidden structure—either because hidden state is rare or because success can be achieved by luck or memoryless play. The Dungeon World Model Benchmark (DWMB) aims to isolate latent inference: each instance is a grid-world POMDP where traps look identical to safe floor, switches far away arm or disarm hazards, and the agent gets only a local egocentric view. Reaching the goal is possible by taking risks; achieving safe success requires building and updating a belief over what is hidden.

Formal setup

DWMB is a distribution over episodic POMDPs. State is (agent pose, static topology, latent state): topology includes walls, doors, secret edges; latent state includes which traps are armed and which switches are toggled. Observations are a radius-r egocentric view plus events (e.g. “you hear a click,” damage). Crucially, hidden hazards are visually identical to floor in the view. Actions: Move, Inspect (low success probability to reveal a trap or secret), Interact (toggle switch, open door), UseItem. Reward is sparse (goal + small step cost); we also support a constraint regime where goal success and safety (no trap triggered, no death) are reported separately to avoid reward hacking.

The generator enforces adversarial constraints: (a) at least one trap has no unique visual hint; (b) there exists an alternate safe path; (c) in higher tiers, at least one switch far away affects a hazard with no local cue; (d) in T5, distractors (dead-ends, longer horizon) waste time. So agents cannot shortcut: they must infer latent structure and plan.

Difficulty tiers (T1–T5)

T1 — Single hazard, no switch. T2 — 1–2 hazards, optional local switch. T3 — Multiple hazards, at least one switch; short path may be trapped. T4 — Non-local causality: at least one switch controls a distant trap with no local cue. T5 — T4 plus distractors (dead-end corridors, longer horizon). Map size and visibility radius scale with tier. Exact generator parameters are fixed in the released code for reproducibility.

Preemptive Inference Rate (PIR) and metrics

We propose PIR: for each hidden hazard, did the agent’s predicted probability of danger exceed a threshold before its first step on that tile? PIR is the fraction of hazards for which the answer is yes. That separates “safe” latent inference from “unsafe success” (reaching the goal by tolerating trap damage or luck). A uniform belief-extraction convention is required so PIR is comparable across methods: every agent must expose per-hazard probabilities at each step (from belief state, an auxiliary head, or prompted LLM output); evaluation scripts use only these logged values.

We also report AUPIR (area under PIR vs. threshold), calibration as a secondary diagnostic, goal completion and survival rates, hazard activation counts, and optionally topology/causal discovery F1 and sample-efficiency AUCs. Success and safety are decoupled: an agent can have high goal rate but low PIR (risk-taking) or high PIR and high safety (belief-driven planning).

Baselines and hypothesis

Baselines include: (a) Model-free RL (e.g. PPO+LSTM)—expected to achieve some goal success but low PIR (risk-taking). (b) MuZero-like planners—may ignore trap semantics or mis-assign credit over long chains. (c) Dreamer/RSSM recurrent world models—may struggle with discrete triggers or belief collapse. (d) LLM + memory—strong priors but risk of hallucinated maps and poor calibration. (e) JEPA-style encoder + predictor + planner—proposed architecture; must map latent state to per-hazard probabilities for PIR.

Hypothesis (pending validation): At matched goal-success rates, agents that explicitly represent and update beliefs over latent structure will achieve higher PIR and better robustness on counterfactual tests (e.g. randomly permuting which tile is the trap or which switch controls which trap) than purely reactive agents. DWMB is a benchmark definition and preregistered evaluation protocol; we specify statistical tests (paired Wilcoxon, mixed-effects models, FDR correction). No empirical results are reported yet—the claim is falsifiable once experiments are run.

Micro-credential

Learn World Models through D&D

We offer a micro-credential course through Castalia where we use Dungeons & Dragons to teach world models: latent state, partial observability, belief updating, and planning under uncertainty. Dungeons are the perfect sandbox—hidden traps, distant switches, and choices that reveal how well you (or an agent) hold a model of the world in mind.

The course ties directly to the Dungeon World Model Benchmark (DWMB). By the end, you understand the research, the metrics (like Preemptive Inference Rate), and how to run and extend the code—so you can evaluate AI agents that think ahead instead of just reacting.

Implementation

Code & infrastructure

The repo provides the DWMB schema, canonical unit-test dungeons, evaluation scripts, baseline agents, and optional Supabase storage for runs and metrics.

  • dwmb/ — Core: schema, env, validation, generator, Supabase storage
  • instances/unit_test/ — Canonical dungeons (e.g. worked 16×16)
  • scripts/evaluate.py — Run agent on instance, compute PIR, optional sync
  • scripts/generate_splits.py — Generate train/test/counterfactual JSONs per tier
  • scripts/train.py — Train PPO+LSTM (optional)
  • supabase/ — Migrations and edge functions for instances, runs, metrics

Agents: random, heuristic, ppo_lstm — each exposes per-hazard beliefs for PIR. See InquiryInstitute/dungeon for quick start and layout.