tuul.dev · interactive learning deck

Rebuilding AlphaGo from scratch

A hands-on deck for understanding and re-implementing AlphaGo, AlphaZero, and MuZero from scratch. AlphaGo plays Go by combining a policy network (which moves look good) and a value network (who's winning) with Monte Carlo Tree Search guided by the PUCT rule, all trained by self-play reinforcement learning, no human games required. This deck walks the whole idea across 32 scenes and 31 deep-dives & labs, with animated equations, a playable Go board, flashcards, and margin notes you can write in.

What you'll learn: the rules of Go · why brute-force search fails · Monte Carlo Tree Search · the PUCT selection rule · policy & value networks · self-play & the training loop · grounding value without rollouts · variance & advantage · search-free play · scaling laws & compute · off-policy learning · MuZero's learned model.

Built from Eric Jang's AlphaGo lecture on the Dwarkesh Podcast and his AutoGo repository. Part of tuul.dev, a research playground from tuul.ai.

loading the interactive lesson…

This is an interactive deck and needs JavaScript for the animated scenes, the playable Go board, and the flashcards. The full curriculum (20 chapters): Why AlphaGo · Go & the board · why search is hard · Monte Carlo Tree Search · the networks · the game state · the architecture · the warm start · the search loop · self-play · grounding value · the shared net · why it's profound · variance · no-search RL · MCTS vs Q-learning · scaling & taste · off-policy learning · bits per FLOP · automating research.