Rebuilding AlphaGo from scratch
A hands-on deck for understanding and re-implementing AlphaGo, AlphaZero, and MuZero from scratch. AlphaGo plays Go by combining a policy network (which moves look good) and a value network (who's winning) with Monte Carlo Tree Search guided by the PUCT rule, all trained by self-play reinforcement learning, no human games required. This deck walks the whole idea across 32 scenes and 31 deep-dives & labs, with animated equations, a playable Go board, flashcards, and margin notes you can write in.
What you'll learn: the rules of Go · why brute-force search fails · Monte Carlo Tree Search · the PUCT selection rule · policy & value networks · self-play & the training loop · grounding value without rollouts · variance & advantage · search-free play · scaling laws & compute · off-policy learning · MuZero's learned model.
Built from Eric Jang's AlphaGo lecture on the Dwarkesh Podcast and his AutoGo repository. Part of tuul.dev, a research playground from tuul.ai.