Stanford's CS336 Walks Students Through Building a Language Model End-to-End

Stanford’s CS336 takes the build-an-OS-from-scratch pedagogical approach and applies it to language models. Across five assignments, students implement a Transformer from the tokenizer up, write their own Triton FlashAttention2 kernel, build distributed training code, fit scaling laws against a training API, process raw Common Crawl into usable pretraining data, and finish with supervised finetuning plus reinforcement learning for math reasoning (with optional DPO-based safety alignment).

The course assumes serious prerequisites: strong Python, PyTorch fluency, familiarity with GPU memory hierarchy, and prior ML coursework. Scaffolding is deliberately minimal and the code volume is an order of magnitude above typical AI classes. The AI policy is unusually strict for 2026 — LLMs are permitted for conceptual questions but not for solving assignments, and students are urged to disable Copilot-style autocomplete to avoid shortcutting the learning.

For self-study, the staff publish current B200 GPU pricing across Modal (the course sponsor, $6.25/hr with $30 free monthly credit), Lambda, RunPod, Nebius, and Together, and recommend debugging on CPU before burning GPU hours on the training runs in assignments 1, 4, and 5.