Tiny-vLLM: Build a CUDA LLM Inference Engine from Scratch in C++
Original source
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Hacker News →Tiny-vLLM is an open-source project that pairs a working LLM inference server with a step-by-step course teaching readers how to build one themselves. The engine loads Llama 3.2 1B Instruct from Safetensors and runs a full forward pass entirely through custom CUDA kernels, implementing the techniques that make production serving fast: KV cache, static and continuous batching, online softmax with FlashAttention-style computation, and PagedAttention.
The accompanying curriculum derives the underlying math and systems concepts from scratch, covering bfloat16 numerics, RMSNorm and parallel reduction, RoPE, GQA, causal masking, cuBLAS GEMM with column-major-to-row-major tricks, and the prefill-versus-decode split that motivates the KV cache. The author frames C++ and CUDA as the right tools because LLM workloads reduce to large volumes of matrix multiplication that demand GPU parallelism.
Training and model design are explicitly out of scope, with pointers to Karpathy’s nanoGPT and llm.c, tinygrad, and GPU MODE for adjacent territory. The project targets NVIDIA hardware on Linux with CUDA 13.1 and invites readers to fork, adapt build paths, and upstream fixes, positioning itself as both a learning resource and a teaching aid for university courses.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.