Skip to content

bagustris/LLM101n

 
 

Repository files navigation

permalink /
nav_exclude true

LLM101n: Let's build a Storyteller

LLM101n header image

What I cannot create, I do not understand. — Richard Feynman

In this course, we build a Storyteller AI Large Language Model (LLM) from scratch — from a one-line bigram model all the way to a deployed, multimodal web app. Everything is implemented end-to-end in Python with minimal prerequisites. By the end you will have a deep, hands-on understanding of how modern LLMs work.

The training corpus throughout is TinyStories — a dataset of short children's stories — keeping experiments fast enough to run on a laptop while still producing meaningful results.


Syllabus

# Chapter Key concepts
01 Bigram Language Model language modeling, NLL loss, character-level tokenization
02 Micrograd scalar autodiff, backpropagation from scratch
03 N-gram MLP multi-layer perceptron, matmul, GELU
04 Attention self-attention, softmax, positional encoding
05 Transformer GPT-2 architecture, residual connections, LayerNorm
06 Tokenization Byte Pair Encoding (BPE), minBPE
07 Optimization weight initialization, AdamW, LR schedules
08 Need for Speed I: Device CPU vs GPU, device-agnostic PyTorch
09 Need for Speed II: Precision mixed precision, fp16, bf16, fp8
10 Need for Speed III: Distributed DDP, ZeRO, DeepSpeed
11 Datasets data loading, synthetic data generation
12 Inference I: KV-Cache key-value cache, autoregressive generation
13 Inference II: Quantization INT8/INT4 quantization
14 Finetuning I: SFT supervised finetuning, PEFT, LoRA, chat format
15 Finetuning II: RL RLHF, PPO, DPO
16 Deployment FastAPI server, streaming, web UI
17 Multimodal VQVAE, diffusion transformer, image+text

Repository Layout

LLM101n/
├── chNN.md          # Chapter narratives + embedded Python code (kept in sync via inject.py)
├── codes/
│   ├── inject.py    # Syncs named blocks from codes/chNN/main.py → chNN.md
│   ├── extract.py   # Legacy: extracted markdown → main.py (no longer the active workflow)
│   ├── chNN/
│   │   ├── main.py  # Runnable script — SOURCE OF TRUTH for code
│   │   └── run.log  # Expected output
│   └── data/        # Shared datasets, checkpoints, tokenizers
└── llm101n.jpg

codes/chNN/main.py is the source of truth. Edit the Python scripts directly; run inject.py to sync changes back into the markdown chapter files.


Getting Started

Prerequisites

  • Python 3.10+
  • A GPU is helpful but not required for the early chapters

Setup

git clone https://github.com/bagustris/LLM101n.git
cd LLM101n

# Create and activate the virtual environment
uv venv codes/.venv
source codes/.venv/bin/activate   # Windows: codes\.venv\Scripts\activate

uv pip install torch datasets transformers tqdm fastapi uvicorn

Running a chapter

source codes/.venv/bin/activate
cd codes/ch01
python main.py

Each chapter is self-contained. Chapter 01 downloads the TinyStories dataset on the first run and saves it to codes/data/ so subsequent chapters can reuse it without hitting the network again.

Editing code and syncing to markdown

codes/chNN/main.py is the source of truth. Edit it directly, then sync named blocks back into the chapter markdown:

cd codes
python inject.py            # sync all chapters
python inject.py ch05       # sync one chapter
python inject.py --dry-run  # preview diffs without writing
python inject.py --status   # show which blocks are marked

Note on block markers: In main.py, wrap editable sections with # === block: <name> === / # === /block: <name> ===. In the markdown, wrap the matching fence with <!-- block: <name> --> / <!-- /block: <name> -->. inject.py will replace only those fenced regions.


Shared Data (codes/data/)

File / Directory Description
tinystories_train.txt 50 K training stories
tinystories_val.txt 5 K validation stories
gpt_tinystories.pt Pretrained GPT-2 checkpoint on TinyStories
tinystories_bpe_tokenizer.json BPE tokenizer (128-token vocab)
lora_adapter/ Saved LoRA adapter (rank=8, alpha=16)
vqvae_cifar10.pt Pretrained VQVAE for CIFAR-10 (ch17)
cifar-10-batches-py/ CIFAR-10 image dataset (ch17)
server.py FastAPI streaming text-generation server (ch16)
frontend.html Web UI (ch16)
Dockerfile Container for the deployed server (ch16)
ds_config.json DeepSpeed config for distributed training (ch10)

Appendix — Topics to Explore Further

  • Programming languages: Assembly, C, Python internals
  • Data types: Integer, Float, String (ASCII, Unicode, UTF-8)
  • Tensors: shapes, views, strides, contiguous memory
  • Frameworks: PyTorch, JAX
  • Architectures: GPT-1/2/3/4, Llama (RoPE, RMSNorm, GQA), Mixture-of-Experts
  • Multimodal: Images, Audio, Video, VQVAE, VQGAN, Diffusion models

About

LLM101n: Let's build a Storyteller

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 97.8%
  • HTML 1.8%
  • Dockerfile 0.4%