LLM Maker — Model Registry & Design Documentation
Building decoder-only LLMs from scratch: pretraining tiny models on FineWeb-Edu, instruction-tuning, and scaling up — every model, the architecture and why, design decisions, measured performance, and I/O examples.
A from-scratch decoder-only LLM project: pretraining tiny models on FineWeb-Edu, instruction-tuning them, and scaling up. This document covers every model built, the architecture and why, design decisions and their rationale, measured performance, and input/output examples.
Per-topic docs also exist:
TRAINING_RUN.md(200M run detail, coming soon),SFT_PLAN.md(coming soon),PLAN_100M.md(coming soon). This file is the index.
1. Model registry (at a glance)
| # | Model / run | Params | Vocab | Train tokens | Val loss (ppl) | Status |
|---|---|---|---|---|---|---|
| A | smoke_10m_full_200m | 10,096,896 | 8,192 | 200M (×1) | 3.559 (35.1) | ✅ done |
| B | smoke_12m_16k_4ep | 12,194,048 | 16,384 | 2.88B (4×720M) | 3.433 (30.97) | ✅ done |
| C | sft_12m_16k (SFT of B) | 12,194,048 | 16,384 | 10.8k SFT pairs | resp. 2.758 | ✅ done |
| D | smoke_100m_16k_4b | 97,536,768 | 16,384 | 4B (×1) | ~3.04 @ 52% | ⏸ paused |
| D+ | sft_100m_16k (SFT of D) | 97,536,768 | 16,384 | (reuses C's data) | — | ⏳ queued |
Perplexity is comparable only within the same tokenizer. A is 8k; B/C/D are 16k (B→D directly comparable, same tokenizer + data distribution).
2. Shared architecture (and why each piece)
All models are the same decoder-only Transformer family, scaled by width/depth:
| Component | Choice | Why |
|---|---|---|
| Type | Decoder-only, causal | Standard autoregressive LM; next-token prediction |
| Positional encoding | RoPE | Relative positions, extrapolates better than learned/absolute |
| Normalization | RMSNorm | Cheaper than LayerNorm, no mean-centering, stable |
| MLP | SwiGLU (hidden ≈2.7×d) | Gated activation, stronger than GELU MLP at equal params |
| Embeddings | Tied input/output | Saves params (big deal at small scale), regularizes |
| Attention | Causal scaled-dot-product (PyTorch SDPA) | Fast fused kernel, is_causal=True |
| Precision | bf16 autocast, fp32 loss | Stable on Ampere (3090), no loss scaler needed |
| Optimizer | Fused AdamW (β 0.9/0.95, wd 0.1) | Standard LM recipe; fused for speed |
| LR schedule | Cosine to 10%, linear warmup | Standard; smooth anneal |
| Grad clip | Global norm 1.0 | Stability |
| Context | 1024 tokens | Fixed by training; extension needs separate work |
| Tokenizer | Byte-level BPE (8k → 16k) | No <unk> issues; see §5 |
Per-model dimensions
| A (10M) | B/C (12M) | D (100M) | |
|---|---|---|---|
| Layers | 8 | 8 | 12 |
| d_model | 256 | 256 | 768 |
| Heads (head_dim) | 4 (64) | 4 (64) | 12 (64) |
| MLP hidden | 960 | 960 | 2048 |
| Vocab | 8,192 | 16,384 | 16,384 |
| Embedding share of params | 21% | 34% | 13% |
The trainer derives the expected param count from these dims and asserts it — so a config typo can't silently change the model.
3. The pipeline
download shards → select subset (deterministic hash) → train BPE tokenizer →
pack to uint16 blocks → pretrain → SFT → serve (chat_app) / eval (chat_samples)
| Stage | Script | Notes |
|---|---|---|
| Source | fineweb-edu/ | FineWeb-Edu sample-10BT (~9.97B GPT-2 tokens, 14 shards) |
| Select | select_fineweb_smoke_subset.py | Keyed BLAKE2b hash ranking → deterministic, reproducible subset; same seed ⇒ superset/identical splits |
| Tokenizer | train_smoke_tokenizer.py | Byte-level BPE on a hashed text sample |
| Pack | prepare_smoke_tokens.py | Encode + <eos> per doc, concat, write 1024-token uint16 blocks |
| Pretrain | train_smoke_model.py | Memory-mapped blocks, multi-epoch w/ per-epoch reshuffle, --resume, graceful STOP |
| SFT | prepare_sft.py + sft_finetune.py | Masked instruction tuning (§7) |
| Eval | sft_eval.py, chat_samples.py | Base vs SFT, <eos>-stop reporting |
| Serve | chat_app.py | Local Flask chat (top-k 40, temp 0.75) |
| Orchestrate | run_100m_pipeline.cmd | All phases, resumable, pausable |
4. Per-model detail
Model A — smoke_10m_full_200m (10M / 8k / 200M)
The original baseline. 10,096,896 params, 8k vocab, 200,015,872 tokens (3,052 steps).
- Val loss 3.5592 (ppl 35.14), train 3.5418, gap +0.017 (no overfit; loss-limited by size).
- Throughput 227,563 tok/s, peak 4.6 GB.
- Output: fluent English, weak coherence, never stops. Detail in
TRAINING_RUN.md(coming soon).
Model B — smoke_12m_16k_4ep (12M / 16k / 4B, 4 epochs)
Scaled vocab 8k→16k and trained 4 epochs over ~720M unique tokens (single cosine across all 4). 12,194,048 params, 2,883,059,712 token-passes (43,992 steps).
- Init val 3.513 → final/best 3.4330 (ppl 30.97); train/val gap −0.042 (zero overfit).
- Per-epoch curve: 3.655 (38.7) → 3.555 (35.0) → 3.477 (32.4) → 3.433 (30.97) — taper confirms ~4 epochs is the practical ceiling for repeated data.
- Throughput 214,890 tok/s, peak 13.9 GB.
- Survived two crashes (sleep @ step 6,558; session-teardown @ 27,106) via
--resume— see §9.
Model C — sft_12m_16k (SFT of B)
Instruction-tuned B. 10,789 train + 800 val pairs (8,111 Dolly + 3,500 SQuAD), max_len 512, 30.2% of tokens supervised. 3 epochs, lr 2e-5, batch 32.
- Response val loss 3.256 → 2.758. Cleanly converged, no overfit.
- Behavior change: now answers and stops (
<eos>); genuine extractive QA (§8).
Model D — smoke_100m_16k_4b (100M / 16k / 4B) — IN PROGRESS
GPT-2-small-class scale-up. 97,536,768 params, 4B-token single pass (target 61,036 steps).
- Paused at step 31,892 (~52%), val loss ~3.04 and still dropping — already below B's final 3.433 at barely half-trained (same tokenizer/distribution ⇒ fair comparison). This is the scaling payoff: capability is param-limited, and more params is the lever.
- Measured throughput 45,703 tok/s (benchmark predicted 34,537 — conservative), peak 7.9 GB.
- Phase 3 (
sft_100m_16k) auto-runs after pretraining, reusing C's SFT data.
5. Key design decisions & rationale
- 8k → 16k vocab. The bigger corpus justified a bigger vocab (better compression: 16k packs the same text in ~10% fewer tokens than 8k). Tradeoff is embedding share of params (21%→34% at 12M) — acceptable, and only 13% at 100M.
- Reuse the tokenizer across runs. Perplexity is only comparable with an identical tokenizer, so B and D share the 16k tokenizer → their loss curves are directly comparable.
- Deterministic, seed-based data selection. Same seed ⇒ larger budgets are supersets of smaller ones and validation splits are identical — controlled scaling experiments.
- Single cosine over N epochs, not warm restarts. "Train 4 epochs" = one continuous cosine over all 4 (B). Bolting a fresh cosine onto already-annealed weights is a different (worse) regime.
- Fresh data > repeats. Data-constrained scaling: repeating data is "nearly as good as fresh" only up to ~4 epochs, then collapses. B repeated 4× (720M data); D uses 4B unique tokens (single pass) — strictly better when data is available.
- Token budget. ~20 tokens/param is Chinchilla-optimal; small models benefit from over-training. A ≈ 20/param; D ≈ 41/param (mild over-train).
- SFT, not RL. Making the model answer is supervised fine-tuning (loss masked to the
response, every example ends in
<eos>to teach stopping). RL/DPO only polishes preferences and is the weakest lever at this capacity — deferred indefinitely. - Play to capacity strengths. A tiny model can't store facts but can extract from a given passage, so SFT mixes in SQuAD; the highest-value future lever is RAG.
Improvement levers, ranked (for a capacity-limited model)
- Bigger model (the real unlock — Model D) · 2. More/better pretrain data ·
- Distillation from a larger teacher · 4. RAG (best practical, no retrain) ·
- Decoding fixes (repetition penalty) · 6. More SFT data · 7. RL/DPO (last).
6. Performance summary
| Model | Params | Throughput (tok/s) | Peak VRAM | Val loss | Notes |
|---|---|---|---|---|---|
| A 10M/8k | 10.1M | 227,563 | 4.6 GB | 3.559 | baseline |
| B 12M/16k | 12.2M | 214,890 | 13.9 GB | 3.433 | +vocab, +epochs |
| D 100M/16k | 97.5M | 45,703 | 7.9 GB | ~3.04 (52%) | scaling win |
Scaling observation: at ~52% trained, the 100M model already beats the fully-trained 12M model's loss — confirming the bottleneck at small scale is parameters, not data or tuning.
7. SFT mechanics (how instruction-tuning works here)
- Format: every example is
User: {prompt}\n\nAssistant: {response}<eos>. - Loss masking: only response tokens +
<eos>carry loss; prompt and padding are set to-100(ignored bycross_entropy). The model learns to generate the answer and stop, not to predict the (given) prompt. - Data: Dolly-15k (filtered to short, no-context categories) for instruction breadth + SQuAD (passage→answer) for extractive skill the model can actually do well.
- Hyperparameters: init from pretrained weights, LR 2e-5 (50× lower than pretraining, to avoid washing out the base), 3 epochs, no weight decay.
8. Input / output examples
Base models (next-token predictors — fluent, no answering)
12M/16k base, "What is the capital of France?" → "…the capital of France is the capital of France. We can look at it and know it…" (loops, never stops).
After SFT (Model C) — base vs SFT, same prompts
| Prompt | Base (epoch_4) | SFT (best) |
|---|---|---|
| Capital of France? | never stops; loops | eos@13 "The capital of France is Ville de Lafayette." (stops; wrong — capacity) |
| How tall is the Eiffel Tower? (passage given) | hallucinates "Wounded Hole…President Madison" | eos@3 "330 metres" ✅ |
| What gas does photosynthesis release? (passage given) | rambles, never stops | eos@1 "oxygen" ✅ |
Takeaway: SFT taught the model to answer and stop; with facts in the prompt (extractive), it's correct. Open factual recall still fails — the 12M/100M capacity ceiling, which SFT cannot fix (RAG can).
9. Infrastructure & operations (resilience)
Long runs on this single-GPU Windows box died twice — sleep (step 6,558) and session teardown (step 27,106), both silent (no traceback). Fixes, now standard:
--resume <ckpt>— restores model + optimizer + RNG + data cursor → bit-exact continuation (verified resume delta0.0).last.ptsaved every 250 steps.- Scheduled Task launch — Task Scheduler owns the process, so it survives the app/session closing.
- Graceful
STOPfile — pause-at-will: trainer checkpoints within ~2 s and exits, freeing VRAM. - Auto-resume loop (
run_100m_pipeline.cmd) — relaunches fromlast.pton any crash, honorsSTOP, stops at completion. - Sleep disabled (
powercfg) during runs.
Pause / resume workflow (current 100M run)
- Pause: create
runs/smoke_100m_16k_4b/STOP(pause_100m.cmd) → checkpoints + frees GPU. - Resume: clear
STOP, restart the task (resume_100m.cmd) → continues fromlast.pt. - All state is on disk → a reboot is safe; it stays parked until explicitly resumed.
10. Artifact index
- Configs:
configs/smoke_10m.yaml(A),smoke_12m_16k_800m.yaml(B/C),smoke_100m_16k.yaml(D) - Tokenizers:
tokenizers/fineweb_edu_smoke_8k,…_16k - Data:
data/fineweb_edu_smoke_8k(A),…_16k_800m(B),fineweb_edu_100m_16k(D),sft_16k(C) - Runs:
runs/<name>/— each hassummary.json,metrics.jsonl, checkpoints,generated_samples.txt - Ops scripts:
run_100m_pipeline.cmd,pause_100m.cmd,resume_100m.cmd
Model D is paused mid-pretraining at step 31,892/61,036. Resume to finish the remaining ~13 h + auto-SFT. This document should be updated when D and its SFT complete.