PAUSEDUPDATED 28 June 2026

LLM Maker — Model Registry & Design Documentation

Building decoder-only LLMs from scratch: pretraining tiny models on FineWeb-Edu, instruction-tuning, and scaling up — every model, the architecture and why, design decisions, measured performance, and I/O examples.

A from-scratch decoder-only LLM project: pretraining tiny models on FineWeb-Edu, instruction-tuning them, and scaling up. This document covers every model built, the architecture and why, design decisions and their rationale, measured performance, and input/output examples.

Per-topic docs also exist: TRAINING_RUN.md (200M run detail, coming soon), SFT_PLAN.md (coming soon), PLAN_100M.md (coming soon). This file is the index.

1. Model registry (at a glance)

#	Model / run	Params	Vocab	Train tokens	Val loss (ppl)	Status
A	`smoke_10m_full_200m`	10,096,896	8,192	200M (×1)	3.559 (35.1)	✅ done
B	`smoke_12m_16k_4ep`	12,194,048	16,384	2.88B (4×720M)	3.433 (30.97)	✅ done
C	`sft_12m_16k` (SFT of B)	12,194,048	16,384	10.8k SFT pairs	resp. 2.758	✅ done
D	`smoke_100m_16k_4b`	97,536,768	16,384	4B (×1)	~3.04 @ 52%	⏸ paused
D+	`sft_100m_16k` (SFT of D)	97,536,768	16,384	(reuses C's data)	—	⏳ queued

Perplexity is comparable only within the same tokenizer. A is 8k; B/C/D are 16k (B→D directly comparable, same tokenizer + data distribution).

2. Shared architecture (and why each piece)

All models are the same decoder-only Transformer family, scaled by width/depth:

Component	Choice	Why
Type	Decoder-only, causal	Standard autoregressive LM; next-token prediction
Positional encoding	RoPE	Relative positions, extrapolates better than learned/absolute
Normalization	RMSNorm	Cheaper than LayerNorm, no mean-centering, stable
MLP	SwiGLU (hidden ≈2.7×d)	Gated activation, stronger than GELU MLP at equal params
Embeddings	Tied input/output	Saves params (big deal at small scale), regularizes
Attention	Causal scaled-dot-product (PyTorch SDPA)	Fast fused kernel, `is_causal=True`
Precision	bf16 autocast, fp32 loss	Stable on Ampere (3090), no loss scaler needed
Optimizer	Fused AdamW (β 0.9/0.95, wd 0.1)	Standard LM recipe; fused for speed
LR schedule	Cosine to 10%, linear warmup	Standard; smooth anneal
Grad clip	Global norm 1.0	Stability
Context	1024 tokens	Fixed by training; extension needs separate work
Tokenizer	Byte-level BPE (8k → 16k)	No `<unk>` issues; see §5

Per-model dimensions

	A (10M)	B/C (12M)	D (100M)
Layers	8	8	12
d_model	256	256	768
Heads (head_dim)	4 (64)	4 (64)	12 (64)
MLP hidden	960	960	2048
Vocab	8,192	16,384	16,384
Embedding share of params	21%	34%	13%

The trainer derives the expected param count from these dims and asserts it — so a config typo can't silently change the model.

3. The pipeline

download shards → select subset (deterministic hash) → train BPE tokenizer →
pack to uint16 blocks → pretrain → SFT → serve (chat_app) / eval (chat_samples)

Stage	Script	Notes
Source	`fineweb-edu/`	FineWeb-Edu `sample-10BT` (~9.97B GPT-2 tokens, 14 shards)
Select	`select_fineweb_smoke_subset.py`	Keyed BLAKE2b hash ranking → deterministic, reproducible subset; same seed ⇒ superset/identical splits
Tokenizer	`train_smoke_tokenizer.py`	Byte-level BPE on a hashed text sample
Pack	`prepare_smoke_tokens.py`	Encode + `<eos>` per doc, concat, write 1024-token uint16 blocks
Pretrain	`train_smoke_model.py`	Memory-mapped blocks, multi-epoch w/ per-epoch reshuffle, `--resume`, graceful `STOP`
SFT	`prepare_sft.py` + `sft_finetune.py`	Masked instruction tuning (§7)
Eval	`sft_eval.py`, `chat_samples.py`	Base vs SFT, `<eos>`-stop reporting
Serve	`chat_app.py`	Local Flask chat (top-k 40, temp 0.75)
Orchestrate	`run_100m_pipeline.cmd`	All phases, resumable, pausable

4. Per-model detail

Model A — `smoke_10m_full_200m` (10M / 8k / 200M)

The original baseline. 10,096,896 params, 8k vocab, 200,015,872 tokens (3,052 steps).

Val loss 3.5592 (ppl 35.14), train 3.5418, gap +0.017 (no overfit; loss-limited by size).
Throughput 227,563 tok/s, peak 4.6 GB.
Output: fluent English, weak coherence, never stops. Detail in TRAINING_RUN.md (coming soon).

Model B — `smoke_12m_16k_4ep` (12M / 16k / 4B, 4 epochs)

Scaled vocab 8k→16k and trained 4 epochs over ~720M unique tokens (single cosine across all 4). 12,194,048 params, 2,883,059,712 token-passes (43,992 steps).

Init val 3.513 → final/best 3.4330 (ppl 30.97); train/val gap −0.042 (zero overfit).
Per-epoch curve: 3.655 (38.7) → 3.555 (35.0) → 3.477 (32.4) → 3.433 (30.97) — taper confirms ~4 epochs is the practical ceiling for repeated data.
Throughput 214,890 tok/s, peak 13.9 GB.
Survived two crashes (sleep @ step 6,558; session-teardown @ 27,106) via --resume — see §9.

Model C — `sft_12m_16k` (SFT of B)

Instruction-tuned B. 10,789 train + 800 val pairs (8,111 Dolly + 3,500 SQuAD), max_len 512, 30.2% of tokens supervised. 3 epochs, lr 2e-5, batch 32.

Response val loss 3.256 → 2.758. Cleanly converged, no overfit.
Behavior change: now answers and stops (<eos>); genuine extractive QA (§8).

Model D — `smoke_100m_16k_4b` (100M / 16k / 4B) — IN PROGRESS

GPT-2-small-class scale-up. 97,536,768 params, 4B-token single pass (target 61,036 steps).

Paused at step 31,892 (~52%), val loss ~3.04 and still dropping — already below B's final 3.433 at barely half-trained (same tokenizer/distribution ⇒ fair comparison). This is the scaling payoff: capability is param-limited, and more params is the lever.
Measured throughput 45,703 tok/s (benchmark predicted 34,537 — conservative), peak 7.9 GB.
Phase 3 (sft_100m_16k) auto-runs after pretraining, reusing C's SFT data.

5. Key design decisions & rationale

8k → 16k vocab. The bigger corpus justified a bigger vocab (better compression: 16k packs the same text in ~10% fewer tokens than 8k). Tradeoff is embedding share of params (21%→34% at 12M) — acceptable, and only 13% at 100M.
Reuse the tokenizer across runs. Perplexity is only comparable with an identical tokenizer, so B and D share the 16k tokenizer → their loss curves are directly comparable.
Deterministic, seed-based data selection. Same seed ⇒ larger budgets are supersets of smaller ones and validation splits are identical — controlled scaling experiments.
Single cosine over N epochs, not warm restarts. "Train 4 epochs" = one continuous cosine over all 4 (B). Bolting a fresh cosine onto already-annealed weights is a different (worse) regime.
Fresh data > repeats. Data-constrained scaling: repeating data is "nearly as good as fresh" only up to ~4 epochs, then collapses. B repeated 4× (720M data); D uses 4B unique tokens (single pass) — strictly better when data is available.
Token budget. ~20 tokens/param is Chinchilla-optimal; small models benefit from over-training. A ≈ 20/param; D ≈ 41/param (mild over-train).
SFT, not RL. Making the model answer is supervised fine-tuning (loss masked to the response, every example ends in <eos> to teach stopping). RL/DPO only polishes preferences and is the weakest lever at this capacity — deferred indefinitely.
Play to capacity strengths. A tiny model can't store facts but can extract from a given passage, so SFT mixes in SQuAD; the highest-value future lever is RAG.

Improvement levers, ranked (for a capacity-limited model)

Bigger model (the real unlock — Model D) · 2. More/better pretrain data ·
Distillation from a larger teacher · 4. RAG (best practical, no retrain) ·
Decoding fixes (repetition penalty) · 6. More SFT data · 7. RL/DPO (last).

6. Performance summary

Model	Params	Throughput (tok/s)	Peak VRAM	Val loss	Notes
A 10M/8k	10.1M	227,563	4.6 GB	3.559	baseline
B 12M/16k	12.2M	214,890	13.9 GB	3.433	+vocab, +epochs
D 100M/16k	97.5M	45,703	7.9 GB	~3.04 (52%)	scaling win

Scaling observation: at ~52% trained, the 100M model already beats the fully-trained 12M model's loss — confirming the bottleneck at small scale is parameters, not data or tuning.

7. SFT mechanics (how instruction-tuning works here)

Format: every example is User: {prompt}\n\nAssistant: {response}<eos>.
Loss masking: only response tokens + <eos> carry loss; prompt and padding are set to -100 (ignored by cross_entropy). The model learns to generate the answer and stop, not to predict the (given) prompt.
Data: Dolly-15k (filtered to short, no-context categories) for instruction breadth + SQuAD (passage→answer) for extractive skill the model can actually do well.
Hyperparameters: init from pretrained weights, LR 2e-5 (50× lower than pretraining, to avoid washing out the base), 3 epochs, no weight decay.

8. Input / output examples

Base models (next-token predictors — fluent, no answering)

12M/16k base, "What is the capital of France?" → "…the capital of France is the capital of France. We can look at it and know it…" (loops, never stops).

After SFT (Model C) — base vs SFT, same prompts

Prompt	Base (`epoch_4`)	SFT (`best`)
Capital of France?	never stops; loops	`eos@13` "The capital of France is Ville de Lafayette." (stops; wrong — capacity)
How tall is the Eiffel Tower? (passage given)	hallucinates "Wounded Hole…President Madison"	`eos@3` "330 metres" ✅
What gas does photosynthesis release? (passage given)	rambles, never stops	`eos@1` "oxygen" ✅

Takeaway: SFT taught the model to answer and stop; with facts in the prompt (extractive), it's correct. Open factual recall still fails — the 12M/100M capacity ceiling, which SFT cannot fix (RAG can).

9. Infrastructure & operations (resilience)

Long runs on this single-GPU Windows box died twice — sleep (step 6,558) and session teardown (step 27,106), both silent (no traceback). Fixes, now standard:

--resume <ckpt> — restores model + optimizer + RNG + data cursor → bit-exact continuation (verified resume delta 0.0). last.pt saved every 250 steps.
Scheduled Task launch — Task Scheduler owns the process, so it survives the app/session closing.
Graceful STOP file — pause-at-will: trainer checkpoints within ~2 s and exits, freeing VRAM.
Auto-resume loop (run_100m_pipeline.cmd) — relaunches from last.pt on any crash, honors STOP, stops at completion.
Sleep disabled (powercfg) during runs.

Pause / resume workflow (current 100M run)

Pause: create runs/smoke_100m_16k_4b/STOP (pause_100m.cmd) → checkpoints + frees GPU.
Resume: clear STOP, restart the task (resume_100m.cmd) → continues from last.pt.
All state is on disk → a reboot is safe; it stays parked until explicitly resumed.

10. Artifact index

Configs: configs/smoke_10m.yaml (A), smoke_12m_16k_800m.yaml (B/C), smoke_100m_16k.yaml (D)
Tokenizers: tokenizers/fineweb_edu_smoke_8k, …_16k
Data: data/fineweb_edu_smoke_8k (A), …_16k_800m (B), fineweb_edu_100m_16k (D), sft_16k (C)
Runs: runs/<name>/ — each has summary.json, metrics.jsonl, checkpoints, generated_samples.txt
Ops scripts: run_100m_pipeline.cmd, pause_100m.cmd, resume_100m.cmd

Model D is paused mid-pretraining at step 31,892/61,036. Resume to finish the remaining ~13 h + auto-SFT. This document should be updated when D and its SFT complete.