< all_docs
PAUSEDUPDATED 28 June 2026

LLM Maker — Model Registry & Design Documentation

Building decoder-only LLMs from scratch: pretraining tiny models on FineWeb-Edu, instruction-tuning, and scaling up — every model, the architecture and why, design decisions, measured performance, and I/O examples.


A from-scratch decoder-only LLM project: pretraining tiny models on FineWeb-Edu, instruction-tuning them, and scaling up. This document covers every model built, the architecture and why, design decisions and their rationale, measured performance, and input/output examples.

Per-topic docs also exist: TRAINING_RUN.md (200M run detail, coming soon), SFT_PLAN.md (coming soon), PLAN_100M.md (coming soon). This file is the index.


1. Model registry (at a glance)

#Model / runParamsVocabTrain tokensVal loss (ppl)Status
Asmoke_10m_full_200m10,096,8968,192200M (×1)3.559 (35.1)✅ done
Bsmoke_12m_16k_4ep12,194,04816,3842.88B (4×720M)3.433 (30.97)✅ done
Csft_12m_16k (SFT of B)12,194,04816,38410.8k SFT pairsresp. 2.758✅ done
Dsmoke_100m_16k_4b97,536,76816,3844B (×1)~3.04 @ 52%⏸ paused
D+sft_100m_16k (SFT of D)97,536,76816,384(reuses C's data)⏳ queued

Perplexity is comparable only within the same tokenizer. A is 8k; B/C/D are 16k (B→D directly comparable, same tokenizer + data distribution).


2. Shared architecture (and why each piece)

All models are the same decoder-only Transformer family, scaled by width/depth:

ComponentChoiceWhy
TypeDecoder-only, causalStandard autoregressive LM; next-token prediction
Positional encodingRoPERelative positions, extrapolates better than learned/absolute
NormalizationRMSNormCheaper than LayerNorm, no mean-centering, stable
MLPSwiGLU (hidden ≈2.7×d)Gated activation, stronger than GELU MLP at equal params
EmbeddingsTied input/outputSaves params (big deal at small scale), regularizes
AttentionCausal scaled-dot-product (PyTorch SDPA)Fast fused kernel, is_causal=True
Precisionbf16 autocast, fp32 lossStable on Ampere (3090), no loss scaler needed
OptimizerFused AdamW (β 0.9/0.95, wd 0.1)Standard LM recipe; fused for speed
LR scheduleCosine to 10%, linear warmupStandard; smooth anneal
Grad clipGlobal norm 1.0Stability
Context1024 tokensFixed by training; extension needs separate work
TokenizerByte-level BPE (8k → 16k)No <unk> issues; see §5

Per-model dimensions

A (10M)B/C (12M)D (100M)
Layers8812
d_model256256768
Heads (head_dim)4 (64)4 (64)12 (64)
MLP hidden9609602048
Vocab8,19216,38416,384
Embedding share of params21%34%13%

The trainer derives the expected param count from these dims and asserts it — so a config typo can't silently change the model.


3. The pipeline

download shards → select subset (deterministic hash) → train BPE tokenizer →
pack to uint16 blocks → pretrain → SFT → serve (chat_app) / eval (chat_samples)
StageScriptNotes
Sourcefineweb-edu/FineWeb-Edu sample-10BT (~9.97B GPT-2 tokens, 14 shards)
Selectselect_fineweb_smoke_subset.pyKeyed BLAKE2b hash ranking → deterministic, reproducible subset; same seed ⇒ superset/identical splits
Tokenizertrain_smoke_tokenizer.pyByte-level BPE on a hashed text sample
Packprepare_smoke_tokens.pyEncode + <eos> per doc, concat, write 1024-token uint16 blocks
Pretraintrain_smoke_model.pyMemory-mapped blocks, multi-epoch w/ per-epoch reshuffle, --resume, graceful STOP
SFTprepare_sft.py + sft_finetune.pyMasked instruction tuning (§7)
Evalsft_eval.py, chat_samples.pyBase vs SFT, <eos>-stop reporting
Servechat_app.pyLocal Flask chat (top-k 40, temp 0.75)
Orchestraterun_100m_pipeline.cmdAll phases, resumable, pausable

4. Per-model detail

Model A — smoke_10m_full_200m (10M / 8k / 200M)

The original baseline. 10,096,896 params, 8k vocab, 200,015,872 tokens (3,052 steps).

Model B — smoke_12m_16k_4ep (12M / 16k / 4B, 4 epochs)

Scaled vocab 8k→16k and trained 4 epochs over ~720M unique tokens (single cosine across all 4). 12,194,048 params, 2,883,059,712 token-passes (43,992 steps).

Model C — sft_12m_16k (SFT of B)

Instruction-tuned B. 10,789 train + 800 val pairs (8,111 Dolly + 3,500 SQuAD), max_len 512, 30.2% of tokens supervised. 3 epochs, lr 2e-5, batch 32.

Model D — smoke_100m_16k_4b (100M / 16k / 4B) — IN PROGRESS

GPT-2-small-class scale-up. 97,536,768 params, 4B-token single pass (target 61,036 steps).


5. Key design decisions & rationale

Improvement levers, ranked (for a capacity-limited model)

  1. Bigger model (the real unlock — Model D) · 2. More/better pretrain data ·
  2. Distillation from a larger teacher · 4. RAG (best practical, no retrain) ·
  3. Decoding fixes (repetition penalty) · 6. More SFT data · 7. RL/DPO (last).

6. Performance summary

ModelParamsThroughput (tok/s)Peak VRAMVal lossNotes
A 10M/8k10.1M227,5634.6 GB3.559baseline
B 12M/16k12.2M214,89013.9 GB3.433+vocab, +epochs
D 100M/16k97.5M45,7037.9 GB~3.04 (52%)scaling win

Scaling observation: at ~52% trained, the 100M model already beats the fully-trained 12M model's loss — confirming the bottleneck at small scale is parameters, not data or tuning.


7. SFT mechanics (how instruction-tuning works here)


8. Input / output examples

Base models (next-token predictors — fluent, no answering)

12M/16k base, "What is the capital of France?""…the capital of France is the capital of France. We can look at it and know it…" (loops, never stops).

After SFT (Model C) — base vs SFT, same prompts

PromptBase (epoch_4)SFT (best)
Capital of France?never stops; loopseos@13 "The capital of France is Ville de Lafayette." (stops; wrong — capacity)
How tall is the Eiffel Tower? (passage given)hallucinates "Wounded Hole…President Madison"eos@3 "330 metres"
What gas does photosynthesis release? (passage given)rambles, never stopseos@1 "oxygen"

Takeaway: SFT taught the model to answer and stop; with facts in the prompt (extractive), it's correct. Open factual recall still fails — the 12M/100M capacity ceiling, which SFT cannot fix (RAG can).


9. Infrastructure & operations (resilience)

Long runs on this single-GPU Windows box died twice — sleep (step 6,558) and session teardown (step 27,106), both silent (no traceback). Fixes, now standard:

Pause / resume workflow (current 100M run)


10. Artifact index


Model D is paused mid-pretraining at step 31,892/61,036. Resume to finish the remaining ~13 h + auto-SFT. This document should be updated when D and its SFT complete.