A step-by-step guide to building a GPT-2-class LLM using nothing but 80-column punch cards.
We assume no prior punch card experience, but a strong foundation in linear algebra and attention to detail recommended.
TARGET MODEL SPECIFICATIONS
Architecture
Transformer (decoder)
Parameters
124 million
Layers
12
Embedding Dim
768
Attention Heads
12
Vocab Size
50,257
Context Length
1,024 tokens
Precision
float32 (4 bytes)
Reference
GPT-2 Small
00.Prerequisites & Bill of Materials
Before writing a single line of punched code, you need to secure the physical infrastructure. Building an LLM is primarily a logistics problem. The math is straightforward. The scale is not.
REQUIRED MATERIALS
▸~150,000,000 blank IBM 80-column punch cards
▸3× industrial keypunch machines (IBM 029 or equivalent)
▸128-page procedure manual (you’ll write this yourself)
▸~9.5 trillion years of uninterrupted labor
▸Coffee (amount: unbounded)
WARNING
The total weight of punch cards required for this project is approximately 265 metric tons. Verify that your warehouse floor can support this load. Standard commercial flooring is rated for ~250 kg/m². You will need reinforced flooring.
01.Floating-Point Arithmetic Library
Neural networks run on floating-point math. Every parameter is a 32-bit float. You need four fundamental operations (add, subtract, multiply, divide) plus transcendental functions (exp, log, tanh, sqrt) for activation functions and softmax.
On punch cards, you’ll implement these as lookup tables and procedure decks. A procedure deck is a sequence of cards that, when fed through the tabulating machine, performs a specific operation on input cards and produces output cards.
REQUIRED LOOKUP TABLES
▸Multiplication table — Pre-computed products for common mantissa pairs. 10,000 entries × 8 bytes = 80,000 bytes = 1,000 cards.
▸Exponential table (eˣ) — Required for softmax and GELU. 5,000 entries × 8 bytes = 500 cards.
Additionally, you need procedure decks for interpolation (values between table entries), carry propagation, and IEEE 754 special cases (NaN, infinity, denormalized numbers). Approximately 300 procedure cards.
PUNCH CARD COUNT
3,000
math library (lookup tables + procedures)
RUNNING TOTAL: 3,000
02.Matrix Operations Library
Transformers are matrix multiplication engines. You need procedures for: matrix–matrix multiply, matrix–vector multiply, element-wise operations (add, multiply, apply activation), transpose, and row-wise softmax.
The core operation is matrix multiplication. For matrices A (m×k) and B (k×n), the result C (m×n) requires m×n×k multiply-accumulate operations. Each one involves a table lookup (multiply), then an addition with carry propagation.
TIME PER MATRIX MULTIPLY
A skilled operator using pre-computed lookup tables can perform one float32 multiply-accumulate in approximately 30 seconds, including card retrieval, lookup, punching the result, and filing.
Example: multiply two 768 × 768 matrices
Operations: 768 × 768 × 768 = 452,984,832
Time: 452,984,832 × 30 sec = 13,589,544,960 sec
= 430.7 years per matrix multiply
WARNING
A single transformer layer requires approximately 14 matrix multiplications. At this rate, one forward pass through one layer takes about 6,030 years. Plan accordingly.
PUNCH CARD COUNT
500
matrix operation procedure decks
RUNNING TOTAL: 3,500
03.Build the Tokenizer (BPE)
Before training, text must be converted into integer token IDs using Byte-Pair Encoding (BPE). GPT-2 uses a vocabulary of 50,257 tokens. You need to store:
▸Vocabulary mapping — 50,257 token entries, each mapping a token ID to its byte sequence. Average ~10 bytes per entry. 502,570 bytes = 6,282 cards.
▸Merge table — ~50,000 BPE merge rules specifying which byte pairs merge in what priority. ~20 bytes per rule. 1,000,000 bytes = 12,500 cards.
▸Encoding procedure deck — The algorithm for applying merges to raw text. ~200 cards.
TIP
File the vocabulary cards alphabetically by token bytes and use tab dividers. You’ll be looking up tokens billions of times. An index system will save you centuries.
PUNCH CARD COUNT
19,000
tokenizer (vocabulary + merge table + procedures)
RUNNING TOTAL: 22,500
04.Prepare the Training Corpus
GPT-2 was trained on WebText, approximately 40 GB of text. We’ll use a modest 10 GB subset. After tokenization, this yields approximately 2.5 billion tokens, stored as 2-byte (uint16) token IDs.
Raw text: 10 GB = 10,737,418,240 bytes
Raw text cards: 10,737,418,240 ÷ 80 = 134,217,728 cards
You only need the tokenized version for training. Store the raw text separately as a reference archive. The tokenized corpus must be organized into sequences of 1,024 tokens each for training batches.
SHUFFLING THE DATASET
Training requires random access to sequences. Using an IBM 083 card sorter at 1,000 cards/minute, a single shuffle of the tokenized corpus takes:
You will need to re-shuffle the corpus each epoch. Budget 43 days per epoch just for data shuffling. With a standard training run of 3–10 epochs, that’s 130–434 days of pure card sorting.
PUNCH CARD COUNT
62,500,000
tokenized training corpus
RUNNING TOTAL: 62,522,500
05.Define the Transformer Architecture
Create a specification deck that defines every layer, dimension, and connection in the model. This is your blueprint — the punch card equivalent of a model config file.
// architecture.deck
EMBEDDING: vocab=50257, dim=768
POSITION: max_len=1024, dim=768
// repeat 12×
LAYER_NORM: dim=768
ATTENTION: heads=12, dim=768, head_dim=64
RESIDUAL: add
LAYER_NORM: dim=768
FFN: dim=768, hidden=3072, activation=GELU
RESIDUAL: add
// output
LAYER_NORM: dim=768
LINEAR: in=768, out=50257 (tied with embedding)
SOFTMAX: dim=50257
The GELU activation function is critical. Its formula requires computing:
For each of the 3,072 hidden units, per layer, per token, per training step. This is why your tanh lookup table needs to be very thorough.
PUNCH CARD COUNT
500
architecture specification deck
RUNNING TOTAL: 62,523,000
06.Initialize Model Weights
The model has 124 million parameters. Each is a 32-bit floating-point number (4 bytes). They must be initialized with small random values using Xavier/Glorot initialization:
▸Physical entropy — Flip a coin for each bit. For 496 million bytes × 8 bits = 3.97 billion coin flips. At 3 flips per second, this takes approximately 42 years.
▸Atmospheric noise — Record radio static, convert amplitude samples to bits. Faster, but requires additional equipment.
TIP
Pre-punch a set of "random number cards" using atmospheric noise. Store them in a separate cabinet labeled "ENTROPY RESERVE." You will need these again for dropout (if you implement regularization).
PUNCH CARD COUNT
6,200,000
initial model weights
RUNNING TOTAL: 68,723,000
07.Implement the Forward Pass
The forward pass converts a sequence of 1,024 token IDs into a probability distribution over the next token. Here is every operation you must perform, in order:
STEP 7.1: TOKEN EMBEDDING
Look up each of the 1,024 input tokens in the embedding weight matrix (50,257 × 768). Each lookup retrieves a 768-dimensional vector. Result: a 1,024 × 768 matrix. This is a card lookup operation — find the right card among 50,257 filed entries, 1,024 times.
STEP 7.2: POSITIONAL ENCODING
Add the positional embedding matrix (1,024 × 768) element-wise to the token embeddings. This is 786,432 floating-point additions.
STEP 7.3: TRANSFORMER LAYERS (×12)
For each of the 12 layers, perform the following sub-steps:
// 7.3a: Layer Normalization
Compute mean and variance over dim=768 for each of 1,024 positions
Normalize, scale (γ), and shift (β)
Ops: ~3 × 1,024 × 768 = 2,359,296
// 7.3b: Multi-Head Self-Attention
Project Q, K, V: three (1024×768) × (768×768) matmuls
Split into 12 heads of dim 64
Attention scores: Q × Kᵀ for each head: (1024×64) × (64×1024)
Scale by 1/√64 = 1/8
Apply causal mask (upper triangle → -∞)
Softmax over each row (1,024 exp + div operations per row)
All 12 layers: ~792 MB (must store all for backprop)
= 792,000,000 ÷ 80 = 9,900,000 cards
WARNING
You must keep all intermediate activations from the forward pass in order to compute gradients during backpropagation. Do not discard or re-file scratch cards until backprop is complete. Label every scratch card with its layer index and position.
PUNCH CARD COUNT
9,900,000
scratch space cards (reusable per step)
RUNNING TOTAL: 78,623,000
08.Compute the Loss Function
GPT-2 uses cross-entropy loss: for each position in the sequence, compute the negative log probability of the correct next token.
L = -(1/T) ∑ log(p(correct_token))
After softmax gives you a probability distribution over 50,257 tokens, look up the probability assigned to the correct token and take its natural log. This requires your log lookup table from Step 01. Repeat for all 1,024 positions and average.
PUNCH CARD COUNT
100
loss computation procedure deck
RUNNING TOTAL: 78,623,100
09.Implement Backpropagation
Backpropagation computes the gradient of the loss with respect to every parameter in the network by applying the chain rule in reverse order through every layer. The compute is approximately 2× the forward pass.
For each layer (in reverse, 12 → 1), you must:
Compute gradients for the FFN weights (two large matmuls)
Backprop through GELU (element-wise, using the GELU derivative lookup table)
Compute gradients for the attention projection weights
Backprop through softmax attention (per-head)
Compute gradients for Q, K, V projection weights (three matmuls)
Backprop through layer normalization (involves stored means/variances)
Accumulate gradients through residual connections
Backward FLOPs ≈ 2 × forward FLOPs
= 2 × 107 billion = ~214 billion FLOPs
At 30 sec/FLOP: ~203,000 years per backward pass
You must also store the gradients for all 124 million parameters. That’s another full set of weight cards:
PUNCH CARD COUNT
6,200,000
gradient storage cards
RUNNING TOTAL: 84,823,100
10.Implement the Adam Optimizer
Adam maintains two running averages per parameter: the first moment (mean of gradients, m) and the second moment (mean of squared gradients, v). This requires two additional complete copies of the parameter storage.
// Adam update for each parameter θ:
m = β₁ × m + (1 - β₁) × gradient
v = β₂ × v + (1 - β₂) × gradient²
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
θ = θ - lr × m̂ / (√v̂ + ε)
Each parameter update involves ~15 floating-point operations. For 124 million parameters, that’s 1.86 billion operations per training step just for the optimizer.
the current age of the universe (13.8 billion years)
PARALLELIZATION
You can reduce training time by hiring more operators. Matrix operations are embarrassingly parallel — each element of the output can be computed independently.
1 operator: ......... 2.25 trillion years
10 operators: ........ 225 billion years
1,000 operators: ..... 2.25 billion years
1,000,000 operators: . 2.25 million years
7,000,000 operators: . ~321,000 years
TIP
At 7 million parallel operators, training completes in approximately the time since humans first developed agriculture. This is the minimum viable staffing level for a single training run.
Checkpoint cards (saving weights every 5,000 steps):
Checkpoints: 7,324,218 ÷ 5,000 ≈ 1,465 saves
Cards per checkpoint: 6,200,000
Total checkpoint cards: 1,465 × 6,200,000 = 9,083,000,000
WARNING
Checkpoint storage alone requires 9 billion cards, weighing approximately 16,000 metric tons. You will need a second warehouse. Consider checkpointing less frequently.
PUNCH CARD COUNT
9,083,000,000
training checkpoints (every 5,000 steps)
RUNNING TOTAL: 9,180,223,100
12.Run Inference
After training, you can generate text. Inference is autoregressive: to generate N tokens, you run the forward pass N times, each time appending the predicted token to the input sequence.
To generate a single token:
1 forward pass: ~107 billion FLOPs
At 30 sec/FLOP: 1,017 years
To generate a 100-token response:
= 100 × 1,017 years
= 101,700 years per response
You’ll also need to implement temperature scaling and top-k sampling during token selection. Temperature divides the logits by a scalar before softmax. Top-k requires sorting 50,257 values to find the k largest — approximately 16 minutes on the card sorter per token generated.
TIP
For faster inference, consider pre-computing a KV cache: store the key and value matrices from previous positions so you only compute attention for the new token. This reduces per-token compute from O(n²) to O(n), at the cost of additional cache storage cards.