◆ COMPREHENSIVE GUIDE ◆

ANYTHING IS POSSIBLE
WITH PUNCH CARDS

A step-by-step guide to building a GPT-2-class LLM using nothing but 80-column punch cards.

We assume no prior punch card experience, but a strong foundation in linear algebra and attention to detail recommended.

TARGET MODEL SPECIFICATIONS
Architecture
Transformer (decoder)
Parameters
124 million
Layers
12
Embedding Dim
768
Attention Heads
12
Vocab Size
50,257
Context Length
1,024 tokens
Precision
float32 (4 bytes)
Reference
GPT-2 Small

00.Prerequisites & Bill of Materials

Before writing a single line of punched code, you need to secure the physical infrastructure. Building an LLM is primarily a logistics problem. The math is straightforward. The scale is not.

REQUIRED MATERIALS
WARNING
The total weight of punch cards required for this project is approximately 265 metric tons. Verify that your warehouse floor can support this load. Standard commercial flooring is rated for ~250 kg/m². You will need reinforced flooring.

01.Floating-Point Arithmetic Library

Neural networks run on floating-point math. Every parameter is a 32-bit float. You need four fundamental operations (add, subtract, multiply, divide) plus transcendental functions (exp, log, tanh, sqrt) for activation functions and softmax.

On punch cards, you’ll implement these as lookup tables and procedure decks. A procedure deck is a sequence of cards that, when fed through the tabulating machine, performs a specific operation on input cards and produces output cards.

REQUIRED LOOKUP TABLES

Additionally, you need procedure decks for interpolation (values between table entries), carry propagation, and IEEE 754 special cases (NaN, infinity, denormalized numbers). Approximately 300 procedure cards.

PUNCH CARD COUNT
3,000
math library (lookup tables + procedures)
RUNNING TOTAL: 3,000

02.Matrix Operations Library

Transformers are matrix multiplication engines. You need procedures for: matrix–matrix multiply, matrix–vector multiply, element-wise operations (add, multiply, apply activation), transpose, and row-wise softmax.

The core operation is matrix multiplication. For matrices A (m×k) and B (k×n), the result C (m×n) requires m×n×k multiply-accumulate operations. Each one involves a table lookup (multiply), then an addition with carry propagation.

TIME PER MATRIX MULTIPLY

A skilled operator using pre-computed lookup tables can perform one float32 multiply-accumulate in approximately 30 seconds, including card retrieval, lookup, punching the result, and filing.

Example: multiply two 768 × 768 matrices
Operations: 768 × 768 × 768 = 452,984,832
Time: 452,984,832 × 30 sec = 13,589,544,960 sec
= 430.7 years per matrix multiply
WARNING
A single transformer layer requires approximately 14 matrix multiplications. At this rate, one forward pass through one layer takes about 6,030 years. Plan accordingly.
PUNCH CARD COUNT
500
matrix operation procedure decks
RUNNING TOTAL: 3,500

03.Build the Tokenizer (BPE)

Before training, text must be converted into integer token IDs using Byte-Pair Encoding (BPE). GPT-2 uses a vocabulary of 50,257 tokens. You need to store:

TIP
File the vocabulary cards alphabetically by token bytes and use tab dividers. You’ll be looking up tokens billions of times. An index system will save you centuries.
PUNCH CARD COUNT
19,000
tokenizer (vocabulary + merge table + procedures)
RUNNING TOTAL: 22,500

04.Prepare the Training Corpus

GPT-2 was trained on WebText, approximately 40 GB of text. We’ll use a modest 10 GB subset. After tokenization, this yields approximately 2.5 billion tokens, stored as 2-byte (uint16) token IDs.

Raw text: 10 GB = 10,737,418,240 bytes
Raw text cards: 10,737,418,240 ÷ 80 = 134,217,728 cards
Tokenized: ~2.5 billion tokens × 2 bytes = 5,000,000,000 bytes
Tokenized cards: 5,000,000,000 ÷ 80 = 62,500,000 cards

You only need the tokenized version for training. Store the raw text separately as a reference archive. The tokenized corpus must be organized into sequences of 1,024 tokens each for training batches.

SHUFFLING THE DATASET

Training requires random access to sequences. Using an IBM 083 card sorter at 1,000 cards/minute, a single shuffle of the tokenized corpus takes:

62,500,000 cards ÷ 1,000 cards/min = 62,500 minutes
= 1,042 hours
= 43.4 days per shuffle
WARNING
You will need to re-shuffle the corpus each epoch. Budget 43 days per epoch just for data shuffling. With a standard training run of 3–10 epochs, that’s 130–434 days of pure card sorting.
PUNCH CARD COUNT
62,500,000
tokenized training corpus
RUNNING TOTAL: 62,522,500

05.Define the Transformer Architecture

Create a specification deck that defines every layer, dimension, and connection in the model. This is your blueprint — the punch card equivalent of a model config file.

// architecture.deck
EMBEDDING: vocab=50257, dim=768
POSITION: max_len=1024, dim=768
// repeat 12×
LAYER_NORM: dim=768
ATTENTION: heads=12, dim=768, head_dim=64
RESIDUAL: add
LAYER_NORM: dim=768
FFN: dim=768, hidden=3072, activation=GELU
RESIDUAL: add
// output
LAYER_NORM: dim=768
LINEAR: in=768, out=50257 (tied with embedding)
SOFTMAX: dim=50257

The GELU activation function is critical. Its formula requires computing:

GELU(x) = 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))

For each of the 3,072 hidden units, per layer, per token, per training step. This is why your tanh lookup table needs to be very thorough.

PUNCH CARD COUNT
500
architecture specification deck
RUNNING TOTAL: 62,523,000

06.Initialize Model Weights

The model has 124 million parameters. Each is a 32-bit floating-point number (4 bytes). They must be initialized with small random values using Xavier/Glorot initialization:

W ~ Uniform(-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out)))
Parameters: 124,000,000
Bytes per parameter: 4 (float32)
Total bytes: 496,000,000
Cards: 496,000,000 ÷ 80 = 6,200,000

RANDOM NUMBER GENERATION

You have no PRNG. Recommended approaches:

TIP
Pre-punch a set of "random number cards" using atmospheric noise. Store them in a separate cabinet labeled "ENTROPY RESERVE." You will need these again for dropout (if you implement regularization).
PUNCH CARD COUNT
6,200,000
initial model weights
RUNNING TOTAL: 68,723,000

07.Implement the Forward Pass

The forward pass converts a sequence of 1,024 token IDs into a probability distribution over the next token. Here is every operation you must perform, in order:

STEP 7.1: TOKEN EMBEDDING

Look up each of the 1,024 input tokens in the embedding weight matrix (50,257 × 768). Each lookup retrieves a 768-dimensional vector. Result: a 1,024 × 768 matrix. This is a card lookup operation — find the right card among 50,257 filed entries, 1,024 times.

STEP 7.2: POSITIONAL ENCODING

Add the positional embedding matrix (1,024 × 768) element-wise to the token embeddings. This is 786,432 floating-point additions.

STEP 7.3: TRANSFORMER LAYERS (×12)

For each of the 12 layers, perform the following sub-steps:

// 7.3a: Layer Normalization
Compute mean and variance over dim=768 for each of 1,024 positions
Normalize, scale (γ), and shift (β)
Ops: ~3 × 1,024 × 768 = 2,359,296
// 7.3b: Multi-Head Self-Attention
Project Q, K, V: three (1024×768) × (768×768) matmuls
Split into 12 heads of dim 64
Attention scores: Q × Kᵀ for each head: (1024×64) × (64×1024)
Scale by 1/√64 = 1/8
Apply causal mask (upper triangle → -∞)
Softmax over each row (1,024 exp + div operations per row)
Weighted sum: attn × V for each head
Concatenate heads, project: (1024×768) × (768×768)
Ops: ~4 × 1,024 × 768 × 768 + 2 × 12 × 1,024 × 1,024 × 64
≈ 2.42 billion + 1.61 billion = ~4.03 billion
// 7.3c: Residual Connection
Add attention output to layer input: 786,432 additions
// 7.3d: Layer Normalization (again)
Ops: ~2,359,296
// 7.3e: Feed-Forward Network
Linear 768 → 3072: (1024×768) × (768×3072) matmul
GELU activation: 3,145,728 evaluations (lookup + interpolate)
Linear 3072 → 768: (1024×3072) × (3072×768) matmul
Ops: ~2 × 1,024 × 768 × 3,072 = ~4.83 billion
// 7.3f: Residual Connection
Add FFN output to sub-layer input: 786,432 additions

TOTAL FORWARD PASS COMPUTE

Per layer: ~8.9 billion FLOPs
12 layers: ~106.8 billion FLOPs
Output projection + softmax: ~79 million FLOPs
Total per forward pass: ~107 billion FLOPs
At 30 sec/FLOP: 107 × 109 × 30 = 3.21 × 1011 seconds
= 1,017 years per forward pass

The forward pass also requires scratch space cards to store intermediate activations. These are reusable between passes.

Activations per layer: 1,024 × 768 × 4 bytes = 3.15 MB
Attention matrices: 12 heads × 1,024 × 1,024 × 4 = 50.3 MB
FFN intermediates: 1,024 × 3,072 × 4 = 12.6 MB
Total scratch per layer: ~66 MB
All 12 layers: ~792 MB (must store all for backprop)
= 792,000,000 ÷ 80 = 9,900,000 cards
WARNING
You must keep all intermediate activations from the forward pass in order to compute gradients during backpropagation. Do not discard or re-file scratch cards until backprop is complete. Label every scratch card with its layer index and position.
PUNCH CARD COUNT
9,900,000
scratch space cards (reusable per step)
RUNNING TOTAL: 78,623,000

08.Compute the Loss Function

GPT-2 uses cross-entropy loss: for each position in the sequence, compute the negative log probability of the correct next token.

L = -(1/T) ∑ log(p(correct_token))

After softmax gives you a probability distribution over 50,257 tokens, look up the probability assigned to the correct token and take its natural log. This requires your log lookup table from Step 01. Repeat for all 1,024 positions and average.

PUNCH CARD COUNT
100
loss computation procedure deck
RUNNING TOTAL: 78,623,100

09.Implement Backpropagation

Backpropagation computes the gradient of the loss with respect to every parameter in the network by applying the chain rule in reverse order through every layer. The compute is approximately 2× the forward pass.

For each layer (in reverse, 12 → 1), you must:

  1. Compute gradients for the FFN weights (two large matmuls)
  2. Backprop through GELU (element-wise, using the GELU derivative lookup table)
  3. Compute gradients for the attention projection weights
  4. Backprop through softmax attention (per-head)
  5. Compute gradients for Q, K, V projection weights (three matmuls)
  6. Backprop through layer normalization (involves stored means/variances)
  7. Accumulate gradients through residual connections
Backward FLOPs ≈ 2 × forward FLOPs
= 2 × 107 billion = ~214 billion FLOPs
At 30 sec/FLOP: ~203,000 years per backward pass

You must also store the gradients for all 124 million parameters. That’s another full set of weight cards:

PUNCH CARD COUNT
6,200,000
gradient storage cards
RUNNING TOTAL: 84,823,100

10.Implement the Adam Optimizer

Adam maintains two running averages per parameter: the first moment (mean of gradients, m) and the second moment (mean of squared gradients, v). This requires two additional complete copies of the parameter storage.

// Adam update for each parameter θ:
m = β₁ × m + (1 - β₁) × gradient
v = β₂ × v + (1 - β₂) × gradient²
m̂ = m / (1 - β₁ᵗ)
v̂ = v / (1 - β₂ᵗ)
θ = θ - lr × m̂ / (√v̂ + ε)

Each parameter update involves ~15 floating-point operations. For 124 million parameters, that’s 1.86 billion operations per training step just for the optimizer.

m storage: 124M × 4 bytes = 496 MB = 6,200,000 cards
v storage: 124M × 4 bytes = 496 MB = 6,200,000 cards
Optimizer compute: 1.86B FLOPs × 30 sec = 1,770 years/step
PUNCH CARD COUNT
12,400,000
Adam optimizer state (m + v)
RUNNING TOTAL: 97,223,100

11.Run the Training Loop

You now have all the components. The training loop is:

// training loop
for epoch in 1..3:
shuffle(training_corpus) // 43 days
for batch in batches:
activations = forward(batch) // 1,017 years
loss = cross_entropy(...) // 2 days
gradients = backward(...) // 2,034 years
adam_update(weights, grads) // 1,770 years
if step % 5000 == 0:
checkpoint(weights) // 6.2M cards

TOTAL TRAINING STEPS

Training tokens: 2.5 billion
Sequence length: 1,024
Sequences: 2,500,000,000 ÷ 1,024 ≈ 2,441,406
Batch size: 1 (limited by operator throughput)
Steps per epoch: 2,441,406
Epochs: 3
Total steps: 7,324,218

TOTAL TRAINING TIME

FLOPs per step: ~321 billion (forward + backward)
Optimizer per step: ~1.86 billion
Total FLOPs per step: ~323 billion
Total FLOPs: 7,324,218 steps × 323 × 109 = 2.37 × 1018
At 30 sec/FLOP (1 operator):
= 2.37 × 1018 × 30 = 7.1 × 1019 seconds
= 2.25 × 1012 years
= 2.25 trillion years
TIMELINE CONTEXT
163×
the current age of the universe (13.8 billion years)

PARALLELIZATION

You can reduce training time by hiring more operators. Matrix operations are embarrassingly parallel — each element of the output can be computed independently.

1 operator: ......... 2.25 trillion years
10 operators: ........ 225 billion years
1,000 operators: ..... 2.25 billion years
1,000,000 operators: . 2.25 million years
7,000,000 operators: . ~321,000 years
TIP
At 7 million parallel operators, training completes in approximately the time since humans first developed agriculture. This is the minimum viable staffing level for a single training run.

Checkpoint cards (saving weights every 5,000 steps):

Checkpoints: 7,324,218 ÷ 5,000 ≈ 1,465 saves
Cards per checkpoint: 6,200,000
Total checkpoint cards: 1,465 × 6,200,000 = 9,083,000,000
WARNING
Checkpoint storage alone requires 9 billion cards, weighing approximately 16,000 metric tons. You will need a second warehouse. Consider checkpointing less frequently.
PUNCH CARD COUNT
9,083,000,000
training checkpoints (every 5,000 steps)
RUNNING TOTAL: 9,180,223,100

12.Run Inference

After training, you can generate text. Inference is autoregressive: to generate N tokens, you run the forward pass N times, each time appending the predicted token to the input sequence.

To generate a single token:

1 forward pass: ~107 billion FLOPs
At 30 sec/FLOP: 1,017 years
To generate a 100-token response:
= 100 × 1,017 years
= 101,700 years per response

You’ll also need to implement temperature scaling and top-k sampling during token selection. Temperature divides the logits by a scalar before softmax. Top-k requires sorting 50,257 values to find the k largest — approximately 16 minutes on the card sorter per token generated.

TIP
For faster inference, consider pre-computing a KV cache: store the key and value matrices from previous positions so you only compute attention for the new token. This reduces per-token compute from O(n²) to O(n), at the cost of additional cache storage cards.

13.Total Requirements Summary

FINAL PROJECT TALLY
Punch Cards (training)~9.18 billion
Stack Height~1,634 km (2× height of ISS orbit)
Total Weight~16,194 metric tons
Total FLOPs~2.37 × 10^{18}
Time (1 operator)2.25 trillion years
Time (1M operators)2.25 million years
Universes Required~163
Warehouses Required≥2
Filing Cabinets~200,000
Coffee (estimated)

COST ESTIMATE

Cards at $0.01/card: 9.18B × $0.01 = $91,800,000
Keypunch machines: 3 × $8,500 = $25,500
Card sorters: 2 × $12,000 = $24,000
Warehouse (2,500 sqft/yr): ~$30,000/yr
Operators (1M × $50k/yr × 2.25M yrs): $112,500,000,000,000,000
= ~$112.5 quadrillion
(approximately 1,000× current global GDP)

FOR COMPARISON

Training GPT-2 Small on a single A100 GPU: ~24 hours
Training GPT-2 Small on punch cards: ~2.25 trillion years
The GPU is approximately 8.2 × 10^15 times faster.

Or you could just use a GPU. Your call.

◆ READY TO START? ◆

Pick up your first 9.18 billion cards and get punching.

SHOP PUNCH CARDS →