How to Build an LLM with Punch Cards — A Complete Guide

We assume no prior punch card experience, but a strong foundation in linear algebra and attention to detail recommended.

TARGET MODEL SPECIFICATIONS

Architecture

Transformer (decoder)

Parameters

124 million

Layers

Embedding Dim

768

Attention Heads

Vocab Size

50,257

Context Length

1,024 tokens

Precision

float32 (4 bytes)

Reference

GPT-2 Small

00.Prerequisites & Bill of Materials

Before writing a single line of punched code, you need to secure the physical infrastructure. Building an LLM is primarily a logistics problem. The math is straightforward. The scale is not.

REQUIRED MATERIALS

▸~150,000,000 blank IBM 80-column punch cards
▸3× industrial keypunch machines (IBM 029 or equivalent)
▸2× mechanical card sorters (IBM 083, 1,000 cards/min)
▸1× tabulating machine (IBM 407)
▸12+ trained keypunch operators (3 shifts, 24/7)
▸2,500 sq ft warehouse (climate-controlled, low humidity)
▸200+ filing cabinets (5-drawer, letter-size)
▸Mathematical reference tables: log, exp, tanh, erf
▸128-page procedure manual (you’ll write this yourself)
▸~9.5 trillion years of uninterrupted labor
▸Coffee (amount: unbounded)

WARNING

The total weight of punch cards required for this project is approximately 265 metric tons. Verify that your warehouse floor can support this load. Standard commercial flooring is rated for ~250 kg/m². You will need reinforced flooring.

01.Floating-Point Arithmetic Library

Neural networks run on floating-point math. Every parameter is a 32-bit float. You need four fundamental operations (add, subtract, multiply, divide) plus transcendental functions (exp, log, tanh, sqrt) for activation functions and softmax.

On punch cards, you’ll implement these as lookup tables and procedure decks. A procedure deck is a sequence of cards that, when fed through the tabulating machine, performs a specific operation on input cards and produces output cards.

REQUIRED LOOKUP TABLES

▸Multiplication table — Pre-computed products for common mantissa pairs. 10,000 entries × 8 bytes = 80,000 bytes = 1,000 cards.
▸Exponential table (eˣ) — Required for softmax and GELU. 5,000 entries × 8 bytes = 500 cards.
▸Natural log table — Required for cross-entropy loss. 5,000 entries × 8 bytes = 500 cards.
▸Tanh / erf table — Required for GELU activation. 5,000 entries × 8 bytes = 500 cards.
▸Reciprocal square root table — Required for layer normalization. 2,000 entries × 8 bytes = 200 cards.

Additionally, you need procedure decks for interpolation (values between table entries), carry propagation, and IEEE 754 special cases (NaN, infinity, denormalized numbers). Approximately 300 procedure cards.

PUNCH CARD COUNT

3,000

math library (lookup tables + procedures)

RUNNING TOTAL: 3,000

02.Matrix Operations Library

Transformers are matrix multiplication engines. You need procedures for: matrix–matrix multiply, matrix–vector multiply, element-wise operations (add, multiply, apply activation), transpose, and row-wise softmax.

The core operation is matrix multiplication. For matrices A (m×k) and B (k×n), the result C (m×n) requires m×n×k multiply-accumulate operations. Each one involves a table lookup (multiply), then an addition with carry propagation.

TIME PER MATRIX MULTIPLY

A skilled operator using pre-computed lookup tables can perform one float32 multiply-accumulate in approximately 30 seconds, including card retrieval, lookup, punching the result, and filing.

Example: multiply two 768 × 768 matrices

Operations: 768 × 768 × 768 = 452,984,832

Time: 452,984,832 × 30 sec = 13,589,544,960 sec

= 430.7 years per matrix multiply

WARNING

A single transformer layer requires approximately 14 matrix multiplications. At this rate, one forward pass through one layer takes about 6,030 years. Plan accordingly.

PUNCH CARD COUNT

500

matrix operation procedure decks

RUNNING TOTAL: 3,500

03.Build the Tokenizer (BPE)

Before training, text must be converted into integer token IDs using Byte-Pair Encoding (BPE). GPT-2 uses a vocabulary of 50,257 tokens. You need to store:

▸Vocabulary mapping — 50,257 token entries, each mapping a token ID to its byte sequence. Average ~10 bytes per entry. 502,570 bytes = 6,282 cards.
▸Merge table — ~50,000 BPE merge rules specifying which byte pairs merge in what priority. ~20 bytes per rule. 1,000,000 bytes = 12,500 cards.
▸Encoding procedure deck — The algorithm for applying merges to raw text. ~200 cards.

TIP

File the vocabulary cards alphabetically by token bytes and use tab dividers. You’ll be looking up tokens billions of times. An index system will save you centuries.

PUNCH CARD COUNT

19,000

tokenizer (vocabulary + merge table + procedures)

RUNNING TOTAL: 22,500

04.Prepare the Training Corpus

GPT-2 was trained on WebText, approximately 40 GB of text. We’ll use a modest 10 GB subset. After tokenization, this yields approximately 2.5 billion tokens, stored as 2-byte (uint16) token IDs.

Raw text: 10 GB = 10,737,418,240 bytes

Raw text cards: 10,737,418,240 ÷ 80 = 134,217,728 cards

Tokenized: ~2.5 billion tokens × 2 bytes = 5,000,000,000 bytes

Tokenized cards: 5,000,000,000 ÷ 80 = 62,500,000 cards

You only need the tokenized version for training. Store the raw text separately as a reference archive. The tokenized corpus must be organized into sequences of 1,024 tokens each for training batches.

SHUFFLING THE DATASET

Training requires random access to sequences. Using an IBM 083 card sorter at 1,000 cards/minute, a single shuffle of the tokenized corpus takes:

62,500,000 cards ÷ 1,000 cards/min = 62,500 minutes

= 1,042 hours

= 43.4 days per shuffle

WARNING

You will need to re-shuffle the corpus each epoch. Budget 43 days per epoch just for data shuffling. With a standard training run of 3–10 epochs, that’s 130–434 days of pure card sorting.

PUNCH CARD COUNT

62,500,000

tokenized training corpus

RUNNING TOTAL: 62,522,500

05.Define the Transformer Architecture

Create a specification deck that defines every layer, dimension, and connection in the model. This is your blueprint — the punch card equivalent of a model config file.

// architecture.deck

EMBEDDING: vocab=50257, dim=768

POSITION: max_len=1024, dim=768

// repeat 12×

LAYER_NORM: dim=768

ATTENTION: heads=12, dim=768, head_dim=64

RESIDUAL: add

LAYER_NORM: dim=768

FFN: dim=768, hidden=3072, activation=GELU

RESIDUAL: add

// output

LAYER_NORM: dim=768

LINEAR: in=768, out=50257 (tied with embedding)

SOFTMAX: dim=50257

The GELU activation function is critical. Its formula requires computing:

GELU(x) = 0.5 × x × (1 + tanh(√(2/π) × (x + 0.044715 × x³)))

For each of the 3,072 hidden units, per layer, per token, per training step. This is why your tanh lookup table needs to be very thorough.

PUNCH CARD COUNT

500

architecture specification deck

RUNNING TOTAL: 62,523,000

06.Initialize Model Weights

The model has 124 million parameters. Each is a 32-bit floating-point number (4 bytes). They must be initialized with small random values using Xavier/Glorot initialization:

W ~ Uniform(-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out)))

Parameters: 124,000,000

Bytes per parameter: 4 (float32)

Total bytes: 496,000,000

Cards: 496,000,000 ÷ 80 = 6,200,000

RANDOM NUMBER GENERATION

You have no PRNG. Recommended approaches:

▸Physical entropy — Flip a coin for each bit. For 496 million bytes × 8 bits = 3.97 billion coin flips. At 3 flips per second, this takes approximately 42 years.
▸Atmospheric noise — Record radio static, convert amplitude samples to bits. Faster, but requires additional equipment.

TIP

Pre-punch a set of "random number cards" using atmospheric noise. Store them in a separate cabinet labeled "ENTROPY RESERVE." You will need these again for dropout (if you implement regularization).

PUNCH CARD COUNT

6,200,000

initial model weights

RUNNING TOTAL: 68,723,000

07.Implement the Forward Pass

The forward pass converts a sequence of 1,024 token IDs into a probability distribution over the next token. Here is every operation you must perform, in order:

STEP 7.1: TOKEN EMBEDDING

Look up each of the 1,024 input tokens in the embedding weight matrix (50,257 × 768). Each lookup retrieves a 768-dimensional vector. Result: a 1,024 × 768 matrix. This is a card lookup operation — find the right card among 50,257 filed entries, 1,024 times.

STEP 7.2: POSITIONAL ENCODING

Add the positional embedding matrix (1,024 × 768) element-wise to the token embeddings. This is 786,432 floating-point additions.

STEP 7.3: TRANSFORMER LAYERS (×12)

For each of the 12 layers, perform the following sub-steps:

// 7.3a: Layer Normalization

Compute mean and variance over dim=768 for each of 1,024 positions

Normalize, scale (γ), and shift (β)

Ops: ~3 × 1,024 × 768 = 2,359,296

// 7.3b: Multi-Head Self-Attention

Project Q, K, V: three (1024×768) × (768×768) matmuls

Split into 12 heads of dim 64

Attention scores: Q × Kᵀ for each head: (1024×64) × (64×1024)

Scale by 1/√64 = 1/8

Apply causal mask (upper triangle → -∞)

Softmax over each row (1,024 exp + div operations per row)

Weighted sum: attn × V for each head

Concatenate heads, project: (1024×768) × (768×768)

Ops: ~4 × 1,024 × 768 × 768 + 2 × 12 × 1,024 × 1,024 × 64

≈ 2.42 billion + 1.61 billion = ~4.03 billion

// 7.3c: Residual Connection

Add attention output to layer input: 786,432 additions

// 7.3d: Layer Normalization (again)

Ops: ~2,359,296

// 7.3e: Feed-Forward Network

Linear 768 → 3072: (1024×768) × (768×3072) matmul

GELU activation: 3,145,728 evaluations (lookup + interpolate)

Linear 3072 → 768: (1024×3072) × (3072×768) matmul

Ops: ~2 × 1,024 × 768 × 3,072 = ~4.83 billion

// 7.3f: Residual Connection

Add FFN output to sub-layer input: 786,432 additions

TOTAL FORWARD PASS COMPUTE

Per layer: ~8.9 billion FLOPs

12 layers: ~106.8 billion FLOPs

Output projection + softmax: ~79 million FLOPs

Total per forward pass: ~107 billion FLOPs

At 30 sec/FLOP: 107 × 10⁹ × 30 = 3.21 × 10¹¹ seconds

= 1,017 years per forward pass

The forward pass also requires scratch space cards to store intermediate activations. These are reusable between passes.

Activations per layer: 1,024 × 768 × 4 bytes = 3.15 MB

Attention matrices: 12 heads × 1,024 × 1,024 × 4 = 50.3 MB

FFN intermediates: 1,024 × 3,072 × 4 = 12.6 MB

Total scratch per layer: ~66 MB

All 12 layers: ~792 MB (must store all for backprop)

= 792,000,000 ÷ 80 = 9,900,000 cards

WARNING

You must keep all intermediate activations from the forward pass in order to compute gradients during backpropagation. Do not discard or re-file scratch cards until backprop is complete. Label every scratch card with its layer index and position.

PUNCH CARD COUNT

9,900,000

scratch space cards (reusable per step)

RUNNING TOTAL: 78,623,000

08.Compute the Loss Function

GPT-2 uses cross-entropy loss: for each position in the sequence, compute the negative log probability of the correct next token.

L = -(1/T) ∑ log(p(correct_token))

After softmax gives you a probability distribution over 50,257 tokens, look up the probability assigned to the correct token and take its natural log. This requires your log lookup table from Step 01. Repeat for all 1,024 positions and average.

PUNCH CARD COUNT

100

loss computation procedure deck

RUNNING TOTAL: 78,623,100

09.Implement Backpropagation

Backpropagation computes the gradient of the loss with respect to every parameter in the network by applying the chain rule in reverse order through every layer. The compute is approximately 2× the forward pass.

For each layer (in reverse, 12 → 1), you must:

Compute gradients for the FFN weights (two large matmuls)
Backprop through GELU (element-wise, using the GELU derivative lookup table)
Compute gradients for the attention projection weights
Backprop through softmax attention (per-head)
Compute gradients for Q, K, V projection weights (three matmuls)
Backprop through layer normalization (involves stored means/variances)
Accumulate gradients through residual connections

Backward FLOPs ≈ 2 × forward FLOPs

= 2 × 107 billion = ~214 billion FLOPs

At 30 sec/FLOP: ~203,000 years per backward pass

You must also store the gradients for all 124 million parameters. That’s another full set of weight cards:

PUNCH CARD COUNT

6,200,000

gradient storage cards

RUNNING TOTAL: 84,823,100

10.Implement the Adam Optimizer

Adam maintains two running averages per parameter: the first moment (mean of gradients, m) and the second moment (mean of squared gradients, v). This requires two additional complete copies of the parameter storage.

// Adam update for each parameter θ:

m = β₁ × m + (1 - β₁) × gradient

v = β₂ × v + (1 - β₂) × gradient²

m̂ = m / (1 - β₁ᵗ)

v̂ = v / (1 - β₂ᵗ)

θ = θ - lr × m̂ / (√v̂ + ε)

Each parameter update involves ~15 floating-point operations. For 124 million parameters, that’s 1.86 billion operations per training step just for the optimizer.

m storage: 124M × 4 bytes = 496 MB = 6,200,000 cards

v storage: 124M × 4 bytes = 496 MB = 6,200,000 cards

Optimizer compute: 1.86B FLOPs × 30 sec = 1,770 years/step

PUNCH CARD COUNT

12,400,000

Adam optimizer state (m + v)

RUNNING TOTAL: 97,223,100

11.Run the Training Loop

You now have all the components. The training loop is:

// training loop

for epoch in 1..3:

shuffle(training_corpus) // 43 days

for batch in batches:

activations = forward(batch) // 1,017 years

loss = cross_entropy(...) // 2 days

gradients = backward(...) // 2,034 years

adam_update(weights, grads) // 1,770 years

if step % 5000 == 0:

checkpoint(weights) // 6.2M cards

TOTAL TRAINING STEPS

Training tokens: 2.5 billion

Sequence length: 1,024

Sequences: 2,500,000,000 ÷ 1,024 ≈ 2,441,406

Batch size: 1 (limited by operator throughput)

Steps per epoch: 2,441,406

Epochs: 3

Total steps: 7,324,218

TOTAL TRAINING TIME

FLOPs per step: ~321 billion (forward + backward)

Optimizer per step: ~1.86 billion

Total FLOPs per step: ~323 billion

Total FLOPs: 7,324,218 steps × 323 × 10⁹ = 2.37 × 10¹⁸

At 30 sec/FLOP (1 operator):

= 2.37 × 10¹⁸ × 30 = 7.1 × 10¹⁹ seconds

= 2.25 × 10¹² years

= 2.25 trillion years

TIMELINE CONTEXT

163×

the current age of the universe (13.8 billion years)

PARALLELIZATION

You can reduce training time by hiring more operators. Matrix operations are embarrassingly parallel — each element of the output can be computed independently.

1 operator: ......... 2.25 trillion years

10 operators: ........ 225 billion years

1,000 operators: ..... 2.25 billion years

1,000,000 operators: . 2.25 million years

7,000,000 operators: . ~321,000 years

TIP

At 7 million parallel operators, training completes in approximately the time since humans first developed agriculture. This is the minimum viable staffing level for a single training run.

Checkpoint cards (saving weights every 5,000 steps):

Checkpoints: 7,324,218 ÷ 5,000 ≈ 1,465 saves

Cards per checkpoint: 6,200,000

Total checkpoint cards: 1,465 × 6,200,000 = 9,083,000,000

WARNING

Checkpoint storage alone requires 9 billion cards, weighing approximately 16,000 metric tons. You will need a second warehouse. Consider checkpointing less frequently.

PUNCH CARD COUNT

9,083,000,000

training checkpoints (every 5,000 steps)

RUNNING TOTAL: 9,180,223,100

12.Run Inference

After training, you can generate text. Inference is autoregressive: to generate N tokens, you run the forward pass N times, each time appending the predicted token to the input sequence.

To generate a single token:

1 forward pass: ~107 billion FLOPs

At 30 sec/FLOP: 1,017 years

To generate a 100-token response:

= 100 × 1,017 years

= 101,700 years per response

You’ll also need to implement temperature scaling and top-k sampling during token selection. Temperature divides the logits by a scalar before softmax. Top-k requires sorting 50,257 values to find the k largest — approximately 16 minutes on the card sorter per token generated.

TIP

For faster inference, consider pre-computing a KV cache: store the key and value matrices from previous positions so you only compute attention for the new token. This reduces per-token compute from O(n²) to O(n), at the cost of additional cache storage cards.

13.Total Requirements Summary

FINAL PROJECT TALLY

Punch Cards (training)~9.18 billion

Stack Height~1,634 km (2× height of ISS orbit)

Total Weight~16,194 metric tons

Total FLOPs~2.37 × 10^{18}

Time (1 operator)2.25 trillion years

Time (1M operators)2.25 million years

Universes Required~163

Warehouses Required≥2

Filing Cabinets~200,000

Coffee (estimated)∞

COST ESTIMATE

Cards at $0.01/card: 9.18B × $0.01 = $91,800,000

Keypunch machines: 3 × $8,500 = $25,500

Card sorters: 2 × $12,000 = $24,000

Warehouse (2,500 sqft/yr): ~$30,000/yr

Operators (1M × $50k/yr × 2.25M yrs): $112,500,000,000,000,000

= ~$112.5 quadrillion

(approximately 1,000× current global GDP)

FOR COMPARISON

Training GPT-2 Small on a single A100 GPU: ~24 hours

Training GPT-2 Small on punch cards: ~2.25 trillion years

The GPU is approximately 8.2 × 10^15 times faster.

Or you could just use a GPU. Your call.

◆ READY TO START? ◆

Pick up your first 9.18 billion cards and get punching.

SHOP PUNCH CARDS →

ANYTHING IS POSSIBLEWITH PUNCH CARDS

00.Prerequisites & Bill of Materials

01.Floating-Point Arithmetic Library

REQUIRED LOOKUP TABLES

02.Matrix Operations Library

TIME PER MATRIX MULTIPLY

03.Build the Tokenizer (BPE)

04.Prepare the Training Corpus

SHUFFLING THE DATASET

05.Define the Transformer Architecture

06.Initialize Model Weights

RANDOM NUMBER GENERATION

07.Implement the Forward Pass

STEP 7.1: TOKEN EMBEDDING

STEP 7.2: POSITIONAL ENCODING

STEP 7.3: TRANSFORMER LAYERS (×12)

TOTAL FORWARD PASS COMPUTE

08.Compute the Loss Function

09.Implement Backpropagation

10.Implement the Adam Optimizer

11.Run the Training Loop

TOTAL TRAINING STEPS

TOTAL TRAINING TIME

PARALLELIZATION

12.Run Inference

13.Total Requirements Summary

COST ESTIMATE

FOR COMPARISON

ANYTHING IS POSSIBLE
WITH PUNCH CARDS