Brayan’s Blog

RLHF from Scratch: A Complete Alignment Study

2026-04-02T00:00:00+00:00

Personal project · 2026 · PyTorch · tatsu-lab/alpaca · 1M parameter model

Overview
The four algorithms
Phase 1 — Baseline evaluation
Phase 5 — Hyperparameter tuning
DPO training dynamics
GRPO group collapse
Phase 5 results
Per-prompt delta analysis
The ranking reversal
Conclusion

1. Overview

This is the final report of a ten part project implementing reinforcement learning from human feedback (RLHF) entirely from scratch in PyTorch. Every component the tokenizer, the language model, the reward model, and all three post-SFT alignment algorithms were built from first principles without relying on pretrained weights or alignment libraries.

The project runs two evaluation phases. Phase 1 establishes baselines by running SFT, PPO, GRPO, and DPO checkpoints through a fixed evaluation suite of 16 prompts scored by a trained reward model. Phase 5 reruns all four algorithms with targeted hyperparameter changes, motivated by the specific failure modes identified in Phase 1. The result is a complete before-and-after picture of what each algorithm does, what breaks it, and what fixes it.

Architecture

All four models share the same architecture throughout:

Component	Value
Layers	2
Attention heads	2
Embedding dim	128
Parameters	~1M
Tokenizer	BPE, 8,000 vocab
Block size	256 tokens
Reward model	4-layer, 256-dim bidirectional transformer

Evaluation protocol: All scores come from the same reward model scoring the same 16 prompts from tatsu-lab/alpaca via sample_prompts(). Phase 1 generation uses temperature=0.7, top_k=50, max_new_tokens=64. Phase 5 uses temperature=0.3, top_k=20, max_new_tokens=96. The same protocol for all methods makes scores directly comparable within each phase.

2. The four algorithms

2.1 Supervised Fine-Tuning (SFT) — the baseline

SFT is not an alignment algorithm it is the starting point for all three. The SFT model is fine-tuned on tatsu-lab/alpaca using standard cross-entropy loss over the response tokens only, learning to imitate the distribution of human-written instruction responses.

The SFT checkpoint serves two roles: it is both the evaluation baseline that all post-SFT methods must beat, and the initialisation point from which PPO, GRPO, and DPO all start.

2.2 PPO — Proximal Policy Optimisation

PPO frames alignment as a reinforcement learning problem. The policy generates rollouts (responses), the reward model scores them, and the policy is updated to maximise reward subject to a KL constraint preventing too much drift from the reference model.

The KL-constrained RL objective:

The clipped surrogate loss:

Where r_t(θ) = π_θ(aₜ|sₜ) / π_old(aₜ|sₜ) is the probability ratio and Â_t is the advantage estimated by the value head using GAE.

The shaped reward used at each token position:

shaped_reward[t] = r_scalar - kl_coef * (log_pi_theta[t] - log_pi_ref[t])

# Phase 1: kl_coef = 0.01
# Phase 5: kl_coef = 0.1

2.3 GRPO — Group Relative Policy Optimisation

GRPO eliminates the value function entirely. Instead of estimating a baseline from a critic, it generates a group of k responses to the same prompt and uses the group statistics as the baseline. The advantage for each response is its normalised position within the group.

Group-relative advantage:

The GRPO loss:

The critical vulnerability: when all k responses are near-identical, std(r) → 0 and all advantages → 0. No gradient flows. This is group collapse, and it is GRPO’s primary failure mode at small model scale.

rewards    = torch.tensor([rm(resp_i) for i in range(k)])
mean_r     = rewards.mean()
std_r      = rewards.std().clamp_min(1e-6)   # prevents div-by-zero
advantages = (rewards - mean_r) / std_r      # (k,) — zero if all same

# Phase 1: k=4, gen_temp=0.8
# Phase 5: k=8, gen_temp=1.0

2.4 DPO — Direct Preference Optimisation

DPO eliminates both the explicit reward model and the RL loop from training. Instead, it reparameterises the optimal policy in terms of log-ratios between the policy and reference, then derives a loss directly over preference pairs (chosen, rejected).

The key insight from Rafailov et al. (NeurIPS 2023) is that the optimal policy under the KL-constrained reward objective satisfies:

Optimal policy form:

Rearranging to express reward in terms of the policy:

Reward reparameterisation:

Substituting into the Bradley-Terry preference model causes Z(x) to cancel, yielding the DPO loss:

The DPO loss:

The reward model does not appear anywhere in the DPO training loop. It is used only for post-hoc evaluation in dpo_logger.py and eval_dpo.py. This is the key architectural distinction from PPO and GRPO.

The get_logps function — the shift logic that must be correct:

def get_logps(model, input_ids, response_mask):
    logits, _, _ = model.lm(input_ids, None)     # (B, T, V)
    shift_logits = logits[:, :-1, :]              # predict positions 1..T
    shift_labels = input_ids[:, 1:]               # actual tokens 1..T
    shift_mask   = response_mask[:, 1:]           # only response positions
    log_probs    = torch.log_softmax(shift_logits, dim=-1)
    token_logps  = log_probs.gather(-1, shift_labels.unsqueeze(-1)).squeeze(-1)
    return (token_logps * shift_mask.float()).sum(dim=-1)  # (B,)

# Phase 1: beta=0.1
# Phase 5: beta=0.3

3. Phase 1 — Baseline evaluation

All four checkpoints were evaluated on the same 16 prompts with the same reward model at temperature=0.7, top_k=50.

Per-prompt results

Fig 1 — Phase 1 per-prompt reward scores across all four methods.

Average reward by algorithm

Fig 2 Phase 1 average reward. PPO wins at +3.99. GRPO is the only method below the SFT baseline.

Summary table

Prompt	SFT	PPO	GRPO	DPO
Stay healthy tips	+6.77	+5.39	-1.23	+5.44
Three primary colors	+4.05	+3.34	-5.97	+6.93
Structure of an atom	+0.42	+5.95	-7.25	-6.63
Reduce air pollution	+3.83	+5.58	-3.27	-7.75
Difficult decision	-3.92	+1.99	-2.18	+0.25
Odd one out (Twitter…)	+6.43	+8.07	+4.12	+3.94
4/16 = 1/4 explain	+7.04	-4.93	+4.16	+6.14
Short story career	+0.87	+8.05	+6.37	+1.67
Render 3D house	+6.27	+6.27	-0.77	+2.97
Spelling & grammar	+1.85	+7.47	+5.92	+6.99
Julius Caesar death	+6.49	+5.78	-2.63	+7.13
Capital of France	-7.10	-7.10	+0.01	+4.91
Camping trip list	+7.44	-1.21	-5.14	-4.28
Great Depression causes	+0.59	+6.96	+1.55	-4.91
Classify oak/copper/eleph	+7.01	+7.01	+4.99	+7.84
Word embeddings NLP	+0.11	+5.27	-0.53	+7.82
AVERAGE	+3.009	+3.992	-0.116	+2.403

Bold = highest score in row.

Phase 1 findings

PPO wins Phase 1 at +3.99 but fails on prompts where SFT was already strong. The camping list drops from +7.44 to -1.21. The capital of France scores identically to SFT at -7.10 the policy learned nothing on that prompt.

GRPO is the only method to regress below SFT (-0.12 average). “What are the three primary colors?” yields -5.97 because all four generated samples collapsed to “Theal” with group std ≈ 0. No gradient flowed on this prompt type.

DPO has the highest variance of any method +7.82 on word embeddings and -7.75 on air pollution in the same evaluation run. Reward margin explosion (reaching 599 by step 150) caused catastrophic forgetting on specific prompt types.

On reward model reliability: The capital of France is Paris scores -7.10 under both SFT and PPO. Meanwhile incoherent DPO output scores +4.91. The reward model penalises short, definitive answers regardless of correctness. All Phase 1 rankings must be read with this caveat in mind.

4. Phase 5 — Hyperparameter tuning

Each algorithm’s Phase 1 failure mode was diagnosed and a targeted multi-parameter tweak was applied. One retrain per algorithm, all changes applied simultaneously, evaluated with improved sampling (temperature=0.3, top_k=20).

4.1 SFT — sampling only, no retraining

Parameter	Phase 1 → Phase 5	Rationale
temperature	0.7 → 0.3	Forces model to commit to highest-probability tokens
top_k	50 → 20	Tighter nucleus sampling, higher output consistency
max_new_tokens	64 → 96	More complete responses for the RM to score

4.2 PPO — stronger KL constraint

Phase 1 diagnosis: kl_coef=0.01 was too weak to prevent forgetting of SFT-strong prompts.

Parameter	Phase 1 → Phase 5	Rationale
kl_coef	0.01 → 0.1	10× stronger KL anchor to reference
learning rate	1e-5 → 5e-6	Slower updates, less aggressive policy drift
resp_len	64 → 96	Longer rollouts give RM more signal
eval temp / top_k	0.7/50 → 0.3/20	Consistent with other methods

4.3 GRPO — larger group, higher diversity

Phase 1 diagnosis: group collapse. With k=4 on a low-entropy model, many steps had std ≈ 0.

Parameter	Phase 1 → Phase 5	Rationale
group_size k	4 → 8	More samples per group, lower collapse probability
gen_temperature	0.8 → 1.0	Higher entropy during rollout keeps group std positive
learning rate	1e-5 → 5e-6	Stabilises noisy diverse batches

4.4 DPO — stronger β, slower learning rate

Phase 1 diagnosis: reward margin explosion. With β=0.1, the margin reached 599 by step 150.

β controls how strongly large margins are penalised in the loss:

DPO reward margin:

Parameter	Phase 1 → Phase 5	Rationale
beta β	0.1 → 0.3	3× stronger implicit KL, slows margin explosion
learning rate	1e-5 → 5e-6	Combined with stronger β prevents catastrophic drift
rej_temperature	0.9 → 1.1	More diverse rejected responses, cleaner preference signal

5. DPO training dynamics: Phase 1 vs Phase 5

The DPO training logs provide the clearest picture of what the β change achieved.

Fig 3 DPO training dynamics. Top row: Phase 1 (β=0.1). Bottom row: Phase 5 (β=0.3). Left: loss. Right: reward margin.

Phase 1 (β=0.1): Loss collapses to ~0 by step 30 and stays there. The reward margin grows monotonically, reaching 599 at step 150. The model is overfitting each pair to zero loss with no recovery.

Phase 5 (β=0.3): The loss shows genuine variation several steps near zero, but recoveries at steps 90 (1.44) and 100 (5.60). The margin peaks at 261 rather than 599, and shows negative values at steps 90 and 100, indicating the model occasionally prefers the rejected response a healthier training signal that triggers correction.

The negative margins in Phase 5 are not failures. They are the loss function doing its job when margin is negative, loss is high, a strong gradient fires, and the policy corrects. With β=0.1, loss reached zero so fast that these corrections never registered.

6. GRPO group collapse: Phase 1 vs Phase 5

The group standard deviation is the critical GRPO diagnostic. When std → 0, advantages → 0, and no gradient flows.

Fig 4 GRPO group std. Left: Phase 1 per-prompt (k=4). Right: Phase 5 per training step (k=8, temp=1.0). Red dashed = collapse threshold.

Phase 1 had 2 of 16 prompts at exactly std=0 (primary colors, atom structure) and several more near the threshold. These correspond directly to GRPO’s worst Phase 1 scores.

Phase 5 shows only one collapse event at step 140 (the France prompt, where the model has a near-deterministic output regardless of k). At every other step, std > 0.5 useful gradient signal was available throughout training.

The terminal output confirms: Phase 5 GRPO group mean rewards show the model successfully learning mean_r of 5.718 at step 40, 6.419 at step 120, 5.760 at step 200 versus Phase 1 where many groups were stuck near the group mean due to collapse.

7. Phase 5 — Results

Per-prompt results

Fig 5 Phase 5 per-prompt reward scores after hyperparameter tuning.

Before and after averages

Fig 6 Average reward Phase 1 vs Phase 5. Hatched = Phase 1. Solid = Phase 5. Delta annotated.

Summary

Algorithm	Phase 1 avg	Phase 5 avg	Delta	Direction
SFT	+3.009	+4.131	+1.122	Improved
PPO	+3.992	+3.523	-0.469	Regressed
GRPO	-0.116	+3.312	+3.428	Largest gain
DPO	+2.403	+4.148	+1.745	Improved

Three algorithms improved. One regressed. GRPO has the largest absolute gain at +3.428 — directly validating the group collapse hypothesis.

8. Per-prompt delta analysis

Fig 7 — Delta heatmap (Phase 5 − Phase 1). Green = improvement. Red = regression.

Several patterns stand out:

The capital of France row is all red or zero this is a structural reward model failure. The correct answer (“Paris”) is penalised by the RM regardless of which algorithm generates it. No hyperparameter change can fix this.
The classify oak/copper/elephant row shows near-zero deltas SFT already scores perfectly here (+7.01) and all methods converge to the same output regardless of configuration.
GRPO’s improvements are concentrated on structured list tasks (staying healthy: +8.00, camping list: +11.57) where a more diverse group correctly identifies higher-quality completions.
DPO’s improvements are most notable on knowledge retrieval (atom structure: +13.09 Phase 1 was -6.63, Phase 5 is +6.46) where stronger β prevented the drift that destroyed these representations.
PPO’s regression is clearest on tasks where SFT already had good representations (4/16 fraction: -7.07, word embeddings: -4.31) where kl_coef=0.1 over-constrained the policy in the opposite direction.

9. The ranking reversal

!The ranking reversal](/images/rlhfblogimages/fig8_ranking_bump_chart.png)

Fig 8 Algorithm ranking Phase 1 → Phase 5. DPO moves from 3rd to 1st. GRPO moves from 4th to 3rd. PPO falls from 1st to 4th.

Rank	Phase 1	Avg	Rank	Phase 5	Avg
1st	PPO	+3.99	1st	DPO	+4.15
2nd	SFT	+3.01	2nd	SFT	+4.13
3rd	DPO	+2.40	3rd	GRPO	+3.31
4th	GRPO	-0.12	4th	PPO	+3.52

The ranking completely reshuffled. The Phase 1 winner (PPO) is the Phase 5 loser. The Phase 1 loser (GRPO) jumped to third. DPO, which was below SFT in Phase 1, became the overall winner.

This outcome directly demonstrates that Phase 1 results were as much about hyperparameter sensitivity as about algorithmic quality. PPO with kl_coef=0.01 performs differently from PPO with kl_coef=0.1. GRPO with k=4 performs differently from GRPO with k=8. The algorithm identity alone is not sufficient to predict ranking.

Key takeaway for practitioners: At 1M parameter scale PPO is most sensitive to kl_coef, GRPO is most sensitive to group size and generation temperature (group collapse is a binary failure mode, not gradual), and DPO is most sensitive to beta. All three are also sensitive to eval temperature: the SFT +1.12 gain from temp=0.7 to temp=0.3 with no retraining illustrates how much evaluation protocol matters independently of training.

10. Conclusion

DPO is theoretically the most elegant and empirically the most sensitive to β. With β=0.3 it is the best-performing method across both phases. With β=0.1 it degrades catastrophically on specific prompt types due to reward margin explosion.
GRPO’s group collapse failure mode is real, diagnosable from the group standard deviation during training, and directly fixable by increasing k and generation temperature. The +3.43 improvement from Phase 1 to Phase 5 is the clearest causal result in the entire project.
PPO is the most robust to suboptimal hyperparameters in Phase 1 but the most vulnerable to over-correction in Phase 5. kl_coef=0.01 was too weak; kl_coef=0.1 was too strong. The optimal value lies between them.
The reward model is the binding constraint on evaluation quality. Multiple results — including “The capital of France is Paris” scoring -7.10 reveal that the RM has learned surface patterns that do not correlate with factual correctness. All rankings here are relative to the trained RM, not human preference.
Evaluation sampling matters independently of training. The SFT model improved by +1.12 with zero retraining just by changing from temperature=0.7 to temperature=0.3. Phase 1 underestimated SFT’s capabilities and all post-SFT deltas should be read with this baseline correction in mind.

References

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO). arXiv:2402.03300.
Rafailov, R. et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290.
Taori, R. et al. (2023). Stanford Alpaca: An Instruction-following LLaMA model. tatsu-lab/alpaca.
Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. NeurIPS 2017.

Built entirely from scratch in PyTorch · No pretrained weights · No alignment libraries

Implementing Direct Preference Optimization (DPO)

2026-03-24T00:00:00+00:00

Series: Multi-Stage RLHF from Scratch · Phase: Part 10 of 10 · Algorithm: DPO

1. Context and motivation

This write-up documents Part 10 of a multi-stage project implementing reinforcement learning from human feedback (RLHF) from scratch. The full project covers: Supervised Fine-Tuning (SFT), a Reward Model, Proximal Policy Optimisation (PPO), Group Relative Policy Optimisation (GRPO), and finally Direct Preference Optimisation (DPO). Each algorithm is implemented from first principles against the same model architecture, tokenizer, and evaluation suite, enabling a direct side-by-side comparison of approaches.

DPO is chosen as the final stage because it represents the most elegant solution to the preference alignment problem. Where PPO requires a live reward model, a value function, rollout collection, and a clipped policy gradient update, DPO collapses the entire pipeline into a single classification loss. The goal of implementing it in this project is not just to obtain good scores, but to understand precisely what it does differently, where it succeeds, and where it falls short at small model scale.

2. What the DPO paper says

2.1 The core problem

Standard RLHF (as used in PPO and InstructGPT) has two stages after SFT: first train a reward model on human preference data, then use reinforcement learning to maximise the learned reward subject to a KL constraint from the reference policy. The optimisation objective is:

This is expensive: it requires sampling from the LM during training, maintaining a separate reward model and critic, and careful hyperparameter tuning of the KL coefficient. The paper’s central insight is that this objective has a closed-form optimal solution:

Rearranging this to express the reward in terms of the policy gives:

The key observation is that when this reparameterisation is substituted into the Bradley-Terry preference model, the intractable partition function $Z(x)$ cancels out entirely. This allows the preference probability to be expressed purely in terms of the policy and the reference — no reward model required.

2.2 The DPO loss

Substituting the reparameterised reward into the Bradley-Terry preference model and framing it as a maximum likelihood objective over preference pairs $(x, y_w, y_l)$ yields the DPO loss:

Where $y_w$ is the chosen (preferred) response, $y_l$ is the rejected (dispreferred) response, and $\beta$ controls how tightly the policy stays near the reference. This is a binary cross-entropy loss — the model learns to assign higher implicit reward to chosen over rejected, with the gradient automatically weighting harder examples more heavily.

2.3 What the gradient does

The paper provides an explicit gradient analysis. Increasing the DPO loss parameters $\theta$ increases the log-probability of $y_w$ and decreases the log-probability of $y_l$. Crucially, the weight applied to each example is $\sigma(\hat{r}\theta(x, y_l) - \hat{r}\theta(x, y_w))$ — proportional to how much the current model incorrectly ranks the rejected response over the chosen one. This dynamic weighting prevents trivial updates on already-solved pairs and concentrates learning on the hardest examples.

2.4 Experimental setup in the paper

The paper evaluates DPO on three tasks. In controlled sentiment generation on IMDb, it uses GPT-2-large SFT’d on movie reviews, with preference pairs generated synthetically using a pre-trained sentiment classifier. In TL;DR summarisation, it uses a GPT-J SFT model with human preference labels from Stiennon et al. In single-turn dialogue on the Anthropic HH dataset, it uses Pythia-2.8B fine-tuned on preferred completions. Evaluation uses the frontier of reward vs KL divergence (sentiment task, where the ground-truth reward function is known) and GPT-4 win rates against reference completions (summarisation and dialogue).

Crucially, all three experiments use pre-collected preference datasets. The paper never generates rejected responses on-the-fly during training — it works from a static offline dataset of $(x, y_w, y_l)$ triplets. This distinction is central to understanding how this implementation differs.

3. This implementation

3.1 Architecture

The policy uses the PolicyWithValue class introduced in Part 8 (PPO). It wraps a GPTModern language model (a small transformer with RMSNorm, SwiGLU activations, and RoPE positional embeddings) and adds a linear value head over the logit space. In DPO, the value head is not trained — it is frozen and retained only for checkpoint compatibility with the earlier PPO and GRPO phases. Only the LM parameters receive gradient updates.

The reference model is an identical frozen copy of the SFT checkpoint. Its weights are fixed throughout training. All reference log-probabilities are computed inside torch.no_grad() blocks.

Figure 3 — DPO pipeline: the frozen reference model and the trainable policy both receive the preference pair and output log-probabilities that feed the DPO loss. Gradients flow only to the trainable policy.

3.2 The dataset challenge

Key departure from the paper

The original DPO paper uses datasets that already contain (prompt, chosen, rejected) triplets — the Anthropic HH dataset, TL;DR with Stiennon et al. preferences, or synthetically generated pairs. tatsu-lab/alpaca only provides (instruction, output) pairs with no rejected response. This requires constructing rejected responses on-the-fly during training, which is a meaningful departure from the pure offline DPO setup.

After filtering rows with empty outputs, the Alpaca dataset provides 51,974 instruction-response pairs. For each training step:

The dataset’s human-written output becomes the chosen response $y_w$.
A rejected response $y_l$ is generated on-the-fly from the frozen reference model using temperature=0.9 and top_k=50.
The assumption ‘human output is always better than a high-temperature model generation’ is treated as valid without reward model verification.

This assumption is defensible and is consistent with the spirit of the paper’s approach, but it introduces noise: there will be cases where the generated response is actually acceptable, making the preference signal weak. The high temperature and diverse sampling are intended to maximise the probability that the rejected response is genuinely inferior.

The closest analogue in the paper is the IMDb sentiment task, where preference pairs are also constructed automatically rather than from human annotation — though there the classification is done with a ground-truth reward function, whereas here we rely on the quality gap between human text and a random generation.

3.3 The `get_logps` function

A central implementation detail is the correct computation of per-sequence log-probabilities over the response tokens only. The function must respect the same token-shift logic used throughout the codebase:

def get_logps(model, input_ids, response_mask):
    # Forward pass
    logits, _, _ = model.lm(input_ids, None)   # (B, T, V)

    # Shift: logits[:-1] predict tokens[1:]
    shift_logits = logits[:, :-1, :]            # (B, T-1, V)
    shift_labels = input_ids[:, 1:]             # (B, T-1)

    # Shift mask by 1 to align with shifted labels
    shift_mask   = response_mask[:, 1:]         # (B, T-1)

    # Per-token log-probs, zeroed on prompt tokens
    log_probs   = torch.log_softmax(shift_logits, dim=-1)
    token_logps = log_probs.gather(-1, shift_labels.unsqueeze(-1)).squeeze(-1)
    token_logps = token_logps * shift_mask.float()

    return token_logps.sum(dim=-1)              # (B,) — one value per sequence

The mask is shifted by one position to match the shifted labels. This ensures that only response tokens contribute to the log-probability sum, leaving prompt tokens at zero weight — which is the correct behaviour since we want to measure how well the model assigns probability to the response, not the prompt.

3.4 The DPO loss

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps,    ref_rejected_logps, beta=0.1):

    chosen_log_ratios   = policy_chosen_logps   - ref_chosen_logps
    rejected_log_ratios = policy_rejected_logps - ref_rejected_logps

    # DPO margin: β * (log-ratio_chosen - log-ratio_rejected)
    logits = beta * (chosen_log_ratios - rejected_log_ratios)

    loss          = -F.logsigmoid(logits).mean()
    reward_margin = (chosen_log_ratios - rejected_log_ratios).detach().mean()
    accuracy      = (logits.detach() > 0).float().mean()

    return DPOLossOutput(loss=loss,
                         reward_margin=reward_margin,
                         accuracy=accuracy)

3.5 Training configuration

Training ran for 200 steps on CPU using a single example per step (batch size 1). The key hyperparameters:

$\beta = 0.1$ (DPO temperature, same as paper default)
Learning rate = 1e-5 with AdamW, betas (0.9, 0.999)
Gradient clipping at 1.0
Response generation: temperature=0.9, top_k=50, max 64 new tokens
Block size: 256 tokens
Model: 2-layer, 2-head, 128-dim transformer (same as PPO phase)

The small model size (2 layers, 128-dim) is consistent across all phases of the project. This is intentional — the project is about algorithm implementation and comparison, not about maximising absolute scores.

4. Training dynamics

4.1 Loss behaviour

The loss curve tells a clear story. At step 10, loss = 0.641 and accuracy = 1.0 — the model is already preferring chosen over rejected, but with moderate confidence. At step 20, loss = 0.693 (exactly $\log 2$), accuracy = 0.0, margin = 0.0. This is the degenerate case predicted by theory: when the policy exactly mirrors the reference, all log-ratios are zero, the DPO margin is zero, and the loss equals $-\log(\sigma(0)) = \log 2 \approx 0.693$. The model assigns equal probability to chosen and rejected.

From step 30 onward, loss collapses toward zero and stays there for the remainder of training, reaching exactly 0.0 at steps 70, 80, 110, 130, 150, 160, 170, 190, and 200. Accuracy locks at 1.0. This rapid convergence is consistent with what the paper describes as the efficiency of DPO — but the speed here is a warning sign, not a success signal.

4.2 Reward margin explosion

The most important metric to examine is the reward margin — the mean gap between chosen and rejected log-ratios. A margin of 1–10 is healthy and indicates the model has learned a meaningful preference signal while staying close to the reference. The values observed here are qualitatively different:

Step	Reward Margin
30	56.9
70	240.7
80	295.8
150	599.2
200	329.7

Margins of this magnitude indicate that the policy has drifted very far from the reference distribution. The loss reaching zero and staying there is not evidence of good generalisation — it is evidence that the model has memorised the preference signal on each individual pair to the point where it assigns near-zero probability mass to the rejected response. This is reward hacking in the DPO sense: the policy has exploited the training signal rather than learning a generalisable preference.

Figure 1 — DPO training dynamics over 200 steps. Blue: loss (left axis). Red dashed: reward margin (right axis). Note the margin explosion after step 30, reaching 599 at step 150.

Root cause

The primary driver is the batch size of 1 combined with a small model and a strong quality gap between human text and high-temperature generations. With a single example per step, there is no averaging across a batch to stabilise gradients. Each update can overfit completely to one (chosen, rejected) pair before moving to the next. A batch size of 16–64 with gradient accumulation would stabilise this significantly.

5. Evaluation results

5.1 Summary

Post-training evaluation was run on 16 standard prompts using the Part 7 reward model. Average reward: 2.40. Average KL divergence from the SFT base: 3.31.

Prompt	Reward	Avg KL
Give three tips for staying healthy.	5.44	3.45
What are the three primary colors?	6.93	2.54
Describe the structure of an atom.	-6.63	1.21
How can we reduce air pollution?	-7.75	4.45
Describe a time when you had to make a difficult decision.	0.25	2.06
Identify the odd one out. Twitter, Instagram, Telegram	3.94	2.87
Explain why 4/16 is equivalent to 1/4	6.14	3.02
Write a short story about a career decision.	1.67	8.70
Render a 3D model of a house	2.97	2.35
Evaluate this sentence for spelling/grammar mistakes	6.99	5.91
How did Julius Caesar die?	7.13	2.88
What is the capital of France?	4.91	3.51
Generate a list of ten items for a camping trip	-4.28	1.65
Discuss the causes of the Great Depression	-4.91	1.41
Classify: Oak tree, copper ore, elephant	7.84	2.59
Explain the use of word embeddings in NLP	7.82	4.27

Figure 2 — Per-prompt reward scores for 16 evaluation prompts. Blue bars = positive reward; red bars = negative reward. Average: 2.40.

5.2 Response quality

The responses are incoherent. Representative examples:

Prompt: ‘Describe the structure of an atom.’ → Response: ‘An atom is ways make inform of a severe glic acid.’
Prompt: ‘How can we reduce air pollution?’ → Response: ‘There are reducing made up16 is made up of a risular of 13’
Prompt: ‘What is the capital of France?’ → Response: ‘The capital of France is made up of Water Exercise ore of comm of at 1973on Bret, make amends.’

These outputs are syntactically broken and semantically meaningless. The model has learned to shift its distribution away from the reference but has not learned coherent language — it has simply collapsed into a different incoherent distribution that happens to score positively under the reward model.

5.3 Reward model limitations

The high variance in reward scores (range: −7.75 to +7.84) and the presence of very high rewards on clearly nonsensical responses points to a known limitation: the reward model was trained at this same small scale (2-layer, 256-dim) and has limited ability to discriminate response quality on prompts far outside its training distribution. A response like ‘Julius Caur G conflict, forces, had just graphelling’ receives a reward of 7.13. This is reward model overfitting — the RM has learned surface patterns that map to high scores without capturing semantic quality.

This is not a failure unique to DPO. It is a general property of small-scale RLHF: the reward model and the policy are jointly limited by model capacity, and the policy can find reward-maximising outputs that fool the RM without producing genuinely good text.

6. How this compares to the paper

6.1 What was faithfully implemented

The DPO loss equation from the paper (Equation 7) is implemented exactly, including the beta temperature and the log-sigmoid objective.
The frozen reference model pattern is correct: $\pi_\mathrm{ref}$ is initialised from the SFT checkpoint and receives no gradient updates throughout training.
The val_head is preserved in the architecture for checkpoint compatibility, as required by the multi-stage project structure.
The get_logps masking correctly supervises only response tokens, consistent with the paper’s per-sequence log-probability formulation.
The reward model is absent from the training loop entirely, consistent with DPO’s RL-free design.
$\beta = 0.1$ matches the paper’s default hyperparameter.

6.2 Where this differs from the paper

Dataset: The paper uses pre-collected $(x, y_w, y_l)$ triplets. This implementation constructs rejected responses on-the-fly from the reference model because tatsu-lab/alpaca contains no rejected responses.
Batch size: The paper uses batch size 64 with RMSprop. This implementation uses batch size 1 with AdamW, which destabilises training and contributes to margin explosion.
Model scale: The paper evaluates models up to 6B parameters. This implementation uses a ~1M parameter model. At this scale, the learned representations are too weak to produce coherent responses even after successful preference alignment.
Evaluation: The paper uses GPT-4 win rates for realistic tasks and a ground-truth sentiment classifier for the controlled task. This implementation uses a small trained reward model, which has its own limitations at small scale.
Training steps: The paper trains to convergence (thousands of steps). This run trained for 200 steps, which is sufficient to observe the training dynamics but not to assess long-run behaviour.

6.3 On the RM-free design

The absence of the reward model from the DPO training loop is architecturally correct and theoretically motivated. In standard RLHF, the RM appears twice: once during reward model training (Part 7), and again inside the PPO training loop as the reward signal for each rollout. DPO eliminates the second appearance entirely. The implicit reward is instead encoded in the preference pairs themselves — the policy learns to assign higher implicit reward to chosen responses purely by shifting log-ratios, with no RM forward pass at training time.

In this project, the RM reappears only at evaluation time (dpo_logger.py and eval_dpo.py), where it serves as a consistent scoring function to allow comparison with PPO results. This is the correct separation: RM as evaluator, not as training signal.

7. What the results tell us and what to try next

The training dynamics are informative even though the final outputs are incoherent. The loss and accuracy curves are behaving as expected theoretically — rapid convergence is a known property of DPO on easy preference pairs. The reward margin explosion is the clear pathology, and its cause is well-understood: a batch size of 1 gives the optimiser no averaging, each step overfits the current pair, and the margin grows without bound.

To obtain coherent outputs at this model scale, the most impactful changes in order of priority would be:

Increase batch size to 16–64 using gradient accumulation. This is the single most important change.
Add a margin cap: clip the reward margin at a threshold (e.g., 10.0) to prevent saturation, or use a length-normalised version of the log-probability sum instead of the raw sum.
Reduce $\beta$ to 0.05. A smaller beta tightens the KL constraint and keeps the policy closer to the reference, reducing the risk of degenerate drift.
Train for more steps with early stopping on a held-out reward score. 200 steps on a 51k dataset barely scratches the surface.
Upgrade the model. The 2-layer 128-dim architecture is the binding constraint on response quality. Even moving to 4 layers and 256-dim would substantially improve coherence.

Despite the output quality, the implementation is architecturally sound. The DPO loss, the reference model pattern, the get_logps masking, and the RM-free training loop are all correct. What is broken is the training regime, not the algorithm. This is a meaningful distinction — it means the codebase is a valid starting point for a proper run with better hyperparameters and a larger model.

8. Conclusion

DPO is a genuinely elegant algorithm. The theoretical insight — that the RLHF objective has a closed-form optimal policy, and that substituting a log-ratio reparameterisation into the Bradley-Terry model eliminates both the reward model and the RL loop — is one of the cleaner results in the alignment literature. The implementation is simpler than PPO by a significant margin: no value function, no rollout collection, no advantage estimation, no clipped ratio objective.

At small scale with a batch size of 1, the reward margin explodes and output quality degrades. This is not a failure of DPO as an algorithm — it is a failure of the training configuration relative to model capacity. The paper’s results at 6B parameters with batch size 64 are not directly comparable to a 1M parameter model trained one example at a time. What this run does confirm is the theoretical behaviour: rapid loss convergence, accuracy saturating at 1.0, and the preference signal being successfully encoded (even if over-encoded) in the policy weights.

The next step is the cross-algorithm comparison across PPO, GRPO, and DPO using the same 16 evaluation prompts and the same reward model — the comparison this entire project has been building toward.

References

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arXiv:2305.18290.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.

Taori, R. et al. (2023). Stanford Alpaca: An Instruction-following LLaMA model. tatsu-lab/alpaca dataset.

Building a GPT from Scratch in JAX/Flax:

2026-03-13T00:00:00+00:00

Building a GPT from Scratch in JAX/Flax

A honest account of building a transformer language model using JAX, Flax NNX, and the TinyStories dataset — including every wall I hit along the way.

Why JAX?

Most transformer tutorials start with PyTorch. It’s intuitive, well-documented, and the ecosystem is enormous. So why would anyone choose JAX for a from-scratch GPT implementation?

Three reasons:

1. XLA compilation. JAX compiles your code down to XLA (Accelerated Linear Algebra), which means the same code runs on CPU, GPU, and TPU without modification. You decorate a function with @jax.jit and JAX handles the rest.

2. Functional purity. JAX forces you to write pure functions — no hidden state, no in-place mutations. This is uncomfortable at first, but it makes your model logic explicit and easier to reason about.

3. vmap. JAX’s vmap lets you write code for a single example and automatically vectorize it across a batch. This isn’t just a convenience — it changes how you think about batching entirely.

That said, JAX has a steeper learning curve than PyTorch. This post is an honest account of what that looks like in practice.

The Architecture

The model is a decoder-only transformer — the same family as GPT — trained to predict the next token in a sequence. Here’s the full picture:

Input tokens (batch_size, seq_len)
        ↓
Token Embeddings + Positional Embeddings
        ↓
  ┌─────────────────────┐
  │   Transformer Block  │  × 6
  │  ┌───────────────┐  │
  │  │ Causal Multi- │  │
  │  │ Head Attention│  │
  │  └───────────────┘  │
  │         ↓           │
  │    Residual Add      │
  └─────────────────────┘
        ↓
Linear Projection → Logits (batch_size, seq_len, vocab_size)

Hyperparameters:

Parameter	Value
Transformer blocks	6
Embedding dimension	192
Attention heads	6
Feed-forward dimension	512
Max sequence length	128
Vocabulary	GPT-2 tokenizer (50,257 tokens)

The transformer architecture — our model uses the decoder side (right) only.

Small by modern standards, but trainable on a single GPU and expressive enough to learn story structure.

Part 1: Embeddings

The first layer combines token embeddings (what is this word?) with positional embeddings (where is it in the sequence?):

class TokenAndPositionEmbedding(nnx.Module):
    def __init__(self, maxlen, vocab_size, embed_dim, *, rngs):
        self.token_emb = nnx.Embed(vocab_size, embed_dim, rngs=rngs)
        self.pos_emb   = nnx.Embed(maxlen, embed_dim, rngs=rngs)

    def __call__(self, x):
        seq_len   = x.shape[1]
        positions = jnp.arange(seq_len)[None, :]  # shape: (1, seq_len)
        return self.token_emb(x) + self.pos_emb(positions)

The key line is jnp.arange(seq_len)[None, :] — the [None, :] adds a batch dimension so positions broadcast correctly across the batch. This is a pattern used constantly in JAX.

Token embeddings encode meaning; positional embeddings encode order. Both are summed before entering the transformer.

Part 2: Causal Attention

The attention block uses Flax NNX’s built-in MultiHeadAttention, but the critical piece is the causal mask — without it, the model can look into the future when predicting the next token, which is cheating.

class MiniGPT(nnx.Module):

    def causal_attention_mask(self, seq_len):
        # Lower triangular matrix — token i can only attend to tokens 0..i
        return jnp.tril(jnp.ones((seq_len, seq_len)))

    def __call__(self, token_ids):
        seq_len = token_ids.shape[1]
        mask    = self.causal_attention_mask(seq_len)

        x = self.embedding(token_ids)

        for block in self.transformer_blocks:
            x = block(x, mask=mask)

        return self.output_layer(x)

jnp.tril produces a lower-triangular matrix of ones. Position (i, j) is 1 if j ≤ i, meaning token i is allowed to attend to token j. This single matrix enforces the autoregressive property of the model.

The causal mask — each token (row) can only attend to itself and previous tokens (columns). Future positions are masked out.

Part 3: The Flax NNX Learning Curve

Flax has two APIs: the older linen API and the newer nnx API. This project uses NNX, which is more Pythonic — modules hold their own state rather than requiring external parameter trees.

The gotcha that cost me real time:

Flax 0.11.0 changed the Optimizer and update signatures without much fanfare:

# Flax < 0.11.0
optimizer = nnx.Optimizer(model, optax.adamw(...))
optimizer.update(grads)

# Flax >= 0.11.0 — both arguments now required
optimizer = nnx.Optimizer(model, optax.adamw(...), wrt=nnx.Param)
optimizer.update(model, grads)

The error message (Missing required argument 'wrt') points you in the right direction, but if you’re following a tutorial written before 0.11.0 you’ll hit this immediately. Always check your Flax version against the tutorial’s requirements.

Part 4: Data Loading with Grain

Rather than writing a custom DataLoader, this project uses Google’s grain library — a JAX-native data loading library built for performance.

dataset = StoryDataset(stories, maxlen, tokenizer)

sampler = pygrain.IndexSampler(
    num_records=len(dataset),
    shuffle=True,
    seed=42,
    shard_options=pygrain.NoSharding(),
    num_epochs=num_epochs,
)

dataloader = pygrain.DataLoader(
    data_source=dataset,
    sampler=sampler,
    operations=[pygrain.Batch(batch_size=batch_size, drop_remainder=True)]
)

Each story is tokenized and right-padded to maxlen=128 with zeros. The target sequence is simply the input shifted one position to the right:

# Input:  [Once, upon, a, time, ...]
# Target: [upon, a,    time, ..., ]

prep_target_batch = jax.vmap(
    lambda tokens: jnp.concatenate((tokens[1:], jnp.array([0])))
)

This is where vmap shines — write the transformation for a single sequence, apply it across the entire batch automatically.

Part 5: Training Loop

The training loop uses nnx.value_and_grad to compute loss and gradients in a single pass:

@nnx.jit
def train_step(model, optimizer, metrics, batch):
    input_ids, target_ids = batch

    def loss_fn(model):
        logits = model(input_ids)
        loss   = optax.softmax_cross_entropy_with_integer_labels(
            logits.reshape(-1, vocab_size),
            target_ids.reshape(-1)
        ).mean()
        return loss, logits

    (loss, logits), grads = nnx.value_and_grad(loss_fn, has_aux=True)(model)
    optimizer.update(model, grads)
    metrics.update(loss=loss, logits=logits, labels=target_ids)

The @nnx.jit decorator compiles the entire train step — forward pass, loss computation, gradient calculation, and weight update — into a single XLA kernel. The first call is slow (compilation), every subsequent call is fast.

How @jax.jit works — Python traces your function once, XLA compiles it, then every subsequent call skips Python entirely.

A subtle bug to watch for in the training loop:

# WRONG — step only increments once per epoch
for batch in dataloader:
    train_step(...)
step += 1  # ← outside the for loop

# CORRECT — step increments every batch
for batch in dataloader:
    train_step(...)
    step += 1  # ← inside the for loop

Python indentation bugs are silent and insidious in training loops.

Part 6: Checkpointing with Orbax

Orbax is JAX’s native checkpointing library. Saving and restoring model state:

# Save
checkpointer = orbax.checkpoint.PyTreeCheckpointer()
checkpointer.save(checkpoint_path, nnx.state(model))

# Restore
restored_state = checkpointer.restore(
    checkpoint_path,
    item=nnx.state(model),
    restore_args=restore_args
)
nnx.update(model, restored_state)

nnx.state(model) extracts the parameter pytree from the model. nnx.update(model, restored_state) writes it back in. The model architecture must match exactly — if you change embed_dim from 192 to 256, the checkpoint will fail to load because the weight shapes no longer match.

This also means you can load someone else’s checkpoint on your machine, instantly inheriting their training without running a single training step. This is how the 20M-token checkpoint used in this project was loaded and run on a fresh Colab session.

Part 7: Text Generation

Generation uses greedy decoding (argmax) with temperature scaling:

Autoregressive generation — the model predicts one token at a time, appending each prediction back to the input for the next step.

def generate_text(model, start_tokens, max_new_tokens=50, temperature=1.0):
    tokens = list(start_tokens)

    for _ in range(max_new_tokens):
        context      = tokens[-model.maxlen:]
        actual_len   = len(context)

        # Right-pad to maxlen to match training
        if actual_len < model.maxlen:
            context = context + [0] * (model.maxlen - actual_len)

        context_array = jnp.array(context)[None, :]
        logits        = model(context_array)

        # Sample from the position of the LAST real token, not position 0
        next_token_logits = logits[0, actual_len - 1, :] / temperature
        next_token        = int(jnp.argmax(next_token_logits))

        if next_token == END_TOKEN_ID:
            break

        tokens.append(next_token)

    return tokenizer.decode(tokens)

The line logits[0, actual_len - 1, :] is easy to get wrong. You want the logits at the position of the last real token — not position 0, and not the last padded position. Getting this wrong results in the model repeating the prompt with no new tokens generated.

Temperature controls how peaked the probability distribution is:

temperature=0.2 → conservative, repetitive output
temperature=1.0 → balanced
temperature=1.5 → creative, sometimes incoherent

Results

Trained on the TinyStories dataset with a 20M-token checkpoint, the model generates coherent short stories:

Prompt: "Once upon a time a big bear"
Output: "Once upon a time a big bear lived in the forest.
         He liked to walk and find berries. One day he
         met a little rabbit who was lost..."

The model learned basic story structure, character introduction, and simple cause-and-effect narrative — all from a 6-layer, 192-dimensional transformer.

What’s Next

Add layer normalisation to the transformer blocks (currently missing)
Replace greedy decoding with top-k or nucleus sampling for more varied output
Scale up: larger embed_dim, more blocks, more training data
Experiment with RoPE positional embeddings instead of learned positions

Try It Yourself

The full code is on GitHub: MinGPT-Implementation-with-Jax

A Colab notebook is included — mount your Drive, run the cells, and you can load the pretrained checkpoint and start generating stories in under a minute.

Built with JAX, Flax NNX, Optax, Orbax, Grain, and tiktoken.

Why I Built an LLM From Scratch

2026-03-03T00:00:00+00:00

Coming soon.

Brayan’s Blog

RLHF from Scratch: A Complete Alignment Study

Table of Contents

1. Overview

Architecture

2. The four algorithms

2.1 Supervised Fine-Tuning (SFT) — the baseline

2.2 PPO — Proximal Policy Optimisation

2.3 GRPO — Group Relative Policy Optimisation

2.4 DPO — Direct Preference Optimisation

3. Phase 1 — Baseline evaluation

Per-prompt results

Average reward by algorithm

Summary table

Phase 1 findings

4. Phase 5 — Hyperparameter tuning

4.1 SFT — sampling only, no retraining

4.2 PPO — stronger KL constraint

4.3 GRPO — larger group, higher diversity

4.4 DPO — stronger β, slower learning rate

5. DPO training dynamics: Phase 1 vs Phase 5

6. GRPO group collapse: Phase 1 vs Phase 5

7. Phase 5 — Results

Per-prompt results

Before and after averages

Summary

8. Per-prompt delta analysis

9. The ranking reversal

10. Conclusion

References

Implementing Direct Preference Optimization (DPO)

1. Context and motivation

2. What the DPO paper says

2.1 The core problem

2.2 The DPO loss

2.3 What the gradient does

2.4 Experimental setup in the paper

3. This implementation

3.1 Architecture

3.2 The dataset challenge

3.3 The get_logps function

3.4 The DPO loss

3.5 Training configuration

4. Training dynamics

4.1 Loss behaviour

4.2 Reward margin explosion

5. Evaluation results

5.1 Summary

5.2 Response quality

5.3 Reward model limitations

6. How this compares to the paper

6.1 What was faithfully implemented

6.2 Where this differs from the paper

6.3 On the RM-free design

7. What the results tell us and what to try next

8. Conclusion

References

Building a GPT from Scratch in JAX/Flax:

Building a GPT from Scratch in JAX/Flax

Why JAX?

The Architecture

Part 1: Embeddings

Part 2: Causal Attention

Part 3: The Flax NNX Learning Curve

Part 4: Data Loading with Grain

Part 5: Training Loop

Part 6: Checkpointing with Orbax

Part 7: Text Generation

Results

What’s Next

Try It Yourself

Why I Built an LLM From Scratch

3.3 The `get_logps` function