Reinforcement Learning: An Overview

Kevin P. Murphy

arXiv 2024 (v5, Dec 2025) · 2025 · ★★★★½4.5/5

My reading notes

Why it matters

A single, current, mathematically rigorous reference that connects classical RL to the LLM post-training methods (RLHF, RLVR, GRPO/PPO/DPO, reasoning models) Arun needs for the pi-grpo stack and RS/ML-engineer interviews. Also frames optimization, active learning, and Bayesian optimization as sequential decision problems, which ties to his active-learning surrogate and anomaly-platform work.

Summary

This is a book-length tutorial monograph (253 pages, v5 dated Dec 3, 2025) by Kevin Murphy that surveys modern reinforcement learning from first principles to the frontier. It opens by framing RL as sequential decision making under the maximum expected utility principle, laying out the canonical model zoo: POMDPs, MDPs, goal-conditioned and contextual MDPs, contextual bandits, and belief-state MDPs, plus the observation that black-box optimization, Bayesian optimization, active learning, and even SGD can all be read as decision problems.

The core chapters proceed through the standard taxonomy. Value-based RL covers Bellman equations, value/policy iteration, Monte Carlo and TD(lambda) learning, SARSA, and Q-learning with its deep extensions (DQN, experience replay, the deadly triad, target networks, double/dueling/Rainbow, hindsight relabeling). Policy-based RL covers policy gradients, the policy gradient theorem, actor-critic, trust-region/proximal methods, off-policy variants, gradient-free optimization, and the RL-as-inference view. Model-based RL covers decision-time planning (MCTS-style, MuZero), background planning, learned world models, and predictive representations. A multi-agent chapter treats games, solution concepts, and algorithms.

The chapter most relevant to current practice is LLMs and RL, which is unusually thorough for a survey. It walks through RL fine-tuning vs SFT, reward models (process vs outcome), RLHF via the Bradley-Terry preference model, RL with verifiable rewards (RLVR) for math and code, reasoning/thinking models, and the families of policy-optimization algorithms now standard in post-training: PPO, GRPO and its Dr-GRPO correction, VinePPO, DPO and direct-alignment variants, KL (and chi-squared) regularization, and best-of-N. Later chapters cover regret minimization, exploration-exploitation, distributional RL, intrinsic motivation, hierarchical RL, imitation and offline RL, and a short note on general RL/AIXI. The treatment is notation-consistent and citation-dense, functioning as both a teaching text and a map into the primary literature rather than presenting new results.

Key ideas

Unifies RL under one universal sequential-decision framework (agent internal state, policy, world model) and shows bandits, BayesOpt, active learning, and SGD as special cases
Complete modern coverage of the value/policy/model-based/multi-agent taxonomy, including the deep-RL stabilization tricks (deadly triad, target networks, double/dueling/Rainbow DQN)
A standout LLMs-and-RL chapter: RLHF (Bradley-Terry), RLVR for math/code, process vs outcome reward models, and reasoning/thinking models
Side-by-side derivations of the LLM post-training algorithm family: PPO, GRPO, Dr-GRPO, VinePPO, DPO, with KL / chi-squared regularization and best-of-N
Treats verification-easier-than-generation and the path to super-human performance as the conceptual case for RL over SFT
Acts as an annotated map into the primary literature; citation-dense, notation-consistent, no new empirical results

Takeaways for my work

Use as the canonical reference for the pi-grpo work: the GRPO -> Dr-GRPO advantage-normalization fix and VinePPO baselines are spelled out with the exact estimators worth implementing/citing
The RLVR and process-vs-outcome-reward framing is directly reusable for designing verifiable reward signals in physics-informed or anomaly-detection RL loops where ground truth is checkable
The bandit / BayesOpt / active-learning-as-RL framing connects to Arun's active-learning surrogate and sampling work; the belief-state and exploration-exploitation chapters give principled acquisition-policy language
A high-signal interview-prep resource for RS/ML-engineer roles: covers the post-training stack (RLHF, DPO, PPO/GRPO) and classical RL at a depth that maps to common system-design and ML-fundamentals questions

reinforcement-learningLLM-post-trainingRLHF-RLVRpolicy-optimizationsurvey