arXiv 2024 (v5, Dec 2025) · 2025 ·
My reading notes
A single, current, mathematically rigorous reference that connects classical RL to the LLM post-training methods (RLHF, RLVR, GRPO/PPO/DPO, reasoning models) Arun needs for the pi-grpo stack and RS/ML-engineer interviews. Also frames optimization, active learning, and Bayesian optimization as sequential decision problems, which ties to his active-learning surrogate and anomaly-platform work.
This is a book-length tutorial monograph (253 pages, v5 dated Dec 3, 2025) by Kevin Murphy that surveys modern reinforcement learning from first principles to the frontier. It opens by framing RL as sequential decision making under the maximum expected utility principle, laying out the canonical model zoo: POMDPs, MDPs, goal-conditioned and contextual MDPs, contextual bandits, and belief-state MDPs, plus the observation that black-box optimization, Bayesian optimization, active learning, and even SGD can all be read as decision problems.
The core chapters proceed through the standard taxonomy. Value-based RL covers Bellman equations, value/policy iteration, Monte Carlo and TD(lambda) learning, SARSA, and Q-learning with its deep extensions (DQN, experience replay, the deadly triad, target networks, double/dueling/Rainbow, hindsight relabeling). Policy-based RL covers policy gradients, the policy gradient theorem, actor-critic, trust-region/proximal methods, off-policy variants, gradient-free optimization, and the RL-as-inference view. Model-based RL covers decision-time planning (MCTS-style, MuZero), background planning, learned world models, and predictive representations. A multi-agent chapter treats games, solution concepts, and algorithms.
The chapter most relevant to current practice is LLMs and RL, which is unusually thorough for a survey. It walks through RL fine-tuning vs SFT, reward models (process vs outcome), RLHF via the Bradley-Terry preference model, RL with verifiable rewards (RLVR) for math and code, reasoning/thinking models, and the families of policy-optimization algorithms now standard in post-training: PPO, GRPO and its Dr-GRPO correction, VinePPO, DPO and direct-alignment variants, KL (and chi-squared) regularization, and best-of-N. Later chapters cover regret minimization, exploration-exploitation, distributional RL, intrinsic motivation, hierarchical RL, imitation and offline RL, and a short note on general RL/AIXI. The treatment is notation-consistent and citation-dense, functioning as both a teaching text and a map into the primary literature rather than presenting new results.