Scaling Reinforcement Learning with Verifiable Reward (RLVR)

The Basics Post-training – Scaling Test Time Compute The biggest innovation from GPT-o1 is that it proves test time scaling is another dimension besides scaling data and model parameters.  Both RL and best-of-n therefore share a common structure, differing only in when the optimization is paid—RL pays the cost during training, best-of-n pays it duringContinue reading “Scaling Reinforcement Learning with Verifiable Reward (RLVR)”

State of SFT

In the previous post, we articulated the distinction between supervised fine-tuning (SFT) and instruction fine-tuning (IFT). As the field advanced through 2024, the focus shifted to applications. IFT is now widely regarded as alignment research, while SFT serves as a versatile tool to adapt generic LLM checkpoints to specific domains. For practitioners, SFT remains theContinue reading “State of SFT”

LLM Agent Studies Chapter 2 – Self Reflection

What is Self-Reflection? Self-reflection is one of the key primitives to unlock AI Agent. It’s already heavily studied in AI Alignment research. The definition is super simple, in agentic workflows, agent’s execution is defined as a sequence of state-action pairs: Here the transition from t to t-1 is governed by the environment (external feedbacks), self-reflection isContinue reading “LLM Agent Studies Chapter 2 – Self Reflection”

LLM Agent Studies Chapter 1 – Cognitive Architectures

Agent is such a fascinating topic in 2024, it’ll take a long series of effort to talk about Agent. In Chapter 1, I’ll put out a conceptual view of Agent, in later chapters will dive into detailed design and ML topics. Before we talk about Agent, lemme ask a question first: Why Is Modern SoftwareContinue reading “LLM Agent Studies Chapter 1 – Cognitive Architectures”