wsmhy2011

Scaling Reinforcement Learning with Verifiable Reward (RLVR)

The Basics Post-training – Scaling Test Time Compute The biggest innovation from GPT-o1 is that it proves test time scaling is another dimension besides scaling data and model parameters. Both RL and best-of-n therefore share a common structure, differing only in when the optimization is paid—RL pays the cost during training, best-of-n pays it duringContinue reading “Scaling Reinforcement Learning with Verifiable Reward (RLVR)”

LLM-RL Fine-Tuning – Math Collections

A Humble Attempt to establish a systematic, theoretical understanding of LLM RL Fine-tuning. This is an initial effort to summarize how traditional RL loss formulations transition into those used in LLMs, note this is an ongoing list, and I plan to gradually enrich it with more equations as the framework becomes more mature.

State of SFT

In the previous post, we articulated the distinction between supervised fine-tuning (SFT) and instruction fine-tuning (IFT). As the field advanced through 2024, the focus shifted to applications. IFT is now widely regarded as alignment research, while SFT serves as a versatile tool to adapt generic LLM checkpoints to specific domains. For practitioners, SFT remains theContinue reading “State of SFT”

LLM Agent Studies Chapter 2 – Self Reflection

What is Self-Reflection? Self-reflection is one of the key primitives to unlock AI Agent. It’s already heavily studied in AI Alignment research. The definition is super simple, in agentic workflows, agent’s execution is defined as a sequence of state-action pairs: Here the transition from t to t-1 is governed by the environment (external feedbacks), self-reflection isContinue reading “LLM Agent Studies Chapter 2 – Self Reflection”

LLM Agent Studies Chapter 1 – Cognitive Architectures

Agent is such a fascinating topic in 2024, it’ll take a long series of effort to talk about Agent. In Chapter 1, I’ll put out a conceptual view of Agent, in later chapters will dive into detailed design and ML topics. Before we talk about Agent, lemme ask a question first: Why Is Modern SoftwareContinue reading “LLM Agent Studies Chapter 1 – Cognitive Architectures”

LLM Studies (Part 4) – Reinforcement Learning from Human Feedback (RLHF)

Background “Reinforcement Learning Human Feedback” (RLHF) has been identified as the key innovation on improving LLM alignment. It’s identified to be the key proprietary technique held by OpenAI and Anthropic to achieve superior performance on universal tasks for their chatbots. Some papers that have disclosed their RLHF training details are Learning to summarize, Harmless &Continue reading “LLM Studies (Part 4) – Reinforcement Learning from Human Feedback (RLHF)”

LLM Studies (Part 3) – Prompt Tuning

In this post we talk about an interesting direction regarding parameter efficient tuning for LLMs — prompt-tuning: meaning using gradient information from LLM to construct soft-prompts; prompt-tuning, compared with finetuning, is much cheaper for LLM users to adapt to their downstream tasks: Here we experimented prompt-tuning with initialization from human engineered prompts, the inspiration comesContinue reading “LLM Studies (Part 3) – Prompt Tuning”

LLM Studies (Part 2) — Understanding Fine-tuning

Finetuning LLMs is no doubt the hottest topic of early 2023. Thx to the release of LLaMA, within a couple of months so many high-impact fine-tuning projects emerged: Alpaca, Dolly, Vicuna… (apparently exhausting the list thx to Facebook’s unique taste for this species). It’s worth noting that the incumbent “fine-tuning” is fundamentally different from theContinue reading “LLM Studies (Part 2) — Understanding Fine-tuning”

LLM Studies (Part 1) — Pre-training Feasibility

This a series of posts discussing technical details of LLM. As early practitioner of LLM, I would like to share a series of technical reports to the community as my work progresses.For LLM pre-training, I want to focus on “feasibility” in this blog post. Summarizing the training recipe from published papers is easy, in factContinue reading “LLM Studies (Part 1) — Pre-training Feasibility”

Model Productionization – Large Scale Machine Learning Platform Design

What is “model productionization” and how is different from traditional software’s “productionization” process? Revisit Traditional Software Development Let’s look at 2 popular software platform architectures: The principle is to split “monolithic” design into singular function “micro-services”. Each micro-service has simple, well defined input/output specs, and with good network infrastructures such design can scale to thousandsContinue reading “Model Productionization – Large Scale Machine Learning Platform Design”