Blog

Scaling Reinforcement Learning with Verifiable Reward (RLVR)

The Basics Post-training – Scaling Test Time Compute The biggest innovation from GPT-o1 is that it proves test time scaling is another dimension besides scaling data and model parameters.  Both RL and best-of-n therefore share a common structure, differing only in when the optimization is paid—RL pays the cost during training, best-of-n pays it during…

LLM-RL Fine-Tuning – Math Collections

A Humble Attempt to establish a systematic, theoretical understanding of LLM RL Fine-tuning. This is an initial effort to summarize how traditional RL loss formulations transition into those used in LLMs, note this is an ongoing list, and I plan to gradually enrich it with more equations as the framework becomes more mature.

State of SFT

In the previous post, we articulated the distinction between supervised fine-tuning (SFT) and instruction fine-tuning (IFT). As the field advanced through 2024, the focus shifted to applications. IFT is now widely regarded as alignment research, while SFT serves as a versatile tool to adapt generic LLM checkpoints to specific domains. For practitioners, SFT remains the…

LLM Agent Studies Chapter 2 – Self Reflection

What is Self-Reflection? Self-reflection is one of the key primitives to unlock AI Agent. It’s already heavily studied in AI Alignment research. The definition is super simple, in agentic workflows, agent’s execution is defined as a sequence of state-action pairs: Here the transition from t to t-1 is governed by the environment (external feedbacks), self-reflection is…

LLM Agent Studies Chapter 1 – Cognitive Architectures

Agent is such a fascinating topic in 2024, it’ll take a long series of effort to talk about Agent. In Chapter 1, I’ll put out a conceptual view of Agent, in later chapters will dive into detailed design and ML topics. Before we talk about Agent, lemme ask a question first: Why Is Modern Software…

LLM Studies (Part 4) – Reinforcement Learning from Human Feedback (RLHF)

Background “Reinforcement Learning Human Feedback” (RLHF) has been identified as the key innovation on improving LLM alignment. It’s identified to be the key proprietary technique held by OpenAI and Anthropic to achieve superior performance on universal tasks for their chatbots. Some papers that have disclosed their RLHF training details are Learning to summarize, Harmless &…

LLM Studies (Part 3) – Prompt Tuning

In this post we talk about an interesting direction regarding parameter efficient tuning for LLMs — prompt-tuning: meaning using gradient information from LLM to construct soft-prompts; prompt-tuning, compared with finetuning, is much cheaper for LLM users to adapt to their downstream tasks: Here we experimented prompt-tuning with initialization from human engineered prompts, the inspiration comes…

LLM Studies (Part 2) — Understanding Fine-tuning

Finetuning LLMs is no doubt the hottest topic of early 2023. Thx to the release of LLaMA, within a couple of months so many high-impact fine-tuning projects emerged: Alpaca, Dolly, Vicuna… (apparently exhausting the list thx to Facebook’s unique taste for this species). It’s worth noting that the incumbent “fine-tuning” is fundamentally different from the…

LLM Studies (Part 1) — Pre-training Feasibility

This a series of posts discussing technical details of LLM. As early practitioner of LLM, I would like to share a series of technical reports to the community as my work progresses.For LLM pre-training, I want to focus on “feasibility” in this blog post. Summarizing the training recipe from published papers is easy, in fact…

Software Platform 2.0 – Large Scale Machine Learning Platform Design

I’m not talking about the Software 2.0 by Andrej Kaparthy, ML-as-software thing, here I’m trying to layout the new paradigm of ML platforms which poses very different development challenges from traditional software development. Revisit Traditional Software Development Let’s look at 2 popular software platform architectures: The principle is to split “monolithic” design into singular function…

Something went wrong. Please refresh the page and/or try again.


Follow My Blog

Get new content delivered directly to your inbox.