Blog – Sherman Wong

Scaling Reinforcement Learning with Verifiable Reward (RLVR)

The Basics Post-training – Scaling Test Time Compute The biggest innovation from GPT-o1 is that it proves test time scaling is another dimension besides scaling data and model parameters. Both RL and best-of-n therefore share a common structure, differing only in when the optimization is paid—RL pays the cost during training, best-of-n pays it during…

by wsmhy2011 December 21, 2025December 24, 2025

LLM-RL Fine-Tuning – Math Collections

A Humble Attempt to establish a systematic, theoretical understanding of LLM RL Fine-tuning. This is an initial effort to summarize how traditional RL loss formulations transition into those used in LLMs, note this is an ongoing list, and I plan to gradually enrich it with more equations as the framework becomes more mature.

by wsmhy2011 May 4, 2025May 4, 2025

State of SFT

In the previous post, we articulated the distinction between supervised fine-tuning (SFT) and instruction fine-tuning (IFT). As the field advanced through 2024, the focus shifted to applications. IFT is now widely regarded as alignment research, while SFT serves as a versatile tool to adapt generic LLM checkpoints to specific domains. For practitioners, SFT remains the…

by wsmhy2011 February 13, 2025February 17, 2025

LLM Agent Studies Chapter 2 – Self Reflection

What is Self-Reflection? Self-reflection is one of the key primitives to unlock AI Agent. It’s already heavily studied in AI Alignment research. The definition is super simple, in agentic workflows, agent’s execution is defined as a sequence of state-action pairs: Here the transition from t to t-1 is governed by the environment (external feedbacks), self-reflection is…

by wsmhy2011 January 3, 2025February 17, 2025

LLM Agent Studies Chapter 1 – Cognitive Architectures

Agent is such a fascinating topic in 2024, it’ll take a long series of effort to talk about Agent. In Chapter 1, I’ll put out a conceptual view of Agent, in later chapters will dive into detailed design and ML topics. Before we talk about Agent, lemme ask a question first: Why Is Modern Software…

by wsmhy2011 March 29, 2024February 17, 2025

LLM Studies (Part 4) – Reinforcement Learning from Human Feedback (RLHF)

Background “Reinforcement Learning Human Feedback” (RLHF) has been identified as the key innovation on improving LLM alignment. It’s identified to be the key proprietary technique held by OpenAI and Anthropic to achieve superior performance on universal tasks for their chatbots. Some papers that have disclosed their RLHF training details are Learning to summarize, Harmless &…

by wsmhy2011 July 23, 2023July 23, 2023

LLM Studies (Part 3) – Prompt Tuning

In this post we talk about an interesting direction regarding parameter efficient tuning for LLMs — prompt-tuning: meaning using gradient information from LLM to construct soft-prompts; prompt-tuning, compared with finetuning, is much cheaper for LLM users to adapt to their downstream tasks: Here we experimented prompt-tuning with initialization from human engineered prompts, the inspiration comes…

by wsmhy2011 March 10, 2023July 9, 2023

LLM Studies (Part 2) — Understanding Fine-tuning

Finetuning LLMs is no doubt the hottest topic of early 2023. Thx to the release of LLaMA, within a couple of months so many high-impact fine-tuning projects emerged: Alpaca, Dolly, Vicuna… (apparently exhausting the list thx to Facebook’s unique taste for this species). It’s worth noting that the incumbent “fine-tuning” is fundamentally different from the…

by wsmhy2011 February 10, 2023August 19, 2023

LLM Studies (Part 1) — Pre-training Feasibility

This a series of posts discussing technical details of LLM. As early practitioner of LLM, I would like to share a series of technical reports to the community as my work progresses.For LLM pre-training, I want to focus on “feasibility” in this blog post. Summarizing the training recipe from published papers is easy, in fact…

by wsmhy2011 February 4, 2023August 19, 2023

Model Productionization – Large Scale Machine Learning Platform Design

What is “model productionization” and how is different from traditional software’s “productionization” process? Revisit Traditional Software Development Let’s look at 2 popular software platform architectures: The principle is to split “monolithic” design into singular function “micro-services”. Each micro-service has simple, well defined input/output specs, and with good network infrastructures such design can scale to thousands…

by wsmhy2011 November 6, 2022December 28, 2025

Something went wrong. Please refresh the page and/or try again.

Follow My Blog

Get new content delivered directly to your inbox.