AI Coffee Break with Letitia

AI Coffee Break with Letitia - s1: Simple test-time scaling: Just “wait…” + 1,000 training examples? | PAPER EXPLAINED

The discussion highlights a method to train large language models (LLMs) to output reasoning chains using only a thousand carefully selected examples, rather than millions. This approach leverages supervised fine-tuning and distillation, where a model is trained on the output of another model. The authors used 59,000 challenging questions from various academic fields and standardized tests, generating reasoning traces with Google's Gemini Flash model. They filtered these to a thousand examples, focusing on hard questions and longer reasoning traces. The model, S1, was fine-tuned using these examples, achieving high accuracy on benchmark tests like MASS 500 and AMY 24, outperforming some existing models. Additionally, a test-time scaling trick called 'budget forcing' was introduced, where the model is nudged to continue reasoning by replacing the end-of-thinking token with 'wait', leading to improved accuracy. This method makes high-performance reasoning models more accessible to smaller organizations by reducing the data requirement for training.

Key Points:

A thousand well-chosen examples can effectively train LLMs for reasoning tasks, reducing the need for massive datasets.
Supervised fine-tuning and distillation are used instead of reinforcement learning, making the process more efficient.
The model S1 achieved 92.6% accuracy on MASS 500 and improved its performance on AMY 24 with a test-time trick.
'Budget forcing' involves replacing the end-of-thinking token with 'wait' to enhance reasoning accuracy.
This approach democratizes access to high-performance models, allowing smaller entities to compete with limited data resources.

Details:

1. 🔍 Introduction to Efficient LLM Training

1.1. Optimal Number of Training Examples

1.2. Test-Time Compute Trick

2. 🧠 Beyond Massive Data and Models

Research indicates impressive reasoning performance can be achieved through supervised fine-tuning on output data, eliminating the need for reinforcement learning.
OpenAI models like O1 and DeepSc R1 utilize massive data sets and reinforcement learning; however, the specific methodologies are not disclosed by OpenAI.
Traditional AI training often requires a large number of examples and reinforcement learning to train large language models (LLMs) to output reasoning chains.
DeepSc R1 reveals its reasoning process through visible chains, whereas OpenAI's O1 model summarizes these chains for the user, keeping them hidden.

3. 🔬 The Distillation Process

Deepseek R1 utilizes distillation, requiring only a thousand examples, compared to 800,000 for training Deepseek R1.
59,000 challenging questions from various international exams and standardized tests were initially selected.
The examples were filtered to a thousand by removing formatting issues and unsolvable examples by certain AI models.
The final set favored diverse, hard examples with longer reasoning chains, ensuring subject area diversity.
A fine-tuned model, S1, was developed using the selected thousand examples, achieving 92.6% accuracy on the mass 500 dataset.
S1 outperformed previous models and scored 50% accuracy on the challenging Amy 24 test, improving from 44.6%.

4. ⚙️ Test Time Scaling and Budget Forcing

Test time scaling enhances model accuracy by adjusting the model's decision-making process, encouraging it to continue reasoning through the use of a 'wait' token as opposed to an 'end of thinking' token.
Budget forcing is a technique that significantly improves reasoning accuracy by modifying the model's processing time. For example, on the Amy 24 dataset, accuracy increased from 50% to 56.7% when budget forcing was applied.
The Math 500 dataset also showed accuracy improvements with budget forcing, demonstrating its effectiveness across different datasets.
This method not only improves performance but also suggests strategic adjustments that can be employed for other reasoning tasks.

5. 🚀 Implications and Closing Thoughts

High-performance reasoning models are becoming more accessible due to the ability to use a carefully chosen thousand examples instead of millions for fine-tuning. This benefits individuals and smaller organizations with limited data resources.
The implementation of 'weight' in reasoning models allows them to spend more compute time during testing, but this increases computational costs at inference time. There is a trade-off between waiting longer for smarter answers and the cost of inference tokens.

View Full Content

Upgrade to Plus to unlock complete episodes, key insights, and in-depth analysis

Starting at $5/month. Cancel anytime.