🧠 SmolLM2-zero: Teaching SmolLM2 to "Think Before Answering"

This repository is a fork dedicated to reproducing the "aha moment" observed in DeepSeek R1 model training, using the SmolLM2 1.7B model and Group Relative Policy Optimization (GRPO).

🔍 Project Overview

The DeepSeek R1 release demonstrated that through pure reinforcement learning, language models can be taught to allocate more thinking time to problems and develop better reasoning strategies - without explicit human feedback or demonstrations. This project aims to reproduce this phenomenon with the smaller SmolLM2 1.7B model, demonstrating that even smaller models can learn to "think before answering" through reinforcement learning.

We use the Countdown Game (a numbers puzzle) as our training task, and leverage GRPO, a reinforcement learning approach introduced in the DeepSeekMath paper, to teach the model to:

Show its reasoning process inside <think>...</think> tags
Provide final solutions inside <answer>...</answer> tags

⚙️ How It Works

The training process works by:

Starting with the SmolLM2 1.7B model that has basic instruction following capability
Using GRPO to optimize the model to solve Countdown Game math puzzles
Designing rule-based reward functions that score both formatting and correctness
Training the model to maximize these rewards using reinforcement learning

The model learns not just to solve problems, but to articulate its thinking process - a capability that emerges purely from reinforcement learning without explicit demonstrations.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
assets		assets
completion_samples		completion_samples
container		container
inference		inference
scripts		scripts
training		training
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
train_autotune_grpo.sh		train_autotune_grpo.sh
train_zero_grpo.sh		train_zero_grpo.sh
vastai_rigs.csv		vastai_rigs.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 SmolLM2-zero: Teaching SmolLM2 to "Think Before Answering"

🔍 Project Overview

⚙️ How It Works

About

Releases

Packages

Languages

License

Brainkite/SmolLm2-zero

Folders and files

Latest commit

History

Repository files navigation

🧠 SmolLM2-zero: Teaching SmolLM2 to "Think Before Answering"

🔍 Project Overview

⚙️ How It Works

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages