You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Finetune Llama 3.1, Mistral, Phi-3.5 & Gemma 2-5x faster with 80% less memory!
✨ Finetune for Free
All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
📣 NEW! pip install unsloth now works! Head over to pypi to check it out! This allows non git pull installs. Use pip install unsloth[colab-new] for non dependency installs.
All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for up to 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!
🥇 Performance Benchmarking
For the full list of reproducible benchmarking tables, go to our website
Benchmarking table below was conducted by 🤗Hugging Face.
Free Colab T4
Dataset
🤗Hugging Face
Pytorch 2.1.1
🦥Unsloth
🦥 VRAM reduction
Llama-2 7b
OASST
1x
1.19x
1.95x
-43.3%
Mistral 7b
Alpaca
1x
1.07x
1.56x
-13.7%
Tiny Llama 1.1b
Alpaca
1x
2.06x
3.87x
-73.8%
DPO with Zephyr
Ultra Chat
1x
1.09x
1.55x
-18.6%
💾 Installation Instructions
Conda Installation
⚠️Only use Conda if you have it. If not, use Pip. Select either pytorch-cuda=11.8,12.1 for CUDA 11.8 or CUDA 12.1. If you have mamba, use mamba instead of conda for faster solving. We support python=3.10,3.11,3.12.
⚠️Do **NOT** use this if you have Conda. Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4 and CUDA versions.
In general, if you have torch 2.4 and CUDA 12.1, use:
Afterwards, confirm if nvccxformers and bitsandbytes have successfully installed - if not, install them individually first until they work, then install Unsloth.
Go to our official Documentation for saving to GGUF, checkpointing, evaluation and more!
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗Hugging Face's official docs! Check out the SFT docs and DPO docs!
fromunslothimportFastLanguageModelfromunslothimportis_bfloat16_supportedimporttorchfromtrlimportSFTTrainerfromtransformersimportTrainingArgumentsfromdatasetsimportload_datasetmax_seq_length=2048# Supports RoPE Scaling interally, so choose any!# Get LAION dataseturl="https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"dataset=load_dataset("json", data_files= {"train" : url}, split="train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.fourbit_models= [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unslothmodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=16,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=16,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
use_rslora=False, # We support rank stabilized LoRAloftq_config=None, # And LoftQ
)
trainer=SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=60,
fp16=notis_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
output_dir="outputs",
optim="adamw_8bit",
seed=3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like# (1) Saving to GGUF / merging to 16bit for vLLM# (2) Continued training from a saved LoRA adapter# (3) Adding an evaluation loop / OOMs# (4) Customized chat templates
DPO Support
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!
fromunslothimportFastLanguageModel, PatchDPOTrainerfromunslothimportis_bfloat16_supportedPatchDPOTrainer()
importtorchfromtransformersimportTrainingArgumentsfromtrlimportDPOTrainermodel, tokenizer=FastLanguageModel.from_pretrained(
model_name="unsloth/zephyr-sft-bnb-4bit",
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=True,
)
# Do model patching and add fast LoRA weightsmodel=FastLanguageModel.get_peft_model(
model,
r=64,
target_modules= ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha=64,
lora_dropout=0, # Supports any, but = 0 is optimizedbias="none", # Supports any, but = "none" is optimized# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!use_gradient_checkpointing="unsloth", # True or "unsloth" for very long contextrandom_state=3407,
max_seq_length=max_seq_length,
)
dpo_trainer=DPOTrainer(
model=model,
ref_model=None,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
warmup_ratio=0.1,
num_train_epochs=3,
fp16=notis_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
seed=42,
output_dir="outputs",
),
beta=0.1,
train_dataset=YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,tokenizer=tokenizer,
max_length=1024,
max_prompt_length=512,
)
dpo_trainer.train()
🥇 Detailed Benchmarking Tables
Click "Code" for fully reproducible examples
"Unsloth Equal" is a preview of our PRO version, with code stripped out. All settings and the loss curve remains identical.