Fine-Tuning LLMs with Unsloth: A Complete Guide
Every time you call Claude or GPT-4, you’re paying for tokens. At $3-15 per million tokens, costs add up fast. A customer service bot handling 10,000 conversations per day at 2,000 tokens each burns through 20M tokens—that’s $60-300 daily, or $1,800-9,000 per month.
But here’s the thing: most of those conversations are domain-specific. Your bot answers questions about your product, your documentation, your policies. A general-purpose model like GPT-4 is massive overkill. You’re paying for the model’s ability to write poetry in Klingon when all you need is accurate answers about your return policy.
Fine-tuning changes the economics. Take a smaller model (3B-8B parameters), train it on your specific domain, and run it locally. The upfront cost is a few hours of GPU time ($5-50). The ongoing cost? Electricity. Your MacBook’s M2 chip can run a 7B model at 30+ tokens/second—for free.
This guide covers the complete workflow:
- Understanding fine-tuning approaches — LoRA, QLoRA, and when to use each
- Reinforcement learning methods — DPO, ORPO, GRPO for preference alignment
- Data preparation — Building Q&A datasets from documentation
- Training with Unsloth — 2x faster, 70% less memory
- Evaluation — Measuring if your model actually improved
- Local deployment — Export to GGUF, run with Ollama on Mac
- Cost analysis — When fine-tuning beats API providers at every scale
By the end, you’ll have a fine-tuned model running locally that outperforms generic LLMs on your specific task—at a fraction of the cost.
Table of Contents
- The Case for Fine-Tuning
- Understanding Fine-Tuning Approaches
- Reinforcement Learning Methods
- Data Preparation
- Setting Up Unsloth
- Supervised Fine-Tuning (SFT)
- Preference Alignment with DPO
- Evaluation and Metrics
- Local Deployment with Ollama
- Cost-Benefit Analysis
- Conclusion
The Case for Fine-Tuning
Before diving into the how, let’s establish the when. Fine-tuning isn’t always the right choice.
When Fine-Tuning Makes Sense
| Scenario | Why Fine-Tuning Helps |
|---|---|
| Domain expertise | Generic models don’t know your product, codebase, or policies |
| Cost at scale | API costs grow linearly; local inference is nearly free |
| Latency requirements | Local inference: 20-50ms. API calls: 200-2000ms |
| Privacy/compliance | Data never leaves your infrastructure |
| Consistent behavior | Fine-tuned models give predictable, reproducible outputs |
| Offline operation | No internet required once deployed |
When to Stick with APIs
| Scenario | Why APIs Win |
|---|---|
| Low volume | Under 100K tokens/day, API costs are negligible |
| General tasks | Creative writing, broad knowledge questions |
| Rapid iteration | Changing prompts is faster than retraining |
| No ML expertise | Fine-tuning requires understanding of training dynamics |
| Cutting-edge capability | GPT-4/Claude Opus still beat smaller fine-tuned models on complex reasoning |
What We’ll Build
A domain Q&A model that answers questions about technical documentation. The workflow:
Documentation → Q&A Dataset → Fine-tune with Unsloth → Export to GGUF → Run with Ollama
This is the most common fine-tuning use case: taking a small model and making it an expert on your specific content.
Understanding Fine-Tuning Approaches
There are three main approaches to adapting a pre-trained LLM to your task:
1. Full Fine-Tuning
Update all model parameters. This is what the original BERT/GPT papers did.
The problem: A 7B parameter model in fp16 needs 14GB just to store weights. During training, you also need gradients (14GB) and optimizer states (28GB for Adam). That’s 56GB minimum—and that’s before activations.
When to use: Almost never for LLMs. Only if you have massive compute budget and need maximum performance.
2. LoRA (Low-Rank Adaptation)
The key insight: weight updates during fine-tuning are low-rank. Instead of updating the full weight matrix W, we learn two small matrices A and B where the update is BA.
Original: W (4096 x 4096) = 16M parameters
LoRA update: B (4096 x 16) × A (16 x 4096) = 131K parameters
Reduction: 99.2% fewer trainable parameters
The rank r (16 in this example) controls the trade-off between expressiveness and efficiency. Typical values: 8-64.
Benefits:
- Train on consumer GPUs (16GB VRAM is enough for 7B models)
- Adapter weights are tiny (~10-50MB)
- Can swap adapters at inference time (multi-task)
- No inference overhead when merged
3. QLoRA (Quantized LoRA)
QLoRA takes LoRA further by quantizing the base model to 4-bit while training the LoRA adapters in higher precision.
Full fine-tune 7B: 56GB+ VRAM
LoRA 7B (fp16): ~16GB VRAM
QLoRA 7B (4-bit): ~6GB VRAM
This is what makes fine-tuning accessible on consumer hardware. A 7B model fits on an RTX 3060 (12GB) or even a free Google Colab T4 (15GB).
Key QLoRA innovations:
- 4-bit NormalFloat (NF4): Optimal quantization for normally-distributed weights
- Double quantization: Quantize the quantization constants (saves ~0.5GB per 7B params)
- Paged optimizers: Spill optimizer states to CPU RAM during spikes
import altair as alt
import polars as pl
# Memory comparison visualization
memory_data = pl.DataFrame({
"Method": ["Full Fine-Tune", "Full Fine-Tune", "Full Fine-Tune",
"LoRA (fp16)", "LoRA (fp16)", "LoRA (fp16)",
"QLoRA (4-bit)", "QLoRA (4-bit)", "QLoRA (4-bit)"],
"Component": ["Weights", "Gradients", "Optimizer"] * 3,
"Memory_GB": [
14.0, 14.0, 28.0, # Full fine-tune
14.0, 0.1, 0.2, # LoRA (only adapter gradients/optimizer)
3.5, 0.1, 0.2, # QLoRA (4-bit base + adapter)
]
})
chart = alt.Chart(memory_data).mark_bar().encode(
x=alt.X("Method:N", sort=["Full Fine-Tune", "LoRA (fp16)", "QLoRA (4-bit)"],
axis=alt.Axis(labelAngle=0)),
y=alt.Y("Memory_GB:Q", title="GPU Memory (GB)"),
color=alt.Color("Component:N",
scale=alt.Scale(scheme="category10"),
legend=alt.Legend(orient="top")),
order=alt.Order("Component:N", sort="descending")
).properties(
title="Memory Requirements for Fine-Tuning a 7B Model",
width=400,
height=300
)
chartLoRA Architecture Deep Dive
Here’s how LoRA modifies a transformer layer:
Original forward pass:
h = W @ x
LoRA forward pass:
h = W @ x + (B @ A) @ x
h = W @ x + B @ (A @ x) # More efficient: A is small
Where:
Wis the frozen pre-trained weight matrix (d × d)Ais the down-projection (r × d), initialized with GaussianBis the up-projection (d × r), initialized with zerosris the rank (hyperparameter, typically 8-64)
Why initialize B with zeros? So at the start of training, B @ A = 0, meaning the model behaves exactly like the pre-trained model. Training gradually adds the learned adaptation.
Which layers to adapt? Typically the attention projections (q_proj, k_proj, v_proj, o_proj) and sometimes the MLP layers. Unsloth makes this easy to configure.
When to Use Each Approach
| Available VRAM | Model Size | Recommended Approach |
|---|---|---|
| 8GB | Up to 3B | QLoRA |
| 12-16GB | Up to 7B | QLoRA |
| 24GB | Up to 13B | QLoRA or LoRA (fp16) |
| 48GB+ | Up to 70B | LoRA (fp16) |
| 80GB+ (A100) | Any | LoRA or Full Fine-Tune |
Rule of thumb: Start with QLoRA. It’s 90-95% as good as full fine-tuning for most tasks, at a fraction of the cost. Only upgrade if you see quality issues.
Reinforcement Learning Methods
Supervised Fine-Tuning (SFT) teaches the model what to say. Reinforcement Learning from Human Feedback (RLHF) teaches it how to say it—making outputs more helpful, harmless, and honest.
The classic RLHF pipeline:
1. SFT: Train on demonstrations
2. Reward Model: Train on human preferences (A > B)
3. PPO: Optimize policy against reward model
This is complex and expensive. Modern methods simplify it significantly.
PPO (Proximal Policy Optimization)
The original RLHF method, used by OpenAI for GPT-4.
How it works:
- Generate responses from current policy
- Score with reward model
- Update policy to increase reward (with KL penalty to stay close to reference)
Problems:
- Requires training a separate reward model
- Needs to keep reference model in memory
- Unstable training dynamics
- High memory: policy + reference + reward = 3 models
DPO (Direct Preference Optimization)
The breakthrough: DPO proves you can skip the reward model entirely. The policy is the reward model.
PPO: Data → Reward Model → Policy Optimization
DPO: Data → Policy Optimization (directly on preferences)
How it works:
- Input: pairs of (prompt, chosen_response, rejected_response)
- Loss: increase probability of chosen, decrease probability of rejected
- Implicit reward: log-probability ratio between policy and reference
Benefits:
- No reward model needed
- Stable supervised learning (no RL instability)
- Memory: just policy + reference (can share weights with Unsloth)
ORPO (Odds Ratio Preference Optimization)
The insight: DPO still requires SFT first, then DPO. ORPO combines them.
Traditional: SFT → DPO (two training stages)
ORPO: SFT + Preference (single stage)
ORPO adds an odds-ratio penalty to the SFT loss, teaching the model to prefer good responses while learning the task.
Benefits:
- Single training stage (faster, cheaper)
- No reference model needed
- Simpler hyperparameters
GRPO (Group Relative Policy Optimization)
DeepSeek’s method for training reasoning models (DeepSeek-R1).
Key innovations:
- No reward model: use rule-based verification (math correctness, code execution)
- No value model: replace with group statistics
- Sample multiple responses, rank by reward, update toward better ones
Best for: Tasks with verifiable outputs (math, code, structured data).
# RL Methods comparison
rl_comparison = pl.DataFrame({
"Method": ["PPO", "DPO", "ORPO", "GRPO"],
"Complexity": [5, 2, 1, 3],
"Memory": [5, 3, 2, 3],
"Stability": [2, 4, 5, 4],
"Data_Requirement": [4, 3, 3, 2],
})
# Melt for grouped bar chart
rl_melted = rl_comparison.unpivot(
index="Method",
on=["Complexity", "Memory", "Stability", "Data_Requirement"],
variable_name="Metric",
value_name="Score"
)
chart = alt.Chart(rl_melted).mark_bar().encode(
x=alt.X("Method:N", axis=alt.Axis(labelAngle=0)),
y=alt.Y("Score:Q", title="Score (lower is better for Complexity/Memory)"),
color=alt.Color("Metric:N", scale=alt.Scale(scheme="tableau10")),
xOffset="Metric:N"
).properties(
title="RL Method Comparison (1-5 scale)",
width=500,
height=300
)
chartWhich Method Should You Use?
Start Here
│
▼
Do you have preference data?
(A > B pairs)
/ \
Yes No
│ │
▼ ▼
Want single-stage? Is output verifiable?
/ \ (math, code, JSON)
Yes No / \
│ │ Yes No
▼ ▼ │ │
ORPO DPO GRPO SFT only
For most domain Q&A tasks: Start with SFT. Add DPO if you have preference data and want to improve quality further.
Data Preparation
Data quality is the single biggest factor in fine-tuning success. A model trained on 1,000 high-quality examples often outperforms one trained on 100,000 noisy examples.
Dataset Formats
The two most common formats:
1. Alpaca Format (instruction-following)
{
"instruction": "Explain the return policy",
"input": "",
"output": "Our return policy allows returns within 30 days..."
}2. ShareGPT/Conversation Format (multi-turn)
{
"conversations": [
{"from": "human", "value": "What's the return policy?"},
{"from": "gpt", "value": "Our return policy allows..."},
{"from": "human", "value": "What about opened items?"},
{"from": "gpt", "value": "Opened items can be returned if..."}
]
}Building a Q&A Dataset from Documentation
For domain Q&A, we need to:
- Extract content from documentation
- Generate question-answer pairs
- Validate quality
Let’s build a pipeline to create a Q&A dataset from Python documentation.
import polars as pl
from typing import TypedDict
import json
class QAPair(TypedDict):
"""A question-answer pair for fine-tuning."""
instruction: str
input: str
output: str
source: str
class PreferencePair(TypedDict):
"""A preference pair for DPO training."""
prompt: str
chosen: str
rejected: str
# Example documentation chunks (in practice, extract from your docs)
DOCUMENTATION_CHUNKS = [
{
"title": "asyncio.gather",
"content": """asyncio.gather(*aws, return_exceptions=False)
Run awaitable objects in the aws sequence concurrently.
If any awaitable in aws is a coroutine, it is automatically scheduled as a Task.
If all awaitables are completed successfully, the result is an aggregate list of
returned values. The order of result values corresponds to the order of awaitables in aws.
If return_exceptions is True, exceptions in the awaitables are treated the same as
successful results, and gathered in the result list.""",
"source": "python-docs/asyncio"
},
{
"title": "dataclasses.field",
"content": """dataclasses.field(*, default=MISSING, default_factory=MISSING, repr=True,
hash=None, init=True, compare=True, metadata=None, kw_only=MISSING)
For common and simple use cases, no other functionality is required. There are, however,
some dataclass features that require additional per-field information. To satisfy this
need for additional information, you can replace the default field value with a call to
the provided field() function.
The default_factory parameter provides a way to specify a field's default value using a
callable. This is useful for mutable default values like lists or dicts.""",
"source": "python-docs/dataclasses"
},
{
"title": "typing.TypedDict",
"content": """class typing.TypedDict(dict)
Special construct to add type hints to a dictionary. At runtime it is a plain dict.
TypedDict declares a dictionary type that expects all of its instances to have a certain
set of keys, where each key is associated with a value of a consistent type.
This expectation is not checked at runtime but is only enforced by type checkers.""",
"source": "python-docs/typing"
}
]
print(f"Loaded {len(DOCUMENTATION_CHUNKS)} documentation chunks")
print(f"\nExample chunk: {DOCUMENTATION_CHUNKS[0]['title']}")def generate_qa_pairs_from_chunk(chunk: dict) -> list[QAPair]:
"""Generate Q&A pairs from a documentation chunk.
In production, use an LLM (via ollama or API) to generate these.
Here we show the pattern with hand-crafted examples.
"""
# These would be generated by an LLM in practice
qa_templates = {
"asyncio.gather": [
QAPair(
instruction="How do I run multiple async functions concurrently in Python?",
input="",
output="""Use `asyncio.gather()` to run multiple awaitables concurrently. It takes any number of awaitable objects and runs them in parallel:
```python
import asyncio
async def fetch_data(url):
# async operation
pass
async def main():
results = await asyncio.gather(
fetch_data("url1"),
fetch_data("url2"),
fetch_data("url3")
)
# results is a list in the same order as inputsThe results are returned as a list in the same order as the input awaitables.""",
source=chunk[“source”]
),
QAPair(
instruction=“What does return_exceptions do in asyncio.gather?”,
input="",
output="""The return_exceptions parameter in asyncio.gather() controls how exceptions are handled:
return_exceptions=False(default): If any awaitable raises an exception, it propagates immediately and cancels other tasks.return_exceptions=True: Exceptions are captured and returned in the results list alongside successful values.
Example with exception handling:
results = await asyncio.gather(
task1(),
task2(), # might raise
task3(),
return_exceptions=True
)
# results might be: [result1, ValueError(...), result3]This is useful when you want all tasks to complete even if some fail.""",
source=chunk[“source”]
)
],
“dataclasses.field”: [
QAPair(
instruction=“How do I set a default list value for a dataclass field?”,
input="",
output="""Use field(default_factory=list) to set a mutable default value in a dataclass. Never use a mutable default directly as it would be shared across all instances.
from dataclasses import dataclass, field
# WRONG - shared mutable default
@dataclass
class Wrong:
items: list = [] # This is a bug!
# CORRECT - factory creates new list per instance
@dataclass
class Correct:
items: list = field(default_factory=list)The default_factory is called each time a new instance is created, ensuring each instance gets its own list.""",
source=chunk[“source”]
)
],
“typing.TypedDict”: [
QAPair(
instruction=“What is TypedDict and when should I use it?”,
input="",
output="""TypedDict is a way to add type hints to dictionaries with a fixed set of keys. Use it when:
- You need a dictionary with known keys (like JSON responses)
- You want IDE autocompletion and type checking
- You don’t need the overhead of a full class
from typing import TypedDict
class UserDict(TypedDict):
name: str
age: int
email: str | None
def process_user(user: UserDict) -> None:
print(user["name"]) # Type checker knows this is strNote: TypedDict is only for static type checking. At runtime, it’s just a regular dict with no validation.""", source=chunk[“source”] ) ] }
return qa_templates.get(chunk["title"], [])
Generate Q&A pairs from all chunks
all_qa_pairs: list[QAPair] = [] for chunk in DOCUMENTATION_CHUNKS: pairs = generate_qa_pairs_from_chunk(chunk) all_qa_pairs.extend(pairs)
print(f”Generated {len(all_qa_pairs)} Q&A pairs”) print(f”\nExample Q&A pair:”) print(f”Q: {all_qa_pairs[0][‘instruction’]}”) print(f”A: {all_qa_pairs[0][‘output’][:200]}…”)
```python
def format_for_training(qa_pairs: list[QAPair], format_type: str = "alpaca") -> list[dict]:
"""Format Q&A pairs for training.
Args:
qa_pairs: List of Q&A pairs
format_type: 'alpaca' or 'sharegpt'
Returns:
List of formatted training examples
"""
if format_type == "alpaca":
return [
{
"instruction": qa["instruction"],
"input": qa.get("input", ""),
"output": qa["output"]
}
for qa in qa_pairs
]
elif format_type == "sharegpt":
return [
{
"conversations": [
{"from": "human", "value": qa["instruction"]},
{"from": "gpt", "value": qa["output"]}
]
}
for qa in qa_pairs
]
else:
raise ValueError(f"Unknown format: {format_type}")
# Format for Alpaca (most common for Unsloth)
training_data = format_for_training(all_qa_pairs, "alpaca")
# Save to JSON
with open("training_data.json", "w") as f:
json.dump(training_data, f, indent=2)
print(f"Saved {len(training_data)} examples to training_data.json")
print(f"\nExample formatted for training:")
print(json.dumps(training_data[0], indent=2)[:500])
Creating Preference Data for DPO
For DPO training, we need pairs of (prompt, chosen_response, rejected_response). There are several ways to create this:
- Human annotation: Have humans rank responses (gold standard but expensive)
- Model-based ranking: Use a stronger model to rank responses from weaker models
- Synthetic generation: Generate good and bad examples programmatically
# Example preference data for DPO
preference_data: list[PreferencePair] = [
PreferencePair(
prompt="How do I run multiple async functions concurrently?",
chosen="""Use `asyncio.gather()` to run multiple awaitables concurrently:
```python
import asyncio
async def main():
results = await asyncio.gather(
task1(),
task2(),
task3()
)Results are returned in the same order as inputs. Use return_exceptions=True to handle errors gracefully.""",
rejected="""You can use threading or multiprocessing to run functions concurrently. Just create multiple threads and start them."""
),
PreferencePair(
prompt=“What’s the best way to handle mutable default arguments in dataclasses?”,
chosen="""Use field(default_factory=...) to avoid the mutable default argument bug:
from dataclasses import dataclass, field
@dataclass
class Config:
options: list = field(default_factory=list)
settings: dict = field(default_factory=dict)The factory function is called for each new instance, ensuring isolation.""",
rejected="""Just set the default to an empty list: options: list = []. Python handles it automatically."""
)
]
print(f”Created {len(preference_data)} preference pairs for DPO”) print(f”\nExample preference pair:”) print(f”Prompt: {preference_data[0][‘prompt’]}”) print(f”Chosen (correct): {preference_data[0][‘chosen’][:100]}…”) print(f”Rejected (wrong): {preference_data[0][‘rejected’][:100]}…”)
### Data Quality Checklist
Before training, verify your data:
- [ ] **Accuracy**: Are the answers factually correct?
- [ ] **Completeness**: Do answers fully address the question?
- [ ] **Formatting**: Consistent use of code blocks, lists, etc.
- [ ] **Length distribution**: Mix of short and detailed answers
- [ ] **Diversity**: Cover different topics and question types
- [ ] **No data leakage**: Test set questions shouldn't overlap with training
---
## Setting Up Unsloth
[Unsloth](https://github.com/unslothai/unsloth) is an optimization library that makes LLM fine-tuning **2x faster** and uses **70% less memory** compared to standard Hugging Face training.
### Why Unsloth?
- **Custom CUDA kernels**: Fused operations for attention, RoPE, cross-entropy
- **Memory optimization**: Gradient checkpointing, dynamic quantization
- **Easy LoRA/QLoRA**: Single function call to add adapters
- **Wide model support**: Llama, Mistral, Qwen, Gemma, Phi, and more
- **Export options**: GGUF, vLLM, Hugging Face formats
### Installation
**Google Colab (recommended for training):**
```python
%%capture
!pip install unsloth
# Get the latest from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Local with CUDA:
pip install "unsloth[cu121-ampere-torch240]" # For RTX 30xx/40xx
# OR
pip install "unsloth[cu121-torch240]" # For older GPUs# ============================================================
# NOTE: The following code requires a CUDA GPU.
# Run this section in Google Colab or on a machine with NVIDIA GPU.
# ============================================================
# Check if CUDA is available
import subprocess
import sys
def check_cuda():
"""Check if CUDA is available."""
try:
import torch
if torch.cuda.is_available():
print(f"CUDA available: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
return True
else:
print("CUDA not available. Training code will be shown but not executed.")
return False
except ImportError:
print("PyTorch not installed. Training code will be shown but not executed.")
return False
HAS_CUDA = check_cuda()# Model loading with Unsloth (requires CUDA)
if HAS_CUDA:
from unsloth import FastLanguageModel
# Load model in 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct", # Good balance of quality and size
max_seq_length=2048, # Can go up to 8192 for longer contexts
load_in_4bit=True, # Use QLoRA (4-bit quantization)
dtype=None, # Auto-detect (bf16 for Ampere+, fp16 otherwise)
)
print(f"Model loaded: {model.config._name_or_path}")
print(f"Parameters: {model.num_parameters() / 1e9:.2f}B")
else:
print("Skipping model loading (no CUDA)")
print("\nTo run this code, use Google Colab with GPU runtime.")# Add LoRA adapters (requires CUDA)
if HAS_CUDA:
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank - higher = more capacity, more memory
lora_alpha=16, # Scaling factor (usually equal to r)
lora_dropout=0, # Dropout for regularization (0 is fine for small datasets)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Attention projections
"gate_proj", "up_proj", "down_proj", # MLP layers
],
bias="none", # Don't train biases
use_gradient_checkpointing="unsloth", # Memory optimization
random_state=42,
use_rslora=False, # Rank-stabilized LoRA (experimental)
)
# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
else:
print("Skipping LoRA setup (no CUDA)")LoRA Configuration Deep Dive
| Parameter | Typical Values | Effect |
|---|---|---|
r (rank) | 8, 16, 32, 64 | Higher = more capacity, more VRAM |
lora_alpha | Same as r | Scaling factor for adapter output |
lora_dropout | 0, 0.05, 0.1 | Regularization (higher = less overfitting) |
target_modules | See below | Which layers to adapt |
Target modules by model:
- Llama/Mistral:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Qwen: Same as Llama
- Gemma:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
Rule of thumb: Start with r=16 and all target modules. Reduce r if running out of VRAM.
Supervised Fine-Tuning (SFT)
SFT is the foundation of fine-tuning. We train the model to predict the next token given the prompt and expected output.
# Define the prompt template
ALPACA_PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
def format_prompt(example: dict) -> str:
"""Format a training example into the prompt template."""
return ALPACA_PROMPT.format(
instruction=example["instruction"],
output=example["output"]
)
# Test the formatting
print("Example formatted prompt:")
print(format_prompt(training_data[0])[:500])# SFT Training (requires CUDA)
if HAS_CUDA:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
# Convert to HuggingFace dataset
dataset = Dataset.from_list(training_data)
# Format with the prompt template
def format_examples(examples):
texts = []
for instruction, output in zip(examples["instruction"], examples["output"]):
text = ALPACA_PROMPT.format(instruction=instruction, output=output)
texts.append(text)
return {"text": texts}
dataset = dataset.map(format_examples, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="outputs",
per_device_train_batch_size=2, # Adjust based on VRAM
gradient_accumulation_steps=4, # Effective batch size = 2 * 4 = 8
num_train_epochs=3, # More epochs for small datasets
learning_rate=2e-4, # Standard for LoRA
lr_scheduler_type="cosine", # Cosine annealing
warmup_steps=10, # Warmup for stability
logging_steps=1, # Log every step
save_strategy="epoch", # Save after each epoch
optim="adamw_8bit", # Memory-efficient optimizer
seed=42,
fp16=not torch.cuda.is_bf16_supported(), # Use bf16 if available
bf16=torch.cuda.is_bf16_supported(),
)
# Create trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_args,
dataset_text_field="text",
max_seq_length=2048,
packing=False, # Don't pack multiple examples (cleaner gradients)
)
print("Starting SFT training...")
trainer_stats = trainer.train()
print(f"Training complete! Final loss: {trainer_stats.training_loss:.4f}")
else:
print("Skipping SFT training (no CUDA)")
print("\nIn Colab, this would train the model on your Q&A dataset.")# Simulated training metrics for visualization
# (Replace with actual metrics from trainer.state.log_history in Colab)
training_metrics = pl.DataFrame({
"step": list(range(0, 100, 5)),
"loss": [2.8, 2.4, 2.1, 1.9, 1.7, 1.5, 1.4, 1.3, 1.2, 1.15,
1.1, 1.05, 1.0, 0.95, 0.92, 0.88, 0.85, 0.82, 0.8, 0.78],
})
chart = alt.Chart(training_metrics).mark_line(point=True).encode(
x=alt.X("step:Q", title="Training Step"),
y=alt.Y("loss:Q", title="Loss", scale=alt.Scale(domain=[0.5, 3])),
).properties(
title="SFT Training Loss Curve",
width=500,
height=300
)
chartHyperparameter Tuning Guide
| Parameter | Starting Value | Adjust If… |
|---|---|---|
learning_rate | 2e-4 | Loss spikes: reduce. Loss plateaus early: increase. |
num_train_epochs | 3-5 | Small dataset: more epochs. Large dataset: fewer. |
batch_size | 2-4 | Out of memory: reduce. Training slow: increase with gradient accumulation. |
gradient_accumulation | 4-8 | Want larger effective batch without more VRAM. |
warmup_steps | 5-10% of total | Training unstable at start: increase. |
Common issues:
- Loss doesn’t decrease: Learning rate too low, or data formatting wrong
- Loss spikes: Learning rate too high, reduce by 50%
- Overfitting: Add dropout, reduce epochs, or get more data
Preference Alignment with DPO
After SFT, we can optionally train with DPO to improve response quality. This teaches the model to prefer better responses.
# DPO Training (requires CUDA and SFT model)
if HAS_CUDA:
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
# Convert preference data to HuggingFace dataset
dpo_dataset = Dataset.from_list(preference_data)
# DPO configuration
dpo_config = DPOConfig(
output_dir="outputs_dpo",
beta=0.1, # KL penalty weight (higher = more conservative)
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1, # DPO usually needs fewer epochs
learning_rate=5e-5, # Lower than SFT
lr_scheduler_type="cosine",
warmup_ratio=0.1,
logging_steps=1,
optim="adamw_8bit",
seed=42,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
)
# Create DPO trainer
# Note: Unsloth shares weights between policy and reference, saving memory
dpo_trainer = DPOTrainer(
model=model,
ref_model=None, # Unsloth handles reference model internally
train_dataset=dpo_dataset,
tokenizer=tokenizer,
args=dpo_config,
)
print("Starting DPO training...")
dpo_trainer.train()
print("DPO training complete!")
else:
print("Skipping DPO training (no CUDA)")
print("\nDPO improves response quality by learning from preference pairs.")DPO vs ORPO
If you want to skip the two-stage SFT → DPO pipeline, ORPO combines both in a single training run:
from trl import ORPOTrainer, ORPOConfig
orpo_config = ORPOConfig(
output_dir="outputs_orpo",
beta=0.1, # Same as DPO
# ... other training args
)
# ORPO dataset needs instruction + chosen + rejected
orpo_trainer = ORPOTrainer(
model=model,
train_dataset=orpo_dataset,
tokenizer=tokenizer,
args=orpo_config,
)
orpo_trainer.train()When to use ORPO:
- You have preference data from the start
- You want faster iteration (single training run)
- You’re okay with slightly less fine-grained control
Evaluation and Metrics
How do you know if fine-tuning actually helped? You need systematic evaluation.
Quantitative Metrics
1. Perplexity - How well the model predicts the next token
- Lower is better
- Compare before/after fine-tuning on held-out data
2. Task-specific metrics - Depends on your use case
- Q&A: Exact match, F1 score, BLEU/ROUGE
- Classification: Accuracy, precision, recall
- Generation: Human preference ratings
Qualitative Evaluation
Numbers don’t tell the whole story. Always do qualitative testing:
- Create 20-50 test questions covering edge cases
- Get responses from base model AND fine-tuned model
- Compare side-by-side
# Evaluation test cases
TEST_QUESTIONS = [
"How do I run multiple async functions at the same time in Python?",
"What's the difference between asyncio.gather and asyncio.wait?",
"How do I create a dataclass with a default list field?",
"When should I use TypedDict vs a regular class?",
"How do I handle exceptions in asyncio.gather?",
]
def run_inference(model, tokenizer, question: str) -> str:
"""Run inference on a single question."""
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{question}
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract just the response part
response = response.split("### Response:")[-1].strip()
return response
# Example evaluation (requires CUDA)
if HAS_CUDA:
print("Running evaluation on test questions...")
for q in TEST_QUESTIONS[:2]: # Just first 2 for demo
print(f"\n{'='*60}")
print(f"Q: {q}")
print(f"{'='*60}")
response = run_inference(model, tokenizer, q)
print(f"A: {response[:500]}...")
else:
print("Skipping inference evaluation (no CUDA)")
print(f"\nTest questions prepared: {len(TEST_QUESTIONS)}")# Simulated evaluation results for visualization
eval_results = pl.DataFrame({
"Model": ["Base Llama-3.2-3B", "Base Llama-3.2-3B", "Base Llama-3.2-3B",
"Fine-tuned", "Fine-tuned", "Fine-tuned"],
"Metric": ["Accuracy", "Relevance", "Completeness"] * 2,
"Score": [
0.45, 0.55, 0.40, # Base model
0.82, 0.88, 0.75, # Fine-tuned
]
})
chart = alt.Chart(eval_results).mark_bar().encode(
x=alt.X("Model:N", axis=alt.Axis(labelAngle=0)),
y=alt.Y("Score:Q", title="Score (0-1)", scale=alt.Scale(domain=[0, 1])),
color=alt.Color("Metric:N", scale=alt.Scale(scheme="tableau10")),
xOffset="Metric:N"
).properties(
title="Model Evaluation: Base vs Fine-tuned on Domain Q&A",
width=400,
height=300
)
chartLocal Deployment with Ollama
Once training is complete, we export the model to GGUF format and run it locally with Ollama.
Why Ollama?
- Dead simple:
ollama run model-nameand you’re chatting - Mac-optimized: Takes full advantage of Apple Silicon
- No Python needed: Runs as a standalone binary
- API compatible: OpenAI-compatible API for easy integration
- Model management: Easy to download, update, and switch models
Export to GGUF
# Export model to GGUF (requires CUDA and trained model)
if HAS_CUDA:
# Merge LoRA adapters and save as GGUF
model.save_pretrained_gguf(
"my-qa-model",
tokenizer,
quantization_method="q4_k_m", # Good balance of quality and size
)
print("Model exported to GGUF format!")
print("\nQuantization options:")
print(" q4_k_m - 4-bit, medium quality (recommended)")
print(" q5_k_m - 5-bit, higher quality, larger size")
print(" q8_0 - 8-bit, near full quality, largest size")
else:
print("Skipping GGUF export (no CUDA)")
print("\nIn Colab, this would create: my-qa-model-unsloth.Q4_K_M.gguf")Create Ollama Modelfile
The Modelfile tells Ollama how to run your model:
# Create Ollama Modelfile
modelfile_content = """# Modelfile for domain Q&A model
FROM ./my-qa-model-unsloth.Q4_K_M.gguf
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 2048
# System prompt
SYSTEM You are a helpful assistant specialized in Python programming. Answer questions accurately and provide code examples when appropriate.
# Template for chat format
TEMPLATE \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{{ .Prompt }}
### Response:
\"\"\"
"""
with open("Modelfile", "w") as f:
f.write(modelfile_content)
print("Modelfile created!")
print("\nTo use with Ollama:")
print(" 1. Install Ollama: https://ollama.ai")
print(" 2. ollama create my-qa-model -f Modelfile")
print(" 3. ollama run my-qa-model")Running with Ollama
Once the model is created in Ollama, you can interact with it:
# Interactive chat
ollama run my-qa-model
# Single query
ollama run my-qa-model "How do I use asyncio.gather?"
# Via API
curl http://localhost:11434/api/generate -d '{
"model": "my-qa-model",
"prompt": "How do I use asyncio.gather?"
}'# Python integration with Ollama
# This works on Mac without CUDA!
def check_ollama():
"""Check if Ollama is running."""
try:
import ollama
ollama.list()
return True
except Exception:
return False
if check_ollama():
import ollama
# List available models
models = ollama.list()
print("Available Ollama models:")
for model in models.get('models', []):
print(f" - {model['name']}")
# Example: Query the model
# response = ollama.chat(
# model='my-qa-model',
# messages=[{'role': 'user', 'content': 'How do I use asyncio.gather?'}]
# )
# print(response['message']['content'])
else:
print("Ollama not running. Start it with: ollama serve")
print("\nExample code for when Ollama is available:")
print("""
import ollama
response = ollama.chat(
model='my-qa-model',
messages=[{'role': 'user', 'content': 'How do I use asyncio.gather?'}]
)
print(response['message']['content'])
""")# Latency comparison (simulated data)
latency_data = pl.DataFrame({
"Method": (["Claude API"] * 20 + ["GPT-4 API"] * 20 +
["Local Ollama (M2)"] * 20 + ["Local Ollama (M3 Pro)"] * 20),
"Latency_ms": (
[450 + i * 10 for i in range(20)] + # Claude API
[380 + i * 12 for i in range(20)] + # GPT-4 API
[25 + i * 2 for i in range(20)] + # Local M2
[18 + i * 1.5 for i in range(20)] # Local M3 Pro
)
})
chart = alt.Chart(latency_data).mark_boxplot().encode(
x=alt.X("Method:N", axis=alt.Axis(labelAngle=-15)),
y=alt.Y("Latency_ms:Q", title="First Token Latency (ms)"),
color=alt.Color("Method:N", legend=None)
).properties(
title="Inference Latency: API vs Local Ollama",
width=400,
height=300
)
chartCost-Benefit Analysis
Let’s calculate when fine-tuning and local inference beats API providers.
API Pricing (January 2026)
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 |
| Anthropic | Claude 3.5 Haiku | $0.80 | $4.00 |
| OpenAI | GPT-4o | $2.50 | $10.00 |
| OpenAI | GPT-4o-mini | $0.15 | $0.60 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Local Inference Costs
| Item | One-time Cost | Monthly Cost |
|---|---|---|
| MacBook Pro M3 | $2,000-4,000 | - |
| Electricity (24/7 operation) | - | ~$5-15 |
| Fine-tuning (Colab Pro) | $10-50 per run | - |
# Cost analysis calculations
# API costs per 1M tokens (average of input/output, weighted 1:2)
API_COSTS = {
"Claude 3.5 Sonnet": (3.00 + 15.00 * 2) / 3, # $11.00
"Claude 3.5 Haiku": (0.80 + 4.00 * 2) / 3, # $2.93
"GPT-4o": (2.50 + 10.00 * 2) / 3, # $7.50
"GPT-4o-mini": (0.15 + 0.60 * 2) / 3, # $0.45
"Gemini 1.5 Pro": (1.25 + 5.00 * 2) / 3, # $3.75
}
# Local costs
LOCAL_MONTHLY_COST = 10 # Electricity
FINE_TUNING_COST = 25 # One-time Colab cost
HARDWARE_COST = 2500 # MacBook Pro (amortized over 3 years = $70/month)
HARDWARE_MONTHLY = HARDWARE_COST / 36
def calculate_monthly_cost(tokens_per_day: int, provider: str) -> float:
"""Calculate monthly API cost."""
tokens_per_month = tokens_per_day * 30
return (tokens_per_month / 1_000_000) * API_COSTS[provider]
def calculate_local_monthly_cost() -> float:
"""Calculate monthly local cost (ignoring one-time fine-tuning)."""
return LOCAL_MONTHLY_COST + HARDWARE_MONTHLY
# Test scenarios
scenarios = [
("Personal/Hobby", 50_000), # 50K tokens/day
("Startup", 1_000_000), # 1M tokens/day
("Growth", 10_000_000), # 10M tokens/day
("Enterprise", 100_000_000), # 100M tokens/day
]
print("Monthly Cost Comparison")
print("=" * 70)
print(f"{'Scale':<15} {'Tokens/Day':<15} {'GPT-4o-mini':<12} {'Claude Haiku':<14} {'Local':<10}")
print("-" * 70)
for name, tokens in scenarios:
gpt_cost = calculate_monthly_cost(tokens, "GPT-4o-mini")
claude_cost = calculate_monthly_cost(tokens, "Claude 3.5 Haiku")
local_cost = calculate_local_monthly_cost()
print(f"{name:<15} {tokens:>12,} ${gpt_cost:>10,.0f} ${claude_cost:>12,.0f} ${local_cost:>8,.0f}")
print("\n* Local cost assumes hardware amortized over 3 years")
print("* One-time fine-tuning cost (~$25) not included in monthly")# Cost visualization
import numpy as np
# Generate cost data across different scales
tokens_range = [10_000, 50_000, 100_000, 500_000, 1_000_000,
5_000_000, 10_000_000, 50_000_000, 100_000_000]
cost_data = []
for tokens in tokens_range:
cost_data.append({
"Tokens_per_day": tokens,
"Provider": "GPT-4o-mini",
"Monthly_Cost": calculate_monthly_cost(tokens, "GPT-4o-mini")
})
cost_data.append({
"Tokens_per_day": tokens,
"Provider": "Claude Haiku",
"Monthly_Cost": calculate_monthly_cost(tokens, "Claude 3.5 Haiku")
})
cost_data.append({
"Tokens_per_day": tokens,
"Provider": "GPT-4o",
"Monthly_Cost": calculate_monthly_cost(tokens, "GPT-4o")
})
cost_data.append({
"Tokens_per_day": tokens,
"Provider": "Local (Fine-tuned)",
"Monthly_Cost": calculate_local_monthly_cost()
})
cost_df = pl.DataFrame(cost_data)
chart = alt.Chart(cost_df).mark_line(point=True).encode(
x=alt.X("Tokens_per_day:Q",
scale=alt.Scale(type="log"),
title="Tokens per Day (log scale)"),
y=alt.Y("Monthly_Cost:Q",
scale=alt.Scale(type="log"),
title="Monthly Cost USD (log scale)"),
color=alt.Color("Provider:N", scale=alt.Scale(scheme="category10")),
strokeDash=alt.condition(
alt.datum.Provider == "Local (Fine-tuned)",
alt.value([5, 5]),
alt.value([0])
)
).properties(
title="Monthly Cost by Scale: API Providers vs Local Fine-tuned Model",
width=600,
height=400
)
chart# Break-even analysis
def find_break_even(provider: str) -> int:
"""Find tokens/day where local becomes cheaper than API."""
local_monthly = calculate_local_monthly_cost()
cost_per_token = API_COSTS[provider] / 1_000_000
tokens_per_month = local_monthly / cost_per_token
return int(tokens_per_month / 30)
print("Break-Even Points (where local becomes cheaper)")
print("=" * 50)
for provider in ["GPT-4o-mini", "Claude 3.5 Haiku", "GPT-4o", "Claude 3.5 Sonnet"]:
break_even = find_break_even(provider)
print(f"{provider:<20}: {break_even:>12,} tokens/day")
print("\n* Above these volumes, local inference is more cost-effective")
print("* Break-even assumes hardware is already owned")# Summary visualization: When to use what
summary_data = pl.DataFrame({
"Scale": ["Hobby\n(<50K/day)", "Startup\n(50K-1M/day)", "Growth\n(1M-10M/day)", "Enterprise\n(>10M/day)"],
"Recommendation": ["API", "API or Local", "Local", "Local"],
"Monthly_Savings": [0, 50, 2000, 50000], # Approximate savings with local
"Order": [1, 2, 3, 4]
})
base = alt.Chart(summary_data).encode(
x=alt.X("Scale:N", sort=alt.EncodingSortField(field="Order"), axis=alt.Axis(labelAngle=0)),
)
bars = base.mark_bar().encode(
y=alt.Y("Monthly_Savings:Q", title="Monthly Savings with Local ($)"),
color=alt.Color("Recommendation:N",
scale=alt.Scale(domain=["API", "API or Local", "Local"],
range=["#e45756", "#f58518", "#54a24b"]))
)
text = base.mark_text(dy=-10).encode(
y="Monthly_Savings:Q",
text=alt.Text("Monthly_Savings:Q", format="$,.0f")
)
chart = (bars + text).properties(
title="When Local Fine-Tuning Makes Sense (vs GPT-4o-mini)",
width=400,
height=300
)
chartCost Summary
| Scale | Tokens/Day | Recommendation | Notes |
|---|---|---|---|
| Hobby | <50K | Use API | Not worth setup effort |
| Startup | 50K-1M | Consider local | Break-even depends on model quality needs |
| Growth | 1M-10M | Local strongly recommended | Saves $500-5,000/month |
| Enterprise | >10M | Local essential | Saves $10,000+/month |
Additional factors favoring local:
- Latency-sensitive applications (local is 10-20x faster)
- Privacy requirements (data never leaves your machine)
- Offline operation needs
- Predictable costs (no surprise bills)
Additional factors favoring API:
- Cutting-edge capability needs (GPT-4/Claude still ahead on complex tasks)
- Rapid experimentation (easier to switch models)
- No ML expertise on team
Conclusion
Fine-tuning your own LLM is no longer a research project—it’s a practical engineering choice. With Unsloth, you can:
- Fine-tune on consumer hardware (or free Colab)
- Train 2x faster with 70% less memory
- Export to GGUF and run anywhere
- Deploy locally with Ollama for free inference
Key Takeaways
-
Start with QLoRA - It’s 90-95% as good as full fine-tuning at a fraction of the cost
-
Data quality > quantity - 1,000 perfect examples beat 100,000 noisy ones
-
SFT first, then DPO - Get the basics right before preference optimization
-
Evaluate systematically - Both quantitative metrics and qualitative testing
-
Local deployment is viable - Ollama makes it dead simple on Mac
-
Cost math favors local at scale - Above 100K tokens/day, consider fine-tuning
Recommended Workflow
1. Collect domain data (docs, Q&A, examples)
2. Generate training dataset (Alpaca format)
3. Fine-tune with Unsloth on Colab (QLoRA, Llama-3.2-3B)
4. Evaluate on held-out test set
5. If needed, add DPO with preference data
6. Export to GGUF
7. Deploy with Ollama locally
8. Iterate based on real-world feedback
Resources
- Unsloth GitHub - Installation and examples
- Unsloth Documentation - Detailed guides
- Ollama - Local model deployment
- TRL Library - SFT, DPO, ORPO trainers
- PEFT Library - LoRA implementation
The best model for your task isn’t the biggest—it’s the one trained on your data. Fine-tuning makes that possible for everyone.