Fine-Tuning LLMs with Unsloth: A Complete Guide

Every time you call Claude or GPT-4, you’re paying for tokens. At $3-15 per million tokens, costs add up fast. A customer service bot handling 10,000 conversations per day at 2,000 tokens each burns through 20M tokens—that’s $60-300 daily, or $1,800-9,000 per month.

But here’s the thing: most of those conversations are domain-specific. Your bot answers questions about your product, your documentation, your policies. A general-purpose model like GPT-4 is massive overkill. You’re paying for the model’s ability to write poetry in Klingon when all you need is accurate answers about your return policy.

Fine-tuning changes the economics. Take a smaller model (3B-8B parameters), train it on your specific domain, and run it locally. The upfront cost is a few hours of GPU time ($5-50). The ongoing cost? Electricity. Your MacBook’s M2 chip can run a 7B model at 30+ tokens/second—for free.

This guide covers the complete workflow:

Understanding fine-tuning approaches — LoRA, QLoRA, and when to use each
Reinforcement learning methods — DPO, ORPO, GRPO for preference alignment
Data preparation — Building Q&A datasets from documentation
Training with Unsloth — 2x faster, 70% less memory
Evaluation — Measuring if your model actually improved
Local deployment — Export to GGUF, run with Ollama on Mac
Cost analysis — When fine-tuning beats API providers at every scale

By the end, you’ll have a fine-tuned model running locally that outperforms generic LLMs on your specific task—at a fraction of the cost.

The Case for Fine-Tuning
Understanding Fine-Tuning Approaches
Reinforcement Learning Methods
Data Preparation
Setting Up Unsloth
Supervised Fine-Tuning (SFT)
Preference Alignment with DPO
Evaluation and Metrics
Local Deployment with Ollama
Cost-Benefit Analysis
Conclusion

The Case for Fine-Tuning

Before diving into the how, let’s establish the when. Fine-tuning isn’t always the right choice.

When Fine-Tuning Makes Sense

Scenario	Why Fine-Tuning Helps
Domain expertise	Generic models don’t know your product, codebase, or policies
Cost at scale	API costs grow linearly; local inference is nearly free
Latency requirements	Local inference: 20-50ms. API calls: 200-2000ms
Privacy/compliance	Data never leaves your infrastructure
Consistent behavior	Fine-tuned models give predictable, reproducible outputs
Offline operation	No internet required once deployed

When to Stick with APIs

Scenario	Why APIs Win
Low volume	Under 100K tokens/day, API costs are negligible
General tasks	Creative writing, broad knowledge questions
Rapid iteration	Changing prompts is faster than retraining
No ML expertise	Fine-tuning requires understanding of training dynamics
Cutting-edge capability	GPT-4/Claude Opus still beat smaller fine-tuned models on complex reasoning

What We’ll Build

A domain Q&A model that answers questions about technical documentation. The workflow:

Documentation → Q&A Dataset → Fine-tune with Unsloth → Export to GGUF → Run with Ollama

This is the most common fine-tuning use case: taking a small model and making it an expert on your specific content.

Understanding Fine-Tuning Approaches

There are three main approaches to adapting a pre-trained LLM to your task:

1. Full Fine-Tuning

Update all model parameters. This is what the original BERT/GPT papers did.

The problem: A 7B parameter model in fp16 needs 14GB just to store weights. During training, you also need gradients (14GB) and optimizer states (28GB for Adam). That’s 56GB minimum—and that’s before activations.

When to use: Almost never for LLMs. Only if you have massive compute budget and need maximum performance.

2. LoRA (Low-Rank Adaptation)

The key insight: weight updates during fine-tuning are low-rank. Instead of updating the full weight matrix W, we learn two small matrices A and B where the update is BA.

Original:     W (4096 x 4096) = 16M parameters
LoRA update:  B (4096 x 16) × A (16 x 4096) = 131K parameters
Reduction:    99.2% fewer trainable parameters

The rank r (16 in this example) controls the trade-off between expressiveness and efficiency. Typical values: 8-64.

Benefits:

Train on consumer GPUs (16GB VRAM is enough for 7B models)
Adapter weights are tiny (~10-50MB)
Can swap adapters at inference time (multi-task)
No inference overhead when merged

3. QLoRA (Quantized LoRA)

QLoRA takes LoRA further by quantizing the base model to 4-bit while training the LoRA adapters in higher precision.

Full fine-tune 7B:  56GB+ VRAM
LoRA 7B (fp16):     ~16GB VRAM
QLoRA 7B (4-bit):   ~6GB VRAM

This is what makes fine-tuning accessible on consumer hardware. A 7B model fits on an RTX 3060 (12GB) or even a free Google Colab T4 (15GB).

Key QLoRA innovations:

4-bit NormalFloat (NF4): Optimal quantization for normally-distributed weights
Double quantization: Quantize the quantization constants (saves ~0.5GB per 7B params)
Paged optimizers: Spill optimizer states to CPU RAM during spikes

import altair as alt
import polars as pl
 
# Memory comparison visualization
memory_data = pl.DataFrame({
    "Method": ["Full Fine-Tune", "Full Fine-Tune", "Full Fine-Tune",
               "LoRA (fp16)", "LoRA (fp16)", "LoRA (fp16)",
               "QLoRA (4-bit)", "QLoRA (4-bit)", "QLoRA (4-bit)"],
    "Component": ["Weights", "Gradients", "Optimizer"] * 3,
    "Memory_GB": [
        14.0, 14.0, 28.0,  # Full fine-tune
        14.0, 0.1, 0.2,    # LoRA (only adapter gradients/optimizer)
        3.5, 0.1, 0.2,     # QLoRA (4-bit base + adapter)
    ]
})
 
chart = alt.Chart(memory_data).mark_bar().encode(
    x=alt.X("Method:N", sort=["Full Fine-Tune", "LoRA (fp16)", "QLoRA (4-bit)"],
            axis=alt.Axis(labelAngle=0)),
    y=alt.Y("Memory_GB:Q", title="GPU Memory (GB)"),
    color=alt.Color("Component:N", 
                    scale=alt.Scale(scheme="category10"),
                    legend=alt.Legend(orient="top")),
    order=alt.Order("Component:N", sort="descending")
).properties(
    title="Memory Requirements for Fine-Tuning a 7B Model",
    width=400,
    height=300
)
 
chart

LoRA Architecture Deep Dive

Here’s how LoRA modifies a transformer layer:

Original forward pass:
    h = W @ x

LoRA forward pass:
    h = W @ x + (B @ A) @ x
    h = W @ x + B @ (A @ x)  # More efficient: A is small

Where:

W is the frozen pre-trained weight matrix (d × d)
A is the down-projection (r × d), initialized with Gaussian
B is the up-projection (d × r), initialized with zeros
r is the rank (hyperparameter, typically 8-64)

Why initialize B with zeros? So at the start of training, B @ A = 0, meaning the model behaves exactly like the pre-trained model. Training gradually adds the learned adaptation.

Which layers to adapt? Typically the attention projections (q_proj, k_proj, v_proj, o_proj) and sometimes the MLP layers. Unsloth makes this easy to configure.

When to Use Each Approach

Available VRAM	Model Size	Recommended Approach
8GB	Up to 3B	QLoRA
12-16GB	Up to 7B	QLoRA
24GB	Up to 13B	QLoRA or LoRA (fp16)
48GB+	Up to 70B	LoRA (fp16)
80GB+ (A100)	Any	LoRA or Full Fine-Tune

Rule of thumb: Start with QLoRA. It’s 90-95% as good as full fine-tuning for most tasks, at a fraction of the cost. Only upgrade if you see quality issues.

Reinforcement Learning Methods

Supervised Fine-Tuning (SFT) teaches the model what to say. Reinforcement Learning from Human Feedback (RLHF) teaches it how to say it—making outputs more helpful, harmless, and honest.

The classic RLHF pipeline:

1. SFT: Train on demonstrations
2. Reward Model: Train on human preferences (A > B)
3. PPO: Optimize policy against reward model

This is complex and expensive. Modern methods simplify it significantly.

PPO (Proximal Policy Optimization)

The original RLHF method, used by OpenAI for GPT-4.

How it works:

Generate responses from current policy
Score with reward model
Update policy to increase reward (with KL penalty to stay close to reference)

Problems:

Requires training a separate reward model
Needs to keep reference model in memory
Unstable training dynamics
High memory: policy + reference + reward = 3 models

DPO (Direct Preference Optimization)

The breakthrough: DPO proves you can skip the reward model entirely. The policy is the reward model.

PPO:  Data → Reward Model → Policy Optimization
DPO:  Data → Policy Optimization (directly on preferences)

How it works:

Input: pairs of (prompt, chosen_response, rejected_response)
Loss: increase probability of chosen, decrease probability of rejected
Implicit reward: log-probability ratio between policy and reference

Benefits:

No reward model needed
Stable supervised learning (no RL instability)
Memory: just policy + reference (can share weights with Unsloth)

ORPO (Odds Ratio Preference Optimization)

The insight: DPO still requires SFT first, then DPO. ORPO combines them.

Traditional: SFT → DPO (two training stages)
ORPO:        SFT + Preference (single stage)

ORPO adds an odds-ratio penalty to the SFT loss, teaching the model to prefer good responses while learning the task.

Benefits:

Single training stage (faster, cheaper)
No reference model needed
Simpler hyperparameters

GRPO (Group Relative Policy Optimization)

DeepSeek’s method for training reasoning models (DeepSeek-R1).

Key innovations:

No reward model: use rule-based verification (math correctness, code execution)
No value model: replace with group statistics
Sample multiple responses, rank by reward, update toward better ones

Best for: Tasks with verifiable outputs (math, code, structured data).

# RL Methods comparison
rl_comparison = pl.DataFrame({
    "Method": ["PPO", "DPO", "ORPO", "GRPO"],
    "Complexity": [5, 2, 1, 3],
    "Memory": [5, 3, 2, 3],
    "Stability": [2, 4, 5, 4],
    "Data_Requirement": [4, 3, 3, 2],
})
 
# Melt for grouped bar chart
rl_melted = rl_comparison.unpivot(
    index="Method",
    on=["Complexity", "Memory", "Stability", "Data_Requirement"],
    variable_name="Metric",
    value_name="Score"
)
 
chart = alt.Chart(rl_melted).mark_bar().encode(
    x=alt.X("Method:N", axis=alt.Axis(labelAngle=0)),
    y=alt.Y("Score:Q", title="Score (lower is better for Complexity/Memory)"),
    color=alt.Color("Metric:N", scale=alt.Scale(scheme="tableau10")),
    xOffset="Metric:N"
).properties(
    title="RL Method Comparison (1-5 scale)",
    width=500,
    height=300
)
 
chart

Which Method Should You Use?

                     Start Here
                         │
                         ▼
              Do you have preference data?
                    (A > B pairs)
                   /           \
                 Yes            No
                  │              │
                  ▼              ▼
         Want single-stage?    Is output verifiable?
            /        \         (math, code, JSON)
          Yes        No           /        \
           │          │         Yes        No
           ▼          ▼          │          │
         ORPO       DPO        GRPO      SFT only

For most domain Q&A tasks: Start with SFT. Add DPO if you have preference data and want to improve quality further.

Data Preparation

Data quality is the single biggest factor in fine-tuning success. A model trained on 1,000 high-quality examples often outperforms one trained on 100,000 noisy examples.

Dataset Formats

The two most common formats:

1. Alpaca Format (instruction-following)

{
    "instruction": "Explain the return policy",
    "input": "",
    "output": "Our return policy allows returns within 30 days..."
}

2. ShareGPT/Conversation Format (multi-turn)

{
    "conversations": [
        {"from": "human", "value": "What's the return policy?"},
        {"from": "gpt", "value": "Our return policy allows..."},
        {"from": "human", "value": "What about opened items?"},
        {"from": "gpt", "value": "Opened items can be returned if..."}
    ]
}

Building a Q&A Dataset from Documentation

For domain Q&A, we need to:

Extract content from documentation
Generate question-answer pairs
Validate quality

Let’s build a pipeline to create a Q&A dataset from Python documentation.

import polars as pl
from typing import TypedDict
import json
 
 
class QAPair(TypedDict):
    """A question-answer pair for fine-tuning."""
    instruction: str
    input: str
    output: str
    source: str
 
 
class PreferencePair(TypedDict):
    """A preference pair for DPO training."""
    prompt: str
    chosen: str
    rejected: str
 
 
# Example documentation chunks (in practice, extract from your docs)
DOCUMENTATION_CHUNKS = [
    {
        "title": "asyncio.gather",
        "content": """asyncio.gather(*aws, return_exceptions=False)
        
Run awaitable objects in the aws sequence concurrently.
 
If any awaitable in aws is a coroutine, it is automatically scheduled as a Task.
 
If all awaitables are completed successfully, the result is an aggregate list of 
returned values. The order of result values corresponds to the order of awaitables in aws.
 
If return_exceptions is True, exceptions in the awaitables are treated the same as 
successful results, and gathered in the result list.""",
        "source": "python-docs/asyncio"
    },
    {
        "title": "dataclasses.field",
        "content": """dataclasses.field(*, default=MISSING, default_factory=MISSING, repr=True, 
                         hash=None, init=True, compare=True, metadata=None, kw_only=MISSING)
 
For common and simple use cases, no other functionality is required. There are, however, 
some dataclass features that require additional per-field information. To satisfy this 
need for additional information, you can replace the default field value with a call to 
the provided field() function.
 
The default_factory parameter provides a way to specify a field's default value using a 
callable. This is useful for mutable default values like lists or dicts.""",
        "source": "python-docs/dataclasses"
    },
    {
        "title": "typing.TypedDict",
        "content": """class typing.TypedDict(dict)
 
Special construct to add type hints to a dictionary. At runtime it is a plain dict.
 
TypedDict declares a dictionary type that expects all of its instances to have a certain 
set of keys, where each key is associated with a value of a consistent type.
 
This expectation is not checked at runtime but is only enforced by type checkers.""",
        "source": "python-docs/typing"
    }
]
 
print(f"Loaded {len(DOCUMENTATION_CHUNKS)} documentation chunks")
print(f"\nExample chunk: {DOCUMENTATION_CHUNKS[0]['title']}")

def generate_qa_pairs_from_chunk(chunk: dict) -> list[QAPair]:
    """Generate Q&A pairs from a documentation chunk.
    
    In production, use an LLM (via ollama or API) to generate these.
    Here we show the pattern with hand-crafted examples.
    """
    # These would be generated by an LLM in practice
    qa_templates = {
        "asyncio.gather": [
            QAPair(
                instruction="How do I run multiple async functions concurrently in Python?",
                input="",
                output="""Use `asyncio.gather()` to run multiple awaitables concurrently. It takes any number of awaitable objects and runs them in parallel:
 
```python
import asyncio
 
async def fetch_data(url):
    # async operation
    pass
 
async def main():
    results = await asyncio.gather(
        fetch_data("url1"),
        fetch_data("url2"),
        fetch_data("url3")
    )
    # results is a list in the same order as inputs

The results are returned as a list in the same order as the input awaitables.""", source=chunk[“source”] ), QAPair( instruction=“What does return_exceptions do in asyncio.gather?”, input="", output="""The return_exceptions parameter in asyncio.gather() controls how exceptions are handled:

return_exceptions=False (default): If any awaitable raises an exception, it propagates immediately and cancels other tasks.
return_exceptions=True: Exceptions are captured and returned in the results list alongside successful values.

Example with exception handling:

results = await asyncio.gather(
    task1(),
    task2(),  # might raise
    task3(),
    return_exceptions=True
)
# results might be: [result1, ValueError(...), result3]

This is useful when you want all tasks to complete even if some fail.""", source=chunk[“source”] ) ], “dataclasses.field”: [ QAPair( instruction=“How do I set a default list value for a dataclass field?”, input="", output="""Use field(default_factory=list) to set a mutable default value in a dataclass. Never use a mutable default directly as it would be shared across all instances.

from dataclasses import dataclass, field
 
# WRONG - shared mutable default
@dataclass
class Wrong:
    items: list = []  # This is a bug!
 
# CORRECT - factory creates new list per instance
@dataclass
class Correct:
    items: list = field(default_factory=list)

The default_factory is called each time a new instance is created, ensuring each instance gets its own list.""", source=chunk[“source”] ) ], “typing.TypedDict”: [ QAPair( instruction=“What is TypedDict and when should I use it?”, input="", output="""TypedDict is a way to add type hints to dictionaries with a fixed set of keys. Use it when:

You need a dictionary with known keys (like JSON responses)
You want IDE autocompletion and type checking
You don’t need the overhead of a full class

from typing import TypedDict
 
class UserDict(TypedDict):
    name: str
    age: int
    email: str | None
 
def process_user(user: UserDict) -> None:
    print(user["name"])  # Type checker knows this is str

Note: TypedDict is only for static type checking. At runtime, it’s just a regular dict with no validation.""", source=chunk[“source”] ) ] }

return qa_templates.get(chunk["title"], [])

Generate Q&A pairs from all chunks

all_qa_pairs: list[QAPair] = [] for chunk in DOCUMENTATION_CHUNKS: pairs = generate_qa_pairs_from_chunk(chunk) all_qa_pairs.extend(pairs)

print(f”Generated {len(all_qa_pairs)} Q&A pairs”) print(f”\nExample Q&A pair:”) print(f”Q: {all_qa_pairs[0][‘instruction’]}”) print(f”A: {all_qa_pairs[0][‘output’][:200]}…”)



```python
def format_for_training(qa_pairs: list[QAPair], format_type: str = "alpaca") -> list[dict]:
    """Format Q&A pairs for training.
    
    Args:
        qa_pairs: List of Q&A pairs
        format_type: 'alpaca' or 'sharegpt'
    
    Returns:
        List of formatted training examples
    """
    if format_type == "alpaca":
        return [
            {
                "instruction": qa["instruction"],
                "input": qa.get("input", ""),
                "output": qa["output"]
            }
            for qa in qa_pairs
        ]
    elif format_type == "sharegpt":
        return [
            {
                "conversations": [
                    {"from": "human", "value": qa["instruction"]},
                    {"from": "gpt", "value": qa["output"]}
                ]
            }
            for qa in qa_pairs
        ]
    else:
        raise ValueError(f"Unknown format: {format_type}")


# Format for Alpaca (most common for Unsloth)
training_data = format_for_training(all_qa_pairs, "alpaca")

# Save to JSON
with open("training_data.json", "w") as f:
    json.dump(training_data, f, indent=2)

print(f"Saved {len(training_data)} examples to training_data.json")
print(f"\nExample formatted for training:")
print(json.dumps(training_data[0], indent=2)[:500])

Creating Preference Data for DPO

For DPO training, we need pairs of (prompt, chosen_response, rejected_response). There are several ways to create this:

Human annotation: Have humans rank responses (gold standard but expensive)
Model-based ranking: Use a stronger model to rank responses from weaker models
Synthetic generation: Generate good and bad examples programmatically

# Example preference data for DPO
preference_data: list[PreferencePair] = [
    PreferencePair(
        prompt="How do I run multiple async functions concurrently?",
        chosen="""Use `asyncio.gather()` to run multiple awaitables concurrently:
 
```python
import asyncio
 
async def main():
    results = await asyncio.gather(
        task1(),
        task2(),
        task3()
    )

Results are returned in the same order as inputs. Use return_exceptions=True to handle errors gracefully.""", rejected="""You can use threading or multiprocessing to run functions concurrently. Just create multiple threads and start them.""" ), PreferencePair( prompt=“What’s the best way to handle mutable default arguments in dataclasses?”, chosen="""Use field(default_factory=...) to avoid the mutable default argument bug:

from dataclasses import dataclass, field
 
@dataclass
class Config:
    options: list = field(default_factory=list)
    settings: dict = field(default_factory=dict)

The factory function is called for each new instance, ensuring isolation.""", rejected="""Just set the default to an empty list: options: list = []. Python handles it automatically.""" ) ]

print(f”Created {len(preference_data)} preference pairs for DPO”) print(f”\nExample preference pair:”) print(f”Prompt: {preference_data[0][‘prompt’]}”) print(f”Chosen (correct): {preference_data[0][‘chosen’][:100]}…”) print(f”Rejected (wrong): {preference_data[0][‘rejected’][:100]}…”)


### Data Quality Checklist

Before training, verify your data:

- [ ] **Accuracy**: Are the answers factually correct?
- [ ] **Completeness**: Do answers fully address the question?
- [ ] **Formatting**: Consistent use of code blocks, lists, etc.
- [ ] **Length distribution**: Mix of short and detailed answers
- [ ] **Diversity**: Cover different topics and question types
- [ ] **No data leakage**: Test set questions shouldn't overlap with training

---

## Setting Up Unsloth

[Unsloth](https://github.com/unslothai/unsloth) is an optimization library that makes LLM fine-tuning **2x faster** and uses **70% less memory** compared to standard Hugging Face training.

### Why Unsloth?

- **Custom CUDA kernels**: Fused operations for attention, RoPE, cross-entropy
- **Memory optimization**: Gradient checkpointing, dynamic quantization
- **Easy LoRA/QLoRA**: Single function call to add adapters
- **Wide model support**: Llama, Mistral, Qwen, Gemma, Phi, and more
- **Export options**: GGUF, vLLM, Hugging Face formats

### Installation

**Google Colab (recommended for training):**
```python
%%capture
!pip install unsloth
# Get the latest from GitHub
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Local with CUDA:

pip install "unsloth[cu121-ampere-torch240]"  # For RTX 30xx/40xx
# OR
pip install "unsloth[cu121-torch240]"  # For older GPUs

# ============================================================
# NOTE: The following code requires a CUDA GPU.
# Run this section in Google Colab or on a machine with NVIDIA GPU.
# ============================================================
 
# Check if CUDA is available
import subprocess
import sys
 
def check_cuda():
    """Check if CUDA is available."""
    try:
        import torch
        if torch.cuda.is_available():
            print(f"CUDA available: {torch.cuda.get_device_name(0)}")
            print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
            return True
        else:
            print("CUDA not available. Training code will be shown but not executed.")
            return False
    except ImportError:
        print("PyTorch not installed. Training code will be shown but not executed.")
        return False
 
HAS_CUDA = check_cuda()

# Model loading with Unsloth (requires CUDA)
if HAS_CUDA:
    from unsloth import FastLanguageModel
    
    # Load model in 4-bit quantization
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-3B-Instruct",  # Good balance of quality and size
        max_seq_length=2048,  # Can go up to 8192 for longer contexts
        load_in_4bit=True,    # Use QLoRA (4-bit quantization)
        dtype=None,           # Auto-detect (bf16 for Ampere+, fp16 otherwise)
    )
    
    print(f"Model loaded: {model.config._name_or_path}")
    print(f"Parameters: {model.num_parameters() / 1e9:.2f}B")
else:
    print("Skipping model loading (no CUDA)")
    print("\nTo run this code, use Google Colab with GPU runtime.")

# Add LoRA adapters (requires CUDA)
if HAS_CUDA:
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,              # LoRA rank - higher = more capacity, more memory
        lora_alpha=16,     # Scaling factor (usually equal to r)
        lora_dropout=0,    # Dropout for regularization (0 is fine for small datasets)
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention projections
            "gate_proj", "up_proj", "down_proj",      # MLP layers
        ],
        bias="none",       # Don't train biases
        use_gradient_checkpointing="unsloth",  # Memory optimization
        random_state=42,
        use_rslora=False,  # Rank-stabilized LoRA (experimental)
    )
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
else:
    print("Skipping LoRA setup (no CUDA)")

LoRA Configuration Deep Dive

Parameter	Typical Values	Effect
`r` (rank)	8, 16, 32, 64	Higher = more capacity, more VRAM
`lora_alpha`	Same as r	Scaling factor for adapter output
`lora_dropout`	0, 0.05, 0.1	Regularization (higher = less overfitting)
`target_modules`	See below	Which layers to adapt

Target modules by model:

Llama/Mistral: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Qwen: Same as Llama
Gemma: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Rule of thumb: Start with r=16 and all target modules. Reduce r if running out of VRAM.

Supervised Fine-Tuning (SFT)

SFT is the foundation of fine-tuning. We train the model to predict the next token given the prompt and expected output.

# Define the prompt template
ALPACA_PROMPT = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
 
### Instruction:
{instruction}
 
### Response:
{output}"""
 
def format_prompt(example: dict) -> str:
    """Format a training example into the prompt template."""
    return ALPACA_PROMPT.format(
        instruction=example["instruction"],
        output=example["output"]
    )
 
# Test the formatting
print("Example formatted prompt:")
print(format_prompt(training_data[0])[:500])

# SFT Training (requires CUDA)
if HAS_CUDA:
    from trl import SFTTrainer
    from transformers import TrainingArguments
    from datasets import Dataset
    
    # Convert to HuggingFace dataset
    dataset = Dataset.from_list(training_data)
    
    # Format with the prompt template
    def format_examples(examples):
        texts = []
        for instruction, output in zip(examples["instruction"], examples["output"]):
            text = ALPACA_PROMPT.format(instruction=instruction, output=output)
            texts.append(text)
        return {"text": texts}
    
    dataset = dataset.map(format_examples, batched=True)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir="outputs",
        per_device_train_batch_size=2,       # Adjust based on VRAM
        gradient_accumulation_steps=4,       # Effective batch size = 2 * 4 = 8
        num_train_epochs=3,                  # More epochs for small datasets
        learning_rate=2e-4,                  # Standard for LoRA
        lr_scheduler_type="cosine",          # Cosine annealing
        warmup_steps=10,                     # Warmup for stability
        logging_steps=1,                     # Log every step
        save_strategy="epoch",               # Save after each epoch
        optim="adamw_8bit",                  # Memory-efficient optimizer
        seed=42,
        fp16=not torch.cuda.is_bf16_supported(),  # Use bf16 if available
        bf16=torch.cuda.is_bf16_supported(),
    )
    
    # Create trainer
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        args=training_args,
        dataset_text_field="text",
        max_seq_length=2048,
        packing=False,  # Don't pack multiple examples (cleaner gradients)
    )
    
    print("Starting SFT training...")
    trainer_stats = trainer.train()
    print(f"Training complete! Final loss: {trainer_stats.training_loss:.4f}")
else:
    print("Skipping SFT training (no CUDA)")
    print("\nIn Colab, this would train the model on your Q&A dataset.")

# Simulated training metrics for visualization
# (Replace with actual metrics from trainer.state.log_history in Colab)
training_metrics = pl.DataFrame({
    "step": list(range(0, 100, 5)),
    "loss": [2.8, 2.4, 2.1, 1.9, 1.7, 1.5, 1.4, 1.3, 1.2, 1.15,
             1.1, 1.05, 1.0, 0.95, 0.92, 0.88, 0.85, 0.82, 0.8, 0.78],
})
 
chart = alt.Chart(training_metrics).mark_line(point=True).encode(
    x=alt.X("step:Q", title="Training Step"),
    y=alt.Y("loss:Q", title="Loss", scale=alt.Scale(domain=[0.5, 3])),
).properties(
    title="SFT Training Loss Curve",
    width=500,
    height=300
)
 
chart

Hyperparameter Tuning Guide

Parameter	Starting Value	Adjust If…
`learning_rate`	2e-4	Loss spikes: reduce. Loss plateaus early: increase.
`num_train_epochs`	3-5	Small dataset: more epochs. Large dataset: fewer.
`batch_size`	2-4	Out of memory: reduce. Training slow: increase with gradient accumulation.
`gradient_accumulation`	4-8	Want larger effective batch without more VRAM.
`warmup_steps`	5-10% of total	Training unstable at start: increase.

Common issues:

Loss doesn’t decrease: Learning rate too low, or data formatting wrong
Loss spikes: Learning rate too high, reduce by 50%
Overfitting: Add dropout, reduce epochs, or get more data

Preference Alignment with DPO

After SFT, we can optionally train with DPO to improve response quality. This teaches the model to prefer better responses.

# DPO Training (requires CUDA and SFT model)
if HAS_CUDA:
    from trl import DPOTrainer, DPOConfig
    from datasets import Dataset
    
    # Convert preference data to HuggingFace dataset
    dpo_dataset = Dataset.from_list(preference_data)
    
    # DPO configuration
    dpo_config = DPOConfig(
        output_dir="outputs_dpo",
        beta=0.1,                            # KL penalty weight (higher = more conservative)
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=1,                  # DPO usually needs fewer epochs
        learning_rate=5e-5,                  # Lower than SFT
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        logging_steps=1,
        optim="adamw_8bit",
        seed=42,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
    )
    
    # Create DPO trainer
    # Note: Unsloth shares weights between policy and reference, saving memory
    dpo_trainer = DPOTrainer(
        model=model,
        ref_model=None,      # Unsloth handles reference model internally
        train_dataset=dpo_dataset,
        tokenizer=tokenizer,
        args=dpo_config,
    )
    
    print("Starting DPO training...")
    dpo_trainer.train()
    print("DPO training complete!")
else:
    print("Skipping DPO training (no CUDA)")
    print("\nDPO improves response quality by learning from preference pairs.")

DPO vs ORPO

If you want to skip the two-stage SFT → DPO pipeline, ORPO combines both in a single training run:

from trl import ORPOTrainer, ORPOConfig
 
orpo_config = ORPOConfig(
    output_dir="outputs_orpo",
    beta=0.1,  # Same as DPO
    # ... other training args
)
 
# ORPO dataset needs instruction + chosen + rejected
orpo_trainer = ORPOTrainer(
    model=model,
    train_dataset=orpo_dataset,
    tokenizer=tokenizer,
    args=orpo_config,
)
orpo_trainer.train()

When to use ORPO:

You have preference data from the start
You want faster iteration (single training run)
You’re okay with slightly less fine-grained control

Evaluation and Metrics

How do you know if fine-tuning actually helped? You need systematic evaluation.

Quantitative Metrics

1. Perplexity - How well the model predicts the next token

Lower is better
Compare before/after fine-tuning on held-out data

2. Task-specific metrics - Depends on your use case

Q&A: Exact match, F1 score, BLEU/ROUGE
Classification: Accuracy, precision, recall
Generation: Human preference ratings

Qualitative Evaluation

Numbers don’t tell the whole story. Always do qualitative testing:

Create 20-50 test questions covering edge cases
Get responses from base model AND fine-tuned model
Compare side-by-side

# Evaluation test cases
TEST_QUESTIONS = [
    "How do I run multiple async functions at the same time in Python?",
    "What's the difference between asyncio.gather and asyncio.wait?",
    "How do I create a dataclass with a default list field?",
    "When should I use TypedDict vs a regular class?",
    "How do I handle exceptions in asyncio.gather?",
]
 
def run_inference(model, tokenizer, question: str) -> str:
    """Run inference on a single question."""
    prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
 
### Instruction:
{question}
 
### Response:
"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part
    response = response.split("### Response:")[-1].strip()
    return response
 
 
# Example evaluation (requires CUDA)
if HAS_CUDA:
    print("Running evaluation on test questions...")
    for q in TEST_QUESTIONS[:2]:  # Just first 2 for demo
        print(f"\n{'='*60}")
        print(f"Q: {q}")
        print(f"{'='*60}")
        response = run_inference(model, tokenizer, q)
        print(f"A: {response[:500]}...")
else:
    print("Skipping inference evaluation (no CUDA)")
    print(f"\nTest questions prepared: {len(TEST_QUESTIONS)}")

# Simulated evaluation results for visualization
eval_results = pl.DataFrame({
    "Model": ["Base Llama-3.2-3B", "Base Llama-3.2-3B", "Base Llama-3.2-3B",
              "Fine-tuned", "Fine-tuned", "Fine-tuned"],
    "Metric": ["Accuracy", "Relevance", "Completeness"] * 2,
    "Score": [
        0.45, 0.55, 0.40,  # Base model
        0.82, 0.88, 0.75,  # Fine-tuned
    ]
})
 
chart = alt.Chart(eval_results).mark_bar().encode(
    x=alt.X("Model:N", axis=alt.Axis(labelAngle=0)),
    y=alt.Y("Score:Q", title="Score (0-1)", scale=alt.Scale(domain=[0, 1])),
    color=alt.Color("Metric:N", scale=alt.Scale(scheme="tableau10")),
    xOffset="Metric:N"
).properties(
    title="Model Evaluation: Base vs Fine-tuned on Domain Q&A",
    width=400,
    height=300
)
 
chart

Local Deployment with Ollama

Once training is complete, we export the model to GGUF format and run it locally with Ollama.

Why Ollama?

Dead simple: ollama run model-name and you’re chatting
Mac-optimized: Takes full advantage of Apple Silicon
No Python needed: Runs as a standalone binary
API compatible: OpenAI-compatible API for easy integration
Model management: Easy to download, update, and switch models

Export to GGUF

# Export model to GGUF (requires CUDA and trained model)
if HAS_CUDA:
    # Merge LoRA adapters and save as GGUF
    model.save_pretrained_gguf(
        "my-qa-model",
        tokenizer,
        quantization_method="q4_k_m",  # Good balance of quality and size
    )
    
    print("Model exported to GGUF format!")
    print("\nQuantization options:")
    print("  q4_k_m  - 4-bit, medium quality (recommended)")
    print("  q5_k_m  - 5-bit, higher quality, larger size")
    print("  q8_0    - 8-bit, near full quality, largest size")
else:
    print("Skipping GGUF export (no CUDA)")
    print("\nIn Colab, this would create: my-qa-model-unsloth.Q4_K_M.gguf")

Create Ollama Modelfile

The Modelfile tells Ollama how to run your model:

# Create Ollama Modelfile
modelfile_content = """# Modelfile for domain Q&A model
FROM ./my-qa-model-unsloth.Q4_K_M.gguf
 
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 2048
 
# System prompt
SYSTEM You are a helpful assistant specialized in Python programming. Answer questions accurately and provide code examples when appropriate.
 
# Template for chat format
TEMPLATE \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.
 
### Instruction:
{{ .Prompt }}
 
### Response:
\"\"\"
"""
 
with open("Modelfile", "w") as f:
    f.write(modelfile_content)
 
print("Modelfile created!")
print("\nTo use with Ollama:")
print("  1. Install Ollama: https://ollama.ai")
print("  2. ollama create my-qa-model -f Modelfile")
print("  3. ollama run my-qa-model")

Running with Ollama

Once the model is created in Ollama, you can interact with it:

# Interactive chat
ollama run my-qa-model
 
# Single query
ollama run my-qa-model "How do I use asyncio.gather?"
 
# Via API
curl http://localhost:11434/api/generate -d '{
  "model": "my-qa-model",
  "prompt": "How do I use asyncio.gather?"
}'

# Python integration with Ollama
# This works on Mac without CUDA!
 
def check_ollama():
    """Check if Ollama is running."""
    try:
        import ollama
        ollama.list()
        return True
    except Exception:
        return False
 
if check_ollama():
    import ollama
    
    # List available models
    models = ollama.list()
    print("Available Ollama models:")
    for model in models.get('models', []):
        print(f"  - {model['name']}")
    
    # Example: Query the model
    # response = ollama.chat(
    #     model='my-qa-model',
    #     messages=[{'role': 'user', 'content': 'How do I use asyncio.gather?'}]
    # )
    # print(response['message']['content'])
else:
    print("Ollama not running. Start it with: ollama serve")
    print("\nExample code for when Ollama is available:")
    print("""
import ollama
 
response = ollama.chat(
    model='my-qa-model',
    messages=[{'role': 'user', 'content': 'How do I use asyncio.gather?'}]
)
print(response['message']['content'])
""")

# Latency comparison (simulated data)
latency_data = pl.DataFrame({
    "Method": (["Claude API"] * 20 + ["GPT-4 API"] * 20 + 
               ["Local Ollama (M2)"] * 20 + ["Local Ollama (M3 Pro)"] * 20),
    "Latency_ms": (
        [450 + i * 10 for i in range(20)] +  # Claude API
        [380 + i * 12 for i in range(20)] +  # GPT-4 API
        [25 + i * 2 for i in range(20)] +    # Local M2
        [18 + i * 1.5 for i in range(20)]    # Local M3 Pro
    )
})
 
chart = alt.Chart(latency_data).mark_boxplot().encode(
    x=alt.X("Method:N", axis=alt.Axis(labelAngle=-15)),
    y=alt.Y("Latency_ms:Q", title="First Token Latency (ms)"),
    color=alt.Color("Method:N", legend=None)
).properties(
    title="Inference Latency: API vs Local Ollama",
    width=400,
    height=300
)
 
chart

Cost-Benefit Analysis

Let’s calculate when fine-tuning and local inference beats API providers.

API Pricing (January 2026)

Provider	Model	Input (per 1M tokens)	Output (per 1M tokens)
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00
Anthropic	Claude 3.5 Haiku	$0.80	$4.00
OpenAI	GPT-4o	$2.50	$10.00
OpenAI	GPT-4o-mini	$0.15	$0.60
Google	Gemini 1.5 Pro	$1.25	$5.00

Local Inference Costs

Item	One-time Cost	Monthly Cost
MacBook Pro M3	$2,000-4,000	-
Electricity (24/7 operation)	-	~$5-15
Fine-tuning (Colab Pro)	$10-50 per run	-

# Cost analysis calculations
 
# API costs per 1M tokens (average of input/output, weighted 1:2)
API_COSTS = {
    "Claude 3.5 Sonnet": (3.00 + 15.00 * 2) / 3,  # $11.00
    "Claude 3.5 Haiku": (0.80 + 4.00 * 2) / 3,    # $2.93
    "GPT-4o": (2.50 + 10.00 * 2) / 3,             # $7.50
    "GPT-4o-mini": (0.15 + 0.60 * 2) / 3,         # $0.45
    "Gemini 1.5 Pro": (1.25 + 5.00 * 2) / 3,      # $3.75
}
 
# Local costs
LOCAL_MONTHLY_COST = 10  # Electricity
FINE_TUNING_COST = 25    # One-time Colab cost
HARDWARE_COST = 2500     # MacBook Pro (amortized over 3 years = $70/month)
HARDWARE_MONTHLY = HARDWARE_COST / 36
 
def calculate_monthly_cost(tokens_per_day: int, provider: str) -> float:
    """Calculate monthly API cost."""
    tokens_per_month = tokens_per_day * 30
    return (tokens_per_month / 1_000_000) * API_COSTS[provider]
 
def calculate_local_monthly_cost() -> float:
    """Calculate monthly local cost (ignoring one-time fine-tuning)."""
    return LOCAL_MONTHLY_COST + HARDWARE_MONTHLY
 
# Test scenarios
scenarios = [
    ("Personal/Hobby", 50_000),      # 50K tokens/day
    ("Startup", 1_000_000),          # 1M tokens/day
    ("Growth", 10_000_000),          # 10M tokens/day
    ("Enterprise", 100_000_000),     # 100M tokens/day
]
 
print("Monthly Cost Comparison")
print("=" * 70)
print(f"{'Scale':<15} {'Tokens/Day':<15} {'GPT-4o-mini':<12} {'Claude Haiku':<14} {'Local':<10}")
print("-" * 70)
 
for name, tokens in scenarios:
    gpt_cost = calculate_monthly_cost(tokens, "GPT-4o-mini")
    claude_cost = calculate_monthly_cost(tokens, "Claude 3.5 Haiku")
    local_cost = calculate_local_monthly_cost()
    
    print(f"{name:<15} {tokens:>12,} ${gpt_cost:>10,.0f} ${claude_cost:>12,.0f} ${local_cost:>8,.0f}")
 
print("\n* Local cost assumes hardware amortized over 3 years")
print("* One-time fine-tuning cost (~$25) not included in monthly")

# Cost visualization
import numpy as np
 
# Generate cost data across different scales
tokens_range = [10_000, 50_000, 100_000, 500_000, 1_000_000, 
                5_000_000, 10_000_000, 50_000_000, 100_000_000]
 
cost_data = []
for tokens in tokens_range:
    cost_data.append({
        "Tokens_per_day": tokens,
        "Provider": "GPT-4o-mini",
        "Monthly_Cost": calculate_monthly_cost(tokens, "GPT-4o-mini")
    })
    cost_data.append({
        "Tokens_per_day": tokens,
        "Provider": "Claude Haiku",
        "Monthly_Cost": calculate_monthly_cost(tokens, "Claude 3.5 Haiku")
    })
    cost_data.append({
        "Tokens_per_day": tokens,
        "Provider": "GPT-4o",
        "Monthly_Cost": calculate_monthly_cost(tokens, "GPT-4o")
    })
    cost_data.append({
        "Tokens_per_day": tokens,
        "Provider": "Local (Fine-tuned)",
        "Monthly_Cost": calculate_local_monthly_cost()
    })
 
cost_df = pl.DataFrame(cost_data)
 
chart = alt.Chart(cost_df).mark_line(point=True).encode(
    x=alt.X("Tokens_per_day:Q", 
            scale=alt.Scale(type="log"),
            title="Tokens per Day (log scale)"),
    y=alt.Y("Monthly_Cost:Q", 
            scale=alt.Scale(type="log"),
            title="Monthly Cost USD (log scale)"),
    color=alt.Color("Provider:N", scale=alt.Scale(scheme="category10")),
    strokeDash=alt.condition(
        alt.datum.Provider == "Local (Fine-tuned)",
        alt.value([5, 5]),
        alt.value([0])
    )
).properties(
    title="Monthly Cost by Scale: API Providers vs Local Fine-tuned Model",
    width=600,
    height=400
)
 
chart

# Break-even analysis
def find_break_even(provider: str) -> int:
    """Find tokens/day where local becomes cheaper than API."""
    local_monthly = calculate_local_monthly_cost()
    cost_per_token = API_COSTS[provider] / 1_000_000
    tokens_per_month = local_monthly / cost_per_token
    return int(tokens_per_month / 30)
 
print("Break-Even Points (where local becomes cheaper)")
print("=" * 50)
for provider in ["GPT-4o-mini", "Claude 3.5 Haiku", "GPT-4o", "Claude 3.5 Sonnet"]:
    break_even = find_break_even(provider)
    print(f"{provider:<20}: {break_even:>12,} tokens/day")
 
print("\n* Above these volumes, local inference is more cost-effective")
print("* Break-even assumes hardware is already owned")

# Summary visualization: When to use what
summary_data = pl.DataFrame({
    "Scale": ["Hobby\n(<50K/day)", "Startup\n(50K-1M/day)", "Growth\n(1M-10M/day)", "Enterprise\n(>10M/day)"],
    "Recommendation": ["API", "API or Local", "Local", "Local"],
    "Monthly_Savings": [0, 50, 2000, 50000],  # Approximate savings with local
    "Order": [1, 2, 3, 4]
})
 
base = alt.Chart(summary_data).encode(
    x=alt.X("Scale:N", sort=alt.EncodingSortField(field="Order"), axis=alt.Axis(labelAngle=0)),
)
 
bars = base.mark_bar().encode(
    y=alt.Y("Monthly_Savings:Q", title="Monthly Savings with Local ($)"),
    color=alt.Color("Recommendation:N", 
                    scale=alt.Scale(domain=["API", "API or Local", "Local"],
                                   range=["#e45756", "#f58518", "#54a24b"]))
)
 
text = base.mark_text(dy=-10).encode(
    y="Monthly_Savings:Q",
    text=alt.Text("Monthly_Savings:Q", format="$,.0f")
)
 
chart = (bars + text).properties(
    title="When Local Fine-Tuning Makes Sense (vs GPT-4o-mini)",
    width=400,
    height=300
)
 
chart

Cost Summary

Scale	Tokens/Day	Recommendation	Notes
Hobby	<50K	Use API	Not worth setup effort
Startup	50K-1M	Consider local	Break-even depends on model quality needs
Growth	1M-10M	Local strongly recommended	Saves $500-5,000/month
Enterprise	>10M	Local essential	Saves $10,000+/month

Additional factors favoring local:

Latency-sensitive applications (local is 10-20x faster)
Privacy requirements (data never leaves your machine)
Offline operation needs
Predictable costs (no surprise bills)

Additional factors favoring API:

Cutting-edge capability needs (GPT-4/Claude still ahead on complex tasks)
Rapid experimentation (easier to switch models)
No ML expertise on team

Conclusion

Fine-tuning your own LLM is no longer a research project—it’s a practical engineering choice. With Unsloth, you can:

Fine-tune on consumer hardware (or free Colab)
Train 2x faster with 70% less memory
Export to GGUF and run anywhere
Deploy locally with Ollama for free inference

Key Takeaways

Start with QLoRA - It’s 90-95% as good as full fine-tuning at a fraction of the cost
Data quality > quantity - 1,000 perfect examples beat 100,000 noisy ones
SFT first, then DPO - Get the basics right before preference optimization
Evaluate systematically - Both quantitative metrics and qualitative testing
Local deployment is viable - Ollama makes it dead simple on Mac
Cost math favors local at scale - Above 100K tokens/day, consider fine-tuning

Recommended Workflow

1. Collect domain data (docs, Q&A, examples)
2. Generate training dataset (Alpaca format)
3. Fine-tune with Unsloth on Colab (QLoRA, Llama-3.2-3B)
4. Evaluate on held-out test set
5. If needed, add DPO with preference data
6. Export to GGUF
7. Deploy with Ollama locally
8. Iterate based on real-world feedback

Resources

Unsloth GitHub - Installation and examples
Unsloth Documentation - Detailed guides
Ollama - Local model deployment
TRL Library - SFT, DPO, ORPO trainers
PEFT Library - LoRA implementation

The best model for your task isn’t the biggest—it’s the one trained on your data. Fine-tuning makes that possible for everyone.

Fine-Tuning LLMs with Unsloth: A Complete Guide

Fine-Tuning LLMs with Unsloth: A Complete Guide

Table of Contents

The Case for Fine-Tuning

When Fine-Tuning Makes Sense

When to Stick with APIs

What We’ll Build

Understanding Fine-Tuning Approaches

1. Full Fine-Tuning

2. LoRA (Low-Rank Adaptation)

3. QLoRA (Quantized LoRA)

LoRA Architecture Deep Dive

When to Use Each Approach

Reinforcement Learning Methods

PPO (Proximal Policy Optimization)

DPO (Direct Preference Optimization)

ORPO (Odds Ratio Preference Optimization)

GRPO (Group Relative Policy Optimization)

Which Method Should You Use?

Data Preparation

Dataset Formats

Building a Q&A Dataset from Documentation

Generate Q&A pairs from all chunks

Creating Preference Data for DPO

LoRA Configuration Deep Dive

Supervised Fine-Tuning (SFT)

Hyperparameter Tuning Guide

Preference Alignment with DPO

DPO vs ORPO

Evaluation and Metrics

Quantitative Metrics

Qualitative Evaluation

Local Deployment with Ollama

Why Ollama?

Export to GGUF

Create Ollama Modelfile

Running with Ollama

Cost-Benefit Analysis

API Pricing (January 2026)

Local Inference Costs

Cost Summary

Conclusion

Key Takeaways

Recommended Workflow

Resources