A fine-tuned 7B model trained on your domain data will outperform a frontier model on generic prompts for well-defined tasks — consistently, cheaply, and without sending your data to a third-party endpoint. The process is more accessible than most teams realise.
What fine-tuning actually does
When you prompt a general-purpose model, you are working with weights that were trained to be useful to everyone. The model knows a great deal about the world, but it does not know your domain, your format conventions, your edge cases, or the subtle quality signals that your team cares about. Every prompt is a negotiation with a model that has no memory of the last time you ran it.
Supervised fine-tuning changes this by adjusting the model's weights directly. You provide a set of curated input-output pairs — examples of the task done well, in your format, using your terminology — and the training process updates the model's parameters so that it produces those kinds of outputs by default. The result is not a model that has been given better instructions. It is a model that has internalised your task at the weight level.
The practical effect: the fine-tuned model produces your format, follows your conventions, and handles your domain edge cases without needing to be reminded in every prompt. The system prompt gets shorter. The outputs get more consistent. The failure modes become narrower and more predictable.
Why LoRA makes this accessible
Full fine-tuning — updating every parameter in a 7B model — requires substantial compute and produces a full copy of the model weights for each training run. LoRA (Low-Rank Adaptation) takes a different approach: instead of updating the original weights, it trains a small set of additional parameters — the adapter — that sits on top of the frozen base model. The adapter is tiny relative to the full model, which means training is dramatically faster and cheaper.
QLoRA extends this further by quantising the base model weights to 4-bit precision during training, which cuts memory usage enough that a 7B or 13B model can be trained on a single modern GPU. The quality difference compared to full fine-tuning is small for most tasks. The accessibility difference is large.
What this means in practice: you are not running a data centre operation. You are running a training job on a single GPU instance — rented by the hour if you do not own the hardware — and producing an adapter file that is a few hundred megabytes rather than tens of gigabytes.
The toolchain
Three libraries do most of the work in the Python ecosystem:
- HF TRL — Hugging Face's Transformer Reinforcement Learning library, which handles the supervised fine-tuning training loop, data collation, and evaluation hooks. The SFTTrainer class abstracts away most of the boilerplate.
- Unsloth — a set of optimised CUDA kernels for LoRA training that significantly reduces memory usage and increases training speed compared to vanilla TRL. For single-GPU runs where memory is the constraint, Unsloth often makes the difference between a run that fits and one that does not.
- Axolotl — a configuration-driven training framework that wraps TRL and handles the complexity of multi-GPU training, dataset formatting, and hyperparameter management through YAML config files. When you want to move from a single GPU to a small cluster, or when you want reproducible, version-controlled training configurations, Axolotl is the right layer to add.
These three tools compose cleanly. A typical workflow starts with TRL for experimentation, adds Unsloth for speed on a single GPU, and adopts Axolotl config files when the training setup needs to be reproducible across runs or team members.
What single-GPU-class means in practice
You do not need a cluster. A single A100 80GB, H100, or equivalent — available from most cloud providers for $2–4 per hour — is enough to fine-tune a 7B or 13B model using QLoRA. A typical training run on a few thousand examples takes two to six hours. The total compute cost for an initial fine-tuning run is usually under $30.
If you already have access to a workstation with a high-end GPU, the training can run there. If you are renting cloud compute, you pay for the hours the GPU is actively training — not a persistent reservation. The economics are fundamentally different from the ongoing API costs that scale with every inference call.
What you need to bring
The main input is labelled data: pairs of inputs and the outputs you want the model to produce. The quality of this data matters more than the quantity. A few hundred high-quality, representative examples — especially ones that cover your known failure modes and edge cases — will produce a better fine-tuned model than thousands of mediocre examples.
Good sources for fine-tuning data include:
- Outputs from your prompt-tuned system that were reviewed and accepted by human experts
- Historical examples of the task done well by your team, formatted consistently
- Synthetic examples generated by a frontier model and then filtered or corrected by domain experts
You also need a held-out eval set — examples you do not train on, which you use to measure whether the fine-tuned model is actually better on new inputs. Without this, you cannot distinguish genuine improvement from overfitting to the training set.
What you get at the end
A proprietary model checkpoint — the base model plus your trained LoRA adapter — stored in your artifact registry. This checkpoint runs on your own infrastructure. Inference costs nothing beyond your compute. There is no API call to an external endpoint, no data leaving your environment, no per-token pricing that scales with volume.
The checkpoint is version-controlled. You can compare the performance of checkpoint v3 to checkpoint v7 against the same eval set. You can roll back if a training run produces a regression. You can track exactly what data was used for each version and reproduce any run from its configuration.
The checkpoint is yours
The checkpoint does not expire. It does not get deprecated when a vendor releases a new model. It does not require renegotiation when API pricing changes. It gets better every time you run another optimisation cycle on top of it.
This is the compounding dynamic that makes fine-tuning worthwhile at scale. Each training run starts from the previous checkpoint. Each eval cycle produces better labelled data for the next run. The model that is running your production workload in eighteen months is not the model you started with — it is something your organisation built, iteratively, from your own data and your own quality bar.
If fine-tuning has felt too complex or too expensive to attempt, the tooling has moved significantly in the last two years. The barrier is lower than most teams expect. If you want to work through what a first fine-tuning run would look like for your use case, get in touch.