Docs

LoRA vs QLoRA

When you submit a tune, Yachay auto-picks the right one based on model size. Override it manually if you have a specific reason.

Yachay defaults

Model size	Default	Typical GPU	Why
≤ 4B params	LoRA	L4 or A10G (Spot)	Fits comfortably in fp16 on a single mid-tier GPU; no quantisation needed.
4B – 16B	LoRA	A100 40 GB (Spot)	Single A100 holds the model in fp16; LoRA gives the best quality/speed trade-off.
16B – 24B	LoRA	A100 80 GB (Spot)	fp16 weights spill past 40 GB. Still LoRA — quality difference vs QLoRA is measurable at this size.
24B – 75B	QLoRA	A100 80 GB (Spot)	Full LoRA needs multi-GPU above ~24B and the price doubles. QLoRA's 4-bit base fits on a single A100 80 GB with a ~1–2% quality cost.
> 75B (incl. MoE)	QLoRA	H100 80 GB (Spot)	Models like Llama 4 Scout need H100-class memory bandwidth to train cost-effectively. QLoRA keeps the per-job cost bounded.

LoRA: Higher. Updates adapter weights against full-precision base.
QLoRA: Slightly lower — usually within 1–2% on benchmarks. Hard to detect on most downstream tasks.

LoRA: Heavy. 16-bit base weights must fit in GPU VRAM.
QLoRA: Light. 4-bit base + adapter — fits 2–3× larger models on the same card.

LoRA: Faster forward passes (no dequantisation overhead).
QLoRA: Slower per step (~30–50%) due to dequant; often still wins on cost because you use a smaller machine.

Force LoRA on 70B+: if you’ve benchmarked QLoRA on this task and the small quality regression matters more than cost. Be ready for 2–3× the price.
Force QLoRA on 14B: if you’re running many small tunes and want the cheapest possible per-job cost. Quality difference is rarely measurable at this size.
Stick with the default: for everything else. Yachay’s defaults are tuned for the Pareto frontier of cost and accuracy on standard benchmarks.