Docs

LoRA vs QLoRA

When you submit a tune, Yachay auto-picks the right one based on model size. Override it manually if you have a specific reason.

Yachay defaults

Model sizeDefaultTypical GPUWhy
≤ 4B paramsLoRAL4 or A10G (Spot)Fits comfortably in fp16 on a single mid-tier GPU; no quantisation needed.
4B – 16BLoRAA100 40 GB (Spot)Single A100 holds the model in fp16; LoRA gives the best quality/speed trade-off.
16B – 24BLoRAA100 80 GB (Spot)fp16 weights spill past 40 GB. Still LoRA — quality difference vs QLoRA is measurable at this size.
24B – 75BQLoRAA100 80 GB (Spot)Full LoRA needs multi-GPU above ~24B and the price doubles. QLoRA's 4-bit base fits on a single A100 80 GB with a ~1–2% quality cost.
> 75B (incl. MoE)QLoRAH100 80 GB (Spot)Models like Llama 4 Scout need H100-class memory bandwidth to train cost-effectively. QLoRA keeps the per-job cost bounded.

Trade-off matrix

Quality

LoRA
Higher. Updates adapter weights against full-precision base.
QLoRA
Slightly lower — usually within 1–2% on benchmarks. Hard to detect on most downstream tasks.

Memory

LoRA
Heavy. 16-bit base weights must fit in GPU VRAM.
QLoRA
Light. 4-bit base + adapter — fits 2–3× larger models on the same card.

Speed

LoRA
Faster forward passes (no dequantisation overhead).
QLoRA
Slower per step (~30–50%) due to dequant; often still wins on cost because you use a smaller machine.

Cost

LoRA
Higher for ≥32B models — needs more or bigger GPUs.
QLoRA
Materially cheaper for 32B+. About the same for ≤14B.

When to override

  • Force LoRA on 70B+: if you’ve benchmarked QLoRA on this task and the small quality regression matters more than cost. Be ready for 2–3× the price.
  • Force QLoRA on 14B: if you’re running many small tunes and want the cheapest possible per-job cost. Quality difference is rarely measurable at this size.
  • Stick with the default: for everything else. Yachay’s defaults are tuned for the Pareto frontier of cost and accuracy on standard benchmarks.

← All docs · Browse base models · Pricing