Docs

Picking a base model

Yachay’s catalog has 17 commercial-safe base models. This guide is the cheat-sheet for picking the right one without testing all of them.

If you know what you want

I want maximum quality at any cost.

Best instruction-following general-purpose base. Naming-constrained (derivative must include 'Llama') but otherwise commercial-OK.

I want zero license friction.

Apache 2.0, no naming constraint, multilingual, strong reasoning. The default safe pick if you're shipping a product publicly.

I want the cheapest possible tune.

Fast tunes (under 10 minutes typical), good for narrow tasks like extraction, classification, or short summarization.

I'm tuning for reasoning or math.

Phi-4 is trained on heavy synthetic reasoning data. R1 distills are explicitly chain-of-thought tuned.

I need long context (>32k tokens).

Scout has a massive context window with a 5–15 min cold-start. Nemo has 128k context and faster cold start.

I'm building for edge / on-device.

Sub-5B param targets that quantize cleanly for ONNX, MLX, or llama.cpp deployment after tuning.

Multilingual product, non-Latin scripts important.

Strong on East Asian, Arabic, and Indic scripts out of the box. Qwen 3 especially.

Microsoft / MIT-license environment.

MIT licensed, no attribution requirements anywhere downstream.

Step-by-step

  1. 1

    Decide your license tolerance.

    If your downstream product can't include 'Llama' in its model name, skip the Llama family. If your legal team flags Gemma's TOU, skip Gemma. Apache 2.0 and MIT (Qwen, Mistral, Phi) have no constraints.

  2. 2

    Pick the smallest model that does the job.

    Tuning a 70B for an extraction task wastes money. Start at 8B and only go bigger if eval scores fail. Yachay shows expected tune cost on every catalog entry.

  3. 3

    Check the recommended tasks.

    Each model page lists the tasks it's known to be strong on. Llama 3.3 70B for instruction following, Phi-4 for reasoning, Qwen for multilingual. Picking against the model's strengths fights upstream training.

  4. 4

    Cold-start tier matters for iteration speed.

    Models flagged 'Instant start' are pre-mirrored — your job begins within seconds. Cold-start models add 5–15 minutes per job. If you're iterating fast, prefer hot tier.

  5. 5

    Run a 1-epoch tune first.

    Before committing to a long tune, run 1 epoch on a 1k-row sample. That settles near the $5 floor on small models, the adapter takes minutes, and you can eval it against your real test set before scaling up data + epochs. If a short tune underperforms the base model, try a different base before paying for the long run.

← All docs · Browse the catalog · LoRA vs QLoRA