Docs
Eval reports
How Yachay evaluates a finished fine-tune — what we record, what you get, and what’s coming.
v1 status
The per-job dashboard panel for loss curves and held-out perplexity is on the v1.1 roadmap. v1 records the underlying data — see “What we capture today” below — but there’s no built-in viewer yet. If you need the metrics now, email hello@condorbox.aiwith your job ID and we’ll send the raw JSON within one business day.
What we capture today
- Per-step training loss — every gradient-update step writes a (step, loss) tuple to the orchestrator’s structured log. Available via the Cloud Logging export to GCS for the lifetime of the log retention window (30 days by default).
- Validation perplexity at every checkpoint — we hold out 5% of your dataset (deterministic split, seed=42) and compute perplexity at each checkpoint. The final value is stamped on the Firestore job doc as
finalValPerplexity. - Training args — the full hyperparameter set used by the trainer (epochs, batch size, learning rate, optimizer, dtype, LoRA rank/alpha) ships in
trainer_args.jsoninside the adapter bundle. Survives the standard 30-day retention. - Tokens-seen and wall-clock — stamped on the Firestore doc as
trainTokensandtrainSecondsActual. Surface in the dashboard’s per-job page already.
v1.1 roadmap
- Loss-curve panel — per-job dashboard chart of (step, train loss, val perplexity) across the whole run. Source data already exists in Cloud Logging; this is a UI build.
- Benchmark probe (opt-in) — for a flat $1.50 add-on at submit time, we run your tuned adapter against a small fixed benchmark (MMLU 5-shot, HumanEval pass@1) and surface the score next to your baseline.
- Side-by-side eval — pick two completed jobs and compare loss curves + perplexity head-to-head. Useful for hyperparameter sweeps.
What we will NOT do
We won’t evaluate your adapter on a held-out dataset we don’t share with you. The v1.1 benchmark probe runs against fixed public benchmarks (MMLU, HumanEval) so the score is comparable and reproducible. If you need a custom eval set, run it client-side against the downloaded adapter — that keeps the eval data on your machine.