Docs

Dataset format

Yachay accepts JSONL (OpenAI chat / Alpaca / ShareGPT) or CSV / TSV. We auto-detect and normalize to OpenAI chat-style before fine-tuning — drop whatever you have. The submit form lints every row before you pay; if anything’s off, you’ll see exactly which line.

Accepted formats

Yachay auto-detects the dataset format at upload and normalizes everything to OpenAI chat-style JSONL before fine-tuning. You don’t need to convert anything yourself.

FormatWhere it comes from
OpenAI chat JSONLDefault for new fine-tuning
Alpaca JSONLAcademic LoRA tutorials, HF datasets
ShareGPT JSONLOlder HF datasets, Vicuna-era
CSV / TSVSpreadsheets, exports from databases

Good example

{"messages": [{"role": "system", "content": "You are a polite customer support agent for Acme."}, {"role": "user", "content": "My order #1234 hasn't shipped yet."}, {"role": "assistant", "content": "I'm sorry to hear that. Let me look into order #1234 for you."}]}
{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "We accept returns within 30 days of purchase, no questions asked. Email returns@acme.com to start the process."}]}
{"messages": [{"role": "user", "content": "Where are you based?"}, {"role": "assistant", "content": "Acme is headquartered in Portland, Oregon, with fulfilment from Indianapolis."}]}

Three examples, one per line. The first has a system prompt; the other two start straight with the user.

Common mistakes

{"messages": [{"role": "user", "content": ""}]}
{"input": "What's the weather?", "output": "Sunny."}
{"messages": [{"role": "user", "content": "Hi"}, {"role": "user", "content": "Hi again"}]}
{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}
  • Line 1: empty content. Rejected.
  • Line 2: looks like Alpaca but is missing the required instruction field — bare input/outputisn’t a recognized shape and gets dropped at normalization.
  • Line 3: two consecutive user turns. Rejected.
  • Line 4: a valid Alpaca row mixed into an OpenAI-chat file. Either format works on its own; mixing in one file makes auto-detection pick one and silently drop the others. Stick to a single format per file.

CSV / TSV — how columns map

  • First column becomes the user turn.
  • Last column becomes the assistant turn.
  • Any columns in between are ignored. A 4-column file like id,prompt,metadata,response uses only id → user and response → assistant — middle columns are silently dropped. If idisn’t what you want as the prompt, reorder so prompt is in the first column.
  • A header row is required (≥ 2 columns) and is skipped during training. Column namesdon’t matter — only position.
  • Embedded commas / tabs / newlines inside a quoted field are handled per RFC 4180. A literal " inside a quoted cell is written as "".

Rules

  • One JSON object per line — no commas, no array wrapper.

    The file is JSONL (newline-delimited JSON), not JSON. Most tools that emit "NDJSON" or "line-delimited JSON" produce the right shape.

  • Each line has a top-level messages array.

    Inside, each entry is an object with role and content fields. role is one of system, user, or assistant. content is plain text — no markdown, no images, no nested JSON.

  • Turns alternate user → assistant.

    An optional system message can lead. After that, user and assistant turns must alternate. Two consecutive user turns get rejected.

  • Examples should be representative.

    Yachay trains on what you give it. 50 great examples beat 5000 mediocre ones. Aim for diversity of phrasing and difficulty inside a tight domain.

Limits

Min examples
20
Recommended
200–5,000
Max file size
1 GB
Max content length
Model context window minus 256 tokens

Got something stranger than these five? HuggingFace Parquet, custom JSON schemas, multi-modal — email hello@condorbox.ai with a sample and we’ll send back a converter script.

← All docs · Browse base models