Docs
Dataset format
Yachay accepts JSONL (OpenAI chat / Alpaca / ShareGPT) or CSV / TSV. We auto-detect and normalize to OpenAI chat-style before fine-tuning — drop whatever you have. The submit form lints every row before you pay; if anything’s off, you’ll see exactly which line.
Accepted formats
Yachay auto-detects the dataset format at upload and normalizes everything to OpenAI chat-style JSONL before fine-tuning. You don’t need to convert anything yourself.
| Format | Shape | Where it comes from |
|---|---|---|
| OpenAI chat JSONL | keys: messages[].role, messages[].content | Default for new fine-tuning |
| Alpaca JSONL | keys: instruction, input (optional), output | Academic LoRA tutorials, HF datasets |
| ShareGPT JSONL | keys: conversations[].from, conversations[].value | Older HF datasets, Vicuna-era |
| CSV / TSV | header row + ≥ 2 columns; first → user, last → assistant | Spreadsheets, exports from databases |
Good example
{"messages": [{"role": "system", "content": "You are a polite customer support agent for Acme."}, {"role": "user", "content": "My order #1234 hasn't shipped yet."}, {"role": "assistant", "content": "I'm sorry to hear that. Let me look into order #1234 for you."}]}
{"messages": [{"role": "user", "content": "What's your return policy?"}, {"role": "assistant", "content": "We accept returns within 30 days of purchase, no questions asked. Email returns@acme.com to start the process."}]}
{"messages": [{"role": "user", "content": "Where are you based?"}, {"role": "assistant", "content": "Acme is headquartered in Portland, Oregon, with fulfilment from Indianapolis."}]}Three examples, one per line. The first has a system prompt; the other two start straight with the user.
Common mistakes
{"messages": [{"role": "user", "content": ""}]}
{"input": "What's the weather?", "output": "Sunny."}
{"messages": [{"role": "user", "content": "Hi"}, {"role": "user", "content": "Hi again"}]}
{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}- Line 1: empty content. Rejected.
- Line 2: looks like Alpaca but is missing the required
instructionfield — bareinput/outputisn’t a recognized shape and gets dropped at normalization. - Line 3: two consecutive user turns. Rejected.
- Line 4: a valid Alpaca row mixed into an OpenAI-chat file. Either format works on its own; mixing in one file makes auto-detection pick one and silently drop the others. Stick to a single format per file.
CSV / TSV — how columns map
- First column becomes the
userturn. - Last column becomes the
assistantturn. - Any columns in between are ignored. A 4-column file like
id,prompt,metadata,responseuses onlyid→ user andresponse→ assistant — middle columns are silently dropped. Ifidisn’t what you want as the prompt, reorder so prompt is in the first column. - A header row is required (≥ 2 columns) and is skipped during training. Column namesdon’t matter — only position.
- Embedded commas / tabs / newlines inside a quoted field are handled per RFC 4180. A literal
"inside a quoted cell is written as"".
Rules
One JSON object per line — no commas, no array wrapper.
The file is JSONL (newline-delimited JSON), not JSON. Most tools that emit "NDJSON" or "line-delimited JSON" produce the right shape.
Each line has a top-level messages array.
Inside, each entry is an object with role and content fields. role is one of system, user, or assistant. content is plain text — no markdown, no images, no nested JSON.
Turns alternate user → assistant.
An optional system message can lead. After that, user and assistant turns must alternate. Two consecutive user turns get rejected.
Examples should be representative.
Yachay trains on what you give it. 50 great examples beat 5000 mediocre ones. Aim for diversity of phrasing and difficulty inside a tight domain.
Limits
- Min examples
- 20
- Recommended
- 200–5,000
- Max file size
- 1 GB
- Max content length
- Model context window minus 256 tokens
Got something stranger than these five? HuggingFace Parquet, custom JSON schemas, multi-modal — email hello@condorbox.ai with a sample and we’ll send back a converter script.