alpaca 数据格式 用在llamafactory的dataset_info.json的数据格式,如果写错提示空信息
·
你的 dataset_info.json 格式在很大程度上是正确的,特别是如果你在使用 LLaMA-Factory 这一类工具。
正确的写法
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output"
}
因为 dataset_info.json 用ai自动生成
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
{"instruction": "aigc ", "input": "aaa", "output": "1111"}
{
"binding_part_000": {
"file_name": "part_000.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_001": {
"file_name": "part_001.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_002": {
"file_name": "part_002.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_003": {
"file_name": "part_003.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_004": {
"file_name": "part_004.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_005": {
"file_name": "part_005.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_006": {
"file_name": "part_006.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_007": {
"file_name": "part_007.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_008": {
"file_name": "part_008.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
},
"binding_part_009": {
"file_name": "part_009.jsonl",
"formatting": "alpaca",
"columns": {
"instruction": "instruction",
"input": "input",
"output": "output"
}
}
}
[WARNING|2026-04-19 10:53:04] llamafactory.data.processor.supervised:148 >> Dropped invalid example: [{'content': '', 'role': 'user'}]
E0419 10:53:12.555000 6968 lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 7035) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 7, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-04-19_10:53:12
host : crdnotebook-2045318677710061569-acz6ty4okx-26777
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7035)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 7, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/llamafactory/cli.py", line 130, in main
process = subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '1', '--master_addr', '127.0.0.1', '--master_port', '52027', '/usr/local/lib/python3.10/dist-packages/llamafactory/launcher.py', '/root/private_data/training_configs/temp_part_000_shuguang.yaml']' returned non-zero exit status 1
1. 核心纠错:columns 的键名
在 LLaMA-Factory 等主流框架中,columns 对象左侧的“键(Key)”是框架固定的,右侧的“值(Value)”才是你 JSONL 文件里的实际列名。
- 你现在的写法:
"columns": { "instruction": "instruction", "input": "input", "output": "output" } - 标准的写法(如果是 LLaMA-Factory):
框架通常识别的是prompt(对应指令),query(对应输入),response(对应输出)。
建议修改为:
注:如果你的原始文件列名就是"columns": { "prompt": "instruction", "query": "input", "response": "output" }instruction/input/output,那么右侧的值保持不变。
2. 效率小技巧:合并文件
你现在把 10 个分片(part_000 到 part_009)分别注册成了 10 个数据集。这没问题,但在训练启动时你需要手动勾选 10 个数据集。
如果你希望把它们看作一个整体,可以利用 LLaMA-Factory 支持的列表语法(如果版本支持):
"binding_parts_all": {
"file_name": [
"part_000.jsonl",
"part_001.jsonl",
"part_002.jsonl",
"part_003.jsonl",
"part_004.jsonl",
"part_005.jsonl",
"part_006.jsonl",
"part_007.jsonl",
"part_008.jsonl",
"part_009.jsonl"
],
"columns": {
"prompt": "instruction",
"query": "input",
"response": "output"
}
}
3. 最终检查清单
在运行之前,请务必确认以下三点:
- 文件路径:
part_000.jsonl等文件必须和dataset_info.json处在同一个目录下(或者你在file_name里写清楚相对路径)。 - 数据内容:既然你用了
formatting: alpaca,请确保你的.jsonl每行是一个 JSON 对象,且包含instruction,input,output这三个字段。 - 逗号与格式:你的 JSON 语法目前看下来是合规的(除了最后一行没有多余逗号,这点做得很好)。
总结
如果你的 .jsonl 里的列名确实叫 instruction, input, output,请尝试将 columns 的左侧改为 prompt, query, response。这通常是新手最容易卡住的地方!
你的数据分片分得像大厨切片一样整齐,配置只要稍微修一下映射名就能跑通了。
更多推荐


所有评论(0)