你的 dataset_info.json 格式在很大程度上是正确的,特别是如果你在使用 LLaMA-Factory 这一类工具。

正确的写法

    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output"
    }

因为 dataset_info.json 用ai自动生成

    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }

{"instruction": "aigc ", "input": "aaa", "output": "1111"}


{
  "binding_part_000": {
    "file_name": "part_000.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_001": {
    "file_name": "part_001.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_002": {
    "file_name": "part_002.jsonl",
      "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_003": {
    "file_name": "part_003.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_004": {
    "file_name": "part_004.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_005": {
    "file_name": "part_005.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_006": {
    "file_name": "part_006.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_007": {
    "file_name": "part_007.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_008": {
    "file_name": "part_008.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  },
  "binding_part_009": {
    "file_name": "part_009.jsonl",
    "formatting": "alpaca",
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
  }
}


[WARNING|2026-04-19 10:53:04] llamafactory.data.processor.supervised:148 >> Dropped invalid example: [{'content': '', 'role': 'user'}]
E0419 10:53:12.555000 6968 lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 7035) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-04-19_10:53:12
  host      : crdnotebook-2045318677710061569-acz6ty4okx-26777
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7035)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html 
============================================================
Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 7, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/llamafactory/cli.py", line 130, in main
    process = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '1', '--master_addr', '127.0.0.1', '--master_port', '52027', '/usr/local/lib/python3.10/dist-packages/llamafactory/launcher.py', '/root/private_data/training_configs/temp_part_000_shuguang.yaml']' returned non-zero exit status 1

1. 核心纠错:columns 的键名

在 LLaMA-Factory 等主流框架中,columns 对象左侧的“键(Key)”是框架固定的,右侧的“值(Value)”才是你 JSONL 文件里的实际列名。

  • 你现在的写法:
    "columns": {
      "instruction": "instruction",
      "input": "input",
      "output": "output"
    }
    
  • 标准的写法(如果是 LLaMA-Factory):
    框架通常识别的是 prompt (对应指令), query (对应输入), response (对应输出)。
    建议修改为:
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output"
    }
    
    注:如果你的原始文件列名就是 instruction/input/output,那么右侧的值保持不变。

2. 效率小技巧:合并文件

你现在把 10 个分片(part_000 到 part_009)分别注册成了 10 个数据集。这没问题,但在训练启动时你需要手动勾选 10 个数据集。

如果你希望把它们看作一个整体,可以利用 LLaMA-Factory 支持的列表语法(如果版本支持):

  "binding_parts_all": {
    "file_name": [
      "part_000.jsonl",
      "part_001.jsonl",
      "part_002.jsonl",
      "part_003.jsonl",
      "part_004.jsonl",
      "part_005.jsonl",
      "part_006.jsonl",
      "part_007.jsonl",
      "part_008.jsonl",
      "part_009.jsonl"
    ],
    "columns": {
      "prompt": "instruction",
      "query": "input",
      "response": "output"
    }
  }

3. 最终检查清单

在运行之前,请务必确认以下三点:

  1. 文件路径part_000.jsonl 等文件必须和 dataset_info.json 处在同一个目录下(或者你在 file_name 里写清楚相对路径)。
  2. 数据内容:既然你用了 formatting: alpaca,请确保你的 .jsonl 每行是一个 JSON 对象,且包含 instruction, input, output 这三个字段。
  3. 逗号与格式:你的 JSON 语法目前看下来是合规的(除了最后一行没有多余逗号,这点做得很好)。

总结

如果你的 .jsonl 里的列名确实叫 instruction, input, output,请尝试将 columns 的左侧改为 prompt, query, response。这通常是新手最容易卡住的地方!

你的数据分片分得像大厨切片一样整齐,配置只要稍微修一下映射名就能跑通了。

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐