最近在跑chatglm2的sft的时候出现了下面的错误,我的运行方式是bf16, deepspeed zero3,因为担心fp16会有很多的nan.

File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    return func(*args, **kwargs)
  File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
    result = forward_call(*args, **kwargs)
  File "/home/suser/.cache/huggingface/modules/transformers_modules/chatglm2-6b/modeling_chatglm.py", line 805, in forward
    inputs_embeds = self.embedding(input_ids)
  File "/home/suser/.conda/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    work = group._allgather_base(output_tensor, input_tensor)
RuntimeError: output tensor must have the same type as input tensor

解决方法

在stage3 config里面加入bf16就行了。

{   "bf16": { "enabled": true },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "fp16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

参考文献

[BUG]RuntimeError: output tensor must have the same type as input tensor