
  • 1、AttributeError: 'DataParallel' object has no attribute 'init_hidden_state'
  • 2、input and hidden tensors are not at the same device,found input tensor at GPU and hidden at cpu
  • 3、input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0
  • 4、RuntimeError: Expected hidden[0] size (x, x, x), get(x, x, x)
  • 1、与LSTM输入的数据以及初始权重h_0,c_0的格式有关
  • 2、h_0,c_0的batch_size需要根据输入的batch_size的大小动态变化


pytorch 多GPU 数据并行模式踩坑日记





class Classfication_Model(nn.Module):
    def __init__(self):
        super(Classfication_Model, self).__init__()
        self.hidden_size = 128
        self.embedding_dim = 200
        self.number_layer = 4
        self.bidirectional = True
        self.bi_number = 2 if self.bidirectional else 1
        self.dropout = 0.5
        self.embedding = nn.Embedding(num_embeddings=len(model.index_to_key)+200
                                       , embedding_dim=self.embedding_dim)

        self.lstm = nn.LSTM(input_size=self.embedding_dim
                            , hidden_size=self.hidden_size
                            , num_layers=self.number_layer
                            , dropout=self.dropout
                            , bidirectional=self.bidirectional)
        self.fc = nn.Sequential(
            , nn.ReLU()
            , nn.Linear(20,2)

    def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)

    def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out

def train(epoch):
    ds = corpus_dataset(train_model=True, max_sentence_length=50,train_set=train_set,test_set=test_set)
    train_dataloader = DataLoader(ds, batch, shuffle=True,num_workers=5)
    total_loss = 0
    # hidden = classfication_model.init_hidden_state(batch) DataParallel时出错
    # hidden = classfication_model.module.init_hidden_state(batch) 这个batch_size设置是死的
    for idx, (input, target) in enumerate(train_dataloader):
        target = target.to(device)
        input = input.to(device)
        hidden = classfication_model.module.init_hidden_state(len(input))# 这个batch_size设置是活的
        output = classfication_model(input, hidden)
        loss = criterion(output, target)  # traget需要是[0,9],不能是[1-10]
        total_loss += loss.item()
    print(f"epoch:{epoch}  ######  total_loss:{total_loss:.6f}")
1、AttributeError: ‘DataParallel’ object has no attribute ‘init_hidden_state’


解决办法:在调用原模型的属性的时候,加上一层module.比如将hidden = classfication_model.init_hidden_state(batch) 改为如下形式hidden = classfication_model.module.init_hidden_state(batch),这个bug就解决了。

2、input and hidden tensors are not at the same device,found input tensor at GPU and hidden at cpu

原因:这个就是一部分数据在GPU一部分在CPU,主要是LSTM的hidden parameters 在CPU上,我们需要初始化LSTM的h_0,c_0放入GPU中就好了。


def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)
3、input and hidden tensors are not at the same device, found input tensor at cuda:1 and hidden tensor at cuda:0


首先明确一下背景,出现这个问题是我将本能够在一块GPU上跑通的LSTM模型 进行 多GPU训练时遇到的新问题。

问题分析:从字面意思可以看出,在使用Dataparallel的时候,将输入的数据拆分到了不同的GPU,而 hidden_parameter这些参数没有拆分到不同的GPU上,所以最后导致了input and hidden tensor are not at the same device。



def forward(self, input):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2)
        h_0, c_0 = self.init_hidden_state(input_embeded.shape[1])
        _, (h_n, c_n) = self.lstm(input_embeded, (h_0, c_0))
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)
        out = self.fc(out)
        return out


我将(h_0,c_0)—>hidden 通过forward函数的形参传递进函数中去,这样GPU就会将h_0,c_0切分到各个GPU中,但是解决了这个BUG后紧接着可能就会出现接下来的BUG。

def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out
4、RuntimeError: Expected hidden[0] size (x, x, x), get(x, x, x)




        input_size: The number of expected features in the input `x`
        hidden_size: The number of features in the hidden state `h`
        num_layers: Number of recurrent layers. E.g., setting ``num_layers=2``
            would mean stacking two LSTMs together to form a `stacked LSTM`,
            with the second LSTM taking in outputs of the first LSTM and
            computing the final results. Default: 1
        bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`.
            Default: ``True``
        batch_first: If ``True``, then the input and output tensors are provided
            as `(batch, seq, feature)` instead of `(seq, batch, feature)`.
            Note that this does not apply to hidden or cell states. See the
            Inputs/Outputs sections below for details.  Default: ``False``
        dropout: If non-zero, introduces a `Dropout` layer on the outputs of each
            LSTM layer except the last layer, with dropout probability equal to
            :attr:`dropout`. Default: 0
        bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``
        proj_size: If ``> 0``, will use LSTM with projections of corresponding size. Default: 0


输入数据:如果batch_first设置为True,那么输入的数据的shape就是(batch_size, sequence_length, embedding_size)。而他默认是False,也就是输入的shape是(sequence_length, batch_size, embedding_size)

而对于初试权重h_0,c_0:无论batch_first=false or Trueh_0,c_0的shape永远都是batch_first=False的,也就是

(number_layers * num_directions, batch_size, hidden_size)

问题原因:当模型调用nn.DataParallel后, 在执行model.forward()函数的时候,其输入的参数不同的batch会被分配到不同的GPU上进行并行计算。拆分的维度默认是第一维(dim=0),但可以设置为其他维度进行拆分(比如如果你习惯所有的tensor都用batch second 的格式,就可以设置拆分维度为dim=1)。前提是所有输入tensor都必须是cuda类型。cpu类型的输入只会被原样拷贝到每个实例中而不会被拆分。如果输入的数据第一维不是batch_size或者,输入的hidden(h_0,c_0)第一维不是batch_size,那么就会遇到这个问题。一开始我的init_hidden_state函数将h_0,c_0的batch_size设置到了第二个维度,导致报这个错。


def init_hidden_state(self, batch_size):
        h_0 = torch.rand(self.number_layer * self.bi_number, batch_size, self.hidden_size).to(device)
        c_0 = torch.rand(self.number_layer * self.bi_number, batch_size, self.hidden_size).to(device)
        return h_0, c_0

解决办法:所以我需要在将hidden_state(h_0, c_0)在传入forward的时候保证batch_first, 在forward函数内我再将第一个维度和第二个维度换一下位置,变成h_0,c_0要求的batch_second模式。同理,输入的数据也需要保证第一个维度是batch_size,(如果自定义了拆分的维度就得另说了)。


在forward函数中添加了:hidden = [x.permute(1,0,2).contiguous() for x in hidden]

def init_hidden_state(self, batch_size):
        h_0 = torch.rand(batch_size, self.number_layer * self.bi_number,  self.hidden_size).to(device)
        c_0 = torch.rand(batch_size, self.number_layer * self.bi_number, self.hidden_size).to(device)
        return (h_0, c_0)
def forward(self, input, hidden):
        input_embeded = self.embedding(input)
        input_embeded = input_embeded.permute(1, 0, 2) # 调整为:[sqe_len,batch_size,embedding_dim]
        hidden = [x.permute(1,0,2).contiguous() for x in hidden]
        _, (h_n, c_n) = self.lstm(input_embeded, hidden)
        out = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), -1)# 2,256
        out = self.fc(out)
        return out

问题原因:如果在构建Dataloader时,drop_last=False【默认就是False】,也就是不丢弃最后的 len(datasets)%batch_size个数据,那么此时的h_0,c_0的batch_size没有转变为len(datasets)%batch_size的情况下,也会报RuntimeError: Expected hidden[0] size (x, x, x), get(x, x, x)这个错。


for idx, (input, target) in enumerate(train_dataloader):
    hidden = classfication_model.module.init_hidden_state(len(input))# 这个batch_size动态变化的


DataParallel LSTM/GRU wrong hidden batch size (8 GPUs) - PyTorch Forums

