Pytorch 笔记

WHAT IS TORCH.NN REALLY? :

这个教程比较全面,一层一层递进。

  • TensorDataset->DataLoader
  • validation
  • fit, 同时进行训练和验证
  • 自己定义层 - class Lambda这个例子

一些类和函数

Torch.Tensor相关操作

  • tensor.size(1)可以取出相应维度

  • 对于标量tensor,可以用tensor.item(),转成python基本类型

  • 对于Variable,可以用Variable.data获取其中的tensor信息:

    1
    2
    3
    4
    5
    # Variable
    <class 'torch.autograd.variable.Variable'>

    # tensor
    <class 'torch.FloatTensor'>

将一个4D的tensor按第一个维度展开时(8, 3, 255, 255) -> (8, 3*255*255),发现用tensor.view()tensor.flatten()更快一点。

Dataset相关

torch.utils.data.DataLoader
文档
这个类,主要是分割batch,传入的参数应该是所有训练或者测试数据(包括标签)。

torchvision.utils.make_grid
文档
用于显示多张图片,但是还需要一个辅助函数,一般是:

1
2
3
4
5
def show(img):
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1,2,0)), interpolation='nearest')

show(make_grid(imglist, padding=100))

make_grid可能是将多张图片组合成一张图片(不负责显示),还需要一个具体函数来显示。

关于num_workers参数:

The more data you put into the GPU memory, the less memory is available for the model.
If your model and data is small, it shouldn’t be a problem. Otherwise I would rather use the DataLoader to load and push the samples onto the GPU than to make my model smaller.

对于DataLoader对象,可以用for...enumerate()迭代,也可以:

1
2
dataloaderIter = iter(dataloader)
next(dataloaderIter)

这样写有个缺点,就是迭代次数达到dataloader的长度会抛出StopIteration的异常。

dataloader用for...enumerate(),可以重新迭代:

1
2
3
epoch_num = 50
for epoch in range(epoch_num):
for step, batch in enumerate(loader):

如果想定义一个无限长(即循环)迭代器,可以利用enumerate方法:

1
2
3
4
5
6
7
8
9
10
11
def infinite_next(data_loader):
while True:
for _, v in enumerate(data_loader):
yield v


seq_batch = DataLoader(...)
seq_gts_iter = infinite_next(seq_batch)

while True:
next(seq_gts_iter)

卷积

2D卷积

torch.nn.Conv2dtorch.nn.functional.conv2d的区别是:前者指定输入输出大小,权重内部创建,后者指定具体的输入输出tensor,注意两者都有group参数。

torch.nn.Module

如果继承了torch.nn.Module,可以调用self.modules()方法获取类中已经存在的 modules. 判断具体是某个层,可以:

1
2
3
4
5
for m in self.modules():
if isinstance(m, torch.nn.Conv2d):
pass
if isinstance(m, torch.nn.BatchNorm2d):
pass

应用举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def freeze_base_cnn(cnn):
if isinstance(cnn, torch.nn.DataParallel):
net_modules = cnn.module.modules()
else:
net_modules = cnn.modules()

for m in net_modules:
if isinstance(m, nn.Conv2d):
m.weight.requires_grad = False # 方法1
m.bias.requires_grad = False
elif isinstance(m, nn.BatchNorm2d):
for i in m.parameters(): # 和方法1效果相同
i.requires_grad = False
m.eval()

Module.children() vs Module.modules()

多GPU

DATA PARALLELISM
注意这个教程下面例子中的维度说明

因为pytorch定义的网络模型参数默认放在gpu 0上,所以dataparallel实质是可以看做把训练参数从gpu拷贝到其他的gpu同时训练,此时在dataloader加载数据的时候,batch_size是需要设置成原来大小的n倍,n即gpu的数量。

torch.nn.DataParallel

MULTI-GPU EXAMPLES:

In general, pytorch’s nn.parallel primitives can be used independently. We have implemented simple MPI-like primitives:

  • replicate: replicate a Module on multiple devices
  • scatter: distribute the input in the first-dimension
  • gather: gather and concatenate the input in the first-dimension
  • parallel_apply: apply a set of already-distributed inputs to a set of already-distributed models.

CUDA SEMANTICS

However, once a tensor is allocated, you can do operations on it irrespective of the selected device, and the results will be always placed in on the same device as the tensor.

运行下面代码:

1
2
3
model = nn.DataParallel(model, device_ids=[0])
model.feature_extract()
...

出现如下错误:

1
AttributeError: 'DataParallel' object has no attribute 'feature_extract'

应该这样调用feature_extract方法:

1
model.module.feature_extract()

保存权重也需要用到module,具体见本页最下 保存权重出错

训练相关

学习率

How to adjust Learning Rate

注意前两个关于scheduler.step()的例程。

学习率的保存再加载

PyTorch - How to get learning rate during training?

使用了torch.optim.lr_scheduler.ExponentialLR,学习率应该随着epoch的增加在减小,但是当把 optimizer 的 state_dict 保存再加载,继续训练时,学习率变成了初始最大的学习率。

这个问题是学习率更新引起的(lr_scheduler.step()). 在初始化lr_scheduler的时候用的总是initial_lr.
lr_scheduler的state_dict也可以保存和加载 (这一点官网的教程没有提到),只需要加载lr_scheduler对应的state_dict即可解决这个问题。

损失函数

binary_cross_entropy_with_logits

torch.nn.functional.binary_cross_entropy_with_logits

计算公式参考:BCEWithLogitsLoss

参考:

How is Pytorch’s binary_cross_entropy_with_logitsfunction related to sigmoid and binary_cross_entropy

freeze

将网络参数全部冻结,如果进行了backward()操作,会报错:

1
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

这个错误表明需要至少一个可以训练的Tensor(requires_grad = True)才能执行backward()操作。

保存和加载

Saving & Loading Model Across Devices

保存时,会记录参数是从哪个设备(CPU, GPU1, GPU2, …)上保存下来的。

torch.loadmap_location参数控制保存的参数加载到哪一个设备上(CPU, GPU1, GPU2, …):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Save on GPU, Load on CPU
...
device = torch.device('cpu')
model.load_state_dict(torch.load(PATH, map_location=device))

# Save on GPU, Load on GPU
...
device = torch.device("cuda")
model.load_state_dict(torch.load(PATH))
model.to(device)
# Make sure to call input = input.to(device) on any input tensors that you feed to the model

# Save on CPU, Load on GPU
...
device = torch.device("cuda")
model.load_state_dict(torch.load(PATH, map_location="cuda:0")) # Choose whatever GPU device number you want
model.to(device)
# Make sure to call input = input.to(device) on any input tensors that you feed to the model

torch.load中的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Load all tensors onto the CPU
>>> torch.load('tensors.pt', map_location=torch.device('cpu'))

# Load all tensors onto the CPU, using a function
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage)

# Load all tensors onto GPU 1
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage.cuda(1))

# Map tensors from GPU 1 to GPU 0
>>> torch.load('tensors.pt', map_location={'cuda:1':'cuda:0'})

# Load tensor from io.BytesIO object
>>> with open('tensor.pt') as f:
buffer = io.BytesIO(f.read())
>>> torch.load(buffer)

lambda storage, loc: storage.cuda(1)类似于'cuda:1':'cuda:0'这样的形式。

注意在resume train时,lr_scheduler也需要保存和加载state_dict,否则学习率会reset为初始学习率

RNN

LSTM

官方例程,把官方例程修改为下面的更好理解。

Demo1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torch
import torch.nn as nn

torch.manual_seed(1)

lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 1, 3) for _ in range(5)] # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
torch.randn(1, 1, 3))

for i in inputs:
# Step through the sequence one element at a time.
# after each step, hidden contains the hidden state.
out, hidden = lstm(i, hidden)
print('out1:\n', out)
print('hidden1:\n', hidden)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
out1:
tensor([[[-0.2682, 0.0304, -0.1526]]], grad_fn=<StackBackward>)
hidden1:
(tensor([[[-0.2682, 0.0304, -0.1526]]], grad_fn=<StackBackward>),
tensor([[[-1.0766, 0.0972, -0.5498]]], grad_fn=<StackBackward>))
out1:
tensor([[[-0.5370, 0.0346, -0.1958]]], grad_fn=<StackBackward>)
hidden1:
(tensor([[[-0.5370, 0.0346, -0.1958]]], grad_fn=<StackBackward>),
tensor([[[-1.1552, 0.1214, -0.2974]]], grad_fn=<StackBackward>))
out1:
tensor([[[-0.3947, 0.0391, -0.1217]]], grad_fn=<StackBackward>)
hidden1:
(tensor([[[-0.3947, 0.0391, -0.1217]]], grad_fn=<StackBackward>),
tensor([[[-1.0727, 0.1104, -0.2179]]], grad_fn=<StackBackward>))
out1:
tensor([[[-0.1854, 0.0740, -0.0979]]], grad_fn=<StackBackward>)
hidden1:
(tensor([[[-0.1854, 0.0740, -0.0979]]], grad_fn=<StackBackward>),
tensor([[[-1.0530, 0.1836, -0.1731]]], grad_fn=<StackBackward>))
out1:
tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>)
hidden1:
(tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>),
tensor([[[-1.1298, 0.4467, 0.0254]]], grad_fn=<StackBackward>))

Demo2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch
import torch.nn as nn

torch.manual_seed(1)

lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 1, 3) for _ in range(5)] # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
torch.randn(1, 1, 3))

inputs = torch.cat(inputs).view(len(inputs), 1, -1) # size: (seq_len, batch, input_size)

out, hidden = lstm(inputs, hidden)
print('out2:\n', out)
print('hidden2:\n', hidden)

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
out2:
tensor([[[-0.2682, 0.0304, -0.1526]],

[[-0.5370, 0.0346, -0.1958]],

[[-0.3947, 0.0391, -0.1217]],

[[-0.1854, 0.0740, -0.0979]],

[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>)
hidden2:
(tensor([[[-0.3600, 0.0893, 0.0215]]], grad_fn=<StackBackward>),
tensor([[[-1.1298, 0.4467, 0.0254]]], grad_fn=<StackBackward>))

两个Demo分别是:

  1. 单个cell进行(输入单个数据)。
  2. 整体进行(输入整体数据)。

区别在于输入数据的seq_len维度是否是1.

根据上面的LSTM图示可以发现,pytorch中的LSTM单元的两个输出out, hidden分别是: $a^{}$ 和 ($a^{}, c^{}$). 整体进行一次前向传播时(Demo2),只保留中间过程的 $a^{}$ 和最后一次的 ($a^{}, c^{}$):

alternatively, we can do the entire sequence all at once.
the first value returned by LSTM is all of the hidden states throughout the sequence. the second is just the most recent hidden state (compare the last slice of “out” with “hidden” below, they are the same)
The reason for this is that:
“out” will give you access to all hidden states in the sequence, “hidden” will allow you to continue the sequence and backpropagate, by passing it as an argument to the lstm at a later time.

一些错误

训练出错

1
optimizer.step()

写成了:

1
optimizer.step

导致权重没有更新,但是程序也没有报错。。。

关于 in-place

官方说in-place操作不利于求导:https://pytorch.org/docs/master/notes/autograd.html#in-place-operations-with-autograd

但是torch.nn.ReLU往往写成in-place,好像不影响求导:https://discuss.pytorch.org/t/whats-the-difference-between-nn-relu-and-nn-relu-inplace-true/948

将卷基层的偏差初始化为0:

1
2
3
4
5
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight.data, mode='fan_out',
nonlinearity='relu')
m.bias.data.fill_(0)

注意,最后面一行是m.bias.data.fill_(0),不能是m.bias.fill_(0),否则就会报求导in-place的错误。

x.view(...)并不能修改x,应该写成x = x.view(...).

tensor格式

使用pytorch自带的损失函数:

1
loss = F.binary_cross_entropy_with_logits(responses, labels, weight=weights)

报错:

1
2
3
File "/home/ubuntu/anaconda2/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 2074, in binary_cross_entropy_with_logits
if not (target.size() == input.size()):
TypeError: 'int' object is not callable

是因为输入的参数并不都是torch.tensor格式(有的是np.ndarray),这样,在函数内部调用tensor.size()就会出错(ndarray使用的是.shape),需要进行格式转换:

1
2
labels = torch.from_numpy(labels).to(device).float()
weights = torch.from_numpy(weights).to(device).float()

torchvision.transforms.ToTensor

注意文档中说明了,会自动进行归一化。

nan

binary_cross_entropy_with_logits 的 loss 为 nan,可能原因:

  • 梯度过大:
    1. 减小学习率
    2. 梯度范围限制
  • 输入数据未归一化:
    1. 使用 normalization layer

参考:https://oldpan.me/archives/careful-train-loss-nan-inf

权重加载出错

多GPU保存权重,单GPU加载权重

使用了DataParallel

1
2
3
4
5
6
7
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)

model.to(device)

保存和加载state_dict都没有问题。

但是如果用普通model加载DataParallel modelstate_dict,就会出错。

问题分析见KeyError: ‘unexpected key “module.encoder.embedding.weight” in state_dict’

You probably saved the model using nn.DataParallel, which stores the model in module, and now you are trying to load it without DataParallel. You can either add a nn.DataParallel temporarily in your network for loading purposes, or you can load the weights file, create a new ordered dict without the module prefix, and load it back.

解决方法

方法1:

将model改为nn.DataParallel(model),只使用1个GPU

1
model = nn.DataParallel(model, device_ids=[0])

但是model中对应的方法也需要改用model.module调用

方法2:

可以修改加载的state_dict

1
2
3
4
5
6
7
8
9
10
# original saved file with DataParallel
state_dict = torch.load('myfile.pth.tar')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)

或者修改保存的state_dict

1
2
3
if isinstance(model, torch.nn.DataParallel):
model = model.module
torch.save(model.state_dict(), path_to_file)

当使用DataParallel时,用model.module.state_dict()代替model.state_dict()

关于 .to

将变量放置在GPU上:

  1. torch.nn.Module.to
  2. torch.Tensor.to

需要注意,1.只能把Module中的参数转移,并不能将)__init__中自己创建的tensor转移,需要用到2.


----------over----------