torch.nn.Parameter, a subclass of torch.Tensor, could automatically add the data into the list of parameters and could appear in Module.parameters iterator. It can be automatically optimized by the optimizer if in optimized parameter list. Its arguments:
data (Tensor): parameter tensor.
requires_grad (bool, optional): if the parameter requires gradient. Default: True.
Some algorithms like Conjugate Gradient and LBFGS requries re-evaluate multiple times, so pass in a closure to clear the gradients, compute the loss, finally return.
1 2 3 4 5 6 7 8
forinput, target in dataset: defclosure(): optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() return loss optimizer.step(closure) # pass in a closure
Adjust learning rate
use torch.optim.le_scheduler
1 2 3 4 5
scheduler = ... for epoch inrange(100): train(...) evaluate(...) scheduler.step()
gradient clipping
clip by value, set threshold
clip_norm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
defclip_grad_norm_(parameters, max_norm, norm_type=2): parameters = list(filter(lambda p: p.grad isnotNone, parameters)) max_norm = float(max_norm)) norm_type = float(norm_type) if norm_type = torch._six.inf total_norm = max(p.grad.data.abs().max() for p in parameters) else: total_norm = 0 for p in parameters: param_norm = p.grad.data.norm(norm_type) total_norm += param_norm.item() ** norm_type total_norm = total_norm ** (1./ norm_type) clip_coef = max_norm / (total_norm + 1e-6) if clip_coef < 1: for p in parameters: p.grad.data.mul_(clip_coef) return total_norm
Misc
Define layers
Layers should be directly set as the attribute of a children class of torch.nn.Module, so that model.parameters() can be directly pass to torch.nn.optim. Otherwise, additionally parameters should be passed following model.parameters().
In other words, if layers are wrapped by a parent data structure like dict(), the model.parameters() cannot get all the layer parameters to be optimized, so that the first augument of optimizer should be manually set. (as below python code)
defforward(self, x): x = self.d['f1'](x) x = self.d['f2'](x) x = self.f3(x) return x
if __name__ == '__main__': net = Net() x = torch.rand(1, 20) y = torch.rand(1, 1) optimizer = optim.Adam(net.parameters(), lr=1e-3) for _ inrange(10): y_ = net(x) loss = F.mse_loss(y_, y) opt.zero_grad() loss.backward() # do backprop optimizer.step() # do not optimize layers wrapped in net.d ! # [p for p in l.d['f1'].parameters()] never change!
Also, if we need to use gpu to run, often we do: Net().to(device). But if the there are layers encompassed by a dict attribute in the modulde class, we have to do layer.to(device) individually.