1 | import torch |
Basic op
numpy to/from tensor
1 | # numpy -> tensor |
contiguous
1 | x.transpose(1, 2).contiguous().view(...) |
Parameter v.s. register_buffer
nn.Parameter
is considered a module parameter and will appear inparameters()
iterator. This would do backprop.1
nn.Parameter(data: Tensor, required_grad:bool = True)
register_buffer
add a persistant buffer to the module. It is used to register a buffer, not a parameter. It cannot do backprop.1
self.register_buffer(name:str, tensor: Tensor)
Multiplication
torch.einsum
multi-linear expressions, i.e. sums of products. Use Einstein summation convention
torch.einsum(equation, *operands)
→ Tensor1
2
3As = torch.randn(3,2,5)
Bs = torch.randn(3,5,4)
torch.einsum('bij,bjk->bik', As, Bs) # batch matrix multiplication
torch.ger
torch.ger(input, vec2, out=None)
→ Tensor
outer product1
2
3
4
5
6
7v1 = torch.arange(1., 5.)
v2 = torch.arange(1., 4.)
torch.ger(v1, v2)
# tensor([[ 1., 2., 3.],
[ 2., 4., 6.],
[ 3., 6., 9.],
[ 4., 8., 12.]])
dimension
1 | t = torch.randn(4,5,6) |
nn.Parameter
torch.nn.Parameter
, a subclass of torch.Tensor
, could automatically add the data into the list of parameters and could appear in Module.parameters
iterator. It can be automatically optimized by the optimizer if in optimized parameter list. Its arguments:
- data (Tensor): parameter tensor.
- requires_grad (bool, optional): if the parameter requires gradient. Default: True.
Tensor
byte()
1 | t = torch.ones(2,3) |
topk()
torch.topk(input, k, dim=None, largest=True, sorted=True, out=None)
-> (Tensor, LongTensor)
- a namedtuple of (values, indices) is returned, where the indices are the indices of the elements in the original input tensor.
1 | x = torch.arange(1., 6.) |
Loss functions
NLLLoss
torch.nn.NLLLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
- negative log likelihood loss. It is useful to train a classification problem with C classes
size_average
,reduce
- deprecatedreduction
: (‘none’, ‘mean’ (default), ‘sum’)- It requires adding a
LogSoftmax
layer as the last layer. CombiningLogSoftmax
withNLLLoss
is the same as usingCrossEntropyLoss
.
Optim
Per-parameter optim
Pass in an iterable of dict
s.
E.g. specify per-layer learning rates:1
2
3
4
5
6
7optim.SGD([
{'params': model.base.parameters()}, # default lr
{'params': model.classifier.parameters(), 'lr': 1e-3}
],
lr=1e-2, # default
momentum=.9 # for all params
)
Optim step
Optimizer.step()
Step once the gradients are computed loss.backward()
1 | for input, target in dataset: |
Optim algorithms
Optimization methods in deep learning
Optimizer.step(closure)
Some algorithms like Conjugate Gradient and LBFGS requries re-evaluate multiple times, so pass in a closure to clear the gradients, compute the loss, finally return.
1 | for input, target in dataset: |
Adjust learning rate
use torch.optim.le_scheduler
1
2
3
4
5scheduler = ...
for epoch in range(100):
train(...)
evaluate(...)
scheduler.step()
gradient clipping
- clip by value, set threshold
- clip_norm.
1 | def clip_grad_norm_(parameters, max_norm, norm_type=2): |
Misc
Define layers
Layers should be directly set as the attribute of a children class of torch.nn.Module
, so that model.parameters()
can be directly pass to torch.nn.optim
. Otherwise, additionally parameters should be passed following model.parameters().
- In other words, if layers are wrapped by a parent data structure like dict(), the model.parameters() cannot get all the layer parameters to be optimized, so that the first augument of optimizer should be manually set. (as below python code)
1 | import torch |
- Also, if we need to use gpu to run, often we do: Net().to(device). But if the there are layers encompassed by a dict attribute in the modulde class, we have to do layer.to(device) individually.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17class Net(nn.Module):
def __init__(self):
super(SharedLayers, self).__init__()
d = {}
d['f1'] = nn.Linear(20, 10).to(device)
d['f2'] = nn.Linear(10, 10).to(device)
self.d = d
self.f3 = nn.Linear(10, 1)
self.loss = nn.MSELoss()
...
if __name__ == '__main__':
net = Net().to(device)
x = torch.rand(1, 20).to(device)
y = torch.rand(1, 1).to(device)
...
NaN
If there exists NaN
:
- If within iteration 100, it may be due to the big learning rate. Try to reduce the learning rate 1/2~1/10.
- If use RNNs, may be because of the gradient exploration. Solution: add “gradient clipping”
- Division by 0.
- Take logarithm of 0 or negative number, e.g. calculating entropy or cross entropy.
- In exponential computation, the result is INF/INF, e.g. softmax. Solution: minus the maximum if possible.
Count the parameter numbers
1 | # approach 1 |
Configuration error
MacOSX
import torch
error
1 | >>> import torch |