Reasoning the relations between objects and their properties is a hallmark of intelligence. Here are some notes about the relational reasoning neural networks.

VLEVR dataset

Relation Network

Relation Networks (RNs)

Relation Networks(RNs)^[1] adopt the functional form of a neural network for relational reasoning. RNs consider the potential relations beween all object pairs.

“RNs learn to infer the existence and implications of object relations.” (Santoro et. al, 2017)^[1]

$\begin{align} RN(o) &= f_\phi \big( \color{red}{\pmb{a} \big(} g_\theta (o_i, o_j) \color{red}{\big)} \big) \\ &= f_\phi \big( \color{red}{\pmb{a} \big(} g_\theta(o_1, o_2), g_\theta(o_1, o_3), \cdots, g_\theta(o_{m-1}, o_m), \cdots \color{red}{\big)} \big) \\ \end{align}$

where $\color{red}{\pmb{a}}$ is the aggregation function.

When we take it as summation, the simplest form is:

$\text{RN} (o) = f_\phi \big( \sum_{i,j} g_\theta (o_i, o_j) \big)$

where

the input is Objects $O = \{o_1, o_2, \cdots, o_n\}$
$f_\phi$ and $g_\theta$ are two MLPs with parameters $\phi$ and $\theta$. The same MLP operates on all possible pairs. $g_\theta$ captures the representation of pair-wise relations, and $f_\phi$ integrates information about all pairs.

The summation in RN equations indicating the order (permutation) invariance of the object set. max and average pooling can be used instead.

Image source: ^[2]

Visual QA achitecture^[1]

Wild Relation Network (WReN)

Wild Relation Network (WReN) do RN module multiple times to infer the inter-pannel relationships.^[3] Afterward, pass all scores to a softmax function.

Image source: ^[3]

Visual Interaction Network(VIN)

Visual Interaction Network(VIN) adopts ConvNets to encoder images. Two consecutive input frames are convolved into a state code.

Afterward, employ RN in its Interaction Net(IN).

For each slot, RN is applied to the slot’s concatenation with each other slot.
Then a self-dynamics net is applied to the slot itself.
FInally sum all the results and produce the output.

Relational Memory Core(RMC)

Relational Memory Core(RMC)^[4] assembles LSTMs and non-local networks(i.e. Transformer).

Encoding new memories
Let matrix $M$ denote stored memories with row-wise memories $m_i$ . RMC apply multi-head dot product attention(MHDPA) to allow memories interacting with others. $\color{green}{[M;x]}$ include memories and new observations. The output size of $\tilde{M}$ is the same as $M$.

$\tilde{M} = \text{softmax} \big( \frac{M W^q (\color{green}{[M;x]}W^k)^\top}{\sqrt{d^k}} \big) \color{green}{[M;x]}W^v$

Introducing recurrence into variant LSTM

$\begin{align} \left[\begin{array}{c} \mathbf{f}_{i,t}\\ \mathbf{i}_{i,t} \\ \mathbf{o}_{i,t} \\ \end{array}\right] &= \left[\begin{array}{c} W^f \\ W^i \\ W^o \end{array}\right] \mathbf{x}_t + \left[\begin{array}{c} U^f \\ U^i \\ U^o \end{array}\right] \mathbf{h}_{t-1} + \left[\begin{array}{c} \mathbf{b}^f \\ \mathbf{b}^i \\ \mathbf{b}^o \end{array}\right] \\ m_{i,t} &= \sigma (f_{i,t}+\tilde{b}^f) \odot m_{i,t-1} + \sigma (i_{i,t}) \odot \color{green}{g_{\psi}} (\tilde{m_{i,t}}) \\ h_{i,t} &= \sigma (o_{i,t}) \odot \tanh (m_{i,t})\\ s_{i,t+1} &= (m_{i,t}, h_{i,t}) \end{align}$

where $\color{green}{g_{\psi}}$ is a row/memory-wise MLP with layer normalization.^[4]

recurrent Memory, Attention and Composition (MAC)

The MAC recurrent cell consists of control unit, read unit and write unit.

control unit: attends to different parts of the task question (question)
read unit: extacts information out of knowledge base (image in VQA task)
write unit: integrates the retrieved information into the memory state

Input:

concat the last states of bi-LSTM on textual questions as $\pmb{q}$
convolve image as the knowledge base $\pmb{K}^{H \times W \times d}$

Control unit

Given the contextual question word $\pmb{cw}_1, \cdots, \pmb{cw}_S$ , the question representation $\pmb{q}_i$ , the previous control state $\pmb{c}_{i-1}$ .

Concat $\pmb{q}_i$ and $\pmb{c}_{i-1}$ and feed into a FFNN.
Measure the similarity between $cq_i$ and each question word $\pmb{cw}_s$ ; then use a softmax layer to normalize the weights, aquiring attention distribution.
Weighted averaging the question context words, and get current control state $\pmb{c}_i$

$\begin{align} cq_i &= W^{d \times 2d} [\pmb{c}_{i-1}, \pmb{q}] + b^d \\ ca_{i,s} &= W^{1\times d} (cq \odot \pmb{cw_s}) + b^1 \\ cw_{i,s} &= \text{softmax} (ca_{i,s}) \\ \pmb{c}_i &= \sum_{s=1}^S cw_{i,s} \cdot \pmb{cw}_s \\ \end{align}$

Read unit

Interact between the knowledge-based element $\pmb{k}_{h,w}$ and memory $\pmb{m}_{i-1}$ , get $I_{i,h,w}$
concat $[I_{i,h,w}; \pmb{k}_{h,w}]$ and feed into a dense layer
compute attention distribution over the knowledge base and finally do weighted average.

$\begin{align} I_{i,h,w} &= [W_m^{d \times d} \pmb{m}_{i-1} + b_m^d] \odot [W_k^{d \times d} \pmb{k}_{h,w} + b_k^d] \\ I'_{i,h,w} &= W^{d \times 2d} [I_{i,h,w}, \pmb{k}_{h,w}] + b^d \\ ra_{i,h,w} &= W^{d \times d} (\pmb{c}_i \odot I'_{i,h,w}) + b^d \\ rv_{i,h.w} &= \text{softmax}(ra_{i,h,w}) & \text{attention distribution}\\ \pmb{r}_i &= \sum_{h,w=1,1}^{H,W} rv_{i,h,w} \cdot \pmb{k}_{h,w} \end{align}$

Write unit

$\begin{align} m_i^\text{info} &= W^{d \times 2d} [\pmb{r}_i, \pmb{m}_{i-1}] + b^d \\ sa_{ij} &= \text{softmax} \big( W^{1 \times d} (\pmb{c_i} \odot \pmb{c_j}) + b^1 \big) & \text{self attention }\\ m_i^\text{sa} &= \sum_{j=1}^{i-1} sa_{ij} \cdot \pmb{m_j} \\ m'_i &= W_s^{d \times d} m_i^{sa} + W_p^{d \times d} m_i^\text{info} + b^d \\ c'_i &= W^{1 \times d} \pmb{c_i} + b^1 \\ \pmb{m_i} &= \sigma(c'_i) \pmb{m_{i-1}} + (1-\sigma(c'_i))m'_i & \text{memory gate} \end{align}$

Output unit

Concat $q$ and $\pmb{m_p}$ , then pass 2-layer FFCC followed by a softmax function.

References

1.Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T. (2017). A simple neural network module for relational reasoning. In Advances in neural information processing systems (pp. 4967-4976). ↩
2.Raposo, D., Santoro, A., Barrett, D., Pascanu, R., Lillicrap, T., & Battaglia, P. (2017). Discovering objects and their relations from entangled scene representations. arXiv preprint arXiv:1702.05068. ↩
3.Barrett, D. G., Hill, F., Santoro, A., Morcos, A. S., & Lillicrap, T. (2018). Measuring abstract reasoning in neural networks. arXiv preprint arXiv:1807.04225. ↩
4.Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., ... & Lillicrap, T. (2018). Relational recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 7299-7310). ↩
5.Hudson, D. A., & Manning, C. D. (2018). Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. ↩

The Gradient

Relational Reasoning Networks