DenseCRF, Correlation Filter and Their End-to-End Training

In this post we will review two classical techniques in computer vision. They are
  • Conditional random field in semantic segmentation;
  • Correlation filter in object tracking. 
And we will review some recent advances in this field, where they are integrated into a convolutional neural network model. As we shall see, a network model need NOT be deep to achieve state of the art performance, as is the case in the object tracking. 

As is the trend today, one would like to have a universal model for different computer vision tasks. Neural network models seems to have the potential of being such, although understanding such models remains still challenging.



1. Basics of CRF in semantic segmentation


Figure taken from [1].


Let $ {{\bf I}}$ be an image of size $ {N}$. We want to segment it into $ {\mathcal{L}=\{l_{1},\cdots l_{k}\}}$ semantic classes. That is, each spatial pixel $ {p_{j}}$ is assigned with a label $ {X_{j}\in\mathcal{L}}$. The entire labeling of the image $ {{\bf I}}$ is denoted $ {{\bf X}=(X_{j})_{j}}$. The problem is to find a good way to assign $ {{\bf X}}$ to $ {{\bf I}}$.

From a statistical point of view, $ {{\bf X}}$ can be modeled as a conditional random field (CRF) defined on the image pixels, where the conditioning is on the image $ {{\bf I}}$. In the paper of Krähenbühl & Koltun (2013) it is defined by a Gibbs distribution
$ \displaystyle P({\bf X}|{\bf I})=\frac{1}{Z({\bf I})}\exp(-E({\bf X},{\bf I})) $
where
  • $ {E({\bf X},{\bf I})=\sum_{c\in\mathcal{C}_{\mathcal{G}}}\phi_{c}({\bf X}\,|\,{\bf I})}$ is called the Gibbs energy.
  • $ {\mathcal{G}=(V,E)}$ is a fully connected graph, where vertex set $ {V}$ is the set of pixels of $ {{\bf I}}$. In orther words, for every two different pixels there is an edge connecting them.
  • $ {\mathcal{C}_{\mathcal{G}}}$ is a subset of $ {V\cup E}$, the elements of which are called cliques.
  • $ {\phi_{c}}$ is called the potential/cost function associated to the clique $ {c}$.
  • $ {Z({\bf I})}$ is a normalization constant, $ {Z({\bf I})=\sum_{x\in\mathcal{L}^{N}}\exp(-E(x,{\bf I})}$.
Once the probability distribution is set, the optimal labeling of the image $ {{\bf I}}$ is obtained by MAP estimation
$ \displaystyle x^{*}=\arg\max_{x\in\mathcal{L}^{N}}P(x|{\bf I}) $

In the case $ {\mathcal{C}_{\mathcal{G}}=V\cup E}$, the model is called a fully connected CRF, or dense CRF. And
  • if $ {c=u\in V}$, $ {\phi_{u}}$ is called a unary potential;
  • if $ {c=p\in E}$, $ {\phi_{p}}$ is called an edge/pairwise potential.
In their model, the unary potentials are given by an extended version of TextonBoost. Future work replaces it by output of CNNs.

The edge potentials have the form
$ \displaystyle \begin{array}{rcl} & & \phi_{p}(x_{i},x_{j}\,|\,{\bf I})\\ & = & \frac{1}{2}\mu(x_{i},x_{j})\left(w^{(1)}\exp\left(-\frac{|p_{i}-p_{j}|^{2}}{2\theta_{\alpha}^{2}}-\frac{|I_{i}-I_{j}|^{2}}{2\theta_{\beta}^{2}}\right)+w^{(2)}\exp\left(-\frac{|p_{i}-p_{j}|^{2}}{2\theta_{\gamma}^{2}}\right)\right)\\ & =: & \frac{1}{2}\mu(x_{i},x_{j})\sum_{m=1}^{2}w^{(m)}k(f_{i},f_{j}) \end{array} $
where
  • $ {x_{i}}$ is the label given to the pixel $ {p_{i}}$;
  • $ {I_{i}}$ is the color vector of the pixel $ {p_{i}}$;
  • $ {\mu(\cdot,\cdot)}$ is a symmetric label compatibility function, the weights $ {w^{(1)}}$, $ {w^{(2)}}$, and $ {\theta_{\alpha},\theta_{\beta},\theta_{\gamma}}$ are learnable parameter; $ {f_{i}}$ is the feature vector which in this case consists of $ {p_{i}}$ and $ {I_{i}}$.
Commonly $ {\mu(\cdot,\cdot)}$ is chosen to be the Potts model, which is $ {0}$ when the pixel $ {p_{i}}$ and $ {p_{j}}$ have the same label, otherwise $ {1}$. It is also possible to make it learnable and non-symmetric. This model thus penalises pixels assigned with the different labels but have similar features.


Figures taken from [1].


However, the distribution $ {P}$ is computationally infeasible.

1.1. Mean field approximation

Figure taken from [1].

The mean field approximation computes a distribution $ {Q}$ of the form $ {Q({\bf X})=\prod_{i}Q_{i}(X_{i})}$ that minimises the KL-divergence
$ \displaystyle \begin{array}{rcl} Div(Q\|P) & = & \sum_{x\in\mathcal{L}^{N}}Q(x)\log\left(\frac{Q(x)}{P(x)}\right)\\ & = & -\mathbb{E}_{{\bf U}\sim Q}\left[\log P({\bf U})\right]+\mathbb{E}_{{\bf U}\sim Q}\left[\log Q({\bf U})\right]\\ & = & \mathbb{E}_{{\bf U}\sim Q}\left[E({\bf U},{\bf I})\right]+\log Z({\bf I})+\sum_{i=1}^{N}\mathbb{E}_{U_{i}\sim Q_{i}}\left[\log Q_{i}(U_{i})\right]. \end{array} $
To minimize the above expression, mean field approximation takes the update
$ \displaystyle \begin{array}{rcl} Q_{i}(x_{i}) & = & \frac{1}{Z_{i}}\exp\left(-\phi_{u}(x_{i}\,|\,{\bf I})-2\sum_{j\neq i}\mathbb{E}_{U_{j}\sim Q_{j}}\left[\phi_{p}(x_{i},U_{j}\,|\,{\bf I})\right]\right) \end{array} $
where
$ \displaystyle \begin{array}{rcl} \mathbb{E}_{U_{j}\sim Q_{j}}\left[\phi_{p}(x_{i},U_{j}\,|\,{\bf I})\right] & = & \sum_{x\in\mathcal{L}^{N}}Q_{j}(x)\phi_{p}(x_{i},x\,|\,{\bf I}). \end{array} $

This update can be broken into smaller substeps:
  • Message passing step: $ {\tilde{Q}_{i}^{(m)}(l)\leftarrow\sum_{j\neq i}k^{(m)}(f_{i},f_{j})Q_{j}(l)}$, $ {m=1,2}$.
  • Filter weighting: $ {\bar{Q}_{i}(l)=\sum_{m=1}^{2}w^{(m)}\tilde{Q}_{i}^{(m)}(l)}$.
  • Compatibility transform: $ {\hat{Q}_{i}(x_{i})\leftarrow\sum_{l\in\mathcal{L}}\mu(x_{i},l)\bar{Q}_{i}(l)}$.
  • Local update: $ {Q_{i}(x_{i})\leftarrow\exp\left(-\phi_{u}(x_{i}\,|\,{\bf I})-\hat{Q}_{i}(x_{i})\right)}$.
  • Normalization: $ {Q_{i}(x_{i})\leftarrow Q_{i}(x_{i})/\sum_{x\in\mathcal{L}^{N}}Q_{i}(x)}$
The computation cost concentrates at the message passing step. One of the main contributions of the paper is to note that this step can be efficiently computed using high dimensional filtering techniques.
More precisely,
$ \displaystyle \begin{array}{rcl} \tilde{Q}_{i}^{(m)}(l) & = & G_{\Theta^{(m)}}*Q\,(l)-G_{\Lambda^{(m)}}(0)Q_{i}(l)\\ & = & G_{\Theta^{(m)}}*Q\,(l)-Q_{i}(l), \end{array} $
where
$ \displaystyle \begin{array}{rcl} G_{\Theta^{(1)}}(f_{i}-f_{j}) & = & w^{(1)}\exp\left(-\frac{|p_{i}-p_{j}|^{2}}{2\theta_{\alpha}^{2}}-\frac{|I_{i}-I_{j}|^{2}}{2\theta_{\beta}^{2}}\right)\\ G_{\Theta^{(2)}}(f_{i}-f_{j}) & = & w^{(2)}\exp\left(-\frac{|p_{i}-p_{j}|^{2}}{2\theta_{\gamma}^{2}}\right) \end{array} $
There are very efficient ways to do high dimensional filtering, which will not be our focus here.

Higher order CRF. It is possible to include more terms in the Gibbs energy, such as energy term based on super-pixels, object detections, etc. An end-to-end training approach with CNN is proposed by Arnab et al. Below the fold we shall look at how end-to-end training is achieved.


2. End to end network training with a CRF module

The architechture of a semantic segmentation purposed network is often derived from the some popular image classification nets, such as AlexNet, VGG and ResNet. The current approaches involving CRF can be devided into two categories:
  • The network and the CRF module is trained separately and the CRF is used as post processing. Examples include the DeepLab.
Figure taken from [3]

  • The CRF module is explicitly contained as a part of the network, whose parameters are trained together in a separate stage. Examples include the CRF-as-RNN.
Figure taken from [4]

Note that the CRF module in some sense has not been fully integrated into the network. The current approaches above always train a network that outputs a good unary potential, then the CRF module is included and trained either jointly or separately.

2.1. Some empirical fine-feature extraction techniques with CNNs

These techniques are baiscally extensions and modifications upon the ``fully convolutional network'' of J. Long, E. Shelhamer, and T. Darrell, which, when converting a classification purposed CNN, replaces the fully connected layer with several upsampling layers with learnable filters. The output of this approach are subject to blobby noisy regions.
Figure taken from [2]


To produce a good unary potential, the network must be able to find the structure of the image at multiple scales, and express it (as a score map) at a certain level of detail. There are two basic challenges that one must face:
  • Objects of the same class can appear in multiple scales within and across images;
  • There are many max-pooling and down-sampling stages, which after upsampling back to the input image, destroy the finity of the output; meanwhile, there are memory limit of storing large filters.
Therefore, some common features of the current approaches include:
  • The max-pooling and downsampling operations are removed in the last few layers.
  • Apply spatial pyramid pooling at several grid scales, or several parallel atrous (sometimes also called dilated) convolution with different rates: R-CNN, PSPnet, Deeplabv2.
  • Enlarge the field of view of kernels by using special kernels: separable kernels or filter rarefication (as atrous convolution), meanwhile avoiding memory overflowing and also easier to train: Large-kernel-matters, Deeplab. Note that CRF can also be regarded as a filter with receptive field spanding the whole image.
  • Auto-encoder structure possibly with skip connections: FCN, Deeplabv3, StackedDNN.
We now turn to some of the above techiniques in more details.

2.1.1. Upsampling layers and auto-encoder structures.


Figure taken from [2]

Fully connected layers in a classification purposed convnet can be thought of as convolution with kernels that cover the entire input regions (with a large stride). Therefore, it is easy to replace the last few layers with conventional convolution, resulting in a spatially two dimensional output.
Figure taken from [2]

The above network so far has very coarse output. For example, a 500$ {\times}$500 input image will be transformed into $ {N}$ number of $ {10\times10}$ output images, where $ {N}$ is the number of class. Each pixel intensity in the $ {k}$-th $ {10\times10}$ output images is a probability of that pixel belonging the the $ {k}$-th class. To obtain pixel-wise labeling of the input image, one needs to upsample this output image to the input size.

A trivial way is to do e.g. a bilinear interpolation. But a more clever way is use the transposed convolution, also known as backwards convolution or deconvolution. Thus upsampling can be performed in-network for end-to-end learning by backpropagation from the pixelwise loss.

This way of upsampling can be seen to be the same with the one used in the auto-encoder architechtures.

2.1.2. Multi-scale feature extraction, atrous convolution and spatial pryamid pooling module.

In [2] a preliminary version of multi-scale feature extraction is proposed. The basic idea is to use skip connections.
Figures taken from [2]


The atrous convolution is originated in the efficient computation of the undecimated wavelet transform in the "algorithme á trous" . It is also known as the dilated convolution. Essentially, what it does is to use an enlarged filter when doing convolution. This filter is enlarged by inserting zeros in appropriate places.
Figure taken from [3]

When the filter is enlarged this way, it has a larger receptive field. Equivalently, it is the same filter applied to an approriately downsampled input image. But inserting zeros is easier to implement.

To capture the multi-scale features of the image, atrous convolution can be used: zeros are inserted in a filter in a certain layer at different rates, and these enlarged filters are then applied in parellel to the output from the last layer (so there are separate branches in the network to accomplish this).
Figure taken from [3]

The results from different branches are then upsampled and fused as the final output. This is essentially a spatial pryamid pooling method.

Figure taken from [3]


2.2. Differentiable programing with a CRF module

Now we come back to the mean field approximation iteration outlined in Section 1. The mean contribution of the paper CRF-as-RNN is to note that these steps can be implemented within convnets and trained end-to-end. Different from the conventional concolution, where the filters are static after the training, in CRF the filter's coefficients depend on the input. We now look at each substeps more closely.
Figure taken from [4]

Message passing step. The following operation is performed
$ \displaystyle \tilde{Q}_{i}^{(m)}(l)\leftarrow\sum_{j\neq i}k^{(m)}(f_{i},f_{j})Q_{j}(l) $
which as explained before, can be seen as a high-dimensional filtering. In matrix notations, it can be written as
$ \displaystyle \tilde{Q}^{(m)}(l)=K(\Theta^{(m)})Q(l). $
The differential with respect to parameters of this equation is
$ \displaystyle d\tilde{Q}_{i}^{(m)}(l)=Q(l)^{T}dK_{i}^{T}(\Theta^{(m)}),\quad i=1,\dots,N $
where we think of $ {d\tilde{Q}_{i}^{(m)}(l)}$ as a row vector. The backpropagation algorithm, utilizing the computational graph of the permutohedral lattice algorithm, is used to compute these matrices $ {d\tilde{Q}^{(m)}(l)}$ implicitly.

Filter weighting.
$ \displaystyle \bar{Q}_{i}(l)=\sum_{m=1}^{2}w^{(m)}\tilde{Q}_{i}^{(m)}(l). $
For each label $ {l}$, we can think of it as a layer where input has two channels, which are convolved with filters of size $ {1\times1}$ respectively and summed together. It is also possible to increase the number of learnable parameters, by giving different labels different weights.

Compatibility transform.
$ \displaystyle \hat{Q}_{i}(x_{i})\leftarrow\sum_{l\in\mathcal{L}}\mu(x_{i},l)\bar{Q}_{i}(l). $
We can think of it as a layer with $ {k}$ input channels, where each chanel represents a label. Each channel is convolved with a filter of size $ {1\times1}$ respectively and summed together. It is also possible to increase the number of learnable parameters, by giving different pairs $ {(x_{i},l)}$ different compatibility function values.

Local update and normalization. These are standard operations which do not involve parameters.

Whole network training of CRF-as-RNN. Unary potential part is first trained. The CRF module performs a fixed number of mean field iteration, which resembles a recurrent neural network structure. Backpropagation in time is then used for training this module.

3. Another example of End-to-end training: siamese network with correlation filter in object detection

The basic idea of a siamese network is that it transforms the two input images in such a way that the similarity among patches becomes more apparent.
Figure taken from [5]

Here we consider the fully convolutional siamese network proposed by J. Valmadre and L. Bertinetto. Let us denote the feature map by
$ \displaystyle f_{\theta}:(x',z')\mapsto(f_{\theta}(x'),f_{\theta}(z')) $
where $ {x'}$ is the reference object image, and $ {z'}$ is the image containing objects to be detected. Cross correlation is used to find the the patch position of similarity
$ \displaystyle g_{\theta}(f_{\theta}(x'),f_{\theta}(z'))=f_{\theta}(x')*f_{\theta}(z')+b. $
Here, $ {g_{\theta}}$ effectively output a score map that tells where possibly the object lies in $ {z'}$.
An interesting extension of this work is to include a correlation filter module and trained end-to-end, which enables a much more light-weight network and fast processing.
Figure taken from [6]

Let us describe the correlation filter in some detail below.

3.1. Correlation filters

Roughly speaking, this technique aims to design a filter, after applying which, we get a repsonse as desired as possible. In practice, this response is fixed to be a Gaussian response. In a sense, the correlation filter is a linear classifier.
Figure taken from [7]

One simple formulation of the correlation filter suitable for online object tracking is the following
$ \displaystyle w^{*}=\arg\min_{w}\frac{1}{N}\|w*x-y\|^{2}+\frac{\lambda}{2}\|w\|^{2} $
where $ {w^{*}}$ is the desired correlation filter, $ {x}$ is the input image and $ {N}$ is its size, and $ {y}$ is the desired response.

This type of optimization problem is known as a ridge regression. In this case, simple closed form solution can be obtained. We will stress here the computational graph of this solution, where we backpropagate the gradient.
Figure taken from [6]

It involves two additional variables, $ {k}$ and $ {\alpha}$(Lagrange multiplier). Together with $ {x,y,w}$, they satisfy the system of equations
$ \displaystyle \begin{cases} k=\frac{1}{N}(x*x)+\lambda\delta\\ k*\alpha=\frac{1}{N}y\\ w=\alpha*x \end{cases}. $
With FFT, in the freqeuncy domain the above computations are extremely efficient
$ \displaystyle \begin{cases} \hat{k}=\frac{1}{N}(\hat{x}\odot\hat{x})+\lambda\mathbb{1}\\ \hat{\alpha}=\frac{1}{N}\hat{k}^{-1}\odot\hat{y}\\ \hat{w}=\hat{\alpha}\odot\hat{x} \end{cases}, $
where $ {\odot}$ means element-wise multiplication. And the following backpropagation rule can easily be derived
$ \displaystyle \begin{cases} \widehat{\nabla_{\alpha}\ell}=\hat{x}\odot\overline{\widehat{\nabla_{w}\ell}}\\ \widehat{\nabla_{y}\ell}=\frac{1}{N}\overline{\hat{k}^{-1}}\odot\widehat{\nabla_{\alpha}\ell}\\ \widehat{\nabla_{k}\ell}=-\overline{\hat{k}^{-1}}\odot\overline{\widehat{\alpha}}\odot\widehat{\nabla_{\alpha}\ell}\\ \widehat{\nabla_{k}\ell}=\widehat{\alpha}\odot\widehat{\nabla_{w}\ell}+\frac{2}{N}\hat{x}\odot\text{Re}(\widehat{\nabla_{k}\ell}) \end{cases}. $

3.2. Some more discussion on the results of CF-net

Figures taken from [6]



Reference
[1] Krähenbühl, Philipp, and Vladlen Koltun. "Efficient inference in fully connected crfs with gaussian edge potentials." Advances in neural information processing systems. 2011.
[2] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[3] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs." arXiv preprint arXiv:1606.00915 (2016).
[4] Arnab, Anurag, et al. "Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation: Combining Probabilistic Graphical Models with Deep Learning for Structured Prediction." IEEE Signal Processing Magazine 35.1 (2018): 37-52.
[5] Bertinetto, Luca, et al. "Fully-convolutional siamese networks for object tracking." European conference on computer vision. Springer, Cham, 2016.
[6] Valmadre, Jack, et al. "End-to-end representation learning for correlation filter based tracking." Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017. [7] Chen, Zhe, Zhibin Hong, and Dacheng Tao. "An experimental survey on correlation filter-based tracking." arXiv preprint arXiv:1509.05520 (2015).

No comments:

Post a Comment