# Putting an End to End-to-End

End-to-end backpropagation is the backbone of most of the recent big successes of artificial intelligence: neural networks trained with end-to-end backprop can beat world champions in chess and Go, detect lung cancer from CT scans, and even write stories, poems and news articles. In this blog post, we will explore how we can train a neural network without end-to-end backpropagation while achieving competitive performance.
But why would we want to train a neural network without end-to-end backpropagation in the first place?
Imagine you want to train a very deep neural network on very high-dimensional data. That can pose quite a problem if you want to train this network on a GPU using end-to-end backpropagation. You need to be able to fit the entire computational graph – so everything including all the weights, activations and gradients – in the GPU memory at once. You might be able to alleviate this problem by reducing the batch size, but this will only get you so far. The main constraint here is that with the typical implementation of end-to-end backpropagation, we cannot simply train individual layers of the network asynchronously. In the forward pass, each layer needs to wait for the activations from its predecessor to do any calculations on it. And in the backward pass with standard end-to-end backprop, each layer needs to wait for the gradients from its successor to do its weight update. This so-called forward and backward locking not only prevents us from separating the training of individual subparts of the network, which could help mitigate our memory problem, it also prevents us from parallelizing the training of these subparts, which could make our training more efficient.
In our NeurIPS paper Putting An End to End-to-End: Gradient-Isolated Learning of Representations, we try to resolve these computational issues of end-to-end backpropagation by taking inspiration from the brain. Our brains are extremely efficient learning machines. One possible explanation for this is that biological synapses are mostly adjusted based on local information, i.e. based on information from their immediate neighboring neurons. They do not wait for a global signal to update their connection strengths. Thus, different areas of the brain can learn highly asynchronously and in parallel.
Inspired by this local learning found in the brain, we propose a new learning algorithm – Greedy InfoMax. With Greedy Infomax, we show that we can train a neural network without end-to-end backpropagation and achieve competitive performance. At the same time, Greedy InfoMax makes use of a self-supervised loss function. Thus, we also remove the reliance on a labeled dataset for the training of our model.
Let's have a look at Greedy InfoMax in a bit more detail.

# Greedy InfoMax

Take any conventional neural network architecture. With Greedy InfoMax, we want to train it without end-to-end backpropagation. In order to do so, we remove the global loss, divide the network into separate modules (e.g. individual layers or stacks of layers) and use a separate local loss for the training of each module. Then, we block gradients from flowing in between these modules. This enforces that each module greedily optimizes its own local objective.

But how do we make sure that modules provide meaningful inputs to one another – even though they do not share gradients?

## Contrastive Learning

Our magic trick is called contrastive learning. Recently, several papers successfully applied variants of this method in combination with end-to-end backpropagation to different tasks and domains. Have a look, for example, at Deep InfoMax by Hjelm et al. (2019), Contrastive Predictive Coding by van den Oord et al. (2018), Contrastive Multiview Coding by Tian et al. (2019), Contrastively-trained Structured World Models by Kipf et al. (2019) and Multi-View Information Bottleneck by Federici et al. (2020).
In Greedy InfoMax, we contrast information from different time-steps. To understand the idea behind this variant of contrastive learning, imagine we are given a short speech sample:

Now consider two small patches from this sample, "cats" and "awe" for example. These speech patches share a lot of information with one another. For example, the identity of the speaker, the emotion that is being expressed, and (if we select patches of a smaller time-scale) also the words and phonemes that are being said. All this shared information – speaker identity, emotions, words – could potentially be helpful if we can extract it with a neural network. Thus, our goal is to train a neural network in such a way that it learns to preserve this information that is shared between temporally nearby patches.
For this, we use the InfoNCE objective developed by van den Oord et al. (2018). Essentially, this objective pairs the representations of temporally nearby patches $$(z_t, z_{t+k})$$, termed the positive sample, and contrasts them against random pairs $$(z_t, z_j)$$, termed negative samples. Thus, the positive sample could correspond to the patches "cats" and "awe" from above and the negative sample could correspond to the patch "cats" paired with any other patch that we can sample from our dataset. "Contrasting" in this setting means that the neural network has to learn to differentiate between these two types of samples, i.e. to classify them correctly.
How does this contrasting help us to train a neural network without end-to-end backpropagation?
Van den Oord et al. (2018) showed that the InfoNCE objective enforces the model to preserve the information that is shared between pairs of temporally nearby patches. Mathematically speaking, this means that the InfoNCE objective maximizes the mutual information between temporally nearby representations. In other words, the mutual information between the representations for the time-steps $$t$$ and $$t+k$$:

$\text{max } I(z_t, z_{t+k})$

Intuitively, we can draw the connection between contrastive learning and the mutual information by looking at the definition of the mutual information as the KL divergence between the joint distribution of $$z_t$$ and $$z_{t+k}$$ and the product of their marginals:

$I(z_t, z_{t+k}) = KL(P_{z_t, z_{t+k}} || P_{z_t} P_{z_{t+k}})$

In this context, our positive sample $$(z_t, z_{t+k})$$ can be seen as a sample from the joint distribution $$P_{z_t, z_{t+k}}$$ since $$z_t$$ and $$z_{t+k}$$ are drawn together. The negative sample $$(z_t, z_j)$$, on the other hand, can be seen as a sample from the product of the marginal distributions $$P_{z_t} P_{z_{t+k}}$$ since $$z_t$$ and $$z_j$$ are drawn irrespective of one another. Since the InfoNCE objective pushes these two samples to be as distinguishable as possible (such that it can classify them correctly), this implicitly increases the KL divergence and thus the mutual information between $$z_t$$ and $$z_{t+k}$$.
Since we use the InfoNCE loss for the training of each module (i.e. each individually trained subpart of the neural network) in Greedy InfoMax, all of the above also applies to the representations that these modules create. Thus, in the Greedy InfoMax setting, each module $$m$$ maximizes the mutual information between the representations that it creates for time-steps $$t$$ and $$t+k$$:

$\text{max } I(z_t^m, z_{t+k}^m)$

This is highly relevant for the Greedy InfoMax setting, since maximizing the mutual information between temporally nearby representations in turn also maximizes the mutual information between temporally nearby inputs and outputs of each module. In other words, using the InfoNCE objective, we also maximize the mutual information between $$z^{m-1}_{t+k}$$, the input to module $$m$$ at timestep $$t+k$$ and $$z_{t}^m$$, the output at timestep $$t$$ (hover here to see the underlying intuition):

$I(z_t^m, z_{t+k}^m) \leq I(z_t^m, z_{t+k}^{m-1})$

This provides us with an intuitive explanation as to why Greedy InfoMax works: By maximizing the mutual information between the input and output of each module, we enforce each module to keep as much information about its inputs as possible. Since we optimize the mutual information between different time-steps, we simultaneously discourage the modules from simply copying their inputs. Thus, the InfoNCE objective pushes modules to create useful inputs for their successors.

## Memory Efficiency

What do we gain from this greedy training? It allows us to train modules completely separately. This can increase the memory-efficiency – as only one module has to fit into the GPU memory at a time – and can allow us to train deeper networks on higher-dimensional input.
This distributed training of individual modules is possible since Greedy Infomax allows us to remove both forward and backward locking. By default, Greedy InfoMax modules are not backward locked – they do not share gradients with one another and thus they do not need to wait for one another's gradients in the backward pass. However, in the default implementation, modules are still forward locked – they depend on their predecessor's output to do their own calculations in the forward pass. This dependency is not very strict, though, since modules do not depend on their predecessor's most recent output. Thus, we can remove the forward locking with a simple trick: by regularly (e.g. every $$x$$ epochs) storing each module's output as a dataset for the next module to train on. This reduces the amount of communication needed between modules tremendously and allows us to train modules on separate devices. Ultimately, this enables Greedy InfoMax to achieve a more memory-efficient, asynchronous and distributed training.

# Experiments

We applied Greedy InfoMax to the audio and vision domain. In both domains, we first train all modules within our architecture greedily with the InfoNCE objective. Then, while keeping the weights within our architecture fixed, we create representations for all inputs and train a linear classifier on these. Our results for both domains are on par with self-supervised state-of-the-art methods that make use of end-to-end backpropagation, indicating the competitiveness of Greedy InfoMax.
Let's have a look at how the representations evolve throughout the model. For this, we train a linear classifier on top of each greedily trained module of the Greedy Infomax architecture. On the y-axis, we plot the error rate when evaluating our model on a speaker classification task.

This plot gives us three great insights:
1. Greedy InfoMax works! Not only does it achieve a competitive performance to the other tested methods, we can even see that each Greedy InfoMax module improves upon its predecessors. This shows us that the individual modules in Greedy InfoMax provide inputs to one another that their successors can use to create even better representations.
2. Contrastive Predictive Coding (CPC), which uses the InfoNCE objective combined with end-to-end backpropagation for its training, shows a similar performance to Greedy InfoMax throughout all modules. This shows us that the InfoNCE objective “stacks well”, such that its greedy, iterative application performs similar to its global application.
3. For models trained with a supervised cross-entropy loss, the performance of the intermediate modules differs quite strongly depending on whether we apply the cross-entropy loss greedily to each (Greedy Supervised), or globally at the final module (Supervised). This suggests that the InfoNCE objective, and potentially contrastive learning in general, is especially suited for the greedy, module-wise training with Greedy InfoMax.

# Recap

With Greedy InfoMax, we can train a neural network without end-to-end backpropagation and achieve competitive performance.
To achieve this, Greedy InfoMax divides a neural network into separate, greedily optimized modules. By enforcing modules to preserve the information of their inputs, we enable the stack of modules to create useful features without sharing gradients.
This is exciting since it enables a more memory-efficient, asynchronous and distributed training.

## Credit

I'd like to give a shout-out to my co-authors Bastiaan Veeling and Peter O'Connor for developing Greedy InfoMax with me, and I'd like to thank Marco Federici, Emiel Hoogeboom and Joop Pascha for proofreading and giving feedback on this blog post.

## Citation

If you want to make use of Greedy InfoMax in your own work, you can cite our paper:
@inproceedings{lowe2019putting,
title={Putting An End to End-to-End: Gradient-Isolated Learning of Representations},
author={L{\"o}we, Sindy and O'Connor, Peter and Veeling, Bastiaan},
booktitle={Advances in Neural Information Processing Systems},
year={2019}
}

## Paper, Source Code & More

If you can't wait to train your own neural network without end-to-end backpropagation, check out our source code to get you started.
If you want a more technical and in-depth description of Greedy InfoMax, have a look at our full paper.
If you still didn't get enough of Greedy InfoMax by then, have a look at my master's thesis. It covers the background and related work on contrastive learning, as well as our biological inspiration more extensively.

### Contact

Sindy Löwe

loewe [dot] sindy [at] gmail [dot] com

AMLab, University of Amsterdam

Science Park 904

1098 XH Amsterdam

The Netherlands