ICLR, Volume abs/1609.09106, 2017.
Compositional Pattern-Producing Networksquoc v.andrew daile googleLong Short-Term MemoryMore(12+)
We focused on two use cases of hypernetworks: static hypernetworks to generate weights for a convolutional network, dynamic hypernetworks to generate weights for recurrent networks
- We consider an approach of using a small network to generate the weights for a larger network.
- 1 addition, our embedding vectors can be generated dynamically by our hypernetwork, allowing the weights of a recurrent network to change over timesteps and adapt to the input sequence.
- The focus of this work is to generate weights for practical architectures, such as convolutional networks and recurrent networks by taking layer embedding vectors as inputs.
- The model takes the layer embeddings zj learned during training to reproduce the kernel weights for layer j in the main convolutional network.
- Since the HyperRNN cell can indirectly modify the rows of each weight matrix and the bias of the main RNN, it is implicitly performing a linear scaling to the inputs of the activation function.
- The question is whether we can train a deep convolutional network, using a single hypernetwork generating a set of weights for each layer, on a dataset more challenging than MNIST.
- We will take a simple version of residual network, use a single hypernetwork to generate the weights of all of its layers for image classification task on the CIFAR-10 dataset.
- We see that enforcing a relaxed weight sharing constraint to the deep residual network cost us ∼ 1.25-1.5% classification accuracy, while drastically reducing the number of parameters in the model as a trade off.
- This suggests by dynamically adjusting the weight scaling vectors, the HyperLSTM cell has learned a policy of scaling inputs to the activation functions that is as efficient as the statistical moments-based strategy employed by Layer Norm, and that the required extra computation required is embedded inside the extra 128 units inside the HyperLSTM cell.
- One might wonder whether the HyperLSTM cell, via dynamically tuning the weight scaling vectors, has developed a policy that is similar to the statistics-based approach used by Layer Norm, given that both methods have similar performance.
- Similar to the earlier character generation experiment, we show a generated handwriting sample from the HyperLSTM model in Figure 7, along with a plot of how the weight scaling vectors of the main RNN is changing over time below the sample.
- While the LSTM model alone already does a formidable job of generating time-varying parameters of a Mixture Gaussian distribution used to generate realistic handwriting samples, the ability to go one level deeper, and to dynamically generate the generative model is one of the key advantages of HyperRNN over a normal RNN.
- Language modeling and handwriting generation, hypernetworks are competitive to or sometimes better than state-of-the-art models
- 3.36 M 4.14 M 3.26 M 5.16 M 3.37 M 4.14 M 3.95 M 3.94 M. In this task, we note that data augmentation and applying recurrent dropout improved the performance of all models, compared to the original setup by (Graves, 2013)