# Optimizing Data Usage via Differentiable Rewards

ICML, pp. 9983-9995, 2020.

EI

Keywords:

Weibo:

Abstract:

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that "adapts" to its current learning state and esti...More

Code:

Data:

Introduction

- While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data.
- To avoid hand designed heuristics, several works propose to optimize a parameterized neural network to learn the data usage schedule, but most of them are tailored to specific use cases, such as handling noisy data for classification (Jiang et al, 2018), learning a curriculum learning strategy for NMT (Kumar et al, 2019), and actively selecting data for annotation (Fang et al, 2017; Wu et al, 2018).

Highlights

- While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data
- We propose an alternative: a general Reinforcement Learning (RL) framework for optimizing training data usage by training a scorer network that minimizes the model loss on the development set
- We find that Differentiable Data Selection outperforms SPCL by a large margin for both of the tasks, especially for multilingual neural machine translation
- We see that incorporating prior knowledge into the scorer network leads to further improvements
- We present Differentiable Data Selection, an efficient Reinforcement Learning framework for optimizing training data usage
- We formulate two algorithms under the Differentiable Data Selection framework for two realistic and very different tasks, image classification and multilingual neural machine translation, which lead to consistent improvements over strong baselines

Results

- The results of the baselines and the method are listed in Table 1.
- Comparing the standard baseline strategy of “Uniform” and the proposed method of “DDS” the authors can see that in all 8 settings DDS improves over the uniform baseline.
- For NMT, in comparison to Related and TCS, vanilla DDS performs favorably with respect to these state-of-the-art data selection baselines, outperforming each in 3 out of the 4 settings.
- For image classification, retrained DDS can significantly improve over regular DDS, leading to the new state-of-the-art result on the CIFAR-10 dataset.
- For mulitlingual NMT, TCS+DDS achieves the best performance in three out of four cases.4

Conclusion

- The authors present Differentiable Data Selection, an efficient RL framework for optimizing training data usage.
- The authors parameterize the scorer network as a differentiable function of the data, and provide an intuitive reward function for efficiently training the scorer network.
- The authors formulate two algorithms under the DDS framework for two realistic and very different tasks, image classification and multilingual NMT, which lead to consistent improvements over strong baselines

Summary

## Introduction:

While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data.- To avoid hand designed heuristics, several works propose to optimize a parameterized neural network to learn the data usage schedule, but most of them are tailored to specific use cases, such as handling noisy data for classification (Jiang et al, 2018), learning a curriculum learning strategy for NMT (Kumar et al, 2019), and actively selecting data for annotation (Fang et al, 2017; Wu et al, 2018).
## Results:

The results of the baselines and the method are listed in Table 1.- Comparing the standard baseline strategy of “Uniform” and the proposed method of “DDS” the authors can see that in all 8 settings DDS improves over the uniform baseline.
- For NMT, in comparison to Related and TCS, vanilla DDS performs favorably with respect to these state-of-the-art data selection baselines, outperforming each in 3 out of the 4 settings.
- For image classification, retrained DDS can significantly improve over regular DDS, leading to the new state-of-the-art result on the CIFAR-10 dataset.
- For mulitlingual NMT, TCS+DDS achieves the best performance in three out of four cases.4
## Conclusion:

The authors present Differentiable Data Selection, an efficient RL framework for optimizing training data usage.- The authors parameterize the scorer network as a differentiable function of the data, and provide an intuitive reward function for efficiently training the scorer network.
- The authors formulate two algorithms under the DDS framework for two realistic and very different tasks, image classification and multilingual NMT, which lead to consistent improvements over strong baselines

- Table1: Results for image classification accuracy (left) and multilingual MT BLEU (right). For MT, the statistical significance is indicated with ∗ (p < 0.005) and † (p < 0.0001)
- Table2: Statistics of the multilingual NMT datasets

Related work

- Many machine learning approaches consider how to best present data to models. First, difficultybased curriculum learning estimates the presentation order based on heuristic understanding of the hardness of examples (Bengio et al, 2009; Spitkovsky et al, 2010; Tsvetkov et al, 2016; Zhang et al, 2016; Graves et al, 2017; Zhang et al, 2018; Platanios et al, 2019). These methods, though effective, often generalize poorly because they require task-specific difficulty measures. On the other hand, self-paced learning (Kumar et al, 2010; Lee & Grauman, 2011) defines the hardness of the data based on the loss from the model, but is still based on the assumption that the model should learn from easy examples. Our method does not make these assumptions. Closest to the learning to teach framework (Fan et al, 2018) but their formulation involves manual feature design and requires expensive multi-pass optimization. Instead, we formulate our reward using bi-level optimization, which has been successfully applied for a variety of other tasks (Colson et al, 2007; Anandalingam & Friesz, 1992; Liu et al, 2019a; Baydin et al, 2018; Ren et al, 2018).

Reference

- G. Anandalingam and Terry L. Friesz. Hierarchical optimization: An introduction. Annals OR, 1992.
- Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo in-domain data selection. In EMNLP, 2011.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Atilim Gunes Baydin, Robert Cornish, David Martínez-Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.
- Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
- Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.
- Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals OR, 153(1), 2007.
- Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. CoRR, abs/1812.02224, 201URL http://arxiv.org/abs/1812.02224.
- Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR, 2018.
- Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. In EMNLP, pp. 595–605, 2017.
- Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
- George Foster, Cyril Goutte, and Roland Kuhn. Discriminative instance weighting for domain adaptation in statistical machine translation. In EMNLP, 2010.
- Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In ICML, 2017.
- Frantisek Grézl, Martin Karafiát, Stanislav Kontár, and Jan Cernocky. Probabilistic and bottle-neck features for lvcsr of meetings. In ICASSP, volume 4, pp. IV–757. IEEE, 2007.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CPVR, 2016.
- Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in nlp. In ACL, 2007.
- Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G. Hauptmann. Self-paced curriculum learning. In AAAI, 2015.
- Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
- Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Katrin Kirchhoff and Jeff A. Bilmes. Submodularity for data selection in machine translation. In EMNLP, 2014.
- Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better? In CVPR, 2019.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Gaurav Kumar, George Foster, Colin Cherry, and Maxim Krikun. Reinforcement learning based curriculum optimization for neural machine translation. In NAACL, pp. 2054–2061, 2019.
- M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In NIPS, 2010.
- Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.
- Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. 2019a.
- Shikun Liu, Andrew J. Davison, and Edward Johns. Self-supervised generalisation with meta auxiliary learning. CoRR, abs/1901.08933, 2019b. URL http://arxiv.org/abs/1901.08933.
- Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
- Robert C Moore and William Lewis. Intelligent selection of language model training data. In ACL, 2010.
- Yurii E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 1983.
- Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.
- Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V. Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. CVPR, 2018.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- Minh Quang Pham, Josep Crego, Jean Senellart, and François Yvon. Fixing translation divergences in parallel corpora for neural MT. In EMNLP, 2018.
- Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. In NAACL, 2019.
- Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
- Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340, 2018.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- Rico Sennrich and Biao Zhang. Revisiting low-resource neural machine translation: A case study. In ACL, 2019.
- Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- Sunit Sivasankaran, Emmanuel Vincent, and Irina Illina. Discriminative importance weighting of augmented training data for acoustic model training. In ICASSP, 2017.
- Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. In NAACL, 2010.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In JMLR, 2014.
- Sebastian Tschiatschek, Rishabh K. Iyer, Haochen Wei, and Jeff A. Bilmes. Learning mixtures of submodular functions for image collection summarization. In NIPS, 2014.
- Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the curriculum with bayesian optimization for task-specific word representation learning. In ACL, 2016.
- Marlies van der Wees, Arianna Bisazza, and Christof Monz. Dynamic data selection for neural machine translation. In EMNLP, 2017.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pp. 5998–6008, 2017.
- Yogarshi Vyas, Xing Niu, and Marine Carpuat. Identifying semantic divergences in parallel text without annotations. In NAACL, 2018.
- Wei Wang, Isaac Caswell, and Ciprian Chelba. Dynamically composing domain-data selection with clean-data selection by "co-curricular learning" for neural machine translation. In ACL, 2019a.
- Xinyi Wang and Graham Neubig. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In ACL, 2019.
- Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. Multilingual neural machine translation with soft decoupled encoding. In ICLR, 2019b.
- Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
- Jiawei Wu, Lei Li, and William Yang Wang. Reinforced co-training. In NAACL, 2018.
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
- Arxiv 1612.06138, 2016. Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J
- Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. An empirical exploration of curriculum learning for neural machine translation. Arxiv, 1811.00739, 2018. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.
- where g(·) is any function that may be applied to the gradient ∇θJ(θt−1, ψ). For instance, in standard gradient descent g(·) is simply a linear scaling of ∇θJ(θt−1, ψ) by a learning rate ηt, while with the Adam optimizer (Kingma & Ba, 2015) g also modifies the learning rate on a parameter-by-parameter basis.
- Here we first derive ∇ψg for the general stochastic gradient descent (SGD) update, then provide examples for two other common optimization algorithms, namely Momentum (Nesterov, 1983) and Adam (Kingma & Ba, 2015).
- Here, the last equation follows from the log-derivative trick in the REINFORCE algorithm (Williams, 1992). We can consider the alignment of dev set and training data gradients as the reward for update ψ. In practice, we found that using cosine distance is more stable than simply taking dot product between the gradients. Thus in our implementation of the image classification and machine translation algorithms, we use cos J (θt, Ddev) · ∇θ (x, y; θt−1) as the reward signal.
- Adam Updates. We use a slightly modified update rule based on Adam (Kingma & Ba, 2015): gt ← ∇θJ (θt−1, ψ)

Tags

Comments