Optimizing Data Usage via Differentiable Rewards

ICML, pp. 9983-9995, 2020.

Cited by: 3|Bibtex|Views84|Links
EI
Keywords:
data selectiondomain datumdomain adaptationDifferentiable Data Selectionscorer networkMore(9+)
Weibo:
We present Differentiable Data Selection, an efficient Reinforcement Learning framework for optimizing training data usage

Abstract:

To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that "adapts" to its current learning state and esti...More

Code:

Data:

0
Introduction
  • While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data.
  • To avoid hand designed heuristics, several works propose to optimize a parameterized neural network to learn the data usage schedule, but most of them are tailored to specific use cases, such as handling noisy data for classification (Jiang et al, 2018), learning a curriculum learning strategy for NMT (Kumar et al, 2019), and actively selecting data for annotation (Fang et al, 2017; Wu et al, 2018).
Highlights
  • While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data
  • We propose an alternative: a general Reinforcement Learning (RL) framework for optimizing training data usage by training a scorer network that minimizes the model loss on the development set
  • We find that Differentiable Data Selection outperforms SPCL by a large margin for both of the tasks, especially for multilingual neural machine translation
  • We see that incorporating prior knowledge into the scorer network leads to further improvements
  • We present Differentiable Data Selection, an efficient Reinforcement Learning framework for optimizing training data usage
  • We formulate two algorithms under the Differentiable Data Selection framework for two realistic and very different tasks, image classification and multilingual neural machine translation, which lead to consistent improvements over strong baselines
Results
  • The results of the baselines and the method are listed in Table 1.
  • Comparing the standard baseline strategy of “Uniform” and the proposed method of “DDS” the authors can see that in all 8 settings DDS improves over the uniform baseline.
  • For NMT, in comparison to Related and TCS, vanilla DDS performs favorably with respect to these state-of-the-art data selection baselines, outperforming each in 3 out of the 4 settings.
  • For image classification, retrained DDS can significantly improve over regular DDS, leading to the new state-of-the-art result on the CIFAR-10 dataset.
  • For mulitlingual NMT, TCS+DDS achieves the best performance in three out of four cases.4
Conclusion
  • The authors present Differentiable Data Selection, an efficient RL framework for optimizing training data usage.
  • The authors parameterize the scorer network as a differentiable function of the data, and provide an intuitive reward function for efficiently training the scorer network.
  • The authors formulate two algorithms under the DDS framework for two realistic and very different tasks, image classification and multilingual NMT, which lead to consistent improvements over strong baselines
Summary
  • Introduction:

    While deep learning models are remarkably good at fitting large data sets, their performance is highly sensitive to the structure and domain of their training data.
  • To avoid hand designed heuristics, several works propose to optimize a parameterized neural network to learn the data usage schedule, but most of them are tailored to specific use cases, such as handling noisy data for classification (Jiang et al, 2018), learning a curriculum learning strategy for NMT (Kumar et al, 2019), and actively selecting data for annotation (Fang et al, 2017; Wu et al, 2018).
  • Results:

    The results of the baselines and the method are listed in Table 1.
  • Comparing the standard baseline strategy of “Uniform” and the proposed method of “DDS” the authors can see that in all 8 settings DDS improves over the uniform baseline.
  • For NMT, in comparison to Related and TCS, vanilla DDS performs favorably with respect to these state-of-the-art data selection baselines, outperforming each in 3 out of the 4 settings.
  • For image classification, retrained DDS can significantly improve over regular DDS, leading to the new state-of-the-art result on the CIFAR-10 dataset.
  • For mulitlingual NMT, TCS+DDS achieves the best performance in three out of four cases.4
  • Conclusion:

    The authors present Differentiable Data Selection, an efficient RL framework for optimizing training data usage.
  • The authors parameterize the scorer network as a differentiable function of the data, and provide an intuitive reward function for efficiently training the scorer network.
  • The authors formulate two algorithms under the DDS framework for two realistic and very different tasks, image classification and multilingual NMT, which lead to consistent improvements over strong baselines
Tables
  • Table1: Results for image classification accuracy (left) and multilingual MT BLEU (right). For MT, the statistical significance is indicated with ∗ (p < 0.005) and † (p < 0.0001)
  • Table2: Statistics of the multilingual NMT datasets
Download tables as Excel
Related work
Reference
  • G. Anandalingam and Terry L. Friesz. Hierarchical optimization: An introduction. Annals OR, 1992.
    Google ScholarLocate open access versionFindings
  • Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo in-domain data selection. In EMNLP, 2011.
    Google ScholarLocate open access versionFindings
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Atilim Gunes Baydin, Robert Cornish, David Martínez-Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
    Google ScholarLocate open access versionFindings
  • Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In ACL, 2011.
    Google ScholarLocate open access versionFindings
  • Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals OR, 153(1), 2007.
    Google ScholarLocate open access versionFindings
  • Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji Lakshminarayanan. Adapting auxiliary losses using gradient similarity. CoRR, abs/1812.02224, 201URL http://arxiv.org/abs/1812.02224.
    Findings
  • Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR, 2018.
    Google ScholarLocate open access versionFindings
  • Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. In EMNLP, pp. 595–605, 2017.
    Google ScholarLocate open access versionFindings
  • Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • George Foster, Cyril Goutte, and Roland Kuhn. Discriminative instance weighting for domain adaptation in statistical machine translation. In EMNLP, 2010.
    Google ScholarLocate open access versionFindings
  • Alex Graves, Marc G. Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu. Automated curriculum learning for neural networks. In ICML, 2017.
    Google ScholarLocate open access versionFindings
  • Frantisek Grézl, Martin Karafiát, Stanislav Kontár, and Jan Cernocky. Probabilistic and bottle-neck features for lvcsr of meetings. In ICASSP, volume 4, pp. IV–757. IEEE, 2007.
    Google ScholarLocate open access versionFindings
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CPVR, 2016.
    Google ScholarLocate open access versionFindings
  • Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
    Google ScholarLocate open access versionFindings
  • Jing Jiang and ChengXiang Zhai. Instance weighting for domain adaptation in nlp. In ACL, 2007.
    Google ScholarLocate open access versionFindings
  • Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G. Hauptmann. Self-paced curriculum learning. In AAAI, 2015.
    Google ScholarLocate open access versionFindings
  • Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
    Google ScholarLocate open access versionFindings
  • Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
    Google ScholarLocate open access versionFindings
  • Katrin Kirchhoff and Jeff A. Bilmes. Submodularity for data selection in machine translation. In EMNLP, 2014.
    Google ScholarLocate open access versionFindings
  • Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better? In CVPR, 2019.
    Google ScholarLocate open access versionFindings
  • Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
    Google ScholarFindings
  • Gaurav Kumar, George Foster, Colin Cherry, and Maxim Krikun. Reinforcement learning based curriculum optimization for neural machine translation. In NAACL, pp. 2054–2061, 2019.
    Google ScholarLocate open access versionFindings
  • M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In NIPS, 2010.
    Google ScholarLocate open access versionFindings
  • Yong Jae Lee and Kristen Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.
    Google ScholarLocate open access versionFindings
  • Zachary C Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and correcting for label shift with black box predictors. arXiv preprint arXiv:1802.03916, 2018.
    Findings
  • Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. 2019a.
    Google ScholarFindings
  • Shikun Liu, Andrew J. Davison, and Edward Johns. Self-supervised generalisation with meta auxiliary learning. CoRR, abs/1901.08933, 2019b. URL http://arxiv.org/abs/1901.08933.
    Findings
  • Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Robert C Moore and William Lewis. Intelligent selection of language model training data. In ACL, 2010.
    Google ScholarLocate open access versionFindings
  • Yurii E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 1983.
    Google ScholarLocate open access versionFindings
  • Graham Neubig and Junjie Hu. Rapid adaptation of neural machine translation to new languages. EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V. Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. CVPR, 2018.
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
    Google ScholarLocate open access versionFindings
  • Minh Quang Pham, Josep Crego, Jean Senellart, and François Yvon. Fixing translation divergences in parallel corpora for neural MT. In EMNLP, 2018.
    Google ScholarLocate open access versionFindings
  • Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. In NAACL, 2019.
    Google ScholarLocate open access versionFindings
  • Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. When and why are pre-trained word embeddings useful for neural machine translation? NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, pp. 4331–4340, 2018.
    Google ScholarLocate open access versionFindings
  • Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
    Google ScholarLocate open access versionFindings
  • Rico Sennrich and Biao Zhang. Revisiting low-resource neural machine translation: A case study. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
    Google ScholarLocate open access versionFindings
  • Sunit Sivasankaran, Emmanuel Vincent, and Irina Illina. Discriminative importance weighting of augmented training data for acoustic model training. In ICASSP, 2017.
    Google ScholarLocate open access versionFindings
  • Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. In NAACL, 2010.
    Google ScholarLocate open access versionFindings
  • Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In JMLR, 2014.
    Google ScholarLocate open access versionFindings
  • Sebastian Tschiatschek, Rishabh K. Iyer, Haochen Wei, and Jeff A. Bilmes. Learning mixtures of submodular functions for image collection summarization. In NIPS, 2014.
    Google ScholarLocate open access versionFindings
  • Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the curriculum with bayesian optimization for task-specific word representation learning. In ACL, 2016.
    Google ScholarLocate open access versionFindings
  • Marlies van der Wees, Arianna Bisazza, and Christof Monz. Dynamic data selection for neural machine translation. In EMNLP, 2017.
    Google ScholarLocate open access versionFindings
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pp. 5998–6008, 2017.
    Google ScholarLocate open access versionFindings
  • Yogarshi Vyas, Xing Niu, and Marine Carpuat. Identifying semantic divergences in parallel text without annotations. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Wei Wang, Isaac Caswell, and Ciprian Chelba. Dynamically composing domain-data selection with clean-data selection by "co-curricular learning" for neural machine translation. In ACL, 2019a.
    Google ScholarLocate open access versionFindings
  • Xinyi Wang and Graham Neubig. Target conditioned sampling: Optimizing data selection for multilingual neural machine translation. In ACL, 2019.
    Google ScholarLocate open access versionFindings
  • Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. Multilingual neural machine translation with soft decoupled encoding. In ICLR, 2019b.
    Google ScholarLocate open access versionFindings
  • Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992.
    Google ScholarLocate open access versionFindings
  • Jiawei Wu, Lei Li, and William Yang Wang. Reinforced co-training. In NAACL, 2018.
    Google ScholarLocate open access versionFindings
  • Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
    Google ScholarLocate open access versionFindings
  • Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
    Google ScholarLocate open access versionFindings
  • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
    Google ScholarLocate open access versionFindings
  • Arxiv 1612.06138, 2016. Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup, Marianna J
    Findings
  • Martindale, Paul McNamee, Kevin Duh, and Marine Carpuat. An empirical exploration of curriculum learning for neural machine translation. Arxiv, 1811.00739, 2018. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource neural machine translation. In EMNLP, 2016.
    Findings
  • where g(·) is any function that may be applied to the gradient ∇θJ(θt−1, ψ). For instance, in standard gradient descent g(·) is simply a linear scaling of ∇θJ(θt−1, ψ) by a learning rate ηt, while with the Adam optimizer (Kingma & Ba, 2015) g also modifies the learning rate on a parameter-by-parameter basis.
    Google ScholarLocate open access versionFindings
  • Here we first derive ∇ψg for the general stochastic gradient descent (SGD) update, then provide examples for two other common optimization algorithms, namely Momentum (Nesterov, 1983) and Adam (Kingma & Ba, 2015).
    Google ScholarLocate open access versionFindings
  • Here, the last equation follows from the log-derivative trick in the REINFORCE algorithm (Williams, 1992). We can consider the alignment of dev set and training data gradients as the reward for update ψ. In practice, we found that using cosine distance is more stable than simply taking dot product between the gradients. Thus in our implementation of the image classification and machine translation algorithms, we use cos J (θt, Ddev) · ∇θ (x, y; θt−1) as the reward signal.
    Google ScholarLocate open access versionFindings
  • Adam Updates. We use a slightly modified update rule based on Adam (Kingma & Ba, 2015): gt ← ∇θJ (θt−1, ψ)
    Google ScholarFindings
Your rating :
0

 

Tags
Comments