s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\\lambda_{max} s ability to explain generalization in neural networks. ","authors":[{"name":"Simran Kaur"},{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy Cohen"},{"id":"562c7cc645cedb3398c38fdf","name":"Zachary C. Lipton"}],"create_time":"2022-06-23T04:44:13.431Z","hashs":{"h1":"mheg"},"id":"62b3da1e5aee126c0fb1b12c","lang":"en","num_citation":0,"pages":{"end":"65","start":"51"},"pdf":"https:\u002F\u002Fcz5waila03cyo0tux1owpyofgoryroob.aminer.cn\u002F24\u002F03\u002F82\u002F24038226868A814F849608A7A936AFA6.pdf","pdf_src":["https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10654"],"title":"On the Maximum Hessian Eigenvalue and Generalization","update_times":{"u_a_t":"2022-06-23T04:51:33.945Z"},"urls":["db\u002Fconf\u002Ficbinb\u002Ficbinb2022.html#KaurCL22","https:\u002F\u002Fproceedings.mlr.press\u002Fv187\u002Fkaur23a.html","https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.10654"],"versions":[{"id":"62b3da1e5aee126c0fb1b12c","sid":"2206.10654","src":"arxiv","year":2022},{"id":"64f5616f3fda6d7f06f1fc67","sid":"conf\u002Ficbinb\u002FKaurCL22","src":"dblp","year":2022},{"id":"6392ac4090e50fcafd9f515f","sid":"journals\u002Fcorr\u002Fabs-2206-10654","src":"dblp","vsid":"journals\u002Fcorr","year":2022}],"year":2022},{"abstract":" Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $\\eta$ and $\\beta_1 = 0.9$, this stability threshold is $38\u002F\\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning. ","authors":[{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy M. Cohen"},{"id":"562dfb5c45ce1e5967b9e474","name":"Behrooz Ghorbani"},{"name":"Shankar Krishnan"},{"name":"Naman Agarwal"},{"id":"6403abf37691d561fb216c0a","name":"Sourabh Medapati"},{"id":"6525491c55b3f8ac46705fd9","name":"Michal Badura"},{"name":"Daniel Suo"},{"id":"63725197ec88d95668ccd385","name":"David Cardoze"},{"name":"Zachary Nado"},{"id":"53f3752edabfae4b349ce5b4","name":"George E. Dahl"},{"name":"Justin Gilmer"}],"create_time":"2022-08-01T04:45:26.441Z","hashs":{"h1":"agmes"},"id":"62e744555aee126c0f33c435","lang":"en","num_citation":0,"pdf":"https:\u002F\u002Fcz5waila03cyo0tux1owpyofgoryroob.aminer.cn\u002FDC\u002F63\u002FE3\u002FDC63E3FEE4CBE2C8A183FD3EC9198AF5.pdf","pdf_src":["https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.14484"],"title":"Adaptive Gradient Methods at the Edge of Stability","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.14484"],"versions":[{"id":"62e744555aee126c0f33c435","sid":"2207.14484","src":"arxiv","year":2022},{"id":"6578b387939a5f4082392024","sid":"W4289446290","src":"openalex","vsid":"journals\u002Fcorr","year":2022},{"id":"63d7ae8390e50fcafdacf48e","sid":"journals\u002Fcorr\u002Fabs-2207-14484","src":"dblp","vsid":"journals\u002Fcorr","year":2022}],"year":2022},{"abstract":"We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 \u002F \\text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fedge-of-stability.","authors":[{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy M. Cohen"},{"name":"Simranpreet Kaur"},{"id":"562d325945cedb3398d85513","name":"Yuanzhi Li"},{"name":"J. Zico Kolter"},{"id":"53f42dd5dabfaedf4351c97c","name":"Ameet Talwalkar"}],"create_time":"2024-02-10T23:19:28.078Z","hashs":{"h1":"gdnnt","h3":"oes"},"id":"6577cbbe939a5f40823cbfff","keywords":["gradient","stability","neural networks typically","neural networks"],"num_citation":0,"title":"Gradient Descent on Neural Networks Typically Occurs at the Edge of\n Stability","urls":["https:\u002F\u002Fopenalex.org\u002FW4287322898"],"venue":{"info":{"name":"arXiv (Cornell University)"}},"versions":[{"id":"6577cbbe939a5f40823cbfff","sid":"W4287322898","src":"openalex"}],"year":2021},{"abstract":"We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the value 2\u002F(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability.","authors":[{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy Cohen"},{"id":"5609efd245cedb3396fdbf28","name":"Simran Kaur"},{"id":"562d325945cedb3398d85513","name":"Yuanzhi Li"},{"name":"J. Zico Kolter"},{"id":"53f42dd5dabfaedf4351c97c","name":"Ameet Talwalkar"}],"create_time":"2023-01-07T15:28:31.216Z","hashs":{"h1":"- dge tability"},"id":"61ca12895244ab9dcbe33e1f","num_citation":0,"title":"At the e dge of s tability","urls":["https:\u002F\u002Fsemanticscholar.org\u002Fpaper\u002F8d950b710be8d52203f483b9c51c574cc3015a35"],"venue":{},"versions":[{"id":"61ca12895244ab9dcbe33e1f","sid":"8d950b710be8d52203f483b9c51c574cc3015a35","src":"semanticscholar"}],"year":2021},{"abstract":"We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the leading eigenvalue of the training loss Hessian hovers just above the value 2\u002F(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability.","add_method":true,"authors":[{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy Cohen","org":"Carnegie - Mellon University#TAB#","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie - Mellon University#TAB#"]},{"id":"5609efd245cedb3396fdbf28","name":"Simran Kaur","org":"Carnegie - Mellon University#TAB#","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie - Mellon University#TAB#"]},{"id":"562d325945cedb3398d85513","name":"Yuanzhi Li","org":"CMU Carnegie Mellon University","orgid":"5f71b2861c455f439fe3c771","orgs":["CMU Carnegie Mellon University"]},{"id":"53f42f3fdabfaeb22f41ff94","name":"J Zico Kolter","org":"Carnegie - Mellon University#TAB#","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie - Mellon University#TAB#"]},{"id":"53f42dd5dabfaedf4351c97c","name":"Ameet Talwalkar","org":"University of California Los Angeles","orgid":"5f71b2851c455f439fe3c70a","orgs":["University of California Los Angeles"]}],"citations":{"google_citation":85,"last_citation":80},"create_time":"2021-01-20T13:43:15.323Z","flags":[{"flag":"affirm_author","person_id":"53f42f3fdabfaeb22f41ff94"},{"flag":"affirm_author","person_id":"53f42dd5dabfaedf4351c97c"}],"hashs":{"h1":"gdnnt","h3":"oes"},"id":"600833739e795ed227f5318a","lang":"en","num_citation":166,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F1162\u002F674\u002F1429\u002F600833739e795ed227f5318a_0.pdf","title":"Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability","update_times":{"u_a_t":"2021-10-20T14:00:26.348Z","u_c_t":"2024-02-18T04:21:44.575Z","u_v_t":"2021-06-25T13:52:11.01Z"},"urls":["https:\u002F\u002Fopenreview.net\u002Fforum?id=jh-rTtvkGeM","db\u002Fconf\u002Ficlr\u002Ficlr2021.html#CohenKLKT21","https:\u002F\u002Fopenreview.net\u002Fpdf?id=jh-rTtvkGeM","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Ficlr\u002FCohenKLKT21","db\u002Fjournals\u002Fcorr\u002Fcorr2103.html#abs-2103-00065","https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.00065"],"venue":{"info":{"name":"ICLR","name_s":"ICLR"},"lang":"en","type":2,"volume":"abs\u002F2103.00065"},"venue_hhb_id":"5ea1d518edb6e7d53c0100cb","versions":[{"id":"600833739e795ed227f5318a","sid":"600833739e795ed227f5318a","src":"user-57f9ed429ed5dbd78af2a05d","year":2021},{"id":"603e022991e01129ef28fb4e","sid":"2103.00065","src":"arxiv","year":2022},{"id":"605aa4a0e4510cd7c86f14a5","sid":"3132026801","src":"mag","vsid":"2584161585","year":2021},{"id":"60d5b36f91e011c8cebefcc0","sid":"conf\u002Ficlr\u002FCohenKLKT21","src":"dblp","vsid":"conf\u002Ficlr","year":2021},{"id":"634a168690e50fcafd6638ed","sid":"conf\u002Ficlr\u002Findex.html\u002FCohenKLKT21","src":"dblp","vsid":"conf\u002Ficlr\u002Findex.html","year":2021},{"id":"645647c8d68f896efae43c46","sid":"journals\u002Fcorr\u002Fabs-2103-00065","src":"dblp","year":2021},{"id":"622828ec5aee126c0f97eb7f","sid":"W3132026801","src":"openalex","vsid":"conf\u002Ficlr","year":2021},{"id":"6577d9c0939a5f40827c0eea","sid":"W3135365305","src":"openalex","vsid":"conf\u002Ficlr","year":2021},{"id":"61c8a3a35244ab9dcbecceb1","sid":"026bb8a1066f50ddc8797e1341353603149a8cb8","src":"semanticscholar","vsid":"conf\u002Ficlr","year":2021}],"year":2021},{"abstract":" Recent work has shown that any classifier which classifies well under Gaussian noise can be leveraged to create a new classifier that is provably robust to adversarial perturbations in L2 norm. However, existing guarantees for such classifiers are suboptimal. In this work we provide the first tight analysis of this \"randomized smoothing\" technique. We then demonstrate that this extremely simple method outperforms by a wide margin all other provably L2-robust classifiers proposed in the literature. Furthermore, we train an ImageNet classifier with e.g. a provable top-1 accuracy of 49% under adversarial perturbations with L2 norm less than 0.5 (=127\u002F255). No other provable adversarial defense has been shown to be feasible on ImageNet. While randomized smoothing with Gaussian noise only confers robustness in L2 norm, the empirical success of the approach suggests that provable methods based on randomization at test time are a promising direction for future research into adversarially robust classification. Code and trained models are available at https:\u002F\u002Fgithub.com\u002Flocuslab\u002Fsmoothing . ","authors":[{"id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy M Cohen"},{"id":"6380815cfc451b2d602bdd07","name":"Elan Rosenfeld"},{"name":"J. Zico Kolter"}],"create_time":"2024-01-25T03:39:06.878Z","hashs":{"h1":"carrs"},"id":"5c6146ceda562969d07b6b7a","num_citation":0,"pdf":"https:\u002F\u002Fcz5waila03cyo0tux1owpyofgoryroob.aminer.cn\u002F16\u002FE5\u002FB1\u002F16E5B15CE93ACE847B777BD508BA36E5.pdf","title":"Certified Adversarial Robustness via Randomized Smoothing","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F1902.02918"],"venue":{},"versions":[{"id":"5c6146ceda562969d07b6b7a","sid":"1902.02918","src":"arxiv","year":2019}],"year":2019},{"abstract":" For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, researchers have demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these \"perceptually-aligned gradients\" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding suggests that perceptually-aligned gradients may be a general property of robust classifiers, rather than a specific property of adversarially-trained neural networks. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness. ","authors":[{"id":"5609efd245cedb3396fdbf28","name":"Kaur Simran"},{"id":"53f45b2fdabfaee0d9c05e65","name":"Cohen Jeremy"},{"id":"562c7cc645cedb3398c38fdf","name":"Lipton Zachary C."}],"create_time":"2020-04-06T21:05:40.096Z","flags":[{"flag":"affirm_author","person_id":"562c7cc645cedb3398c38fdf"}],"hashs":{"h1":"pggpr","h3":"c"},"id":"5daed3413a55ac025c26e282","num_citation":0,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F19\u002F1910\u002F1910.08640.pdf","pdf_src":["https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.08640"],"title":"Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?","update_times":{"u_a_t":"2020-07-04T04:33:08.76Z"},"urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.08640"],"versions":[{"id":"5daed3413a55ac025c26e282","sid":"1910.08640","src":"arxiv","year":2019},{"id":"5e8d928d9fced0a24b6139f0","sid":"journals\u002Fcorr\u002Fabs-1910-08640","src":"dblp","vsid":"journals\u002Fcorr","year":2019}],"year":2019},{"abstract":"We show how to turn any classifier that classifies well under Gaussian noise into a new classifier that is certifiably robust to adversarial perturbations under the l(2) norm. While this \"randomized smoothing\" technique has been proposed before in the literature, we are the first to provide a tight analysis, which establishes a close connection between l(2) robustness and Gaussian noise. We use the technique to train an ImageNet classifier with e.g. a certified top-1 accuracy of 49% under adversarial perturbations with l(2) norm less than 0.5 (=127\u002F255). Smoothing is the only approach to certifiably robust classification which has been shown feasible on full-resolution ImageNet. On smaller-scale datasets where competing approaches to certified l(2) robustness are viable, smoothing delivers higher certified accuracies. The empirical success of the approach suggests that provable methods based on randomization at prediction time are a promising direction for future research into adversarially robust classification. Code and models are available at http:\u002F\u002Fgithub.com\u002Flocuslab\u002Fsmoothing.","authors":[{"email":"jeremycohen@cmu.edu","id":"53f45b2fdabfaee0d9c05e65","name":"Jeremy Cohen","org":"Carnegie Mellon Univ, Pittsburgh, PA 15213 USA","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie Mellon Univ, Pittsburgh, PA 15213 USA"]},{"id":"6380815cfc451b2d602bdd07","name":"Elan Rosenfeld","org":"Carnegie Mellon Univ, Pittsburgh, PA 15213 USA","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie Mellon Univ, Pittsburgh, PA 15213 USA"]},{"id":"53f42f3fdabfaeb22f41ff94","name":"J. Zico Kolter","org":"Carnegie Mellon Univ, Pittsburgh, PA 15213 USA","orgid":"5f71b2861c455f439fe3c771","orgs":["Carnegie Mellon Univ, Pittsburgh, PA 15213 USA","Bosch Ctr AI, Baden, Switzerland"]}],"create_time":"2024-01-08T03:01:52.162Z","hashs":{"h1":"carrs"},"id":"5cede108da562983788e803f","issn":"2640-3498","num_citation":1768,"pages":{"end":"1320","start":"1310"},"title":"Certified Adversarial Robustness via Randomized Smoothing","update_times":{"u_c_t":"2024-02-18T04:19:21.986Z"},"urls":["db\u002Fconf\u002Ficml\u002Ficml2019.html#CohenRK19","http:\u002F\u002Fproceedings.mlr.press\u002Fv97\u002Fcohen19c.html","db\u002Fjournals\u002Fcorr\u002Fcorr1902.html#abs-1902-02918","http:\u002F\u002Farxiv.org\u002Fabs\u002F1902.02918","http:\u002F\u002Fwww.webofknowledge.com\u002F","https:\u002F\u002Fopenalex.org\u002FW2911634294"],"venue":{"info":{"name":"International Conference on Machine Learning"},"volume":"97"},"versions":[{"id":"64e966663fda6d7f06357642","sid":"W2911634294","src":"openalex","year":2019},{"id":"64d4f8563fda6d7f061c7576","sid":"WOS:000684034301045","src":"wos","year":2019},{"id":"5e8d929e9fced0a24b626335","sid":"journals\u002Fcorr\u002Fabs-1902-02918","src":"dblp","year":2019},{"id":"5d245bb5da56295a28fcd4a0","sid":"conf\u002Ficml\u002FCohenRK19","src":"dblp","year":2019},{"id":"6216835f5aee126c0ffbfecf","sid":"10.1145\u002F166955.167010","src":"crossref","vsid":"INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97","year":2019},{"id":"5c6146ceda562969d07b6b7a","sid":"1902.02918","src":"arxiv","vsid":"INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97","year":2019}],"year":2019}],"profilePubsTotal":9,"profilePatentsPage":0,"profilePatents":null,"profilePatentsTotal":null,"profilePatentsEnd":false,"profileProjectsPage":1,"profileProjects":{"success":true,"msg":"","data":null,"log_id":"2ch7GlfxvgwjFW1X7GzDgP8UxM5"},"profileProjectsTotal":0,"newInfo":null,"checkDelPubs":[]}};