, with $\\alpha \\leq 1$. We show that exploiting invariance in the architecture saves a $d^\\alpha$ factor ($d$ stands for the dimension) in sample size and number of hidden units to achieve the same test error as for unstructured architectures. Finally, we show that output symmetrization of an unstructured kernel estimator does not give a significant statistical improvement; on the other hand, data augmentation with an unstructured kernel estimator is equivalent to an invariant kernel estimator and enjoys the same improvement in statistical efficiency. ","authors":[{"name":"Song Mei"},{"name":"Theodor Misiakiewicz"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"603cb38991e011aeee15061d","num_citation":0,"order":2,"pages":{"end":"3418","start":"3351"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2102\u002F2102.13219.pdf","title":"Learning with invariances in random features and kernel models.","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.13219","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Fcolt\u002FMeiMM21","http:\u002F\u002Fproceedings.mlr.press\u002Fv134\u002Fmei21a.html"],"venue":{"info":{"name":"COLT"}},"versions":[{"id":"603cb38991e011aeee15061d","sid":"2102.13219","src":"arxiv","year":2021},{"id":"6124dda391e0114cbe7a9a95","sid":"conf\u002Fcolt\u002FMeiMM21","src":"dblp","vsid":"conf\u002Fcolt","year":2021}],"year":2021},{"abstract":" Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in $\\mathbb{R}^d$, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm -- which is equivalent to a weighted $\\ell_2$ norm -- is replaced by a weighted functional $\\ell_p$ norm, which we refer to as $\\mathcal{F}_p$ norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinite-dimensional convex problem. We study random features approximations to these norms and show that, for $p\u003E1$, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with $\\mathcal{F}_p$ norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models. ","authors":[{"name":"Michael Celentano"},{"name":"Theodor Misiakiewicz"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"60644c0491e011538305cf78","num_citation":0,"order":2,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2103\u002F2103.15996.pdf","title":"Minimum complexity interpolation in random features models","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.15996"],"versions":[{"id":"60644c0491e011538305cf78","sid":"2103.15996","src":"arxiv","year":2021}],"year":2021},{"abstract":" The community detection problem requires to cluster the nodes of a network into a small number of well-connected \"communities\". There has been substantial recent progress in characterizing the fundamental statistical limits of community detection under simple stochastic block models. However, in real-world applications, the network structure is typically dynamic, with nodes that join over time. In this setting, we would like a detection algorithm to perform only a limited number of updates at each node arrival. While standard voting approaches satisfy this constraint, it is unclear whether they exploit the network information optimally. We introduce a simple model for networks growing over time which we refer to as streaming stochastic block model (StSBM). Within this model, we prove that voting algorithms have fundamental limitations. We also develop a streaming belief-propagation (StreamBP) approach, for which we prove optimality in certain regimes. We validate our theoretical findings on synthetic and real data. ","authors":[{"id":"542dcb00dabfae11fc4a170f","name":"Yuchen Wu"},{"name":"MohammadHossein Bateni"},{"name":"Andre Linhares"},{"name":"Filipe Miguel Goncalves de Almeida"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"name":"Ashkan Norouzi-Fard"},{"id":"618505578672f1ddaff25a63","name":"Jakab Tardos"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"60c2fb7091e0117e30ca2953","lang":"en","num_citation":0,"order":4,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2106\u002F2106.04805.pdf","title":"Streaming Belief Propagation for Community Detection","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.04805"],"versions":[{"id":"60c2fb7091e0117e30ca2953","sid":"2106.04805","src":"arxiv","year":2021}],"year":2021},{"abstract":" We consider the problem of estimating a low-dimensional parameter in high-dimensional linear regression. Constructing an approximately unbiased estimate of the parameter of interest is a crucial step towards performing statistical inference. Several authors suggest to orthogonalize both the variable of interest and the outcome with respect to the nuisance variables, and then regress the residual outcome with respect to the residual variable. This is possible if the covariance structure of the regressors is perfectly known, or is sufficiently structured that it can be estimated accurately from data (e.g., the precision matrix is sufficiently sparse). Here we consider a regime in which the covariate model can only be estimated inaccurately, and hence existing debiasing approaches are not guaranteed to work. When errors in estimating the covariate model are correlated with errors in estimating the linear model parameter, an incomplete elimination of the bias occurs. We propose the Correlation Adjusted Debiased Lasso (CAD), which nearly eliminates this bias in some cases, including cases in which the estimation errors are neither negligible nor orthogonal. We consider a setting in which some unlabeled samples might be available to the statistician alongside labeled ones (semi-supervised learning), and our guarantees hold under the assumption of jointly Gaussian covariates. The new debiased estimator is guaranteed to cancel the bias in two cases: (1) when the total number of samples (labeled and unlabeled) is larger than the number of parameters, or (2) when the covariance of the nuisance (but not the effect of the nuisance on the variable of interest) is known. Neither of these cases is treated by state-of-the-art methods. ","authors":[{"name":"Michael Celentano"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"61039cd15244ab9dcb8b6b8b","lang":"en","num_citation":0,"order":1,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2107\u002F2107.14172.pdf","title":"CAD: Debiasing the Lasso with inaccurate covariate model","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.14172"],"versions":[{"id":"61039cd15244ab9dcb8b6b8b","sid":"2107.14172","src":"arxiv","year":2021}],"year":2021},{"abstract":" Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not induce overfitting. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic feature vectors in $d$ dimensions, and $N$ hidden neurons. Under the assumption $N \\le Cd$ (for $C$ a constant), we show that the network can exactly interpolate the data as soon as the number of parameters is significantly larger than the number of samples: $Nd\\gg n$. Under these assumptions, we show that the empirical NT kernel has minimum eigenvalue bounded away from zero, and characterize the generalization error of min-$\\ell_2$ norm interpolants, when the target function is linear. In particular, we show that the network approximately performs ridge regression in the raw features, with a strictly positive `self-induced' regularization. ","authors":[{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"name":"Yiqiao Zhong"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"5f1feefa91e011d50a6219b2","num_citation":0,"order":0,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2007\u002F2007.12826.pdf","title":"The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.12826"],"versions":[{"id":"5f1feefa91e011d50a6219b2","sid":"2007.12826","src":"arxiv","year":2021}],"year":2021},{"abstract":" In the negative perceptron problem we are given $n$ data points $({\\boldsymbol x}_i,y_i)$, where ${\\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\\in\\{+1,-1\\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \\emph{negative} margin. In other words, we want to find a unit norm vector ${\\boldsymbol \\theta}$ that maximizes $\\min_{i\\le n}y_i\\langle {\\boldsymbol \\theta},{\\boldsymbol x}_i\\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\\to \\infty$ with $n\u002Fd\\to\\delta$, and prove upper and lower bounds on the maximum margin $\\kappa_{\\text{s}}(\\delta)$ or -- equivalently -- on its inverse function $\\delta_{\\text{s}}(\\kappa)$. In other words, $\\delta_{\\text{s}}(\\kappa)$ is the overparametrization threshold: for $n\u002Fd\\le \\delta_{\\text{s}}(\\kappa)-\\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n\u002Fd\\ge \\delta_{\\text{s}}(\\kappa)+\\varepsilon$ it does not. Our bounds on $\\delta_{\\text{s}}(\\kappa)$ match to the leading order as $\\kappa\\to -\\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $\\delta_{\\text{lin}}(\\kappa)$. We observe a gap between the interpolation threshold $\\delta_{\\text{s}}(\\kappa)$ and the linear programming threshold $\\delta_{\\text{lin}}(\\kappa)$, raising the question of the behavior of other algorithms. ","authors":[{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"name":"Yiqiao Zhong"},{"name":"Kangjie Zhou"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"617f5aa35244ab9dcbaa71c3","lang":"en","num_citation":0,"order":0,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2110\u002F2110.15824.pdf","title":"Tractability from overparametrization: The example of the negative\n perceptron","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.15824"],"versions":[{"id":"617f5aa35244ab9dcbaa71c3","sid":"2110.15824","src":"arxiv","year":2021}],"year":2021},{"abstract":" Given a graph $G$ of degree $k$ over $n$ vertices, we consider the problem of computing a near maximum cut or a near minimum bisection in polynomial time. For graphs of girth $L$, we develop a local message passing algorithm whose complexity is $O(nkL)$, and that achieves near optimal cut values among all $L$-local algorithms. Focusing on max-cut, the algorithm constructs a cut of value $nk\u002F4+ n\\mathsf{P}_\\star\\sqrt{k\u002F4}+\\mathsf{err}(n,k,L)$, where $\\mathsf{P}_\\star\\approx 0.763166$ is the value of the Parisi formula from spin glass theory, and $\\mathsf{err}(n,k,L)=o_n(n)+no_k(\\sqrt{k})+n \\sqrt{k} o_L(1)$ (subscripts indicate the asymptotic variables). Our result generalizes to locally treelike graphs, i.e., graphs whose girth becomes $L$ after removing a small fraction of vertices. Earlier work established that, for random $k$-regular graphs, the typical max-cut value is $nk\u002F4+ n\\mathsf{P}_\\star\\sqrt{k\u002F4}+o_n(n)+no_k(\\sqrt{k})$. Therefore our algorithm is nearly optimal on such graphs. An immediate corollary of this result is that random regular graphs have nearly minimum max-cut, and nearly maximum min-bisection among all regular locally treelike graphs. This can be viewed as a combinatorial version of the near-Ramanujan property of random regular graphs. ","authors":[{"name":"Ahmed El Alaoui"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"name":"Mark Sellke"}],"id":"6191cfa05244ab9dcb16bf1b","lang":"en","num_citation":0,"order":1,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2111\u002F2111.06813.pdf","title":"Local algorithms for Maximum Cut and Minimum Bisection on locally\n treelike regular graphs of large degree","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06813"],"versions":[{"id":"6191cfa05244ab9dcb16bf1b","sid":"2111.06813","src":"arxiv","year":2021}],"year":2021},{"abstract":"Let ${A}\\\\in{\\\\mathbb R}^{n\\\\times n}$ be a symmetric random matrix with independent and identically distributed (i.i.d.) Gaussian entries above the diagonal. We consider the problem of maximizing $\\\\l...","authors":[{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"doi":"10.1137\u002F20M132016X","flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"60376663d3485cfff1da1c00","num_citation":0,"order":0,"title":"Optimization of the Sherrington--Kirkpatrick Hamiltonian","urls":["https:\u002F\u002Fepubs.siam.org\u002Fdoi\u002Fabs\u002F10.1137\u002F20M132016X"],"venue":{"info":{"name":"SIAM Journal on Computing"}},"versions":[{"id":"60376663d3485cfff1da1c00","sid":"3119398506","src":"mag","vsid":"153560523","year":2021}],"year":2021},{"abstract":" Given a probability measure $\\mu$ over ${\\mathbb R}^n$, it is often useful to approximate it by the convex combination of a small number of probability measures, such that each component is close to a product measure. Recently, Ronen Eldan used a stochastic localization argument to prove a general decomposition result of this type. In Eldan's theorem, the `number of components' is characterized by the entropy of the mixture, and `closeness to product' is characterized by the covariance matrix of each component. We present an elementary proof of Eldan's theorem which makes use of an information theory (or estimation theory) interpretation. The proof is analogous to the one of an earlier decomposition result known as the `pinning lemma.' ","authors":[{"id":"562d5f5845cedb3398ddd3fd","name":"Ahmed El Alaoui"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"613192745244ab9dcb9dfb32","lang":"en","num_citation":0,"order":1,"title":"An Information-Theoretic View of Stochastic Localization","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.00709"],"versions":[{"id":"613192745244ab9dcb9dfb32","sid":"2109.00709","src":"arxiv","year":2021}],"year":2021},{"abstract":" Consider the classical supervised learning problem: we are given data $(y_i,{\\boldsymbol x}_i)$, $i\\le n$, with $y_i$ a response and ${\\boldsymbol x}_i\\in {\\mathcal X}$ a covariates vector, and try to learn a model $f:{\\mathcal X}\\to{\\mathbb R}$ to predict future responses. Random features methods map the covariates vector ${\\boldsymbol x}_i$ to a point ${\\boldsymbol \\phi}({\\boldsymbol x}_i)$ in a higher dimensional space ${\\mathbb R}^N$, via a random featurization map ${\\boldsymbol \\phi}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${\\mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $\\ell$ eigenfunctions of the kernel, where $\\ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $N\\le n^{1-\\delta}$ for some $\\delta\u003E0$. We characterize this gap. For $N\\ge n^{1+\\delta}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error. ","authors":[{"name":"Song Mei"},{"name":"Theodor Misiakiewicz"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"601139d991e0117654c5fe67","num_citation":0,"order":2,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2101\u002F2101.10588.pdf","title":"Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.10588"],"versions":[{"id":"601139d991e0117654c5fe67","sid":"2101.10588","src":"arxiv","year":2021}],"year":2021},{"abstract":"The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.","authors":[{"id":"53f46be1dabfaeecd6a225b5","name":"Peter L. Bartlett"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"name":"Alexander Rakhlin"}],"doi":"10.1017\u002FS0962492921000027","flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"6051dcdc91e011c24e59920f","num_citation":0,"order":1,"pages":{"end":"201","start":"87"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F21\u002F2103\u002F2103.09177.pdf","title":"Deep Learning: A Statistical Viewpoint","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.09177","http:\u002F\u002Fwww.webofknowledge.com\u002F"],"venue":{"info":{"name":"ACTA NUMERICA"},"volume":"30"},"versions":[{"id":"6051dcdc91e011c24e59920f","sid":"2103.09177","src":"arxiv","year":2021},{"id":"619bba651c45e57ce9723837","sid":"WOS:000680868300003","src":"wos","vsid":"ACTA NUMERICA","year":2021}],"year":2021},{"abstract":"Statistical learning methods in a retrospective cohort study automatically identified a set of significant features for prediction and yielded high prediction performance for preeclampsia risk from routine early pregnancy information.","authors":[{"id":"53f4cafcdabfaee579780bdb","name":"Ivana Marić"},{"id":"561936f645ce1e596429f094","name":"Abraham Tsur"},{"id":"53f448e8dabfaee2a1d3ec2b","name":"Nima Aghaeepour"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"id":"53f7f2eedabfae9060afc08e","name":"David K. Stevenson"},{"id":"54483b76dabfae87b7deedad","name":"Gary M. Shaw"},{"id":"53f42feedabfaee1c0a5417c","name":"Virginia D. Winn"}],"doi":"10.1016\u002Fj.ajogmf.2020.100100","id":"5f157d429fced0a24b1c157e","num_citation":4,"order":3,"pages":{"start":"100100"},"pdf":"https:\u002F\u002Fcz5waila03cyo0tux1owpyofgoryrooa.oss-cn-beijing.aliyuncs.com\u002FE9\u002FF9\u002F51\u002FE9F951230A80D17D2D9C9E84AAF9D5BF.pdf","title":"Early prediction of preeclampsia via machine learning","urls":["https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2589933320300306","https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpubmed\u002F33345966","http:\u002F\u002Fwww.webofknowledge.com\u002F"],"venue":{"info":{"name":"American Journal of Obstetrics & Gynecology MFM"},"issue":"2","volume":"2"},"versions":[{"id":"5f157d429fced0a24b1c157e","sid":"S2589933320300306","src":"sciencedirect","vsid":"american-journal-of-obstetrics-and-gynecology-mfm","year":2020},{"id":"5fe1c4e1d4150a363c01f12d","sid":"33345966","src":"pubmed","vsid":"101746609","year":2020},{"id":"619bb3d81c45e57ce92a9847","sid":"WOS:000658221800016","src":"wos","vsid":"AMERICAN JOURNAL OF OBSTETRICS & GYNECOLOGY MFM","year":2020}],"year":2020},{"authors":[{"id":"53f437e5dabfaeb22f47f3c9","name":"A Montanari"}],"id":"61a58ef06750f842185dcb9f","lang":"en","num_citation":1,"order":0,"title":"Mean field asymptotics in high-dimensional statistics: A few references","urls":["https:\u002F\u002Fscholar.google.com.hk"],"versions":[{"id":"61a58ef06750f842185dcb9f","sid":"61a58ef06750f842185dcb9f","src":"user-6144298de55422cecdaf68a5","year":2020}],"year":2020},{"abstract":" ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain. ","authors":[{"name":"Alexander D'Amour"},{"id":"53f43a8fdabfaee02acf28b9","name":"Katherine Heller"},{"name":"Dan Moldovan"},{"name":"Ben Adlam"},{"name":"Babak Alipanahi"},{"id":"53f477c4dabfaee2a1df37c6","name":"Alex Beutel"},{"name":"Christina Chen"},{"name":"Jonathan Deaton"},{"id":"53f66205dabfae6a71b634ef","name":"Jacob Eisenstein"},{"name":"Matthew D. Hoffman"},{"name":"Farhad Hormozdiari"},{"id":"53f37a4bdabfae4b349e0122","name":"Neil Houlsby"},{"name":"Shaobo Hou"},{"id":"61770b7760a9653833e1a50a","name":"Ghassen Jerfel"},{"name":"Alan Karthikesalingam"},{"id":"562cf7ac45cedb3398d15def","name":"Mario Lucic"},{"name":"Yian Ma"},{"name":"Cory McLean"},{"name":"Diana Mincu"},{"name":"Akinori Mitani"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"5fa9147a91e011e83f7407b4","num_citation":27,"order":20,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2011\u002F2011.03395.pdf","title":"Underspecification Presents Challenges for Credibility in Modern Machine Learning","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.03395"],"versions":[{"id":"5fa9147a91e011e83f7407b4","sid":"2011.03395","src":"arxiv","year":2020}],"year":2020},{"abstract":" Optimizing a high-dimensional non-convex function is, in general, computationally hard and many problems of this type are hard to solve even approximately. Complexity theory characterizes the optimal approximation ratios achievable in polynomial time in the worst case. On the other hand, when the objective function is random, worst case approximation ratios are overly pessimistic. Mean field spin glasses are canonical families of random energy functions over the discrete hypercube $\\{-1,+1\\}^N$. The near-optima of these energy landscapes are organized according to an ultrametric tree-like structure, which enjoys a high degree of universality. Recently, a precise connection has begun to emerge between this ultrametric structure and the optimal approximation ratio achievable in polynomial time in the typical case. A new approximate message passing (AMP) algorithm has been proposed that leverages this connection. The asymptotic behavior of this algorithm has been analyzed, conditional on the nature of the solution of a certain variational problem. In this paper we describe the first implementation of this algorithm and the first numerical solution of the associated variational problem. We test our approach on two prototypical mean-field spin glasses: the Sherrington-Kirkpatrick (SK) model, and the $3$-spin Ising spin glass. We observe that the algorithm works well already at moderate sizes ($N\\gtrsim 1000$) and its behavior is consistent with theoretical expectations. For the SK model it asymptotically achieves arbitrarily good approximations of the global optimum. For the $3$-spin model, it achieves a constant approximation ratio that is predicted by the theory, and it appears to beat the `threshold energy' achieved by Glauber dynamics. Finally, we observe numerically that the intermediate states generated by the algorithm have the properties of ancestor states in the ultrametric tree. ","authors":[{"id":"562d5f5845cedb3398ddd3fd","name":"Ahmed El Alaoui"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"5f6dbae791e011533700548d","num_citation":0,"order":1,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2009\u002F2009.11481.pdf","title":"Algorithmic Thresholds in Mean Field Spin Glasses","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2009.11481"],"versions":[{"id":"5f6dbae791e011533700548d","sid":"2009.11481","src":"arxiv","year":2020}],"year":2020},{"abstract":" The Lasso is a method for high-dimensional regression, which is now commonly used when the number of covariates $p$ is of the same order or larger than the number of observations $n$. Classical asymptotic normality theory is not applicable for this model due to two fundamental reasons: $(1)$ The regularized risk is non-smooth; $(2)$ The distance between the estimator $\\bf \\widehat{\\theta}$ and the true parameters vector $\\bf \\theta^\\star$ cannot be neglected. As a consequence, standard perturbative arguments that are the traditional basis for asymptotic normality fail. On the other hand, the Lasso estimator can be precisely characterized in the regime in which both $n$ and $p$ are large, while $n\u002Fp$ is of order one. This characterization was first obtained in the case of standard Gaussian designs, and subsequently generalized to other high-dimensional estimation procedures. Here we extend the same characterization to Gaussian correlated designs with non-singular covariance structure. This characterization is expressed in terms of a simpler ``fixed design'' model. We establish non-asymptotic bounds on the distance between distributions of various quantities in the two models, which hold uniformly over signals $\\bf \\theta^\\star$ in a suitable sparsity class, and values of the regularization parameter. As applications, we study the distribution of the debiased Lasso, and show that a degrees-of-freedom correction is necessary for computing valid confidence intervals. ","authors":[{"name":"Michael Celentano"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"},{"id":"562d600245cedb3398ddebd4","name":"Yuting Wei"}],"id":"5f200dea91e011d50a621d2c","num_citation":0,"order":1,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2007\u002F2007.13716.pdf","title":"The Lasso with general Gaussian designs with applications to hypothesis testing","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.13716"],"versions":[{"id":"5f200dea91e011d50a621d2c","sid":"2007.13716","src":"arxiv","year":2020}],"year":2020},{"abstract":"Given a collection of data points, nonnegative matrix factorization (NMF) suggests expressing them as convex combinations of a small set of \"archetypes\" with nonnegative entries. This decomposition is unique only if the true archetypes are nonnegative and sufficiently sparse (or the weights are sufficiently sparse), a regime that is captured by the separability condition and its generalizations. In this article, we study an approach to NMF that can be traced back to the work of Cutler and Breiman [(1994), \"Archetypal Analysis,\" Technometrics, 36, 338-347] and does not require the data to be separable, while providing a generally unique decomposition. We optimize a trade-off between two objectives: we minimize the distance of the data points from the convex envelope of the archetypes (which can be interpreted as an empirical risk), while also minimizing the distance of the archetypes from the convex envelope of the data (which can be interpreted as a data-dependent regularization). The archetypal analysis method of Cutler and Breiman is recovered as the limiting case in which the last term is given infinite weight. We introduce a \"uniqueness condition\" on the data which is necessary for identifiability. We prove that, under uniqueness (plus additional regularity conditions on the geometry of the archetypes), our estimator is robust. While our approach requires solving a nonconvex optimization problem, we find that standard optimization methods succeed in finding good solutions for both real and synthetic data. for this article are available online","authors":[{"name":"Hamid Javadi"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"doi":"10.1080\u002F01621459.2019.1594832","id":"5d9edc4a47c8f7664603af86","num_citation":0,"order":1,"pages":{"end":"907.0","start":"896.0"},"pdf":"https:\u002F\u002Fcz5waila03cyo0tux1owpyofgoryrooa.oss-cn-beijing.aliyuncs.com\u002F60\u002F84\u002F51\u002F60845198724AC101E269A4319C75EB39.pdf","title":"Nonnegative Matrix Factorization Via Archetypal Analysis","venue":{"info":{"name":"JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION"},"issue":"530.0","volume":"115.0"},"versions":[{"id":"5d9edc4a47c8f7664603af86","sid":"2963884903","src":"mag","vsid":"62401924","year":2020},{"id":"5fc70c7be8bf8c1045340a10","sid":"WOS:000538423300033","src":"wos","vsid":"JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION","year":2020}],"year":2020},{"abstract":" For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If feature vectors are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the feature vectors display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present a model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs. ","authors":[{"id":"562dfb5c45ce1e5967b9e474","name":"Behrooz Ghorbani"},{"id":"54325089dabfaeb54214e094","name":"Song Mei"},{"id":"540fc91edabfae450f4a49b8","name":"Theodor Misiakiewicz"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Andrea Montanari"}],"doi":"10.1088\u002F1742-5468\u002Fac3a81","flags":[{"flag":"affirm_author","person_id":"53f437e5dabfaeb22f47f3c9"}],"id":"5ef476b691e01165a63bbb12","num_citation":36,"order":3,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fupload\u002Fpdf\u002F682\u002F764\u002F28\u002F5ef476b691e01165a63bbb12_1.pdf","title":"When Do Neural Networks Outperform Kernel Methods?","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.13409","https:\u002F\u002Fneurips.cc\u002FConferences\u002F2020\u002FAcceptedPapersInitial","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Fnips\u002FGhorbaniMMM20","https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2020\u002Fhash\u002Fa9df2255ad642b923d95503b9a7958d8-Abstract.html","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2006.html#abs-2006-13409","https:\u002F\u002Fwww.arxiv-vanity.com\u002Fpapers\u002F2006.13409\u002F","https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2020\u002Fhash\u002Fa9df2255ad642b923d95503b9a7958d8-Abstract.html","https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2020\u002Ffile\u002Fa9df2255ad642b923d95503b9a7958d8-Paper.pdf","http:\u002F\u002Fui.adsabs.harvard.edu\u002Fabs\u002F2020arXiv200613409G\u002Fabstract"],"venue":{"info":{"name":"NIPS 2020"},"volume":"33"},"versions":[{"id":"5ef476b691e01165a63bbb12","sid":"2006.13409","src":"arxiv","year":2021},{"id":"5f7fdd328de39f0828397bbc","sid":"neurips2020#547","src":"conf_neurips","year":2020},{"id":"5ff8844791e011c8326763a8","sid":"conf\u002Fnips\u002FGhorbaniMMM20","src":"dblp","vsid":"conf\u002Fnips","year":2020},{"id":"5ff68c39d4150a363cd1c3c0","sid":"3101787089","src":"mag","vsid":"1127325140","year":2020}],"year":2020},{"abstract":" Modern large-scale statistical models require to estimate thousands to millions of parameters. This is often accomplished by iterative algorithms such as gradient descent, projected gradient descent or their accelerated versions. What are the fundamental limits to these approaches? This question is well understood from an optimization viewpoint when the underlying objective is convex. Work in this area characterizes the gap to global optimality as a function of the number of iterations. However, these results have only indirect implications in terms of the gap to statistical optimality. Here we consider two families of high-dimensional estimation problems: high-dimensional regression and low-rank matrix estimation, and introduce a class of `general first order methods' that aim at efficiently estimating the underlying parameters. This class of algorithms is broad enough to include classical first order optimization (for convex and non-convex objectives), but also other types of algorithms. Under a random design assumption, we derive lower bounds on the estimation error that hold in the high-dimensional asymptotics in which both the number of observations and the number of parameters diverge. These lower bounds are optimal in the sense that there exist algorithms whose estimation error matches the lower bounds up to asymptotically negligible terms. We illustrate our general results through applications to sparse phase retrieval and sparse principal component analysis. ","authors":[{"name":"Celentano Michael"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Montanari Andrea"},{"name":"Wu Yuchen"}],"id":"5e5cd97391e011498fe97377","num_citation":0,"order":1,"pages":{"end":"1141","start":"1078"},"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2002\u002F2002.12903.pdf","title":"The estimation error of general first order methods","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.12903","https:\u002F\u002Fwww.colt2020.org\u002Faccepted.html","https:\u002F\u002Fdblp.org\u002Frec\u002Fconf\u002Fcolt\u002FCelentanoMW20","http:\u002F\u002Fproceedings.mlr.press\u002Fv125\u002Fcelentano20a.html","https:\u002F\u002Fdblp.uni-trier.de\u002Fdb\u002Fjournals\u002Fcorr\u002Fcorr2002.html#abs-2002-12903"],"venue":{"info":{"name":"COLT"}},"versions":[{"id":"5e5cd97391e011498fe97377","sid":"2002.12903","src":"arxiv","year":2020},{"id":"5f068d7211dc830562232218","sid":"colt2020#104","src":"conf_colt","year":2020},{"id":"5f2541f491e0113474e36dda","sid":"conf\u002Fcolt\u002FCelentanoMW20","src":"dblp","vsid":"conf\u002Fcolt","year":2020},{"id":"5ff681e7d4150a363cb822dd","sid":"3046451915","src":"mag","vsid":"1177622950","year":2020}],"year":2020},{"abstract":" We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to \\emph{impute} the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the covariates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d.\\ sub-Gaussian. We further show that the square-root LASSO remains \\emph{pivotal} in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions. ","authors":[{"name":"Chandrasekher Kabir Aladin"},{"id":"562d5f5845cedb3398ddd3fd","name":"Alaoui Ahmed El"},{"id":"53f437e5dabfaeb22f47f3c9","name":"Montanari Andrea"}],"id":"5e3006423a55ac0524a25a96","num_citation":0,"order":2,"pdf":"https:\u002F\u002Fstatic.aminer.cn\u002Fstorage\u002Fpdf\u002Farxiv\u002F20\u002F2001\u002F2001.09180.pdf","title":"Imputation for High-Dimensional Linear Regression","urls":["https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.09180"],"versions":[{"id":"5e3006423a55ac0524a25a96","sid":"2001.09180","src":"arxiv","year":2020}],"year":2020}],"profilePubsTotal":356,"profilePatentsPage":1,"profilePatents":[],"profilePatentsTotal":4,"profilePatentsEnd":true,"profileProjectsPage":0,"profileProjects":null,"profileProjectsTotal":null,"newInfo":null,"checkDelPubs":[]}};