Learning with deep neural networks has been very successful recently in a wide variety of tasks. Even though the training loss is a complex non-convex objective, simple methods such as stochastic gradient descent (SGD) are able to find a global minima. More surprisingly, the solutions found by SGD have small test error, even though the problem is generally over-parametrized (model has more parameters than the number of samples). This is a result of the implicit regularization of the optimization algorithm (SGD), that results in picking a global minima with a smaller test error. To understand this generalization behavior and its relation to the optimization methods, we study the following two questions.