How can we use tools from signal processing to understand better neural networks?

You are here

Inside Signal Processing Newsletter Home Page

Top Reasons to Join SPS Today!

1. IEEE Signal Processing Magazine
2. Signal Processing Digital Library*
3. Inside Signal Processing Newsletter
4. SPS Resource Center
5. Career advancement & recognition
6. Discounts on conferences and publications
7. Professional networking
8. Communities for students, young professionals, and women
9. Volunteer opportunities
10. Coming soon! PDH/CEU credits
Click here to learn more.

News and Resources for Members of the IEEE Signal Processing Society

How can we use tools from signal processing to understand better neural networks?

By: 
Raja Giryes and Joan Bruna
Deep neural networks achieve state-of-the-art performance in many domains in signal processing. The main practice is getting pairs of examples, input, and its desired output, and then training a network to produce the same outputs with the goal that it will learn how to generalize also to new unseen data, which is indeed the case in many scenarios.
 
One of the main questions that are asked these days is whether there is still a need for the “classical tools” of signal processing or all we need is just training data. In this short article, we will provide several examples where classical techniques from signal processing can help to improve our understanding of neural networks, which in some cases even lead to improved performance. In particular, we focus on the use of (i) splines to characterize the function space of neural networks; (ii) sampling theory to explain generalization and improve it; (iii) linear  programming to explain the optimization; and (iv) spectral (Fourier) analysis to understand and improve network robustness to label noise.
 
Spline interpolation. A very famous property of neural networks is the universal approximation theorem, i.e., that with sufficiently large neural networks one may approximate virtually any function [1, 2]. While this result shows that neural networks have strong expressive power, it does not explain the generalization ability of neural networks when they overfit the training data. In general, one may expect that when a network overfits the data, its performance on new unseen data (the “test data”) should degrade. Yet, perhaps surprisingly, the opposite happens in general.
 
Recently, the function space of neural networks with bounded weights, which is the common case when training a network, has been analyzed in various works [3, 4, 5, 6]. It was first shown in [3] that univariate shallow neural networks with infinite width and bounded weights perform a smooth interpolation of the data points. This interpolation has been shown to be at least a first-order spline. In [4], it has been shown that based on the parameterization of the network, one may get a guarantee for second-order spline interpolation of the training data in the same setting. In [5], the result of [3] has been extended to the case of shallow networks with multidimensional input. In [6], the case of finite networks has been analyzed.
 
Sampling theory. The fact that neural networks perform a smooth interpolation has been used to explain the generalization ability of the networks when they overfit the data. Using tools from sampling theory, it has been shown that if the input data is generated from a band-limited mapping (can be also a sufficiently smooth mapping), then the error of the trained network on unseen data decays rapidly (with an order of 1/n3, where n is the number of training examples) [6]. This analysis may explain the intriguing ability of neural networks to overfit and generalize at the same time.
 
In a different context, it was shown that by using sampling theory, one may improve the robustness of networks to their input. In particular, in the task of image super-resolution, neural networks achieved remarkable performance in the quality of the generated images. Yet, typically, the networks heavily relied on the way their input low-resolution images were originally sub-sampled. Specifically, most of them assumed that they were generated by a bicubic subsampling. If they got an image that was sampled with a different kernel, their results degraded significantly. Using principles of sampling theory, it was shown that by applying a simple kernel correction on the input image, the quality of the reconstruction improved remarkably [7].
 
Linear Programming and Representer Theorems. Neural Networks are non-parametric models enjoying universal approximation over the class of continuous functions, similarly as Reproducing Kernel Hilbert Spaces. Fitting such neural networks on a finite dataset of n data points in dimension d thus naturally defines two distinct regimes: the overparametrised (resp. underparametrised) regime, corresponding to the setting when the number of neurons is larger (resp. smaller) than the number of data points. Several authors noted a remarkable phenomenon around this transition, such as the so-called double-descent [8, 9, 10, 11], leading to the apparent paradox of having good generalization capabilities even as the number of neurons grows indefinitely [8].
 
A possible hypothesis supporting this behavior comes from the implicit regularisation built-in by gradient descent methods at solving the Empirical Risk Minimisation. Weight-decay is a popular regularisation strategy that penalizes the squared L2 norm of the neuron’s weights. In the context of shallow ReLU networks, the resulting training problem can be formulated in terms of a probability measure over the neuron parameters [5, 12], leading to a sparse regularised convex program over a ‘continuous’ dictionary {φθ;θ ∈ Θ}, where Θ ⊆ Rd is the space of neuron parameters and φθ(x) = \sigma(<x,\theta>) is a single neuron. Classic results from convex geometry assert that such convex programs admit sparse solutions with at most n atoms [13, 14, 15], the so-called Representer theorem. In fact, [16] recently have shown that all solutions of this program are sparse, therefore resulting in only a finite number of neurons, irrespective of the amount of overparameterization. In other words, training shallow ReLU networks in the overparametrised regime amounts to solving a linear program in finite dimension, given by the hyperplane arrangement generated by the dataset when identifying a point x ∈ Rd with a hyperplane in the dual. Although this reduction is not computationally useful (since the size of the hyperplane arrangement is O(nd)), studying properties of the dataset that allow for substantially smaller arrangements is a promising direction for future research.
 
Spectral/Fourier analysis. Deep networks usually require a massive amount of labeled data for their training. Yet, such data may include some mistakes in the labels. Interestingly, networks have been shown to be robust to such errors. By analyzing the function space of neural networks in the spectral domain, an explanation for this robustness has been provided. In particular, it has been shown that this is related to the fact that neural networks tend to learn smooth functions, i.e., functions whose spectrum decays rapidly [17, 18, 19, 20]. This has been related to both the fact that networks have bounded weights as discussed above and to the structure of the network [21]. Noisy labels mainly affect the high frequencies in the learned function. Thus, the attenuation of high frequencies in neural networks may explain their robustness to noise. This understanding has led to the use of spectral regularization to further improve the network robustness to label noise [22].
 
Summary. We have briefly discussed how classic signal processing tools can greatly help in improving and understanding deep neural networks. The above directions are just an example and there are directions that used for deep learning other tools from signal processing such as sparse representation [23], max affine splines [24], wavelets [25]. We hope that others in the research community will be encouraged to do so as well.
 

References:

  1. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, July 1989.
  2. G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 2:303–314, 1989.
  3. Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667–2690, 2019.
  4. Francis Williams, Matthew Trager, Daniele Panozzo, Claudio Silva, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks. In Advances in Neural Information Processing Systems 32, pages 8376–8385, 2019.
  5. Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLU nets: The multivariate case. In International Conference on Learning Representations (ICLR), 2020.
  6. Raja Giryes. A function space analysis of finite neural networks with insights from sampling theory. CoRR, abs/2004.06989, 2020.
  7. Shady Abu Hussein, Tom Tirer, and Raja Giryes. Correction filter for single image super-resolution: Robustifying off-the-shelf deep super-resolvers. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  8. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning practice and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
  9. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  10. Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
  11. Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
  12. Lenaic Chizat. Sparse optimization on measures with over-parameterized gradient descent. arXiv preprint arXiv:1907.10300, 2019.
  13. S. Zuhovickii. Remarks on problems in approximation theory. Mat. Zbirnik KDU, 1948.
  14. SD Fisher and Joseph W Jerome. Spline solutions to l1 extremal problems in one and several variables. Journal of Approximation Theory, 13(1):73–83, 1975.
  15. Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  16. Jaume de Dios and Joan Bruna. On sparsity in overparametrised shallow relu networks. arXiv preprint arXiv:2006.10225, 2020.
  17. Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5301–5310, 2019.
  18. Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. In International Conference on Neural Information Processing, pages 264–274, 2019.
  19. Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Advances in Neural Information Processing Systems (NeurIPS), pages 4761–4771, 2019.
  20. Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In ICML, 2020.
  21. Reinhard Heckel and Mahdi Soltanolkotabi. Denoising and regularization via exploiting the structural bias of convolutional generators. In International Conference on Learning Representations (ICLR), 2020.
  22. Oshrat Bar, Amnon Drory, and Raja Giryes. A function space spectral perspective of neural network robustness to label noise. CoRR, 2020.
  23. Vardan Papyan, Yaniv Romano, and Michael Elad. Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research, 18(1):2887–2938, 2017.
  24. Randall Balestriero and richard baraniuk. A spline theory of deep learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 374–383, 10–15 Jul 2018.
  25. Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.

SPS on Twitter

  • DEADLINE EXTENDED: The 2023 IEEE International Workshop on Machine Learning for Signal Processing is now accepting… https://t.co/NLH2u19a3y
  • ONE MONTH OUT! We are celebrating the inaugural SPS Day on 2 June, honoring the date the Society was established in… https://t.co/V6Z3wKGK1O
  • The new SPS Scholarship Program welcomes applications from students interested in pursuing signal processing educat… https://t.co/0aYPMDSWDj
  • CALL FOR PAPERS: The IEEE Journal of Selected Topics in Signal Processing is now seeking submissions for a Special… https://t.co/NPCGrSjQbh
  • Test your knowledge of signal processing history with our April trivia! Our 75th anniversary celebration continues:… https://t.co/4xal7voFER

IEEE SPS Educational Resources

IEEE SPS Resource Center

IEEE SPS YouTube Channel