Fantastic Generalization Measures and Where to Find Them
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
https://arxiv.org/pdf/1912.02178.pdf
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
https://arxiv.org/pdf/1912.02178.pdf
🔥1
On Characterizing the Capacity of Neural Networks using Algebraic Topology
The learnability of different neural architectures can be characterized directly by computable measures of data complexity. In this paper, we reframe the problem of architecture selection as understanding how data determines the most expressive and generalizable architectures suited to that data, beyond inductive bias. After suggesting algebraic topology as a measure for data complexity, we show that the power of a network to express the topological complexity of a dataset in its decision region is a strictly limiting factor in its ability to generalize. We then provide the first empirical characterization of the topological capacity of neural networks. Our empirical analysis shows that at every level of dataset complexity, neural networks exhibit topological phase transitions.
https://arxiv.org/pdf/1802.04443.pdf
The learnability of different neural architectures can be characterized directly by computable measures of data complexity. In this paper, we reframe the problem of architecture selection as understanding how data determines the most expressive and generalizable architectures suited to that data, beyond inductive bias. After suggesting algebraic topology as a measure for data complexity, we show that the power of a network to express the topological complexity of a dataset in its decision region is a strictly limiting factor in its ability to generalize. We then provide the first empirical characterization of the topological capacity of neural networks. Our empirical analysis shows that at every level of dataset complexity, neural networks exhibit topological phase transitions.
https://arxiv.org/pdf/1802.04443.pdf
🔥1
Interpreting the Latent Space of GANs for Semantic Face Editing
Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photorealistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection
https://arxiv.org/abs/1907.1078
Despite the recent advance of Generative Adversarial Networks (GANs) in high-fidelity image synthesis, there lacks enough understanding of how GANs are able to map a latent code sampled from a random distribution to a photorealistic image. Previous work assumes the latent space learned by GANs follows a distributed representation but observes the vector arithmetic phenomenon. In this work, we propose a novel framework, called InterFaceGAN, for semantic face editing by interpreting the latent semantics learned by GANs. In this framework, we conduct a detailed study on how different semantics are encoded in the latent space of GANs for face synthesis. We find that the latent code of well-trained generative models actually learns a disentangled representation after linear transformations. We explore the disentanglement between various semantics and manage to decouple some entangled semantics with subspace projection
https://arxiv.org/abs/1907.1078
A Geometrical Perspective on Image Style Transfer with Adversarial Learning
—Recent years witness the booming trend of applying Generative Adversarial Nets (GAN) and its variants to image style transfer. Although many reported results strongly demonstrate the power of GAN on this task, there is still little known about neither the interpretations of several fundamental phenomenons of image style transfer by generative adversarial learning, nor its underlying mechanism. To bridge this gap, this paper presents a general framework for analyzing style transfer with adversarial learning through the lens of differential geometry. To demonstrate the utility of our proposed framework, we provide an in-depth analysis of Isola et al.’s pioneering style transfer model pix2pix [1] and reach a comprehensive interpretation on their major experimental phenomena. Furthermore, we extend the notion of generalization to conditional GAN and derive a condition to control the generalization capability of the pix2pix model.
—Recent years witness the booming trend of applying Generative Adversarial Nets (GAN) and its variants to image style transfer. Although many reported results strongly demonstrate the power of GAN on this task, there is still little known about neither the interpretations of several fundamental phenomenons of image style transfer by generative adversarial learning, nor its underlying mechanism. To bridge this gap, this paper presents a general framework for analyzing style transfer with adversarial learning through the lens of differential geometry. To demonstrate the utility of our proposed framework, we provide an in-depth analysis of Isola et al.’s pioneering style transfer model pix2pix [1] and reach a comprehensive interpretation on their major experimental phenomena. Furthermore, we extend the notion of generalization to conditional GAN and derive a condition to control the generalization capability of the pix2pix model.
🔥1
Forwarded from Высшая школа экономики
Объявлены победители международного хакатона по искусственному интеллекту Global AI Challenge. В состав одной из команд, занявших третье место, вошли аспиранты Высшей школы экономики Герман Магай, Дмитрий Киселев и Максим Бекетов.
Поздравляем!
Поздравляем!
🔥3
Data Interpolations in Deep Generative Models under Non-Simply-Connected Manifold Topology
Exploiting the deep generative model’s remarkable ability of learning the data-manifold structure, some recent researches proposed a geometric data interpolation method based on the geodesic curves on the learned data-manifold. However, this interpolation method often gives poor results due to a topological difference between the model and the dataset. The model defines a family of simply-connected manifolds, whereas the dataset generally contains disconnected regions or holes that make them non-simply-connected. To compensate this difference, we propose a novel density regularizer that make the interpolation path circumvent the holes denoted by low probability density. We confirm that our method gives consistently better interpolation results from the experiments with real-world image datasets.
https://arxiv.org/pdf/1901.08553.pdf
Exploiting the deep generative model’s remarkable ability of learning the data-manifold structure, some recent researches proposed a geometric data interpolation method based on the geodesic curves on the learned data-manifold. However, this interpolation method often gives poor results due to a topological difference between the model and the dataset. The model defines a family of simply-connected manifolds, whereas the dataset generally contains disconnected regions or holes that make them non-simply-connected. To compensate this difference, we propose a novel density regularizer that make the interpolation path circumvent the holes denoted by low probability density. We confirm that our method gives consistently better interpolation results from the experiments with real-world image datasets.
https://arxiv.org/pdf/1901.08553.pdf
🔥1
SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
https://arxiv.org/pdf/1609.05473.pdf
In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
https://arxiv.org/pdf/1609.05473.pdf
🔥1
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
We study approximation of probability measures supported on n-dimensional manifolds embedded in R^m by injective flows -- neural networks composed of invertible flows and injective layers. We show that in general, injective flows between R^n and R^m universally approximate measures supported on images of extendable embeddings, which are a subset of standard embeddings: when the embedding dimension m is small, topological obstructions may preclude certain manifolds as admissible targets. When the embedding dimension is sufficiently large, m \geq 3n+1, we use an argument from algebraic topology known as the clean trick to prove that the topological obstructions vanish and injective flows universally approximate any differentiable embedding. Along the way we show that the studied injective flows admit efficient projections on the range, and that their optimality can be established "in reverse,"
https://arxiv.org/abs/2110.04227
We study approximation of probability measures supported on n-dimensional manifolds embedded in R^m by injective flows -- neural networks composed of invertible flows and injective layers. We show that in general, injective flows between R^n and R^m universally approximate measures supported on images of extendable embeddings, which are a subset of standard embeddings: when the embedding dimension m is small, topological obstructions may preclude certain manifolds as admissible targets. When the embedding dimension is sufficiently large, m \geq 3n+1, we use an argument from algebraic topology known as the clean trick to prove that the topological obstructions vanish and injective flows universally approximate any differentiable embedding. Along the way we show that the studied injective flows admit efficient projections on the range, and that their optimality can be established "in reverse,"
https://arxiv.org/abs/2110.04227
🔥1
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective
We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena. We see that decision boundary reproducibility depends strongly on model width. Near the threshold of interpolation, neural network decision boundaries become fragmented into many small decision regions, and these regions are non-reproducible. Meanwhile, very narrows and very wide networks
https://arxiv.org/abs/2203.08124
We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena. We see that decision boundary reproducibility depends strongly on model width. Near the threshold of interpolation, neural network decision boundaries become fragmented into many small decision regions, and these regions are non-reproducible. Meanwhile, very narrows and very wide networks
https://arxiv.org/abs/2203.08124
🔥1
Analyzing the Latent Space of GAN through Local Dimension Estimation
Building upon this observation, we propose a local dimension estimation algorithm for an arbitrary intermediate layer in a pre-trained GAN model. The estimated intrinsic dimension corresponds to the number of disentangled local perturbations. In this perspective, we analyze the intermediate layers of the mapping network in StyleGANs. Our analysis clarifies the success of W-space in StyleGAN and suggests an alternative. Moreover, the intrinsic dimension estimation opens the possibility of unsupervised evaluation of global-basis-compatibility and disentanglement for a latent space. Our proposed metric, called Distortion, measures an inconsistency of intrinsic tangent space on the learned latent space. The metric is purely geometric and does not require any additional attribute information. Nevertheless, the metric shows a high correlation with the global-basis-compatibility and supervised disentangl. score.
https://arxiv.org/abs/2205.13182
Building upon this observation, we propose a local dimension estimation algorithm for an arbitrary intermediate layer in a pre-trained GAN model. The estimated intrinsic dimension corresponds to the number of disentangled local perturbations. In this perspective, we analyze the intermediate layers of the mapping network in StyleGANs. Our analysis clarifies the success of W-space in StyleGAN and suggests an alternative. Moreover, the intrinsic dimension estimation opens the possibility of unsupervised evaluation of global-basis-compatibility and disentanglement for a latent space. Our proposed metric, called Distortion, measures an inconsistency of intrinsic tangent space on the learned latent space. The metric is purely geometric and does not require any additional attribute information. Nevertheless, the metric shows a high correlation with the global-basis-compatibility and supervised disentangl. score.
https://arxiv.org/abs/2205.13182
Forwarded from Мишин Лернинг 🇺🇦🇮🇱
🌊 Лайфхак для промт-инжиниринга DALL•E и Imagen. Баба Люба получает отличные картинки, для этого достаточно при каждой генерации писать..
Дикий промт-инжиниринговый трюк, можно описать следующей формулой:
👉 “A ” + “very “*n + “beautiful painting of “ + your_prompt, где int n >= 0
Автор идеи решил сгенерировать водопад в горах, воспользовавшись формулой.
🐥 Первая картинка, n = 0: "A beautiful painting of a mountain next to a waterfall." Сеть прекрасно понимает описанное, но генерит скучное изображение без деталей.
🐱 Вторая картинка, n = 1: "A very beautiful painting of a mountain next to a waterfall." Сеть генерирует более насыщенное изображение и главное, что проявляются меткие детали гор, зелени, деревьев.
🧠 Третья картинка, n = 6: “A very very very very very very beautiful painting of a mountain next to a waterfall." Генерации очень красивы. Множество мелких деталей, объектов, отражений, эффектов погоды.
🤖 Четвертая картинка, n = 22: Ощущение такое, что выставили все настройки графики на ultra!
Дикий промт-инжиниринговый трюк, можно описать следующей формулой:
👉 “A ” + “very “*n + “beautiful painting of “ + your_prompt, где int n >= 0
Автор идеи решил сгенерировать водопад в горах, воспользовавшись формулой.
🐥 Первая картинка, n = 0: "A beautiful painting of a mountain next to a waterfall." Сеть прекрасно понимает описанное, но генерит скучное изображение без деталей.
🐱 Вторая картинка, n = 1: "A very beautiful painting of a mountain next to a waterfall." Сеть генерирует более насыщенное изображение и главное, что проявляются меткие детали гор, зелени, деревьев.
🧠 Третья картинка, n = 6: “A very very very very very very beautiful painting of a mountain next to a waterfall." Генерации очень красивы. Множество мелких деталей, объектов, отражений, эффектов погоды.
🤖 Четвертая картинка, n = 22: Ощущение такое, что выставили все настройки графики на ultra!
❤🔥1👍1🔥1
COLT 2021 papers:
1) Size and Depth Separation in Approximating Benign Functions with Neural Networks - https://arxiv.org/pdf/2102.00314.pdf
As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give complexity-theoretic barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-sized networks of depth 4 would settle longstanding open problems in computational complexity
2) A law of robustness for two-layers neural networks - https://arxiv.org/pdf/2009.14444.pdf
We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than p n/k where n is the number of datapoints
3) On the Approximation Power of Two-Layer Networks of Random ReLUs - https://arxiv.org/pdf/2102.02336.pdf
This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upperand lower-bounds for L2-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms.
4) The Connection Between Approximation, Depth Separation an d Learnability in Neural Networks - https://arxiv.org/pdf/2102.00434.pdf
In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks
1) Size and Depth Separation in Approximating Benign Functions with Neural Networks - https://arxiv.org/pdf/2102.00314.pdf
As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give complexity-theoretic barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-sized networks of depth 4 would settle longstanding open problems in computational complexity
2) A law of robustness for two-layers neural networks - https://arxiv.org/pdf/2009.14444.pdf
We initiate the study of the inherent tradeoffs between the size of a neural network and its robustness, as measured by its Lipschitz constant. We make a precise conjecture that, for any Lipschitz activation function and for most datasets, any two-layers neural network with k neurons that perfectly fit the data must have its Lipschitz constant larger (up to a constant) than p n/k where n is the number of datapoints
3) On the Approximation Power of Two-Layer Networks of Random ReLUs - https://arxiv.org/pdf/2102.02336.pdf
This paper considers the following question: how well can depth-two ReLU networks with randomly initialized bottom-level weights represent smooth functions? We give near-matching upperand lower-bounds for L2-approximation in terms of the Lipschitz constant, the desired accuracy, and the dimension of the problem, as well as similar results in terms of Sobolev norms.
4) The Connection Between Approximation, Depth Separation an d Learnability in Neural Networks - https://arxiv.org/pdf/2102.00434.pdf
In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks
👍1🔥1
Torsional Diffusion for Molecular Conformer Generation
Diffusion-based generative models generate samples by mapping noise to data via the reversal of a diffusion process that typically consists of independent Gaussian noise in every data coordinate. This diffusion process is, however, not well suited to the fundamental task of molecular conformer generation where the degrees of freedom differentiating conformers lie mostly in torsion angles. We, therefore, propose Torsional Diffusion that generates conformers by leveraging the definition of a diffusion process over the space Tm, a high dimensional torus representing torsion angles, and a SE(3) equivariant model capable of accurately predicting the score over this process. Empirically, we demonstrate that our model outperforms state-of-the-art methods in terms of both diversity and precision of generated conformers, reducing the mean minimum RMSD by respectively 31% and 17%.
https://openreview.net/forum?id=D9IxPlXPJJS
Diffusion-based generative models generate samples by mapping noise to data via the reversal of a diffusion process that typically consists of independent Gaussian noise in every data coordinate. This diffusion process is, however, not well suited to the fundamental task of molecular conformer generation where the degrees of freedom differentiating conformers lie mostly in torsion angles. We, therefore, propose Torsional Diffusion that generates conformers by leveraging the definition of a diffusion process over the space Tm, a high dimensional torus representing torsion angles, and a SE(3) equivariant model capable of accurately predicting the score over this process. Empirically, we demonstrate that our model outperforms state-of-the-art methods in terms of both diversity and precision of generated conformers, reducing the mean minimum RMSD by respectively 31% and 17%.
https://openreview.net/forum?id=D9IxPlXPJJS
🔥1
A Geometric Analysis of Neural Collapse with Unconstrained Features
We provide the first global optimization landscape analysis of NeuralCollapse -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified unconstrainedfeaturemodel, which isolates the topmost layers from the classifier of the neural network. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs
https://arxiv.org/abs/2105.02375
We provide the first global optimization landscape analysis of NeuralCollapse -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified unconstrainedfeaturemodel, which isolates the topmost layers from the classifier of the neural network. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs
https://arxiv.org/abs/2105.02375
Redundant representations help generalization in wide neural networks
Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this ``benign overfitting'' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise. The number of such groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training process reaches interpolation and the training error is zero.
Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this ``benign overfitting'' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise. The number of such groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training process reaches interpolation and the training error is zero.
Prevalence of neural collapse during the terminal phase of deep learning training
Modern deep neural networks for image classification have achieved superhuman performance. Yet, the complex details of trained networks have forced most practitioners and researchers to regard them as black boxes with little that could be understood. This paper considers in detail a now-standard training methodology: driving the cross-entropy loss to zero, continuing long after the classification error is already zero. Applying this methodology to an authoritative collection of standard deepnets and datasets, we observe the emergence of a simple and highly symmetric geometry of the deepnet features and of the deepnet classifier, and we document important benefits that the geometry conveys—thereby helping us understand an important component of the modern deep learning training paradigm.
https://www.pnas.org/doi/10.1073/pnas.2015509117
Modern deep neural networks for image classification have achieved superhuman performance. Yet, the complex details of trained networks have forced most practitioners and researchers to regard them as black boxes with little that could be understood. This paper considers in detail a now-standard training methodology: driving the cross-entropy loss to zero, continuing long after the classification error is already zero. Applying this methodology to an authoritative collection of standard deepnets and datasets, we observe the emergence of a simple and highly symmetric geometry of the deepnet features and of the deepnet classifier, and we document important benefits that the geometry conveys—thereby helping us understand an important component of the modern deep learning training paradigm.
https://www.pnas.org/doi/10.1073/pnas.2015509117