Telegram Web
neural_collapse (1).gif
10.3 MB
Deep learning theory and Neural Collapse of features space

1) Prevalence of Neural Collapse during the terminal phase of deep learning training

2)Limitations of Neural Collapse for Understanding Generalization in Deep Learning

3)Neural Collapse A Review on Modelling Principles and Generalization

4)A Geometric Analysis of Neural Collapse with Unconstrained Features

5)Grassmannian Frames with Applications to Coding and Communication
Revisiting the Calibration of Modern Neural Networks

Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.

https://openreview.net/forum?id=QRBvLayFXI
On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers

A recent numerical study observed that neural network classifiers enjoy a large degree of symmetry in the penultimate layer. Namely, if h(x)=Af(x)+b where A is a linear map and f is the output of the penultimate layer of the network (after activation), then all data points xi,1,…,xi,Ni in a class Ci are mapped to a single point yi by f and the points yi are located at the vertices of a regular k−1-dimensional standard simplex in a high-dimensional Euclidean space. We explain this observation analytically in toy models for highly expressive deep neural networks. In complementary examples, we demonstrate rigorously that even the final output of the classifier h is not uniform over data samples from a class Ci if h is a shallow network (or if the deeper layers do not bring the data samples into a convenient geometric configuration).

https://arxiv.org/abs/2012.05420
Dirichlet Neural Networks, a Bayesian Approach

Epistemic uncertainty relates to the spread of the categorical distribution p ( π | x ∗ , D ) on the simplex, which corresponds to α 0 = ∑ C c = 1 α c for a Dirichlet distribution. Aleatoric uncertainty is linked to the position of the mean on the simplex. Equipped with this configuration, we would like Dirichlet network to yield the desired behaviors shown in the figure below:

When the model is confident in its prediction, it should yield a sharp distribution centered on one of the corners of the simplex (Fig a.). For an input in a region with high degrees of noise or class overlap (aleatoric uncertainty), it should yield a sharp distribution focused on the center of the simplex, which corresponds to being confident in predicting a flat categorical distribution over class labels (Fig b.). Finally, for out-of-distribution inputs, the Dirichlet network should yield a flat distribution over the simplex, indicating large epistemic uncertainty (Fig c.).

https://chcorbi.github.io/posts/2020/11/dirichlet-networks/
👍1
Maximally Compact and Separated Features with Regular Polytope Networks

In this work we show how to extract from CNNs features with the properties of maximum inter-class separability and maximum intra-class compactness by setting the parameters of the classifier transformation as not trainable (i.e. fixed). We obtain features similar to what can be obtained with the well-known “Center Loss” [1] and other similar approaches but with several practical advantages including maximal exploitation of the available feature space representation, reduction in the number of network parameters, no need to use other auxiliary losses besides the Softmax. Our approach unifies and generalizes into a common approach two apparently different classes of methods regarding: discriminative features, pioneered by the Center Loss [1] and fixed classifiers, firstly evaluated in [2].

https://openaccess.thecvf.com/content_CVPRW_2019/papers/Deep%20Vision%20Workshop/Pernici_Maximally_Compact_and_Separated_Features_with_Regular_Polytope_Networks_CVPRW_2019_paper.pdf
🤔1
Regular Polytope Networks

Neural networks are widely used as a model for classification in a large variety of tasks. Typically, a learnable transformation (i.e. the classifier) is placed at the end of such models returning a value for each class used for classification. This transformation plays an important role in determining how the generated features change during the learning process. In this work, we argue that this transformation not only can be fixed (i.e. set as non-trainable) with no loss of accuracy and with a reduction in memory usage, but it can also be used to learn stationary and maximally separated embeddings. We show that the stationarity of the embedding and its maximal separated representation can be theoretically justified by setting the weights of the fixed classifier to values taken from the coordinate vertices of the three regular polytopes available in R d , namely: the d-Simplex, the d-Cube and the d-Orthoplex. These regular polytopes have the maximal amount of symmetry that can be exploited to generate stationary features angularly centered around their corresponding fixed weights.

https://arxiv.org/pdf/2103.15632.pdf
🔥1🤔1
Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks

Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a Feller process, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of “capacity metric”. We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature

https://arxiv.org/pdf/2006.09313.pdf
🔥1🤔1🫡1
«Над желтизной правительственных зданий кружилась долго мутная метель» от ruDALL-E Kandinsky (XXL)
👍1🎉1🫡1
Group Symmetry in PAC Learnin

In this paper we show rigorously how learning in the PAC framework with invariant or equivariant hypotheses reduces to learning in a space of orbit representatives. Our results hold for any compact group, including infinite groups such as rotations. In addition, we show how to use these equivalences to derive generalisation bounds for invariant/equivariant models in terms of the geometry of the input and output spaces. To the best of our knowledge, our results are the most general of their kind to date.

https://openreview.net/pdf?id=HxeTEZJaxq
❤‍🔥1👍1🤔1
Deep Double Descent: Where Bigger Models and More Data Hurt

We show that a variety of modern deep learning tasks exhibit a “double-descent” phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance

https://arxiv.org/abs/1912.02292
👍1😱1
Explaining Neural Scaling Laws

The test loss of well-trained neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents: super-classing image tasks does not change exponents, while changing input distribution (via changing datasets or adding noise) has a strong effect. We further explore the effect of architecture aspect ratio on scaling exponents

https://arxiv.org/pdf/2102.06701.pdf
👍1🔥1
Rank Diminishing in Deep Neural Networks

The rank of neural networks measures information flowing across layers. It is an instance of a key structural condition that applies across broad domains of machine learning. In particular, the assumption of low-rank feature representations leads to algorithmic developments in many architectures. For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear. To fill this gap, we perform a rigorous study on the behavior of network rank, focusing particularly on the notion of rank deficiency. We theoretically establish a universal monotonic decreasing property of network rank from the basic rules of differential and algebraic composition, and uncover rank deficiency of network blocks and deep function coupling. By virtue of our numerical tools, we provide the first empirical analysis of the per-layer behavior of network rank in practical settings, i.e., ResNets, deep MLPs, and Transformers on ImageNet. These empirical results are in direct accord with our theory. Furthermore, we reveal a novel phenomenon of independence deficit caused by the rank deficiency of deep networks, where classification confidence of a given category can be linearly decided by the confidence of a handful of other categories. The theoretical results of this work, together with the empirical findings, may advance understanding of the inherent principles of deep neural networks

https://arxiv.org/abs/2206.06072
👍1🔥1
Forwarded from DLStories
В MIT исследовали феномен гроккинга (grokking) и, кажется, нашли этому эффекту правдоводобное объяснение

Grokking — это эффект, при котором уже после стадии переобучения модели (при возрастающем test лоссе и падаюшем train) при дальнейшем обучении test loss вдруг начинает падать, и сеть лучше генерализуется. Такой эффект наблюдали уже давно (например, вот тут), но, чаще всего, на малых, “игрушечных” датасетах (например, на такой простой задаче как сложение чисел). Никакого сильно убедительного объяснения эффекту не было (насколько я знаю), а также не удавалось повторить эффект на реальных, не игрушечных данных.

Ребята из MIT подошли ближе к пониманию природы grokking’а и даже смогли воспроизвести эффект на разных больших датасетах.

Авторы связывают эффект гроккинга с устройством поверхности лосс-функции. В недавней работе из Стенфорда было показано, что у этой поверхности существует сферическая область, в которой достигается оптимальная генерализация модели: то есть, для модели с параметрами из этой области train и test loss малы, переобучения при этом не наблюдается. Эту сферическую область ребята из Стенфорда назвали Goldilocks zone (“зона Златовласки”, показана на картинке зеленым). Так как область сферическая, она соответствует параметрам модели с определенной нормой. Внутренность это сферической области соответствует параметрам с меньшем нормой, область вне сферы — параметрам с большей нормой.

Далее: оказывается, grokking начинает проявляться, если в какой-то момент параметры сети имели большую норму (то есть, соответствовали точке поверхности вне сферы Златовласки). В этом случае параметры из этой точки быстро скатываются в точку локального минимума, которая также чаще всего находится за пределами сферы Златовласки, т.е. вне той области, где достигается оптимальная генерализация модели. Лосс на train становится малым, лосс на test остается большим: наблюдается переобучение.

Далее происходит следующее: если у модели нет никакой регуляризации (weight decay, к примеру), то на этом обучение модели заканчивается. Лосс не выходит из точки минимума, grokking не наблюдается, переобучение остается. Но если к модели применяется регуляризация, это выталкивает веса модели из локального минимума, и они начинают стремиться к зоне Златовласки. Обычно регуляризация применяется не очень сильная, поэтому проходит довольно много эпох обучения, прежде чем веса модели достигают точки внутри зоны Goldilocks и модель начинает генерализоваться. Это и объясняет эффект гроккинга: когда через достаточно долгое время после переобучения лосс на тесте вдруг падает и модель начинает лучше обобщаться.

Кстати, если у модели регуляризация довольно сильная, grokking’а тоже не будет: веса модели будут сразу быстро приходить внутрь зоны Златовласки, не задерживаясь в локальном минимуме вне сферы. Модель просто сразу достаточно хорошо генерализуется.

Теперь, почему grokking часто наблюдают на “игрушечных” датасетах, и почти никогда — на реальных. Во-первых, обычно веса любых нейросетей инициализируют значениями с довольно малой нормой, т.е. лежащими внутри зоны Златовласки. Однако на суперпростых датасетах при обучении модели наблюдается более быстрое увеличение нормы весов, потому что градиентный спуск быстрее выталкивает веса модели в локальную точку минимума трейн лосса сильно за пределами сферы. При обучении больших моделей этот эффект не такой сильный, и веса нечасто выходят за пределы зоны Златовласки вообще, и grokking не наблюдается.

Чтобы подтвердить гипотезу об описанном происхождении гроккина, исследователи обучили несколько моделей для разных задач CV и NLP с разными нормами весов. Оказалось, что увеличение нормы весов действительно приводит к возникновению эффекта grokking’а, пусть все же и не настолько выраженно, как на более простых датасетах.

Подробнее читайте в статье
Изначально новость нашла тут
👍1🔥1
TOAST: Topological Algorithm for Singularity Tracking

The manifold hypothesis, which assumes that data lie on or close to an unknown manifold of low intrinsic dimensionality, is a staple of modern machine learning research. However, recent work has shown that realworld data exhibit distinct non-manifold structures, which result in singularities that can lead to erroneous conclusions about the data. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address detecting singularities by developing (i) persistent local homology, a new topologydriven framework for quantifying the intrinsic dimension of a data set locally, and (ii) Euclidicity, a topologybased multi-scale measure for assessing the ‘manifoldness’ of individual points. We show that our approach can reliably identify singularities of complex spaces, while also capturing singular structures in real-world data sets.

https://arxiv.org/abs/2210.00069
👍1🔥1
PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training (CLIP) have shown inspirational performance on 2D visual recognition, which learns to match images with their corresponding texts in openvocabulary settings. However, it remains under explored that whether CLIP, pre-trained by large-scale image-text pairs in 2D, can be generalized to 3D recognition. In this paper, we identify such a setting is feasible by proposing PointCLIP, which conducts alignment between CLIPencoded point clouds and 3D category texts. Specifically, we encode a point cloud by projecting it onto multi-view depth maps and aggregate the view-wise zero-shot prediction in an end-to-end manner, which achieves efficient knowledge transfer from 2D to 3D. We further design an inter-view adapter to better extract the global feature and adaptively fuse the 3D few-shot knowledge into CLIP pre-trained in 2D. By just fine-tuning the adapter under few-shot settings, the performance of PointCLIP could be largely improved. In addition, we observe the knowledge complementary property between PointCLIP and classical 3D-supervised networks. Via simple ensemble during inference, PointCLIP contributes to favorable performance enhancement over state-of-the-art 3D networks. Therefore, PointCLIP is a promising alternative for effective 3D point cloud understanding under low data regime with marginal resource cost. We conduct thorough experiments on ModelNet10, ModelNet40 and ScanObjectNN to demonstrate the effectiveness of PointCLIP. Code is available at https: //github.com/ZrrSkywalker/PointCLIP.

https://arxiv.org/abs/2112.02413
👍1🔥1
What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon’s escape theorem to tractably estimate the effective dimension of CMs and their boundaries through tomographic intersections with random affine subspaces of varying dimension. We show several connections between the dimension of CMs, generalization, and robustness. In particular we investigate how CM dimension depends on 1) the dataset, 2) architecture (including ResNet, WideResNet & Vision Transformer), 3) initialization, 4) stage of training, 5) class, 6) network width, 7) ensemble size, 8) label randomization, 9) training set size, and 10) robustness to data corruption. Together a picture emerges that higher performing and more robust models have higher dimensional CMs. Moreover, we offer a new perspective on ensembling via intersections of CMs
👍1🔥1
2025/07/10 11:26:43
Back to Top
HTML Embed Code: