Skip to content

Latest commit

 

History

History
111 lines (84 loc) · 6.86 KB

animations.md

File metadata and controls

111 lines (84 loc) · 6.86 KB

Compressing Invariant Manifolds in Neural Nets

Reference: article

The Stripe Model

We consider a binary classification task where the label function depends only on one direction in the data space, namely . Layers of and regions alternate along the direction , separated by parallel planes. The two labels are assumed equiprobable. The points that constitute the training and test set are drawn from d uncorrelated Gaussian distributions .

We take a fully-connected one hidden neural network of activation ,

and we train the function with a discrete approximation of Gradient Flow

,

where the trained parameters are .

Varying the scale drives the network dynamics from the feature regime (small ) to the lazy regime (large ).

In the following animation we show the amplification effect taking place during the evolution of the vectors during training in the feature regime

The animation is consistent with the fact that the amplification factor diverges.

Note: considering weights magnitude exploses during learning, vectors length is divided by in order to make them fit in the frame. Relative norms and orientations are the quantities of interest.

Considering we choose , we can plot the point in space, nearest to the origin, for which the ReLU argument is zero. For each neuron, this is given by .

In the following we plot the evolution during learning of for active neurons (by active we mean that the corresponding output weight is non-negligible). Points are colored depending on which says if the ReLU function is oriented towards the origin or away from it .

We can distinguish here the three temporal regimes described in the paper:

  1. Compressing Regime: Before , all neuron vectors converge towards a finite number of fixed points. All fixed points are located on the informative subspace, in accordance with our prediction for the large p limit.
  2. Fitting regime: After , a finite fraction of the training points escape the loss function and the dynamics exit the ODE prediction: fixed points start moving.
  3. Over-fitting regime: When the number of points still contributing to the loss function gets of order one, learning focuses on these points and this can result in overfitting.

The Cylinder Model

We consider here an extension of the stripe model where the labelling function depends on a subset of coordinates of dimension . Specifically, .

We train a fully connected architecture on this dataset, as described in the previous section. We choose and .

The following animation shows the weight evolution during learning for two different sections of the 3d space:

Similarly to what is observed in the Stripe Model, we see an amplification of the weigths taking place in the directions.

Parameters:

  • Training set size:
  • Number of neurons:
  • Activation function:
  • Hinge loss:
  • All weights are initalized
  • is the network function at initialization.