*
We trained a large, deep convolutional neural network to classify the 1.2 millionhigh-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5*

- CNN architecture:
- 650.000 neurons (60 million parameters)
- 5 convolutional layers
- Some of them followed by a max-pooling layer
- 3 fully-connected layers
- 1 1000-way softmax

- Dropout regularization method to reduce overfitting in 3 fully-connected layers
- Training time: 5-6 days on two GTX 580 3GB GPUs
- Dataset:
- ILSVRC-2010
- Down-sampled images to a fixed resolution of 256x256
- Substract the mean activity ofver training set from each pixel

- ReLU:
- Faster than tanh
- ReLU: 6 epochs
- tanh: 36 more epochs to achieve same performance

- Local Response Normalization
- and error reduction
- Helps generalization
- , and

- Overlapping Pooling
- and error reduction
- grid
- stride = 2
- Overlap each pooling one column pixel

- Overall Architecture
- 224x224x3 (RGB image)
- Conv 96 kernels of size 11x11x3 with stride of 4 pixels
- Response-Normalized and max-pooling
- Conv 256 kernels of size 5x5x48 with stride of ? pixels
- Response-Normalized and max-pooling
- Conv 384 kernels of size 3x3x256
- Conv 384 kernels of size 3x3x192
- Conv 256 kernels of size 3x3x192
- ¿Response-Normalized? and Max-pooling
- Fully connected 4096
- Fully connected 4096
- Fully connected 1000
- Softmax

- Data augmentation
- error reduction
- Original images escaled scaled and croped to 256x256
- Extract 5 images of 224x224 from corners plus center
- Mirror horizontally and get 5 more images
- Augment data altering RGB channels:
- Perform PCA on RGB throughout the training set
- Each training image add multiples of PCs with gaussian noise

- Dropout
- Put to zero the output of neurons with probability 0.5
- At test time multiply the outputs by 0.5
- Two first fully-connected layers
- Solves overfitting
- Dobules the number of iterations required to ocnverge

- Details of learning
- batch size = 128
- momentum 0.9
- weight decay 0.0005
- Initial weights from zero-mean Gaussian std=0.01
- biases = 1 on second, fourth, fifth Conv and fully-connected
- biases = 0 on the rest

- Evaluation
- Consider the feature activations induced by an image at the last, 4096-dimensional hidden layer