Next: Combining gradient histograms using Up: Summary of References Related Previous: AXES at TRECVid 2012: Contents

Subsections

Improving neural networks by preventing co-adaptation of feature detectors []

Original Abstract

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

Main points

Paper about Dropout
Standard way to reduce test error
- averaging different models
- Computationally expensive in training and test
Dropout
- Small training set
- Prevents ``overfitting''
- They use
- Instead of L2 norm, they set an upper bound for each individual neuron.
- Mean network : At test time divide all the outgoing weights by 2 to compensate dropout
- Specific case
  - Single hidden layer network
  - N hidden units
  - ``Softmax'' output
  - dropout
  - during test using mean network
  - Exactly equivalent to taking the geometric mean of the probability distributions over labels predicted by all possible networks
Results
- MNIST
  - No dropout : 160 errors
  - Dropbout : 130 errors
  - Dropout + rm random pixels : 110 errors
  - Deep Boltzmann machine : 88 errors
  - + Dropout : 79 errors
    
    Figure 2: Visualization of features learned by first layer hidden units left without dropout and right using dropout
- TIMIT
  - 4 Fully-connected hidden layers 4000 units per layer
  - + 185 ``softmax'' output units
  - Without dropout :
  - Dropout on hidden units :
- CIFAR-10
  - Best published :
  - 3 Conv+Max-pool 1 Fully :
  - + Dropout in last hidden layer :
- ImageNet
  - Average of 6 separate models :
  - state-of-the-art :
  - 5 Conv+Max-pool
  - + 2 Fully
  - + 1000 ``softmax''
  - Without dropout :
  - Dropout in the 6th :
- Reuters
  - 2 fully of 2000 hidden units
  - Without dropout :
  - Dropout :

Next: Combining gradient histograms using Up: Summary of References Related Previous: AXES at TRECVid 2012: Contents

Miquel Perello Nieto 2014-11-28