Next: Spectral Networks and Deep
Up: Summary of References Related
Previous: Learning Multi-modal Latent Attributes
Contents
Subsections
Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large- scale video classification using a new dataset of 1 millionYouTube videos belonging to 487 classes. We study mul-tiple approaches for extending the connectivity of a CNNin time domain to take advantage of local spatio-temporalinformation and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training.Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-basedbaselines (55.3
- Compare different CNN architectures for video classification
- Create a new dataset with 1 million of YouTube sport videos and 487 classes
- They required one month of training
- Multiresolution CNNs: New CNN with low resolution context and high resolution center
- Context stream: seems to learn color filters
- Fovea stream: learns grayscale features
- Compare with and without pretraining on other dataset UCF-101
- Architectures (increasing spatio-temporal relations)
- Single frame: Classify with one single shot
- Late Fusion: Classify with separate-in-time shots
- Early Fusion: Classify with adjacent shots merging on first convolution layer
- Slow Fusion: Classify with adjacent shots progressively mergin in upper layers
- Results (best models):
- clip Hit, Video Hit, Video Hit top5
- 42.4 60.0 78.5 Single-Frame + Multiresolution
- 41.9 60.9 80.2 Slow Fusion
- Results on UCF-101 with pretraining:
- 41.3 No pretraining
- 64.1 Fine-tune top layer
- 65.4 Fine-tune top 3 layers
- 62.2 Fine-tune all layers
- Conclusions:
- From video classification can be derived that camera movements
deteriorate the predictions
- Single frame gives very good results
- Further work:
- Apply some filter for camera movements
- Explore RNN from clip-level into video-level
Next: Spectral Networks and Deep
Up: Summary of References Related
Previous: Learning Multi-modal Latent Attributes
Contents
Miquel Perello Nieto
2014-11-28