Next: Spectral Networks and Deep Up: Summary of References Related Previous: Learning Multi-modal Latent Attributes Contents

Subsections

Large-scale Video Classification with Convolutional Neural Networks [45]

Original Abstract

Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large- scale video classification using a new dataset of 1 millionYouTube videos belonging to 487 classes. We study mul-tiple approaches for extending the connectivity of a CNNin time domain to take advantage of local spatio-temporalinformation and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training.Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-basedbaselines (55.3

Main points

Compare different CNN architectures for video classification
Create a new dataset with 1 million of YouTube sport videos and 487 classes
They required one month of training
Multiresolution CNNs: New CNN with low resolution context and high resolution center
- Context stream: seems to learn color filters
- Fovea stream: learns grayscale features
Compare with and without pretraining on other dataset UCF-101
Architectures (increasing spatio-temporal relations)
- Single frame: Classify with one single shot
- Late Fusion: Classify with separate-in-time shots
- Early Fusion: Classify with adjacent shots merging on first convolution layer
- Slow Fusion: Classify with adjacent shots progressively mergin in upper layers
Results (best models):
- clip Hit, Video Hit, Video Hit top5
- 42.4 60.0 78.5 Single-Frame + Multiresolution
- 41.9 60.9 80.2 Slow Fusion
Results on UCF-101 with pretraining:
- 41.3 No pretraining
- 64.1 Fine-tune top layer
- 65.4 Fine-tune top 3 layers
- 62.2 Fine-tune all layers
Conclusions:
- From video classification can be derived that camera movements deteriorate the predictions
- Single frame gives very good results
Further work:
- Apply some filter for camera movements
- Explore RNN from clip-level into video-level

Next: Spectral Networks and Deep Up: Summary of References Related Previous: Learning Multi-modal Latent Attributes Contents

Miquel Perello Nieto 2014-11-28