(aside image)

Figure : Schematic diagram illustrating the interconections between layers in the Neocognitron [K. Fukushima 1980]

Merging chrominance and luminance in early, medium and late fusion using Convolutional Neural Networks

The field of Machine Learning has received extensive attention in recent years. More particularly, computer vision problems have got abundant consideration as the use of images and pictures in our daily routines is growing.

The classification of images is one of the most important tasks that can be used to organize, store, retrieve, and explain pictures. In order to do that, researchers have been designing algorithms that automatically detect objects in images. During last decades, the common approach has been to create sets of features -- manually designed -- that could be exploited by image classification algorithms. More recently, researchers designed algorithms that automatically learn these sets of features, surpassing state-of-the-art performances.

However, learning optimal sets of features is computationally expensive and it can be relaxed by adding prior knowledge about the task, improving and accelerating the learning phase. Furthermore, with problems with a large feature space the complexity of the models need to be reduced to make it computationally tractable (e.g. the recognition of human actions in videos).

Consequently, we propose to use multimodal learning techniques to reduce the complexity of the learning phase in Artificial Neural Networks by incorporating prior knowledge about the connectivity of the network. Furthermore, we analyze state-of-the-art models for image classification and propose new architectures that can learn a locally optimal set of features in an easier and faster manner.

In this thesis, we demonstrate that merging the luminance and the chrominance part of the images using multimodal learning techniques can improve the acquisition of good visual sets of features. We compare the validation accuracy of several models and we demonstrate that our approach outperforms the basic model with statistically significant results.

Documents: Thesis pdf, Slides pdf

HTML tutorial HTML tutorial

Colorspace transformations

RGB cubes represented in different colorspace transformations of the RGB channels in 3D. The 3D gifs can be seen if you click on the desired image . The size of each .GIF is ~11MB.

RGB and YUV

3D rgb cube in rgb axis 3D rgb cube in yuv axis

XYZ and YIQ

3D rgb cube in xyz axis 3D rgb cube in yiq axis

Alexnet first Convolution filters in original RGB and transformed to YUV

It is possible to see that the principal component is focused on the luminance (or luma Y' axis), while the chrominance is equaly distributed along U and V channels

Conv1 weigths for each pixel Conv1 weigths for each pixel transformed to YUV

The same visualization with the boundaries and the diagonal of the RGB colorspace.

Conv1 weigths for each pixel Conv1 weigths for each pixel transformed to YUV

Cifar10 Berkeley first Convolution filters in original RGB and transformed to YUV

In that case the principal component is not very clean but experimental results says that it do not hurt to reduce the number of connections separating the luminance from the chrominance channels.

Conv1 weigths for each pixel Conv1 weigths for each pixel transformed to YUV

Very Deep CNN (19conv) first Convolution filters in original RGB and transformed to YUV

Conv1 weigths for each pixel Conv1 weigths for each pixel transformed to YUV

Datasets

This section contains a list of useful datasets. These datasets and more about computer vision can be found in : cvpapers

ImageNet

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. We hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.

[show more info]

[ webpage | download ]

Video classification USAA dataset

"The USAA dataset includes 8 different semantic class videos which are home videos of social occassions such e birthday party, graduation party,music performance, non-music performance, parade, wedding ceremony, wedding dance and wedding reception which feature activities of group of people. It contains around 100 videos for training and testing respectively. Each video is labeled by 69 attributes. The 69 attributes can be broken down into five broad classes: actions, objects, scenes, sounds, and camera movement. It can be used for evaluating approaches for video classification, N-shot and zero-shot learning, multi-task learning, attribute/concept-annotation, attribute/concepts-modality prediction, suprising attributes/concepts discovery, and latent-attribute(concepts) discovery etc."

[show more info]

[ webpage | download ]

YouTube Faces

"The data set contains 3,425 videos of 1,595 different people. All the videos were downloaded from YouTube. An average of 2.15 videos are available for each subject. The shortest clip duration is 48 frames, the longest clip is 6,070 frames, and the average length of a video clip is 181.3 frames."

[show more info]

[ webpage | download ]

Hollywood-2 Human Actions and Scenes dataset

"Hollywood-2 datset contains 12 classes of human actions and 10 classes of scenes distributed over 3669 video clips and approximately 20.1 hours of video in total. The dataset intends to provide a comprehensive benchmark for human action recognition in realistic and challenging settings. The dataset is composed of video clips extracted from 69 movies, it contains approximately 150 samples per action class and 130 samples per scene class in training and test subsets. A part of this dataset was originally used in the paper "Actions in Context", Marszałek et al. in Proc. CVPR'09. Hollywood-2 is an extension of the earlier Hollywood dataset."

[show more info]

[ webpage | download ]

KTH - Recognition of Human Actions

"The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as illustrated below. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average."

[show more info]

[ webpage ]

UCF50

UCF50 is an action recognition data set with 50 action categories, consisting of realistic videos taken from youtube. This data set is an extension of YouTube Action data set (UCF11) which has 11 action categories.

Most of the available action recognition data sets are not realistic and are staged by actors. In our data set, the primary focus is to provide the computer vision community with an action recognition data set consisting of realistic videos which are taken from youtube. Our data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. For all the 50 categories, the videos are grouped into 25 groups, where each group consists of more than 4 action clips. The video clips in the same group may share some common features, such as the same person, similar background, similar viewpoint, and so on. [...]

[show more info]

[ webpage | download ]

UCF101

UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.

With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories. [...]

[show more info]

[ webpage | download ]



Workshops

ILSVRC - ImageNet Large Scale Visual Recognition Challenge

This challenge evaluates algorithms for object detection and image classification at large scale.

[show more info]

THUMOS'13 - ICCV Workshop on Action Recognition with a Large Number of Classes

THUMOS: The First International Workshop on Action Recognition with a Large Number of Classes, in conjunction with ICCV '13, Sydney, Australia.

[show more info]

[ webpage ]

4DMOD - 14th International Conference on Computer Vision

4DMOD is the workshop on the modeling of dynamic scenes. Modeling shapes that evolve over time, analyzing and interpreting their motion is a subject of increasing interest of many research communities including the computer vision, the computer graphics and the medical imaging community. Following the 1st edition in 2011, the purpose of this workshop is to provide a venue for researchers, from various communities, working in the field of dynamic scene modeling from various modalities to present their work, exchange ideas and identify challenging issues in this domain. Contributions are sought on new and original research on any aspect of 4D Modeling. Possible topics include, but are not limited to :

[show more info]

[ webpage ]

TRECVID

The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.

[show more info]

[ webpage ]



Comparison of CNN libraries

There are several libraries that can be used for training CNN.

Theano

"Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently."

[ webpage | GitHub ]

Pylearn2

"Pylearn2 is a machine learning library. Most of its functionality is built on top of Theano. This means you can write Pylearn2 plugins (new models, algorithms, etc) using mathematical expressions, and theano will optimize and stabilize those expressions for you, and compile them to a backend of your choice (CPU or GPU)."

[ webpage | GitHub ]

Caffe

"Caffe is a framework for convolutional neural network algorithms, developed with speed in mind. It was created by Yangqing Jia, and is in active development by the Berkeley Vision and Learning Center."

[ webpage | GitHub ]

OverFeat

"OverFeat is an image recognizer and feature extractor built around a convolutional network.

The OverFeat convolutional net was trained on the ImageNet 1K dataset. It participated in the ImangeNet Large Scale Recognition Challenge 2013 under the name “OverFeat NYU”.

This release provides C/C++ code to run the network and output class probabilities or feature vectors. It also includes a webcam-based demo."

[ webpage | GitHub ]

Some comparisons

libraries commit counts



References with abstract sorted by year

These are some references I read during my master's thesis. They contain the original abstract and my own notes. Some of the notes are textual parts of the original references, however other notes are just my own interpretation. It is also avaliable in html and LaTeX format : [ html | pdf ]

[+] Press this symbol on the papers title to see the abstract and additional information

1700