Markus Koskela: TRECVID 2006 abstract

PicSOM Experiments in TRECVID 2006

Mats Sjöberg, Hannes Muurinen, Jorma Laaksonen, and Markus Koskela. Online Proceedings of the TRECVID 2006 Workshop. Gaithersburg, MD, USA. November 2006.

PDF version available here.

Our experiments in TRECVID 2006 include participation in the shot boundary detection, high-level feature extraction, and search tasks, using a common system framework based on multiple parallel Self-Organizing Maps (SOMs). In the shot boundary detection task we projected feature vectors calculated from successive frames on parallel SOMs and monitored the trajectories to detect the shot boundaries. We submitted the following ten runs:

PicSOM_CA: cut-optimized using all the training videos
PicSOM_GA: gradual-optimized using all the training videos
PicSOM_BA: optimized for both cuts and gradual transitions using all the training videos
PicSOM_CN: cut-optimized using only the news videos (without the NASA videos)
PicSOM_GN: gradual-optimized using only the news videos
PicSOM_CS: cut-optimized using channel-specific training videos
PicSOM_GS: gradual-optimized using channel-specific training videos
PicSOM_CNF: cut-optimized using only the news videos and only a few features
PicSOM_CNE: cut-optimized using only the news videos and one additional edge feature
PicSOM_CAE: cut-optimized using all the training videos and one additional edge feature

The trajectory-based method seemed to work comparatively well in the task. By comparing the F1 scores of the runs we found out that the results mostly degraded when using only a portion of the data in training. Especially the channel-specific detectors seemed to suffer from overfitting and did not work well probably because of the low amount of channel-specific training data compared to the number of adjustable parameters.

In the high-level feature extraction task, we applied a method of representing semantic concepts as class models on a set of parallel SOMs, combined with an inverse file created from automated speech recognition and machine translation (ASR/MT) data. We submitted six runs as follows:

A_SOM_F3_6: still-image and video features
A_SOM_F4_5: visual features and ASR/MT data
A_SOM_F5_4: visual features and stemmed ASR/MT data
A_SOM_F6_3: visual features and ASR/MT and closed-caption data
B_PicSOM_F7_2: visual features and LSCOM concepts
B_PicSOM_F9_1: visual features, ASR/MT data and LSCOM concepts

We observed increase in performance when adding both textual features and the auxiliary concepts to the visual features baseline.

In the search task, we submitted a total of six runs (five automatic and one interactive run). Our method used SOM and inverse file indices from visual and textual features combined with class models of appropriate semantic concepts. The overall settings for the runs were as follows:

F_A_1_OM-f1_6: baseline automatic run using only ASR/MT data
F_A_2_OM-f2_5: automatic run using only visual features
F_A_2_OM-f3_4: automatic run using ASR/MT data and visual features
F_B_2_OM-f4_2: automatic run using ASR/MT data, visual features, and LSCOM concepts
F_B_2_OM-f5_3: automatic run using either only ASR/MT data or visual features and LSCOM concepts, selected by named entity detection
I_B_2_OM-i_1: interactive run with ASR/MT data, visual features, and LSCOM concepts

Using class models created from the LSCOM concepts improved the retrieval performance as measured by MAP scores. Also the entity detection in the last automatic run proved successful and seems to be a promising topic for future experiments.