Action recognition in visual sensor networksa data fusion perspective

  1. Cilla Ugarte, Rodrigo
Supervised by:
  1. Antonio Berlanga de Jesús Director
  2. Miguel Ángel Patricio Guisado Director

Defence university: Universidad Carlos III de Madrid

Fecha de defensa: 14 December 2012

Committee:
  1. María Araceli Sanchís de Miguel Chair
  2. José Ramón Casar Corredera Secretary
  3. Luis Miguel Parreira Correira Committee member

Type: Thesis

Abstract

Visual Sensor Networks have emerged as a new technology to bring computer vision algorithms to the real world. However, they impose restrictions in the computational resources and bandwidth available to solve target problems. This thesis is concerned with the definition of new efficient algorithms to perform Human Action Recognition with Visual Sensor Networks. Human Action Recognition systems apply sequence modelling methods to integrate the temporal sensor measurements available. Among sequence modelling methods, the Hidden Conditional Random Field has shown a great performance in sequence classification tasks, outperforming many other methods. However, a parameter estimation procedure has not been proposed with feature and model selection properties. This thesis fills this lack proposing a new objective function to optimize during training. The L2 regularizer employed in the standard objective function is replaced by an overlapping group-L1 regularizer that produces feature and model selection effects in the optima. A gradient-based search strategy is proposed to find the optimal parameters of the objective function. Experimental evidence shows that Hidden Conditional Random Fields with their parameters estimated employing the proposed method have a higher predictive accuracy than those estimated with the standard method, with an smaller inference cost. This thesis also deals with the problem of human action recognition from multiple cameras, with the focus on reducing the amount of network bandwidth required. A multiple view dimensionality reduction framework is developed to obtain similar low dimensional representation for the motion descriptors extracted from multiple cameras. An alternative is proposed predicting the action class locally at each camera with the motion descriptors extracted from each view and integrating the different action decisions to make a global decision on the action performed. The reported experiments show that the proposed framework has a predictive performance similar to 3D state of the art methods, but with a lower computational complexity and lower bandwidth requirements. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------