Action recognition in visual sensor networksa data fusion perspective

  1. Cilla Ugarte, Rodrigo
unter der Leitung von:
  1. Antonio Berlanga de Jesús Doktorvater/Doktormutter
  2. Miguel Ángel Patricio Guisado Doktorvater/Doktormutter

Universität der Verteidigung: Universidad Carlos III de Madrid

Fecha de defensa: 14 von Dezember von 2012

Gericht:
  1. María Araceli Sanchís de Miguel Präsident/in
  2. José Ramón Casar Corredera Sekretär/in
  3. Luis Miguel Parreira Correira Vocal

Art: Dissertation

Zusammenfassung

Visual Sensor Networks have emerged as a new technology to bring computer vision algorithms to the real world. However, they impose restrictions in the computational resources and bandwidth available to solve target problems. This thesis is concerned with the definition of new efficient algorithms to perform Human Action Recognition with Visual Sensor Networks. Human Action Recognition systems apply sequence modelling methods to integrate the temporal sensor measurements available. Among sequence modelling methods, the Hidden Conditional Random Field has shown a great performance in sequence classification tasks, outperforming many other methods. However, a parameter estimation procedure has not been proposed with feature and model selection properties. This thesis fills this lack proposing a new objective function to optimize during training. The L2 regularizer employed in the standard objective function is replaced by an overlapping group-L1 regularizer that produces feature and model selection effects in the optima. A gradient-based search strategy is proposed to find the optimal parameters of the objective function. Experimental evidence shows that Hidden Conditional Random Fields with their parameters estimated employing the proposed method have a higher predictive accuracy than those estimated with the standard method, with an smaller inference cost. This thesis also deals with the problem of human action recognition from multiple cameras, with the focus on reducing the amount of network bandwidth required. A multiple view dimensionality reduction framework is developed to obtain similar low dimensional representation for the motion descriptors extracted from multiple cameras. An alternative is proposed predicting the action class locally at each camera with the motion descriptors extracted from each view and integrating the different action decisions to make a global decision on the action performed. The reported experiments show that the proposed framework has a predictive performance similar to 3D state of the art methods, but with a lower computational complexity and lower bandwidth requirements. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------