@PHDTHESIS{varol19_thesis,
title     = {Learning human body and human action representations from visual data},
author    = {G{\"u}l Varol},
school    = {Ecole Normale Sup\'erieure (ENS)},
year      = {2019}
}
The focus of visual content is often people. Automatic analysis of people from visual data is therefore of great importance for numerous applications in content search, autonomous driving, surveillance, health care, and entertainment.
The goal of this thesis is to learn visual representations for human understanding. Particular emphasis is given to two closely related areas of computer vision: human body analysis and human action recognition.
In human body analysis, we first introduce a new synthetic dataset for people, the SURREAL dataset, for training convolutional neural networks (CNNs) with free labels. We show the generalization capabilities of such models on real images for the tasks of body part segmentation and human depth estimation. Our work demonstrates that models trained only on synthetic data obtain sufficient generalization on real images while also providing good initialization for further training. Next, we use this data to learn the 3D body shape from images. We propose the BodyNet architecture that benefits from the volumetric representation, the multi-view re-projection loss, and the multi-task training of relevant tasks such as 2D/3D pose estimation and part segmentation. Our experiments demonstrate the advantages from each of these components. We further observe that the volumetric representation is flexible enough to capture 3D clothing deformations, unlike the more frequently used parametric representation.
In human action recognition, we explore two different aspects of action representations. The first one is the discriminative aspect which we improve by using long-term temporal convolutions. We present an extensive study on the spatial and temporal resolutions of an input video. Our results suggest that the 3D CNNs should operate on long input videos to obtain state-of-the-art performance. We further extend 3D CNNs for optical flow input and highlight the importance of the optical flow quality. The second aspect that we study is the view-independence of the learned video representations. We enforce an additional similarity loss that maximizes the similarity between two temporally synchronous videos which capture the same action. When used in conjunction with the action classification loss in 3D CNNs, this similarity constraint helps improving the generalization to unseen viewpoints.
In summary, our contributions are the following: (i) we generate photo-realistic synthetic data for people that allows training CNNs for human body analysis, (ii) we propose a multi-task architecture to recover a volumetric body shape from a single image, (iii) we study the benefits of long-term temporal convolutions for human action recognition using 3D CNNs, (iv) we incorporate similarity training in multi-view videos to design view-independent representations for action recognition.