## Gül Varol

Short bio: Gül Varol is a permanent researcher at the IMAGINE team of École des Ponts ParisTech. Previously, she was a postdoctoral researcher at the University of Oxford (VGG), working with Andrew Zisserman. She obtained her PhD from the WILLOW team of Inria Paris and École Normale Supérieure (ENS). Her thesis, co-advised by Ivan Laptev and Cordelia Schmid, received the ELLIS PhD Award. During her PhD, she spent time at MPI, Adobe, and Google. Prior to that, she received her BS and MS degrees from Boğaziçi University. Her research is focused on computer vision, specifically human understanding in videos, such as action recognition, body shape and motion analysis, and sign languages.

*Masters internship*: Apply for a 2022 internship on sign languages, co-supervised with Andrew Zisserman. Deaf candidates are highly encouraged. The position is suitable for finishing Master's students, willing to continue for a PhD.

## Research

See Google Scholar profile for a full list of publications.

BOBSL: BBC-Oxford British Sign Language Dataset
Samuel Albanie*, Gül Varol*, Liliane Momeni*, Hannah Bull*, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland and Andrew Zisserman
arXiv 2021.
@ARTICLE{albanie21bobsl,
title   = {{BOBSL}: {BBC}-{O}xford {B}ritish {S}ign {L}anguage Dataset},
author  = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Bull, Hannah and Afouras, Triantafyllos and Chowdhury, Himel and Fox, Neil and Woll, Bencie and Cooper, Rob and McParland, Andrew and Zisserman, Andrew},
journal = {arXiv},
year    = {2021}
}

In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL). BOBSL is an extended and publicly released dataset based on the BSL-1K dataset introduced in previous work. We describe the motivation for the dataset, together with statistics and available annotations. We conduct experiments to provide baselines for the tasks of sign recognition, sign language alignment, and sign language translation. Finally, we describe several strengths and limitations of the data from the perspectives of machine learning and linguistics, note sources of bias present in the dataset, and discuss potential applications of BOBSL in the context of sign language technology. The dataset is available at this https URL.

Towards unconstrained joint hand-object reconstruction from RGB videos
Yana Hasson, Gül Varol, Cordelia Schmid and Ivan Laptev
3DV 2021.
@INPROCEEDINGS{hasson21homan,
title     = {Towards unconstrained joint hand-object reconstruction from RGB videos},
author    = {Hasson, Yana and Varol, G{\"u}l and Schmid, Cordelia and Laptev, Ivan},
booktitle = {3DV},
year      = {2021}
}

Our work aims to obtain 3D reconstruction of hands and manipulated objects from monocular videos. Reconstructing hand-object manipulations holds a great potential for robotics and learning from human demonstrations. The supervised learning approach to this problem, however, requires 3D supervision and remains limited to constrained laboratory settings and simulators for which 3D ground truth is available. In this paper we first propose a learning-free fitting approach for hand-object reconstruction which can seamlessly handle two-hand object interactions. Our method relies on cues obtained with common methods for object detection, hand pose estimation and instance segmentation. We quantitatively evaluate our approach and show that it can be applied to datasets with varying levels of difficulty for which training data is unavailable.

Aligning Subtitles in Sign Language Videos
Hannah Bull*, Triantafyllos Afouras*, Gül Varol, Samuel Albanie, Liliane Momeni and Andrew Zisserman
ICCV 2021.
@INPROCEEDINGS{bull21bslalign,
title     = {Aligning Subtitles in Sign Language Videos},
author    = {Bull, Hannah and Afouras, Triantafyllos and Varol, G{\"u}l and Albanie, Samuel and Momeni, Liliane and Zisserman, Andrew},
booktitle = {ICCV},
year      = {2021}
}

The goal of this work is to temporally align asynchronous subtitles in sign language videos. In particular, we focus on sign-language interpreted TV broadcast data comprising (i) a video of continuous signing, and (ii) subtitles corresponding to the audio content. Previous work exploiting such weakly-aligned data only considered finding keyword-sign correspondences, whereas we aim to localise a complete subtitle text in continuous signing. We propose a Transformer architecture tailored for this task, which we train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals, which interact through a series of attention layers. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not. Through extensive evaluations, we show substantial improvements over existing alignment baselines that do not make use of subtitle text embeddings for learning. Our automatic alignment model opens up possibilities for advancing machine translation of sign languages via providing continuously synchronized video-text data.

Action-Conditioned 3D Human Motion Synthesis with Transformer VAE
Mathis Petrovich, Michael J. Black and Gül Varol
ICCV 2021.
@INPROCEEDINGS{petrovich21actor,
title     = {Action-Conditioned 3{D} Human Motion Synthesis with Transformer {VAE}},
author    = {Petrovich, Mathis and Black, Michael J. and Varol, G{\"u}l},
booktitle = {ICCV},
year      = {2021}
}

We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Our code and models will be made available.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain, Arsha Nagrani, Gül Varol and Andrew Zisserman
ICCV 2021.
@INPROCEEDINGS{bain21_frozen,
title     = {Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval},
author    = {Bain, Max and Nagrani, Arsha and Varol, G{\"u}l and Zisserman, Andrew},
booktitle = {ICCV},
year      = {2021}
}

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as 'frozen' snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Read and Attend: Temporal Localisation in Sign Language Videos
Gül Varol*, Liliane Momeni*, Samuel Albanie*, Triantafyllos Afouras* and Andrew Zisserman
CVPR 2021.
@INPROCEEDINGS{varol21_bslattend,
title     = {Read and Attend: Temporal Localisation in Sign Language Videos},
author    = {Varol, G{\"u}l and Momeni, Liliane and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle = {CVPR},
year      = {2021}
}

The objective of this work is to annotate sign instances across a broad vocabulary in continuous sign language. We train a Transformer model to ingest a continuous signing stream and output a sequence of written tokens on a large-scale collection of signing footage with weakly-aligned subtitles. We show that through this training it acquires the ability to attend to a large vocabulary of sign instances in the input sequence, enabling their localisation. Our contributions are as follows: (1) we demonstrate the ability to leverage large quantities of continuous signing videos with weakly-aligned subtitles to localise signs in continuous sign language; (2) we employ the learned attention to automatically generate hundreds of thousands of annotations for a large sign vocabulary; (3) we collect a set of 37K manually verified sign instances across a vocabulary of 950 sign classes to provide a more robust sign language benchmark; (4) by training on the newly annotated data from our method, we outperform the prior state of the art on the BSL-1K sign language recognition benchmark.

Sign Language Segmentation with Temporal Convolutional Networks
Katrin Renz, Nicolaj C. Stache, Samuel Albanie and Gül Varol
ICASSP 2021.
@INPROCEEDINGS{renz21_segmentation,
title     = {Sign Language Segmentation with Temporal Convolutional Networks},
author    = {Renz, Katrin and Stache, Nicolaj C. and Albanie, Samuel and Varol, G{\"u}l},
booktitle = {ICASSP},
year      = {2021}
}

The objective of this work is to determine the location of temporal boundaries between signs in continuous sign language videos. Our approach employs 3D convolutional neural network representations with iterative temporal segment refinement to resolve ambiguities between sign boundary cues. We demonstrate the effectiveness of our approach on the BSLCORPUS, PHOENIX14 and BSL-1K datasets, showing considerable improvement over the state of the art and the ability to generalise to new signers, languages and domains.

Synthetic Humans for Action Recognition from Unseen Viewpoints
Gül Varol, Ivan Laptev, Cordelia Schmid and Andrew Zisserman
IJCV 2021.
@ARTICLE{varol21_surreact,
title   = {Synthetic Humans for Action Recognition from Unseen Viewpoints},
author  = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia and Zisserman, Andrew},
journal = {IJCV},
year    = {2021}
}

Our goal in this work is to improve the performance of human action recognition for viewpoints unseen during training by using synthetic training data. Although synthetic data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (i) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (ii) We introduce a new dataset, SURREACT, that allows supervised training of spatio-temporal CNNs for action classification; (iii) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (iv) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.

Watch, read and lookup: learning to spot signs from multiple supervisors
Liliane Momeni*, Gül Varol*, Samuel Albanie*, Triantafyllos Afouras and Andrew Zisserman
ACCV 2020. (Best Application Paper Award)
@INPROCEEDINGS{momeni20_spotting,
title     = {Watch, read and lookup: learning to spot signs from multiple supervisors},
author    = {Momeni, Liliane and Varol, G{\"u}l and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew},
booktitle = {ACCV},
year      = {2020}
}

The focus of this work is sign spotting—for a given sign corresponding to a keyword, given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage with a semi-supervised learning objective; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on few-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BslDict, to facilitate study of this task. The dataset, models and code are available at our project page.

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Samuel Albanie*, Gül Varol*, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox and Andrew Zisserman
ECCV 2020.
@INPROCEEDINGS{albanie20_bsl1k,
title     = {{BSL-1K}: {S}caling up co-articulated sign language recognition using mouthing cues},
author    = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Afouras, Triantafyllos and Chung, Joon Son and Fox, Neil and Zisserman, Andrew},
booktitle = {ECCV},
year      = {2020}
}

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

Learning joint reconstruction of hands and manipulated objects
Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev and Cordelia Schmid
CVPR 2019.
@INPROCEEDINGS{hasson19_obman,
title     = {Learning joint reconstruction of hands and manipulated objects},
author    = {Hasson, Yana and Varol, G{\"u}l and Tzionas, Dimitrios and Kalevatykh, Igor and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
booktitle = {CVPR},
year      = {2019}
}

Estimating hand-object manipulations is essential for interpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challenging task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact restricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regularize the joint reconstruction of hands and objects with manipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors physically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transferability of ObMan-trained models to real data.

BodyNet: Volumetric Inference of 3D Human Body Shapes
Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev and Cordelia Schmid
ECCV 2018.
@INPROCEEDINGS{varol18_bodynet,
title     = {{BodyNet}: Volumetric Inference of {3D} Human Body Shapes},
author    = {Varol, G{\"u}l and Ceylan, Duygu and Russell, Bryan and Yang, Jimei and Yumer, Ersin and Laptev, Ivan and Schmid, Cordelia},
booktitle = {ECCV},
year      = {2018}
}

Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

Long-term Temporal Convolutions for Action Recognition
Gül Varol, Ivan Laptev and Cordelia Schmid
TPAMI 2018.
@ARTICLE{varol18_ltc,
title     = {Long-term Temporal Convolutions for Action Recognition},
author    = {Varol, G{\"u}l and Laptev, Ivan and Schmid, Cordelia},
journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year      = {2018},
volume    = {40},
number    = {6},
pages     = {1510--1517},
doi       = {10.1109/TPAMI.2017.2712608}
}

Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

Learning from Synthetic Humans
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev and Cordelia Schmid
CVPR 2017.
@INPROCEEDINGS{varol17_surreal,
title     = {Learning from Synthetic Humans},
author    = {Varol, G{\"u}l and Romero, Javier and Martin, Xavier and Mahmood, Naureen and Black, Michael J. and Laptev, Ivan and Schmid, Cordelia},
booktitle = {CVPR},
year      = {2017}
}

Estimating human pose, shape, and motion from images and video are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL: a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev and Abhinav Gupta
ECCV 2016.
@INPROCEEDINGS{sigurdsson16_charades,
title     = {Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
author    = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ivan Laptev and Ali Farhadi and Abhinav Gupta},
booktitle = {ECCV},
year      = {2016}
}

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but {\em boring} samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, \textit{Charades}, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents, and over $15\%$ of the videos have more than one person. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

### PhD Thesis

Learning human body and human action representations from visual data
Gül Varol
École Normale Supérieure (ENS), 2019.
@PHDTHESIS{varol19_thesis,
title     = {Learning human body and human action representations from visual data},
author    = {G{\"u}l Varol},
school    = {Ecole Normale Sup\'erieure (ENS)},
year      = {2019}
}

The focus of visual content is often people. Automatic analysis of people from visual data is therefore of great importance for numerous applications in content search, autonomous driving, surveillance, health care, and entertainment.

The goal of this thesis is to learn visual representations for human understanding. Particular emphasis is given to two closely related areas of computer vision: human body analysis and human action recognition.

In human body analysis, we first introduce a new synthetic dataset for people, the SURREAL dataset, for training convolutional neural networks (CNNs) with free labels. We show the generalization capabilities of such models on real images for the tasks of body part segmentation and human depth estimation. Our work demonstrates that models trained only on synthetic data obtain sufficient generalization on real images while also providing good initialization for further training. Next, we use this data to learn the 3D body shape from images. We propose the BodyNet architecture that benefits from the volumetric representation, the multi-view re-projection loss, and the multi-task training of relevant tasks such as 2D/3D pose estimation and part segmentation. Our experiments demonstrate the advantages from each of these components. We further observe that the volumetric representation is flexible enough to capture 3D clothing deformations, unlike the more frequently used parametric representation.

In human action recognition, we explore two different aspects of action representations. The first one is the discriminative aspect which we improve by using long-term temporal convolutions. We present an extensive study on the spatial and temporal resolutions of an input video. Our results suggest that the 3D CNNs should operate on long input videos to obtain state-of-the-art performance. We further extend 3D CNNs for optical flow input and highlight the importance of the optical flow quality. The second aspect that we study is the view-independence of the learned video representations. We enforce an additional similarity loss that maximizes the similarity between two temporally synchronous videos which capture the same action. When used in conjunction with the action classification loss in 3D CNNs, this similarity constraint helps improving the generalization to unseen viewpoints.

In summary, our contributions are the following: (i) we generate photo-realistic synthetic data for people that allows training CNNs for human body analysis, (ii) we propose a multi-task architecture to recover a volumetric body shape from a single image, (iii) we study the benefits of long-term temporal convolutions for human action recognition using 3D CNNs, (iv) we incorporate similarity training in multi-view videos to design view-independent representations for action recognition.

Current:
Past:

## People

Current:
Other PhD collaborators:
Alumni:
• 2020 - 2021 Katrin Renz (MS with Samuel Albanie and Nicolaj Stache) - now PhD at MPI
• 2020 - 2021 Aure Enkaoua (MS with Ester Bonmati Coll and Neil Fox) - now PhD at UCL
• 2021 - Charles Raude (MS intern with Justine Cassell, Ivan Laptev) - now at MVA
• 2021 - Jonathan Carter (MS intern with Samuel Albanie) - now PhD at Oxford

Future:
Past: