Abstract

In this work, our goal is large-vocabulary continuous sign language recognition (CSLR) — the task of predicting a sequence of signs from a given continuous signing video clip, without constraints on the lexicon. Previous works have been unable to focus on this realistic setting due to the absence of appropriate benchmarks. We introduce here a new CSLR test benchmark, consisting of manually annotated continuous sign-level data. It is the largest CSLR benchmark to date, both in terms of vocabulary (5K different signs) and duration (6 hours), containing 48k gloss annotations. To solve the CSLR task, we also propose a Transformer model, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. We demonstrate that by a careful choice of loss functions, training the model jointly for both CSLR and sign language retrieval tasks improves the CSLR performance by providing context. Our model significantly outperforms the previous state of the art on our newly collected CSLR benchmark, serving as a strong baseline for the community.

License

The BOBSL-CSLR dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. A complete version of the license can be found here. Note that this only contains the textual annotations. See the terms of use of the original BOBSL dataset at https://www.robots.ox.ac.uk/~vgg/data/bobsl/ for the non-commercial research license associated with the videos.

Examples from the dataset

overview

Citation

Contact

For any query relating to BOBSL-CSLR, please email at bobsl@googlegroups.com.

Acknowledgements

This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011013569 made by GENCI. The authors would like to acknowledge the ANR project CorVis ANR-21-CE23-0003-01 and the Royal Society Research Professorship RP\R1\191132.
The website template was borrowed from Michaël Gharbi.