docExtractor: An off-the-shelf
historical document element extraction

Tom MonnierMathieu Aubry

LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS
teaser.jpg

Paper | Code | Online demo | Slides | IlluHisDoc dataset

Abstract


We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents without requiring any real data annotation. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets and leads to results on par with state-of-the-art when fine-tuned. We argue that the performance obtained without fine-tuning on a specific dataset is critical for applications, in particular in digital humanities, and that the line-level page segmentation we address is the most relevant for a general purpose element extraction engine. We rely on a fast generator of a rich synthetic document dataset called SynDoc and design a fully convolutional network, which we show to generalize better than a detection-based approach. Furthermore, we introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.

Online demo

Check out our web application at https://enherit.paris.inria.fr/ for a live demo! We recommend using Google Chrome for a better user experience. The application was developed in collaboration with Pierre-Guillaume Raverdy from Inria and was supported in part by ENPC and Ecole Nationale des Chartes.

Video


Method


SynDoc

Segmentation network

syndoc.png
syndoc.png
A new 10k synthetic document dataset with fine-grained annotations (bounding shapes for illustrations, x-height+border for text lines) for line-level page segmentation.
A U-Net architecture with a simple ResNet-18 without max-pooling as encoder and deconvolutional layers for upsampling. Small components are filtered out from segmentation output.

Results


cBAD2017 - Simple and Complex

cbad2017.jpg

cBAD2019

cbad2019.jpg

IlluHisDoc (new)

illuhidoc.jpg

Mandragore

mandragore.jpg

RASM2019

rasm2019.jpg

How to cite?


If you find this work useful in your research, please consider citing:

@inproceedings{monnier2020docExtractor,
  title={docExtractor: An off-the-shelf historical document element extraction},
  author={Monnier, Tom and Aubry, Mathieu},
  booktitle={ICFHR},
  year={2020},
}

Resources


Paper

paper.jpg

Code

github.png

Demo

webapp.png

Slides

slides.png

IlluHisDoc

illuhisdoc_raw.jpg

Acknowledgments


This work was supported in part by ANR project EnHerit ANR-17-CE23-0008, project Rapid Tabasco and gifts from Adobe. We thank Beatrice Joyeux-Prunel, K. Bender, Joanna Fronska, Matthieu Husson, Stavros Lazaris, Galla Topalian, Claudia Rabel, Jean-Philippe Moreux and Alexandre Turc for their help in the data collection and fruitful discussions. We also thank Francois Darmon, Pierre-Guillaume Raverdy, Tristan Dot and Ryad Kaoua for code testing and feedbacks.