Internship proposal: Automatic Analysis of Art exhibition catalogues

Mathieu Aubry (ENPC, Computer Vision), Béatrice Joyeux-Prunel (ENS, Art History)

               

Context:

Among the wealth of artifacts produced by artistic activity, exhibition catalogues are a prime source for retracing the making of facts and knowledge within the art field. From their first appearance in the 17th c. (Salon de l’Académie, Paris, 1673) to their contemporary usage amidst the peripheries of the global art economy, they have greatly varied in size, content, and purpose. And yet, since the 17th century, the structure of catalogues has been constant enough in the inclusion of dates, titles, places, and names, to fuel a desire for comparison and convergence. Since the 19th century catalogues have also reached such a level of global  ubiquity that they can serve as a transnational and transperiodical tool of commensurability.

This is why the Artl@s Project has developed a PostGIS database of exhibition catalogues (http://artlas-wp-dev.ens.fr/fr/): to gather and centralize the information contained in exhibition catalogues, over time, and on a global scale, and to provide scholars digital tools and methods to best utilize this source.

One of the highest challenges for the Artlas Project today  is the way the team gathers their data from exhibition catalogues. The current harvest process misses an essential part in catalogues : artworks reproductions and the layouts of catalogues. It is also difficult, time consuming, and prone to errors (copying from the source, “by hand”). The team needs also a more efficient harvesting process, that could help them work with remote partners from Brazil, Lebanon, Morocco, Canada, Israël, and Japan,  which want to send lots of scanned catalogues, to put their data into the Database, and cross and connect them with other global data. At stakes:

Goal:

The goal of this internship is to apply Computer Vision methods to extract structured information from these catalogues at large scale. In particular, it will be necessary to recognize the text, extract the presented work list from the text, extract the images and associate them to the different works. The approach will first be developed on a small subset of catalogues, but should ultimately be applied at very large scale, to catalogues with very diverse presentations. The student will first test standard OCR (e.g. https://github.com/tesseract-ocr/tesseract ) and layout estimation (e.g. https://github.com/tmbdev/ocropy) software but will likely have to develop a new approach, able to use jointly information from the layout and semantic of the the text, in the spirit of Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., & Giles, C. L. , Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network, CVPR 2017 , to identify and extract structured information.

Details: