Perspectives and New Challenges in Data Science

Program with abstracts

Ecole des Ponts ParisTech, February 3rd 2016

Program: morning session

At Ecole des Ponts in Amphi Caquot.

9h15

Arrival of attendees

9h30-9h40

Organisers

ENPC

Foreword

9h40-10h20

Francois Bançilhon

CEO of Data Publica

Data science and its application to predictive marketing

Predictive marketing is a new and emerging field. I will give a general overview of B2B digital marketing, then focus on predictive marketing. I will address the issue of collecting data about entreprises, and the recent impact of open data in that space. I will show the new opportunities that enterprise web sites and enterprise social network accounts give in term of enterprise data collection. Then I will demonstrate how the application of data science to this massive data can help solve real life marketing problems.

10h20-11h00

François Yvon

Professor at Université Paris Sud, Spoken Language Processing team, LIMSI-CNRS

The Unreasonable Effectiveness of Machine Learning Techniques in Natural Language Processing ?

Language Processing Technologies have made tremendous improvements in the past few years, and several applications such as Machine Translation, Spell Checking, and Speech Recognition have now reached the general public. Such improvements were made possible by the availability of very large text or speech corpora, allied with advances in statistical machine learning techniques. In this talk, I will review current state-of-the-art in Natural Language Processing, outline the main reasons of these successes, highlight areas where difficulties remain and where better Machine Learning, possibly combined with better linguistic resources, could help.

11h00-11h20

coffee break

11h20-12h00

Pierre-Paul Vidal

MD-PhD, HOD, COGnition & ACtion Group, Université Paris Descartes

Quantifying human sensorimotricity

In the last five years, the development of non-invasive or minimally invasive tools at reasonable costs has allowed to quantify the behavior of individuals, their biological, psychological, and sociological characteristics, and their genotypes. In that process, it was confirmed that the human population is very polymorphic at every level investigated: genomics, proteomics, and, importantly, ethomics. In contrast, individuals tend to be surprisingly consistent in time. In the past ten years, we have conducted several studies, which confirmed that this was also the case in the field of sensory and motor control both in human and animal models. It led us around 2010, and many others, to three hypothesis that we will discuss:

First, training humans to perform complex tasks, maintaining them in good health by detecting their pathologies at early stages, and determining their optimal therapy will require us to customize whatever is necessary to maintain their efficiency as operators and their well being. Our future is “sur mesure”.
Second, building large stacks of personal databases regrouping sensorimotor, biological, psychological, and sociological variables will be the key of our “maintenance” as a biological and cognitive machine.
Third, building and datamining databases on human behavior will confer major political and economic leverage to those who will control them. So, during these past five years we undertook several studies aimed at building databases on sensorimotor human behavior in various environments. It turned out to be fascinating, complex, and multidisciplinary in nature. All these preliminary studies, reflections, and hypotheses converged to propose the creation of Cognac G in 2014.

Cognac G stands for COGnition and ACtion Group. Also, Théodore-Ernest Cognacq (1839 - 1928) and his wife Marie-Louise Jay (1838-1925), were founders of La Samaritaine, a huge art deco department store in Paris and they were exceptional philanthropists. They built a remarkable collection of fine art and decorative items, about 1200 items in total, with an emphasis on 18th century France, ranging from European and Chinese ceramics, jewels, and snuffboxes, to paintings. The collection was given to the City of Paris. It can be visited for free at the Hôtel Donon in the 3rd arrondissement at 8 rue Elzévir.

COGNAC G is a cooperative project that investigate the long-term follow up of Human groups (ethomics), which have in common to be engaged in complex behavioral tasks during a long stretch of time. These populations require to be followed in order to evaluate their training and once trained to check that their skills are operational. They also need to be to monitored carefully to avoid excessive pressures, which could lead to pathologies such as the burnout syndrome, overtraining, and PTSD. We propose to name these groups “High maintenance cohorts” or HMC. They are very diverse and, given the evolution of society, their number will inevitably rise. To quote a few examples, HMC includes military groups in active duties, athletes at high levels of competition, patients with neurological diseases, patients in reeducation, psychiatric patients, people with heavy chronic handicaps, very senior citizens, etc. All our projects are carried out in collaboration with our colleagues of the CMLA and many of them in partnership with several industrial groups: Thales, Cofely Ineo, Tarkett, CNES etc.

12h00-12h40

Yannig Goude

EDF R&D, Osiris dpt.

Challenges in energy consumption forecasts

Electricity load forecasting faces rising challenges due to the advent of innovating technologies such as smart grids, electric cars and renewable energy production. For utilities, a good knowledge of the future electricity consumption stands as a central point for the reliability of the network, investment strategies, energy trading, optimizing the production etc. Many statistical models have been investigated recently at EDF (Electricité de France) to forecast electricity consumption at different geographical scale and at temporal horizon for both point and probabilistic forecasts. Among them stand regression on functional data, additive models, spatio-temporal modelling, ensemble method and on-line aggregation of experts. We will dress a panorama of these studies, focusing on real data applications and suggest some research perspectives

12h45-14h10

lunch break

Program: afternoon session

At Ecole des Ponts in Amphi Caquot.

14h10-14h50

Cyrille Dubarry

Criteo

Data Science for online advertising

The field of online advertising represents an exciting opportunity for machine learning engineers and scientists : large volumes of data, fast-pace environment enabling quick iterations of research ideas, efficient performance-based marketplace allowing to precisely measure ad effectiveness at a worldwide scale. Industrial players have leveraged the recent advances in large-scale computation to build facilities capable of hosting massive amounts of data and CPU-hungry algorithms. For all these reasons, the last decade has seen tremendous progress in the application of machine learning to online marketing.

In this talk, we will briefly introduce the online advertising marketplace, its stakeholders and the key performance metrics. We will then present the algorithms we have developed at Criteo for bidding in real-time auctions and product recommendation at scale and explain how we evaluate them both offline and online. We will describe the infrastructure for large-scale data processing that these algorithms rely upon. Finally, we will conclude with future areas of research and open the floor for a panel discussion with participants.

14h50-15h30

Vivien Mallet

Clime team, INRIA

Processing environmental simulations and observations for smart cities

The environmental state of cities is monitored by an increasing amount of sensors and analyzed with well established numerical models. Both the resulting observations and simulations are used for environmental management in the context of smart cities. The simulations generate multivariate spatio-temporal fields, down to street resolution, but they are subject to large uncertainties because of shortcomings in the physical models and their input data. On the contrary, the observations are sparse but possibly highly accurate. At the same time, low-cost sensors can now bring large amount of measurements, but with low accuracy. It is possible to optimally combine the observations and the simulations, providing their uncertainties can be evaluated. We will illustrate the approach with applications in air quality and noise pollution. We will discuss where big data is involved, both in the observations and the simulations.

15h30-16h00

coffee break

16h00-16h40

Alexandre Gramfort

CNRS LTCI, Télécom ParisTech

Mind reading from neuroimaging data

Understanding how the brain works in healthy and pathological conditions is considered as one of the challenges for the 21st century. After the first electroencephalography (EEG) measurements in 1929, the 90s was the birth of modern functional brain imaging with the first functional MRI (fMRI) and full head magnetoencephalography (MEG) system. By offering noninvasively unique insights into the living brain, imaging has revolutionized in the last twenty years both clinical and cognitive neuroscience. More recently, the field of brain imaging and electrophysiology has embraced a new set of tools. Using statistical machine learning, new applications have emerged, going from brain computer interaction systems to "mind reading". In this talk, I will briefly explain what the different techniques can offer and show some impressive results recently obtained in the literature. I will then show some of our recent results than demonstrate how low rank optimization can offer better statistical power when learning from fMRI. Finally I will explain how one can learn a subject specific forward model of fMRI data in visual areas, allowing us to replicate neuroscience findings without actual task specific measurements. This is achieved using features obtained via a deep convolutional network applied to input stimuli, features that we use to predict fMRI data.

16h40-17h20

Josef Sivic

Willow team, INRIA

Large-scale quantitative visual analysis of urban environments

Map-based street-level imagery, such as Google Street-view provides a comprehensive visual record of many cities world-wide. In this talk I will describe our efforts to develop automatic tools for large-scale quantitative analysis of this visual data. The aim is to provide quantitative answers to questions like: What are the typical architectural elements (e.g. different types of windows or balconies) characterizing a visual style of a city district? What is their geo-spatial distribution? How does the visual style of a geo-spatial area evolve over time? Progress on these goals could have a significant impact on urban planning and architecture, enabling applications such as quantitative mapping and visualization of existing urban spaces, modeling and predicting the evolution of cities, or obtaining more detailed 3D reconstruction of urban environments.

end of the event

Organisers:

Mohammed El Rhabi	Academic Director of the Maths and CS Engineering department, Ecole des Ponts - ParisTech.
Guillaume Obozinski	Researcher in machine learning, Equipe Imagine A3SI, LIGM Ecole des Ponts - ParisTech.