Statistics·Econometrics·Machine Learning Seminar at ENSAE Paris

The Statistics, Econometrics and Machine Learning seminar aims to deepen cooperation between researchers from the departments of statistics, econometrics and machine learning at CREST and beyond.

Upcoming Talks

Clément Gauchy
(CEA/CMAP)
January 20th, 2021
Clément Gauchy
Adaptive importance sampling for fragility curve estimation
As part of the risk assessment of the seismic safety of industrial installations, it is necessary to characterize the robustness of civil engineering structures to seismic loads. This characterization is often expressed in the form of fragility curves, which represent the conditional probability that the mechanical demand exceeds a given threshold for a given seismic intensity. Unfortunately, numerical simulations of mechanical structures are often costly in terms of computation time. In this context, it is crucial to develop experimental design methods to gain the maximum information with the smallest number of numerical code evaluations. Hence, our methodology consists of intertwin importance sampling and statistical learning in an adaptive fashion, in order to reduce the asymptotic variance of the training loss. We show by asymptotic analysis and numerical simulations that it allows fast convergence of the estimated fragility curve to the true fragility curve.
Anthony Strittmater
(CREST)
February 3rd, 2021
Anthony Strittmater
Efficient Targeting in Fundraising
This paper studies efficient targeting in fundraising. In a large-scale field experiment, we randomly provide potential donors with a small unconditional gift. We then use causal machine learning methods to derive the efficient targeting of the fundraising instrument based on socio-economic characteristics, donation history, and geo-spatial information. In the warm list, efficient targeting increases the charity's profits significantly by 14%, even if the algorithm uses only the publicly available geo-spatial information. In the cold list, efficient targeting does not raise donations sufficiently to justify the additional costs of the fundraising instrument. We conclude that charities which do not efficiently target their fundraising efforts may waste significant resources.
Solenne Gaucher
(LMO)
March 17th, 2021
Solenne Gaucher
TBA
Abstract: TBA

Past Talks

Talks from 2020-2021
Boris Muzellec
(INRIA)
January 6th, 2021
Boris Muzellec
Imputing missing data using regularized optimal transport
Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Indeed, even with a small fixed proportion of missing values, ignoring data points with missing values quickly becomes impracticable as the dimension increases. Therefore, it is necessary to elaborate strategies to replace missing values with reasonable guesses.
In this talk, we show how optimal transport (OT) tools can be used to impute data in a distribution-preserving way. We start with an introduction to the missing data problem and to regularized OT. We then show how OT can be used to turn a simple assumption - two batches extracted randomly from the same dataset should share the same distribution - into a loss function to impute missing data values. Finally, we present and demonstrate practical methods to minimize this loss, that can exploit or not parametric assumptions on the underlying distribution of values.
Talks from 2019-2020
Victor-Emmanuel Brunel
(CREST)
March 4th, 2020
Victor-Emmanuel Brunel
Stein's method and Berry-Esseen bounds
I will present the fundamentals of Stein’s method, based on fairly simple functional equations. This method allows to prove central limit theorems, as well as finite sample approximation bounds such as the well-known Berry-Esseen bounds for normal approximations. It is a very simple yet elegant method which extends far beyond the case of normal approximations for sum of independent variables: It also yields Berry-Esseen type bounds for more general random variables (such as the number of triangles in an Erdös-Rényi graph), as well as finite sample bounds for exponential, or Poisson, or other asymptotic approximations.
Thomas Berrett
(CREST)
February 26th, 2020
Thomas Berrett
Local Differential Privacy
In recent years, it has become clear that in certain studies there is a need to preserve the privacy of the individuals whose data is collected. As a way of formalising the problem, the framework of differential privacy has prevailed as a natural solution. The privacy of the individuals is protected by randomising their original data before any statistical analysis is carried out and hiding the original data from the statistician. In fact, in local differential privacy, each original data point is only ever seen by the individual it belongs to.
Research in the area focuses on constructing mechanisms to privatise the data that strike the optimal balance between protecting the privacy of the individuals in the study and allowing the best statistical performance. In many cases it is possible to find minimax rates of convergence under this constraint and thus to quantify the statistical cost of privacy. In this talk I will provide an introduction to the field before presenting some new results.
Evgenii Chzhen
(LMO)
February 5th, 2020
Evgenii Chzhen
Algorithmic Fairness in Classification and Regression
The goal of this talk is to introduce the audience to the problem of algorithmic fairness. I will provide a general overview on the topic, describe various available frameworks of fairness in classification and regression, and present main approaches to tackle this problem. If time permits, I will present some very recent theoretical results both in classification and regression.
Jules Depersin
(CREST)
January 22th, 2020
Jules Depersin
Robust and Fast Estimation for Heavy-Tailed Distributions
When it comes to estimating the mean of a heavy-tailed distribution (or in the presence of outliers), the empirical mean does not give satisfying results. This issue has been dealt with using tools such as Median-Of-Mean (MOM) estimators. Such estimators are very simple to compute and give optimal rates of convergence when the dimension of the random variable is small, but fail to do so in high-dimensional set-ups. We will try to explain why, giving simple exemples and intuitions, and we will introduce tools needed to study high dimensions.
Julien Chhor
(CREST)
January 8th, 2020
Minimax Testing in Random Graphs
In a lot of recent statistical applications, the intensifying use of networks has made large random graphs a decisive field of interest. To name a few topics, we can mention community detection (in the stochastic block model or in social networks), as well as network modelling, or in modelling the brain. On the other hand, the existing literature about hypothesis testing is profuse. Yet quite surprisingly, only little literature exists about hypothesis testing in random graphs. In this talk, we fill the gap by studying two different testing problems in inhomogeneous Erdös-Rényi random graphs. After having introduced general tools for minimax testing, we first study a two sample testing problem in random graphs under sparsity constraints and second, the goodness-of-fit problem (also called identity testing problem), for which we identify minimax-optimal adaptive tests.
Théo Lacombe
(INRIA Saclay)
December 4th, 2019
Théo Lacombe
An Introduction to Topological Data Analysis
Topological Data Analysis (TDA) is a recent approach in Data Sciences that aims to encode some structured objects---think of graphs, time series, points on a manifold for instance---with respect to the topological information they contain.
The first half of this introductive lecture will give a high-level picture of TDA.
We will then briefly introduce the persistent homology, a notion coming from algebraic topology that is central in TDA to build our topological signatures.
Finally, the last part of the talk will present some statistical and learning aspects of TDA.
François-Pierre Paty
(CREST)
November 20th, 2019
François-Pierre Paty
An Introduction to Optimal Transport
Optimal transport (OT) dates back to the end of the 18th century, when French mathematician Gaspard Monge proposed to solve the problem of déblais and remblais. Yet, the mathematical formulation of Monge was rapidly found to meet its limits in the lack of provable existence of the studied objects. It is only after 150 years that OT enjoyed a resurgence, when Kantorovich understood the suitable framework that would allow to solve Monge’s problem and give rise to fundamental tools and theories in probability, optimization, differential equations and geometry. While applications in economics have a long history, it has only been recently that OT has been applied to statistics and machine learning, as a way to analyze data. In this mini-lecture, I will first define OT and present the most prominent results of OT theory. Then, I will give an overview of the current research in statistical and algorithmic OT, with an emphasis on machine learning and economics applications.
Badr-Eddine
Chérief-Abdellatif

(CREST)
November 6th, 2019
Badr-Eddine Chérief-Abdellatif
Theoretical Study of Variational Inference
Bayesian inference provides an attractive learning framework to analyze and to sequentially update knowledge on streaming data, but is rarely computationally feasible in practice. In the recent years, variational inference (VI) has become more and more popular for approximating intractable posterior distributions in Bayesian statistics and machine learning. Nevertheless, despite promising results in real-life applications, only little attention has been put in the literature towards the theoretical properties of VI. In this talk, we aim to present some recent advances in theory of VI. First, we show that variational inference is consistent under mild conditions and retains the same properties than exact Bayesian inference in the batch setting. Then, we study several online VI algorithms that are inspired from sequential optimization in order to compute the variational approximations in an online fashion. We provide theoretical guarantees by deriving generalization bounds and we present empirical evidence in support of this.