Passer au contenu

Symposium en apprentissage statistique pour les données complexes

Date 25 novembre 2025

Heure 8h45 à 16h30

Lieu Salle Power Corporation du Canada (3452)

Stationnement ($)

Événement gratuit

À propos de
l'événement

La nouvelle Chaire en apprentissage statistique lance ses activités et est heureuse de vous inviter à une journée de conférences sur les sujets qui sont au cœur de sa mission.

Lors de cette journée, il sera question de l’interaction entre la statistique et l’apprentissage automatique pour l’analyse de données complexes et de grande dimension, telles que les données multivariées complexes, les données fonctionnelles, les données sur des variétés ou les données non structurées.  

Cette activité est organisée en partenariat avec le Département d’opérations et systèmes de décision de l’Université Laval, l’Institut de valorisation des données (IVADO), le Centre de recherche du CHU de Québec, et le Centre interuniversitaire de recherche sur les réseaux d’entreprise, la logistique et le transport (CIRRELT)

Les présentations se dérouleront en anglais.

Inscription obligatoire avant le 17 novembre.

Les deux conférences plénières seront webdiffusées en direct. Inscrivez-vous pour obtenir le lien pour les visionner.

Programmation

Conférence plénière
Object Oriented Spatial Statistics: Advancing Geostatistics in the Era of Complex Data

The explosion of complex, georeferenced data is transforming how we study natural and social phenomena. Object Oriented Spatial Statistics (O2S2) provides a unifying framework to address this challenge, by treating complex entities – such as curves, probability distributions or covariance matrices – as the primary objects of analysis. Built on a rigorous geometrical and topological foundation, O2S2 extends classical tools of geostatistics such as variograms and kriging to new data types and domains, enabling robust prediction and uncertainty quantification. In this lecture, I will introduce a few key ideas of O2S2 in accessible terms and illustrate its potential through recent case studies, including epidemiological and environmental monitoring. These examples highlight how O2S2 fosters interdisciplinary innovation and supports informed decision-making in complex spatial systems.

Conférencier invité
Rethinking Image Data through Functional Representations of Shapes and Surfaces

In this presentation, we introduce a novel perspective on image analysis that emphasizes objects and their shapes rather than individual pixels. By moving away from pixel-based approaches and analyzing images as collections of objects characterized by both color and contour, we establish a new framework that is more interpretable, less sensitive to resolution changes, and much more parsimonious. We show how to represent contours using coordinate functions, which can be expanded in a Fourier basis. This provides an elegant solution to the alignment problem and enables the use of a wide range of functional data analysis tools developed in the literature, such as functional principal component analysis and functional regression, for shape analysis. We then discuss how to extend our approach to more complex images containing multiple objects. We illustrate the performance of the proposed framework through a variety of statistical applications, including sampling, classification, and clustering on multiple real image datasets. Finally, we discuss representing entire images as smooth surfaces via basis expansions. This representation again offers a parsimonious and flexible description of images, well suited for statistical analysis.

Selection of functional predictors and smooth coefficient estimation for scalar-on-function regression models

In the framework of scalar-on-function regression models, in which several functional variables are employed to predict a scalar response, we propose a methodology for selecting relevant functional predictors while simultaneously providing accurate smooth (or more generally regular) estimates of the functional coefficients. We suppose that the functional predictors belong to a real separable Hilbert space, while the functional coefficients belong to a specific subspace of this Hilbert space. Such a subspace can be a Reproducing Kernel Hilbert Space (RKHS) to ensure the desired regularity characteristics, such as smoothness or periodicity, for the coefficient estimates. Our procedure, called SOFIA (Scalar-On-Function Integrated Adaptive Lasso), is based on an adaptive penalized least squares algorithm that leverages functional subgradients to efficiently solve the minimization problem. We demonstrate that the proposed method satisfies an appropriate version of the oracle property adapted to the functional setting, even when the number of predictors exceeds the sample size. SOFIA’s effectiveness in variable selection and coefficient estimation is evaluated through extensive simulation studies and a real-data application to predict GDP growth.

  • Fathi, H., Cremona, M.A., & Severino, F. (2025) Selection of functional predictors and smooth coefficient estimation for scalar-on-function regression models. arXiv 2506.17773.

Unified multi state models for learning latent behaviors from complex animal movement data

Complex movement trajectories encode latent behavioral modes and environmental responses. This work develops a unified multi state framework that (i) extends directional random walks to incorporate multiple environmental “targets” via circular–linear modeling of turning angles and step lengths; (ii) learns hidden behavioral states with HMM/HSMM structure; and (iii) formulates a one step multi state conditional logistic regression (step selection) that propagates state uncertainty and reveals the link—indeed an equivalence—between directional and discrete choice approaches. Inference uses maximum likelihood with an EM algorithm and forward–backward smoothing for state decoding. Simulation and case studies on boreal caribou and bison demonstrate that the model recovers encamped versus exploratory states, quantifies resource-driven taxis, and separates “pure movement” from habitat selection. The unified view increases flexibility (arbitrary directional biases; exponential family step lengths), improves interpretability of environmental drivers, and supports practical analysis of high frequency telemetry as complex, structured data. Overall, the contributions provide a statistically principled, scalable basis for learning behavior–environment interactions from trajectories and for comparing or combining directional and step selection analyses.

  • Duchesne, T., Fortin, D., & Nicosia, A. (2015). A General Angular Regression Model for the Analysis of Data on Animal Movement in Ecology. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(3), 497-513.
  • Nicosia, A., Duchesne, T., Rivest, L.-P., & Fortin, D. (2016). A General Hidden State Random Walk Model for Animal Movement. Computational Statistics & Data Analysis, 93, 27-41.
  • Nicosia, A., Duchesne, T., Rivest, L.-P., & Fortin, D. (2017). A Multi-State Conditional Logistic Regression Model for the Analysis of Animal Movement. Annals of Applied Statistics, 11(3), 1537-1560.

From point to probabilistic gradient boosting for claim frequency and severity prediction

Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalised linear models. Many enhancements to the first gradient boosting machine algorithm exist. Our first objective is to present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. Our second objective is to compare, in a comprehensive numerical study, their performance on five publicly available datasets for claim frequency and severity, of various sizes and comprising different numbers of (high cardinality) categorical variables. Our third objective is to explain how varying exposure-to-risk, a peculiarity of actuarial data, can be handled with boosting in frequency models. We compare the algorithms based on computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. CatBoost sometimes improves predictive performance, especially in the presence of high cardinality categorical variables, common in actuarial science. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.

  • Chevalier, D., Côté, M.P. (2025) From point to probabilistic gradient boosting for claim frequency and severity prediction. European Actuarial Journal.

Clustering Glycemic Profiles to Improve Glucose Forecasting in Type 1 Diabetes

This work introduces a clustering-based framework for improving glucose forecasting in individuals with type 1 diabetes (T1D). Multivariate time series segments comprising variables such as glucose levels, physical activity, and carbohydrate intake are encoded into fixed-length embeddings using a bi-directional GRU-based autoencoder with attention. The embeddings are then clustered. A dedicated prediction model is trained on each cluster to forecast future CGM (continuous glucose monitoring) values. Our dataset includes real data for 380 patients and 4 weeks. Our results show relatively small prediction errors and safe predictions for the majority of T1D patients.

Conférencière invitée
Distilling Heterogeneous Treatment Effects: Stable Subgroup Estimation in Causal Inference

Recent methodological developments have introduced new black-box approaches to better estimate heterogeneous treatment effects; however, these methods fall short of providing interpretable characterizations of the underlying individuals who may be most at risk or benefit most from receiving the treatment, thereby limiting their practical utility. In this work, we introduce causal distillation trees (CDT) to estimate interpretable subgroups. CDT allows researchers to fit any machine learning model to estimate the individual-level treatment effect, and then leverages a simple, second-stage tree-based model to “distill” the estimated treatment effect into meaningful subgroups. As a result, CDT inherits the improvements in predictive performance from black-box machine learning models while preserving the interpretability of a simple decision tree. We derive theoretical guarantees for the consistency of the estimated subgroups using CDT, and introduce stability-driven diagnostics for researchers to evaluate the quality of the estimated subgroups. We illustrate our proposed method on a randomized controlled trial of antiretroviral treatment for HIV from the AIDS Clinical Trials Group Study 175 and show that CDT out-performs state-of-the-art approaches in constructing stable, clinically relevant subgroups.

Statistical inference for complex network models

I will present a very brief overview of some statistical problems in complex networks (network science). Complex networks are graph models of real-world systems that surround us. These models are common in biology, telecoms, social media and medicine, to name just a few areas. Naturally, to adequately understand these systems, it is essential to study the now widely available huge data sets they generate. However, for its analysis, this data requires a specific set of tools. Unfortunately, in most cases, traditional statistical tools are unsuited to network data. Indeed, network data is not Euclidean, which renders most traditional statistical tools inapplicable. Additionally, real-world networks are typically very large, which creates a need for scalable tools. Here too, traditional statistical or machine learning tools are often inadequate, as most were designed for applications to much smaller data sets. In fact, due to the recent widespread availability of network data, there is growing interest in studying it. For example, there are growing numbers of national (session at Stat. Soc. of Can. meeting 2025, 2026) and international annual scientific events (e.g., https://sinm.network/) dedicated to the topic.

*For this presentation, I will not assume any prior knowledge of graph theory or network science.

Global testing of SNP–methylation interactions on binary phenotypes via a logistic functional regression model

Understanding the joint influence of genetic and epigenetic factors on binary health outcomes remains a key challenge in modern biomedical research. In this work, we propose a logistic functional regression model to capture global interactions between DNA methylation patterns and single nucleotide polymorphisms (SNPs) in the context of binary phenotypes. Unlike traditional models that assess interactions between SNPs and methylation at individual CpG sites independently, our approach treats methylation as a functional predictor across genomic regions, interacting with discrete SNP genotypes through a smooth interaction term governed by a tunable kernel. This formulation enables the detection of complex region-level interactions that may underlie disease susceptibility. Simulation results show that the proposed method reliably controls type I error rates and achieves high statistical power when interaction signals are present, particularly under appropriate choices of the interaction window parameter ρ, which determines the extent of the methylation region interacting with each SNP. We apply our framework to real methylation data from obese and non-obese individuals, revealing interaction effects not captured by additive or marginal-effect models. These findings highlight the potential of our method to improve predictive accuracy and offer new insights into the epigenetic architecture of complex diseases.

Conférencier invité
Is Stephen Curry really a guard? Some insights using functional data analysis

Basketball shot charts provide rich spatial information on players’ shooting behavior. However, they are often summarized using simple counts or averages that overlook their functional nature. In this talk, I will present how functional data analysis can offer a more nuanced view of player performance by treating shot charts as two-dimensional functional objects. Using data from NBA players, I represent each player’s shooting frequency as a smooth surface over the court and apply functional principal component analysis to uncover the main modes of spatial variation in shooting behavior. These functional components capture interpretable shooting patterns, e.g. perimeter-oriented vs paint-focused styles, that go beyond traditional position labels. Then, using clustering methods on the functional scores, I identify groups of players with similar shooting profiles. This analysis raises a particular question: based on his shooting surface, does Stephen Curry really fit the profile of a guard?

Conférence plénière
Tackling supervised problems when the data is (ultra) high-dimensional or structured: an overview

Much contemporary research, especially biomedical research employing high-throughput assays, is based on associating outcomes to a very large amount of potentially predictive information – in the form of (ultra) high-dimensional predictor vectors, and sometimes of measurements structured over time, space or other domains. To tackle the resulting supervised problems, statisticians design tools to screen, select and combine predictive information, often leveraging notions of sparsity, low complexity, and sufficiency. As size and intricacy of the data increases, these tools must retain effectiveness and interpretability – but also scale efficiently and offer some guarantees of robustness. In this lecture, I will review some of the methods my collaborators and I developed in this area, highlighting their strengths and shortcomings in these respects, and illustrating some of their biomedical applications. As time allows, we will touch upon dimension reduction and feature screening methods based on Covariate Information Matrices and Numbers (Yao et al., 2019; Nandy et al., 2022); feature selection and outlier detection methods based on Mixed Integer Programs (Kenney et al., 2021; Insolia et al., 2022); and feature selection methods for functional regressions based on Elastic Net-type algorithms (Boschi et al., 2021; Boschi et al., 2024).

  • Yao, W., Nandy, D., Lindsay, B. G., and Chiaromonte, F. (2019). Covariate Information Matrix for Sufficient Dimension Reduction. Journal of the American Statistical Association, 114(528), 1752–1764.
  • Nandy, D., Chiaromonte, F., and Li, R. (2022). Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems. Journal of the American Statistical Association, 117(539), 1516–1529.
  • Kenney, A., Chiaromonte, F., and Felici, G. (2021). MIP-BOOST: Efficient and Effective L0 Feature Selection for Linear Regression. Journal of Computational and Graphical Statistics, 30(3), 566–577.
  • Insolia, L., Kenney, A., Chiaromonte, F., and Felici, G. (2022). Simultaneous Feature Selection and Outlier Detection with Optimality Guarantees. Biometrics, 78(4), 1592–1603.
  • Boschi, T., Reimherr, M., and Chiaromonte, F. (2021). A Highly Efficient Group Elastic Net Algorithm with an Application to Function-On-Scalar Regression. NeurIPS 2021 Advances in Neural Information Processing Systems, M. Ranzato et al. eds., Curran Associates, Inc. 34, 9264–9277.
  • Boschi, T., Testa, L., Chiaromonte, F., and Reimherr, M. (2024). FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions. Journal of Computational and Graphical Statistics, 34(2), 567–579.

Cédric Beaulac

Cédric Beaulac est professeur au Département de mathématiques de l’Université du Québec à Montréal (UQAM) depuis 2022. Il a obtenu son doctorat en sciences statistiques à l’Université de Toronto, puis a effectué un stage postdoctoral à l’Université Simon Fraser ainsi qu’à l’Université de Victoria.

Son approche de recherche repose sur le développement de nouvelles méthodologies pour l’analyse de structures de données complexes. Ses travaux portent principalement sur l’apprentissage statistique, l’analyse de données fonctionnelles et les méthodes de réduction de dimension. Ses travaux de recherche récents visent à établir une représentation fonctionnelle des formes et des surfaces pour l’analyse d’images, ainsi qu’à en explorer les applications en santé, en neurosciences et en sécurité routière.

Piercesare Secchi

Piercesare Secchi est professeur titulaire de statistique au Politecnico di Milano, où il est membre du MOX – le laboratoire départemental de modélisation et de calcul scientifique – et responsable du programme de doctorat en analyse de données et sciences de la décision. Il est diplômé en mathématiques et en statistique de l’Université de Milan, de l’Université de Trente et de l’Université du Minnesota.

Ses recherches portent sur les méthodes statistiques pour les données complexes et spatialement dépendantes, incluant la statistique spatiale orientée objet, l’analyse de données fonctionnelles et la classification de données complexes. Il a coordonné et participé à de nombreux projets de recherche nationaux et internationaux, dont les applications vont de la surveillance environnementale et des systèmes énergétiques à la santé, aux transports et aux études urbaines.

Piercesare Secchi est l’un des fondateurs de Moxoff, une spin-off du Politecnico di Milano dédiée aux mathématiques et statistiques appliquées aux entreprises et à l’industrie. Il a occupé plusieurs postes de direction, notamment celui de directeur du Département de mathématiques (2009-2016) et de président du European Center for Nanomedecine (2015-2019).

Ana Maria Kenney

Ana Maria Kenney est professeure adjointe au Département de statistique de l’Université de la Californie à Irvine. Ses travaux portent sur l’interface entre les statistiques, l’apprentissage automatique interprétable et l’optimisation à grande échelle, au service de la recherche biomédicale. Elle a fait partie de plusieurs équipes interdisciplinaires interinstitutionnelles, notamment en génétique cardiovasculaire, en sciences omiques pour la croissance du nourrisson et en dépistage précoce du cancer. Elle a effectué un stage postdoctoral à l’Université de la Californie à Berkeley et a précédemment obtenu un double doctorat en statistique et en recherche opérationnelle, à la Pennsylvania State University. Elle y a été boursière du programme de formation Biomedical Big Data to Knowledge et titulaire d’une bourse de doctorat Alfred P. Sloan.

Francesca Chiaromonte

Francesca Chiaromonte est titulaire d’une laurea en statistique et sciences économiques de l’Université de Rome La Sapienza, en Italie, et d’un doctorat en statistique de l’Université du Minnesota, aux États-Unis. Ses recherches portent sur les méthodes d’analyse de données de grande dimension, complexes, structurées et potentiellement sous-échantillonnées, incluant les méthodes de réduction de dimension supervisée et de sélection de variables, les techniques computationnelles pour l’évaluation empirique de la significativité et de la stabilité, les approches de modélisation de la structure latente et de Markov et les méthodes d’analyse de données fonctionnelles. Elle applique ces méthodes aux sciences omiques contemporaines, à la recherche biomédicale et à d’autres domaines, notamment la météorologie et l’économie.

Professeure titulaire de statistique, elle est aussi titulaire de la chaire Dorothy Foehr Huck et J. Lloyd Huck en statistique pour les sciences de la vie à la Pennsylvania State University, aux États-Unis. Elle est également membre active du Center for Computational Biology & Bioinformatics, du Center for Medical Genomics et de l’Institute for Genome Sciences. Francesca Chiaromonte est aussi professeure titulaire à l’Institut d’économie de la Sant’Anna School of Advanced Studies de Pise, en Italie, où elle est responsable de la coordination scientifique de L’EMbeDS (Économie, gestion et droit à l’ère de la science des données) depuis sa création, en 2018.

Francesca Chiaromonte est également fellow de l’American Statistical Association depuis 2016 et de l’Institute for Mathematical Statistics depuis 2022. Elle a publié environ 120 articles évalués par des pairs, dont de nombreuses publications dans les revues généralistes et disciplinaires les plus influentes (par exemple, Nature, PNAS, JASA, Annals of Statistics, etc.). Ses travaux collaboratifs ont été financés à plusieurs reprises par le NIH et la NSF.

Steven Golovkine

Steven Golovkine est professeur adjoint au Département de mathématique et de statistiques de l’Université Laval. Ses intérêts de recherche se concentrent sur le développement de méthodes statistiques et d’algorithmes pour la modélisation de données fonctionnelles multivariées. Des exemples d’applications de ses travaux incluent l’imagerie à résonance magnétique fonctionnelle (IRMf), les données d’accéléromètres en science du sport ou encore le monitoring d’électrocardiogrammes (EEG) en cardiologie.

Steven Golovkine a reçu son diplôme d’ingénieur en statistique et sa M. Sc. en Big Data en 2017 de l’École nationale de la statistique et de l’analyse de l’information (ENSAI), en France. Il a obtenu un doctorat en mathématique et leur interaction de l’ENSAI en 2017, en collaboration avec le Groupe Renault, sur le sujet «Statistical methods for multivariate functional data».

This page is not available in English

Vivez l'expérience FSA ULaval

Espace étudiant

Bienvenue dans la famille FSA ULaval!
Nous vous présentons les guides des études à l’intention des personnes déjà admises dans l’un de nos programmes.

Plateforme pour les diplômées et les diplômés

Rejoignez le large réseau de personnes diplômées de FSA ULaval réparties un peu partout sur la planète! Entrez en contact avec d’anciens et d’anciennes camarades de classe, profitez de mentorat et accédez à des activités et à des formations exclusives. En savoir plus sur ces fonctionnalités.

Intranet du personnel

Zone FSA ULaval

Restez à l’affût des nouvelles de l’organisation et des activités internes.