Statistics and Data Science Seminar

Department of Mathematics and Statistics

COSAM Statistics & Data Science Seminar

Home Past Seminars

The Statistics & Data Science Seminar is hosted by the Department of Mathematics and Statistics and provides a weekly platform for academics and researchers from different domains to present and discuss problems and solutions regarding data collection, management and analysis.

Spring 2026 Seminars

Welcome to the Spring 2026 Seminar series! The seminar takes place on Wednesdays at 2 p.m. CT. The seminars will be hybrid (in-person and over Zoom) or virtual only (over Zoom). The location is Parker Hall 358. For any questions or requests, please contact Huan He or Haotian Xu. The list of speakers for this series can be found in the table below which is followed by information on the title and abstract of each talk.

Speaker	Institution	Date	Format
		Feb. 4
Sayar Karmakar	U of Florida	Feb. 11	In-person
		Feb. 18
Jiajin Sun	Florida State	Feb. 25	In-person
Shuoyang Wang	U of Louisville	Mar. 4	In-person
NA	NA	Mar. 11	NA
Florian Gunsilius	Emory	Mar. 18	In-person
Yan Li	Auburn	Mar. 25	In-person
Rich Lehoucq	Sandia National Labs	Apr. 1	In-person
Mine Dogucu	UC Irvine	Apr. 8	Online
		Apr. 15
Shivam Kumar	U Chicago	Apr. 22	In-Person

Sayar Karmakar (U of Florida)

Title: Epidemic Changepoints: Applications in spatial anomaly detection and localizing LLM watermarks

Abstract: We present epidemic change-points as a unifying lens for two localization problems:(i) detecting spatial anomalies and (ii) segmenting watermarked regions in mixed-source text. For spatial data, we formalize a `spatial' change-point as an anomalous region (an epidemic in space), provide detection-accuracy results for single and multiple breaks, and propose a block-based scan that delivers substantial computational savings with guarantees. Next, we move to a seemingly unrelated but a very pertinent topic.

As large language models proliferate, ensuring content provenance has become a statistical challenge. For this problem on finding locLized modified text data segments, we introduce WISER, a fast epidemic-segmentation approach with finite-sample error bounds and consistency for multiple watermarked segments, and we demonstrate empirical gains over state-of-the-art baselines on benchmark datasets.

We emphasize how classical changepoint ideas catered to epidemic and transient departures yield principled, scalable solutions to modern problems in text provenance and spatial anomaly detection. Simulations and empirical studies corroborate the theory and point to open questions for PhD-level research.

Joint work with Soham Bonnerjee & Subhrajyoty Roy (watermarks) and with Soham Bonnerjee & George Michailidis (spatial anomaly)

Jiajin Sun (Florida State)

Title: Efficient Analysis of Latent Spaces in Heterogeneous Networks

Abstract: This work proposes a unified framework for efficient estimation under latent space modeling of heterogeneous networks. We consider a class of latent space models that decompose latent vectors into shared and network-specific components across networks. We develop a novel procedure that first identifies the shared latent vectors and further refines estimates through efficient score equations to achieve statistical efficiency. Oracle error rates for estimating the shared and heterogeneous latent vectors are established simultaneously. The analysis framework offers remarkable flexibility, accommodating various types of edge weights under general distributions.

Shuoyang Wang (U of Louisville)

Title: Deep Learning for Complex Functional Data Analysis

Abstract: Functional data are realizations of random functions observed over a continuum, such as signals and images. In many modern applications, including neuroscience and biomedical research, observations are more naturally represented as random functions rather than finite dimensional vectors. The intrinsic complexity of such data stems from high dimensional functional domains, cross cohort heterogeneity, and unknown data generating distributions, which together complicate principled modeling and performance guarantees. Although deep learning has shown strong empirical performance in biomedical studies, its methodological and theoretical foundations for complex functional data settings remain limited. In this talk, I will present two methodological contributions that develop principled deep learning frameworks for complex functional data. First, I will introduce a federated deep learning approach for functional data classification across multiple heterogeneous cohorts. The learner visits each cohort once, performs local updates, and transmits only compressed model weights, thereby preserving privacy and reducing communication and computational costs. To address cross cohort heterogeneity, we develop an adaptive sequential weight updating strategy that progressively corrects distributional shifts and improves performance on a target cohort. We establish minimax optimal excess risk bounds and characterize a sharp sampling threshold governing learnability under both densely and sparsely observed functional data. Second, I will present a deep learning based functional graphical modeling framework for learning conditional independence structures in multivariate functional data. Each node’s neighborhood is estimated via flexible functional regression with embedded feature selection, allowing a fully nonparametric specification, and the overall graph is recovered by aggregating the neighborhood estimates. The method avoids restrictive distributional assumptions and does not rely on a well-defined functional precision operator. We prove global model selection consistency and establish convergence rates that attain the classical nonparametric regression rate up to a logarithmic factor, with a fundamental sampling threshold determining the estimator’s convergence behavior. Empirical performance is demonstrated through simulations and real data applications, including analyses of ADNI dataset and the ADHD-200 Consortium.

Florian Gunsilius (Emory)

Title: Partial Identification with Schrödinger Bridges

Abstract: Partial identification provides an alternative to point identification: instead of pinning down a unique parameter estimate, the goal is to characterize a set guaranteed to contain the true parameter value. Many partial identification approaches take the form of linear optimization problems, which seek the ``best- and worst-case scenarios" of a proposed model subject to the constraint that the model replicates correct observable information. However, such linear programs become intractable in settings with multivalued or continuous variables. This paper introduces a novel method to overcome this computational and statistical curse of cardinality: we provide a duality between a general class of optimal transportation problems and the lower bound of a partial identified effect. Building on such duality, we propose a discretization of the instrument realizations and an entropy transform of these potentially infinite-dimensional linear programs. This maps the problem into general versions of multi-marginal Schrödinger bridges, enabling efficient approximation of their solutions. In the process, we establish novel statistical and mathematical properties of such multi-marginal Schrödinger bridges---including consistency of the estimator and an analysis of the asymptotic distribution of entropic approximations to infinite-dimensional linear programs. We illustrate this approach by analyzing instrumental variable models with continuous variables, a setting that has been out of reach for existing methods that do not rely on sampling. (joint w/ Bruno Nunes Costa from the University of Michigan)

Yan Li (Auburn)

Title: Structured Statistical Methods for High-Dimensional Scientific Data: From Climate Projection to Spatial Genomics

Abstract: Modern scientific data are increasingly high-dimensional, structured, and spatially indexed, requiring statistical methods that integrate multiple sources of information while accounting for complex dependence and domain-specific constraints. We develop structured statistical modeling approaches across several settings, including methods that (i) connect models with observations, (ii) leverage historical information to improve prediction in climate studies, and (iii) learn structured relationships among features in gene expression data.

The talk is organized into three parts. First, I briefly present the optimal fingerprinting framework, which links historical climate model simulations and observations to estimate the contributions of different radiative forcings, corresponding to the detection and attribution of climate change. Second, I introduce a spatially resolved regression framework that links historical and future climate states to constrain projections, leveraging observational information to reduce uncertainty and improve the reliability of future climate predictions. This framework also motivates extensions that incorporate more general spatial structure in the regression operator, such as combining local sparse interactions with global latent factors and incorporating region-based aggregation using hierarchical regional information.

This perspective on structured regression naturally extends beyond climate applications. Third, I introduce a multivariate regression framework for spatial gene expression data that jointly models gene expression and compositional predictors, incorporating structured feature aggregation through hierarchical relationships among compositional variables and fusion across genes, leading to improved biological interpretability and statistical efficiency.

Across these settings, the common theme is the development of regression-based methods that incorporate structural information—such as dependence, spatial organization, and hierarchical relationships—to improve inference and prediction in high-dimensional data.

Rich Lehoucq (Sandia National Labs)

Title: The Poisson tensor completion parametric estimator

Abstract: We introduce the Poisson tensor completion (PTC) estimator that exploits inter-sample relationships to compute a low-rank Poisson tensor decomposition of the frequency histogram for samples of a multivariate distribution. Our crucial observation is that the histogram bins are an instance of a space partitioning of counts and thus can be identified with a spatial non-homogeneous Poisson process. The Poisson tensor decomposition leads to a completion of the mean measure over all bins---including those containing few to no samples---and leads to our proposed estimator. A Poisson tensor decomposition models the underlying distribution of the count data and guarantees non-negative estimated values obviating the need for additional constraints to ensure non-negativity. Furthermore, we demonstrate that our PTC estimator is a substantial improvement over standard histogram-based estimators for sub-Gaussian probability distributions because of the concentration of norm phenomenon.

Mine Dogucu (UC Irvine)

Title: Statistics Education Research as a Catalyst for Professional and Departmental Growth

Abstract: As the demand for data science and statistical literacy explodes across university campuses, mathematics and statistics departments face a dual challenge: how to scale high-quality instruction without sacrificing rigor, and how to prepare the next generation of scholars for an evolving academic job market. In this talk, I argue that Statistics and Data Science Education Research (SDSER) is not merely a service to the department, but a rigorous research domain that acts as a catalyst for growth at both the individual and institutional levels. For graduate students I will outline examples for a "Teaching PI" who leads a research group, secures federal funding, and contributes to the scholarly conversation. For faculty and departmental leadership, I will demonstrate how SDSER provides solutions to practical bottlenecks.

Shivam Kumar (U Chicago)

Title: Optimal Bias-variance Tradeoff in Matrix and Tensor Estimation

Abstract: We study matrix and tensor denoising when the underlying signal is \textbf{not} necessarily low-rank. In the tensor setting, we observe
\[
Y = X^\ast + Z \in \mathbb{R}^{p_1 \times p_2 \times p_3},
\]
where \(X^\ast\) is an unknown signal tensor and \(Z\) is a noise tensor. We propose a one-step variant of the higher-order SVD (HOSVD) estimator, denoted \(\widetilde X\), and show that, uniformly over any user-specified Tucker ranks \((r_1,r_2,r_3)\), with high probability,
\[
\|\widetilde X - X^\ast\|_{\mathrm F}^2
= O\Big( \kappa^2\Big\{r_1r_2r_3 + \sum_{k=1}^3 p_k r_k\Big\} + \xi_{(r_1,r_2,r_3)}^2 \Big).
\]
Here, \(\xi_{(r_1,r_2,r_3)}\) is the best achievable Tucker rank-\((r_1,r_2,r_3)\) approximation error of \(X^\ast\) (bias), \(\kappa^2\) quantifies the noise level, and \(\kappa^2\{r_1r_2r_3+\sum_{k=1}^3 p_k r_k\}\) is the variance term scaling with the effective degrees of freedom of \(\widetilde X\). This yields a rank-adaptive bias--variance tradeoff: increasing \((r_1,r_2,r_3)\) decreases the bias \(\xi_{(r_1,r_2,r_3)}\) while increasing variance. In the matrix setting, we show that truncated SVD achieves an analogous bias--variance tradeoff for arbitrary signal matrices. Notably, our matrix result requires no assumptions on the signal matrix, such as finite rank or spectral gaps. Finally, we complement our upper bounds with matching information-theoretic lower bounds, showing that the resulting bias--variance tradeoff is minimax optimal up to universal constants in both the matrix and tensor settings.

College of Sciences and Mathematics Homepage